Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Kyotaro Horiguchi
At Wed, 8 Dec 2021 11:47:30 +0530, Bharath Rupireddy 
 wrote in 
> On Wed, Dec 8, 2021 at 10:59 AM Bossart, Nathan  wrote:
> > >> Another option we might want to consider is to just skip updating the
> > >> state entirely for end-of-recovery checkpoints.  The state would
> > >> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
> > >> don't know if it's crucial to have a dedicated control file state for
> > >> end-of-recovery checkpoints.

FWIW I find it simple but sufficient since I regarded the
end-of-recovery checkpoint as a part of recovery.  In that case what
is strange here is only that the state transition passes the
DB_SHUTDOWN(ING/ED) states.

On the other hand, when a server is going to shutdown, the state stays
at DB_IN_PRODUCTION if there are clinging clients even if the shutdown
procedure has been already started and no new clients can connect to
the server. There's no reason we need to be so particular about states
for recovery-end.

> > > Please note that end-of-recovery can take a while in production
> > > systems (we have observed such things working with our customers) and
> > > anything can happen during that period of time. The end-of-recovery
> > > checkpoint is not something that gets finished momentarily. Therefore,
> > > having a separate db state in the control file is useful.
> >
> > Is there some useful distinction between the states for users?  ISTM
> > that users will be waiting either way, and I don't know that an extra
> > control file state will help all that much.  The main reason I bring
> > up this option is that the list of states is pretty short and appears
> > to be intended to indicate the high-level status of the server.  Most
> > of the states are over 20 years old, and the newest one is over 10
> > years old, so I don't think new states can be added willy-nilly.
> 
> Firstly, updating the control file with "DB_SHUTDOWNING" and
> "DB_SHUTDOWNED" for end-of-recovery checkpoint is wrong. I don't think
> having DB_IN_CRASH_RECOVERY for end-of-recovery checkpoint is a great
> idea. We have a checkpoint (which most of the time takes a while) in
> between the states DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. The state
> DB_IN_END_OF_RECOVERY_CHECKPOINT added by the v1 patch at [1] (in this
> thread) helps users to understand and clearly distinguish what state
> the db is in.
> 
> IMHO, the age of the code doesn't stop us adding/fixing/improving the code.
> 
> > Of course, I could be off-base and others might agree that this new
> > state would be nice to have.
> 
> Let's see what others have to say about this.

I see it a bit too complex for the advantage.  When end-of-recovery
checkpoint takes so long, that state is shown in server log, which
operators would look into before the control file.

> [1] - 
> https://www.postgresql.org/message-id/CALj2ACVn5M8xgQ3RD%3D6rSTbbXRBdBWZ%3DTTOBOY_5%2BedMCkWjHA%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Make pg_waldump report replication origin ID, LSN, and timestamp.

2021-12-07 Thread Michael Paquier
On Mon, Dec 06, 2021 at 11:24:09PM +0900, Masahiko Sawada wrote:
> On Mon, Dec 6, 2021 at 5:09 PM Michael Paquier  wrote:
>> Shouldn't you check for parsed.origin_lsn instead?  The replication
>> origin is stored there as far as I read EndPrepare().
> 
> Also, looking at PrepareRedoAdd(), we check the replication origin id.
> So I think that it'd be better to check origin_id for consistency.

Okay, this consistency would make sense, then.  Perhaps some comments
should be added to tell that?

>> Commit records check after XACT_XINFO_HAS_ORIGIN, but
>> xact_desc_abort() may include this information for ROLLBACK PREPARED
>> transactions so we could use the same logic as xact_desc_commit() for
>> the abort case, no?
> 
> Good catch! I'll submit an updated patch.

Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-07 Thread Michael Paquier
On Wed, Dec 08, 2021 at 10:47:57AM +0530, Bharath Rupireddy wrote:
> Thanks for taking a look at the patch. How about the attached v4?
> 
> I added a CF entry - https://commitfest.postgresql.org/36/3443/

+   else if (source == XLOG_FROM_STREAM)
+   ereport(LOG,
+   (errmsg("recovering WAL segment \"%s\" received from 
primary",
+   xlogfname)));
This is incorrect when using a cascading standby.  And perhaps this
could use a switch/case?

While quickly testing, I got reminded that the LOG for a segment
retrieved from the local pg_wal would generate some noise when running
for example the bootstrap process.  Is this one really interesting
compared to the two others?
--
Michael


signature.asc
Description: PGP signature


Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Masahiko Sawada
On Wed, Dec 8, 2021 at 3:50 PM Amit Kapila  wrote:
>
> On Wed, Dec 8, 2021 at 11:48 AM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 8, 2021 at 2:16 PM Amit Kapila  wrote:
> > >
> > > On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila  
> > > > wrote:
> > > >
> > > > I'll submit the patch tomorrow.
> > > >
> > > > While updating the patch, I realized that skipping a transaction that
> > > > is prepared on the publisher will be tricky a bit;
> > > >
> > > > First of all, since skip-xid is in pg_subscription catalog, we need to
> > > > do a catalog update in a transaction and commit it to disable it. I
> > > > think we need to set origin-lsn and timestamp of the transaction being
> > > > skipped to the transaction that does the catalog update. That is,
> > > > during skipping the (not prepared) transaction, we skip all
> > > > data-modification changes coming from the publisher, do a catalog
> > > > update, and commit the transaction. If we do the catalog update in the
> > > > next transaction after skipping the whole transaction, skip_xid could
> > > > be left in case of a server crash between them.
> > > >
> > >
> > > But if we haven't updated origin_lsn/timestamp before the crash, won't
> > > it request the same transaction again from the publisher? If so, it
> > > will be again able to skip it because skip_xid is still not updated.
> >
> > Yes. I mean that if we update origin_lsn and origin_timestamp when
> > committing the skipped transaction and then update the catalog in the
> > next transaction it doesn't work in case of a crash. But it's not
> > possible in the first place since the first transaction is empty and
> > we cannot set origin_lsn and origin_timestamp to it.
> >
> > >
> > > > Also, we cannot set
> > > > origin-lsn and timestamp to an empty transaction.
> > > >
> > >
> > > But won't we update the catalog for skip_xid in that case?
> >
> > Yes. Probably my explanation was not clear. Even if we skip all
> > changes of the transaction, the transaction doesn't become empty since
> > we update the catalog.
> >
> > >
> > > Do we see any advantage of updating the skip_xid in the same
> > > transaction vs. doing it in a separate transaction? If not then
> > > probably we can choose either of those ways and add some comments to
> > > indicate the possibility of doing it another way.
> >
> > I think that since the skipped transaction is always empty there is
> > always one transaction. What we need to consider is when we update
> > origin_lsn and origin_timestamp. In non-prepared transaction cases,
> > the only option is when updating the catalog.
> >
>
> Your last sentence is not completely clear to me but it seems you
> agree that we can use one transaction instead of two to skip the
> changes, perform a catalog update, and update origin_lsn/timestamp.

Yes.

>
> > >
> > > > In prepared transaction cases, I think that when handling a prepare
> > > > message, we need to commit the transaction to update the catalog,
> > > > instead of preparing it. And at the commit prepared and rollback
> > > > prepared time, we skip it since there is not the prepared transaction
> > > > on the subscriber.
> > > >
> > >
> > > Can't we think of just allowing prepare in this case and updating the
> > > skip_xid only at commit time? I see that in this case, we would be
> > > doing prepare for a transaction that has no changes but as such cases
> > > won't be common, isn't that acceptable?
> >
> > In this case, we will end up committing both the prepared (empty)
> > transaction and the transaction that updates the catalog, right?
> >
>
> Can't we do this catalog update before committing the prepared
> transaction? If so, both in prepared and non-prepared cases, our
> implementation could be the same and we have a reason to accomplish
> the catalog update in the same transaction for which we skipped the
> changes.

But in case of a crash between these two transactions, given that
skip_xid is already cleared how do we know the prepared transaction
that was supposed to be skipped?

>
> > If
> > so, since these are separate transactions it can be a problem in case
> > of a crash between these two commits.
> >
> > >
> > > > Currently, handling rollback prepared already
> > > > behaves so; it first checks whether we have prepared the transaction
> > > > or not and skip it if haven’t. So I think we need to do that also for
> > > > commit prepared case. With that, this requires protocol changes so
> > > > that the subscriber can get prepare-lsn and prepare-time when handling
> > > > commit prepared.
> > > >
> > > > So I’m writing a separate patch to add prepare-lsn and timestamp to
> > > > commit_prepared message, which will be a building block for skipping
> > > > prepared transactions. Actually, I think it’s beneficial even today;
> > > > we can skip preparing the transaction if it’s an empty transaction.
> > > > Although the comment it’s not a common case, 

Re: Make mesage at end-of-recovery less scary.

2021-12-07 Thread Kyotaro Horiguchi
At Tue, 09 Nov 2021 16:27:51 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> This is the updated version.
> 
> - emode_for_currupt_record() now uses currentSource instead of
>   readSource.
> 
> - If zero record length is faced, make sure the whole header is zeroed
>   before deciding it as the end-of-WAL.
> 
> - Do not overwrite existig message when missing contrecord is
>   detected.  The message added here is seen in the TAP test log
>   026_overwrite_contrecord_standby.log

d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
directly from xlogreader instead of the removed global variables.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From cc521692a9f98fabde07e248b63f1222f8406de1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v7] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c   | 89 +++--
 src/backend/access/transam/xlogreader.c | 42 
 src/backend/replication/walreceiver.c   |  3 +-
 src/include/access/xlogreader.h |  1 +
 4 files changed, 112 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d894af310a..fa435faec4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4469,6 +4469,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, );
 		if (record == NULL)
@@ -4484,6 +4485,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 abortedRecPtr = xlogreader->abortedRecPtr;
 missingContrecPtr = xlogreader->missingContrecPtr;
+ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+/*
+ * NULL ReadRecPtr means we could not read a record at
+ * beginning. In that case EndRecPtr is storing the LSN of the
+ * record we tried to read.
+ */
+ErrRecPtr =
+	xlogreader->ReadRecPtr ?
+	xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4493,12 +4506,11 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error message
+			 * right now. Otherwise the message if any is shown as a part of
+			 * the end-of-WAL message below.
 			 */
-			if (errormsg)
+			if (!xlogreader->EndOfWAL && errormsg)
 ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
 		(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
@@ -4530,11 +4542,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4547,11 +4560,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+/*
+ * We don't report this as LOG, since we don't stop recovery
+ * here
+ */
 ereport(DEBUG1,
-		(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+		(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+		 LSN_FORMAT_ARGS(ErrRecPtr),
+		 replayTLI,
+		 xlogSourceNames[currentSource])));
 InArchiveRecovery = true;
 if (StandbyModeRequested)
 	StandbyMode = true;
@@ -4599,12 +4618,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-continue;
-			else
-return NULL;
+			/*
+			 *  If we haven't emit an error message, we have safely reached the
+			 *  

Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Amit Kapila
On Wed, Dec 8, 2021 at 11:48 AM Masahiko Sawada  wrote:
>
> On Wed, Dec 8, 2021 at 2:16 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila  
> > > wrote:
> > >
> > > I'll submit the patch tomorrow.
> > >
> > > While updating the patch, I realized that skipping a transaction that
> > > is prepared on the publisher will be tricky a bit;
> > >
> > > First of all, since skip-xid is in pg_subscription catalog, we need to
> > > do a catalog update in a transaction and commit it to disable it. I
> > > think we need to set origin-lsn and timestamp of the transaction being
> > > skipped to the transaction that does the catalog update. That is,
> > > during skipping the (not prepared) transaction, we skip all
> > > data-modification changes coming from the publisher, do a catalog
> > > update, and commit the transaction. If we do the catalog update in the
> > > next transaction after skipping the whole transaction, skip_xid could
> > > be left in case of a server crash between them.
> > >
> >
> > But if we haven't updated origin_lsn/timestamp before the crash, won't
> > it request the same transaction again from the publisher? If so, it
> > will be again able to skip it because skip_xid is still not updated.
>
> Yes. I mean that if we update origin_lsn and origin_timestamp when
> committing the skipped transaction and then update the catalog in the
> next transaction it doesn't work in case of a crash. But it's not
> possible in the first place since the first transaction is empty and
> we cannot set origin_lsn and origin_timestamp to it.
>
> >
> > > Also, we cannot set
> > > origin-lsn and timestamp to an empty transaction.
> > >
> >
> > But won't we update the catalog for skip_xid in that case?
>
> Yes. Probably my explanation was not clear. Even if we skip all
> changes of the transaction, the transaction doesn't become empty since
> we update the catalog.
>
> >
> > Do we see any advantage of updating the skip_xid in the same
> > transaction vs. doing it in a separate transaction? If not then
> > probably we can choose either of those ways and add some comments to
> > indicate the possibility of doing it another way.
>
> I think that since the skipped transaction is always empty there is
> always one transaction. What we need to consider is when we update
> origin_lsn and origin_timestamp. In non-prepared transaction cases,
> the only option is when updating the catalog.
>

Your last sentence is not completely clear to me but it seems you
agree that we can use one transaction instead of two to skip the
changes, perform a catalog update, and update origin_lsn/timestamp.

> >
> > > In prepared transaction cases, I think that when handling a prepare
> > > message, we need to commit the transaction to update the catalog,
> > > instead of preparing it. And at the commit prepared and rollback
> > > prepared time, we skip it since there is not the prepared transaction
> > > on the subscriber.
> > >
> >
> > Can't we think of just allowing prepare in this case and updating the
> > skip_xid only at commit time? I see that in this case, we would be
> > doing prepare for a transaction that has no changes but as such cases
> > won't be common, isn't that acceptable?
>
> In this case, we will end up committing both the prepared (empty)
> transaction and the transaction that updates the catalog, right?
>

Can't we do this catalog update before committing the prepared
transaction? If so, both in prepared and non-prepared cases, our
implementation could be the same and we have a reason to accomplish
the catalog update in the same transaction for which we skipped the
changes.

> If
> so, since these are separate transactions it can be a problem in case
> of a crash between these two commits.
>
> >
> > > Currently, handling rollback prepared already
> > > behaves so; it first checks whether we have prepared the transaction
> > > or not and skip it if haven’t. So I think we need to do that also for
> > > commit prepared case. With that, this requires protocol changes so
> > > that the subscriber can get prepare-lsn and prepare-time when handling
> > > commit prepared.
> > >
> > > So I’m writing a separate patch to add prepare-lsn and timestamp to
> > > commit_prepared message, which will be a building block for skipping
> > > prepared transactions. Actually, I think it’s beneficial even today;
> > > we can skip preparing the transaction if it’s an empty transaction.
> > > Although the comment it’s not a common case, I think that it could
> > > happen quite often in some cases:
> > >
> > > * XXX, We can optimize such that at commit prepared time, we first 
> > > check
> > > * whether we have prepared the transaction or not but that doesn't 
> > > seem
> > > * worthwhile because such cases shouldn't be common.
> > > */
> > >
> > > For example, if the publisher has multiple subscriptions and there are
> > > many prepared 

Re: row filtering for logical replication

2021-12-07 Thread Amit Kapila
On Mon, Dec 6, 2021 at 6:04 PM Euler Taveira  wrote:
>
> On Mon, Dec 6, 2021, at 3:35 AM, Dilip Kumar wrote:
>
> On Mon, Dec 6, 2021 at 6:49 AM Euler Taveira  wrote:
> >
> > On Fri, Dec 3, 2021, at 8:12 PM, Euler Taveira wrote:
> >
> > PS> I will update the commit message in the next version. I barely changed 
> > the
> > documentation to reflect the current behavior. I probably missed some 
> > changes
> > but I will fix in the next version.
> >
> > I realized that I forgot to mention a few things about the UPDATE behavior.
> > Regardless of 0003, we need to define which tuple will be used to evaluate 
> > the
> > row filter for UPDATEs. We already discussed it circa [1]. This current 
> > version
> > chooses *new* tuple. Is it the best choice?
>
> But with 0003, we are using both the tuple for evaluating the row
> filter, so instead of fixing 0001, why we don't just merge 0003 with
> 0001?  I mean eventually, 0003 is doing what is the agreed behavior,
> i.e. if just OLD is matching the filter then convert the UPDATE to
> DELETE OTOH if only new is matching the filter then convert the UPDATE
> to INSERT.  Do you think that even we merge 0001 and 0003 then also
> there is an open issue regarding which row to select for the filter?
>
> Maybe I was not clear. IIUC we are still discussing 0003 and I would like to
> propose a different default based on the conclusion I came up. If we merged
> 0003, that's fine; this change will be useless. If we don't or it is optional,
> it still has its merit.
>
> Do we want to pay the overhead to evaluating both tuple for UPDATEs? I'm still
> processing if it is worth it. If you think that in general the row filter
> contains the primary key and it is rare to change it, it will waste cycles
> evaluating the same expression twice. It seems this behavior could be
> controlled by a parameter.
>

I think the first thing we should do in this regard is to evaluate the
performance for both cases (when we apply a filter to both tuples vs.
to one of the tuples). In case the performance difference is
unacceptable, I think it would be better to still compare both tuples
as default to avoid data inconsistency issues and have an option to
allow comparing one of the tuples.

-- 
With Regards,
Amit Kapila.




Re: GUC flags

2021-12-07 Thread Michael Paquier
On Mon, Dec 06, 2021 at 07:36:55AM -0600, Justin Pryzby wrote:
> The script checks that guc.c and sample config are consistent.
> 
> I think your undertanding of INTENTIONALLY_NOT_INCLUDED is not right.
> That's a list of stuff it "avoids reporting" as an suspected error, not an
> additional list of stuff to checks.  INTENTIONALLY_NOT_INCLUDED is a list of
> stuff like NOT_IN_SAMPLE, which is better done by parsing /NOT_IN_SAMPLE/.

Indeed.  I got that wrong, thanks for clarifying.

> I saw that Tom updated it within the last 12 months, which I took to mean that
> it was still being maintained. But I'm okay with removing it.

Yes, I saw that as of bf8a662.  With 42 incorrect reports, I still see
more evidence with removing it.  Before doing anything, let's wait for
and gather some opinions.  I am adding Bruce (as the original author)
and Tom in CC as they are the ones who have updated this script the
most in the last ~15 years.
--
Michael


signature.asc
Description: PGP signature


Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Masahiko Sawada
On Wed, Dec 8, 2021 at 2:16 PM Amit Kapila  wrote:
>
> On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada  wrote:
> >
> > On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila  wrote:
> >
> > I'll submit the patch tomorrow.
> >
> > While updating the patch, I realized that skipping a transaction that
> > is prepared on the publisher will be tricky a bit;
> >
> > First of all, since skip-xid is in pg_subscription catalog, we need to
> > do a catalog update in a transaction and commit it to disable it. I
> > think we need to set origin-lsn and timestamp of the transaction being
> > skipped to the transaction that does the catalog update. That is,
> > during skipping the (not prepared) transaction, we skip all
> > data-modification changes coming from the publisher, do a catalog
> > update, and commit the transaction. If we do the catalog update in the
> > next transaction after skipping the whole transaction, skip_xid could
> > be left in case of a server crash between them.
> >
>
> But if we haven't updated origin_lsn/timestamp before the crash, won't
> it request the same transaction again from the publisher? If so, it
> will be again able to skip it because skip_xid is still not updated.

Yes. I mean that if we update origin_lsn and origin_timestamp when
committing the skipped transaction and then update the catalog in the
next transaction it doesn't work in case of a crash. But it's not
possible in the first place since the first transaction is empty and
we cannot set origin_lsn and origin_timestamp to it.

>
> > Also, we cannot set
> > origin-lsn and timestamp to an empty transaction.
> >
>
> But won't we update the catalog for skip_xid in that case?

Yes. Probably my explanation was not clear. Even if we skip all
changes of the transaction, the transaction doesn't become empty since
we update the catalog.

>
> Do we see any advantage of updating the skip_xid in the same
> transaction vs. doing it in a separate transaction? If not then
> probably we can choose either of those ways and add some comments to
> indicate the possibility of doing it another way.

I think that since the skipped transaction is always empty there is
always one transaction. What we need to consider is when we update
origin_lsn and origin_timestamp. In non-prepared transaction cases,
the only option is when updating the catalog.

>
> > In prepared transaction cases, I think that when handling a prepare
> > message, we need to commit the transaction to update the catalog,
> > instead of preparing it. And at the commit prepared and rollback
> > prepared time, we skip it since there is not the prepared transaction
> > on the subscriber.
> >
>
> Can't we think of just allowing prepare in this case and updating the
> skip_xid only at commit time? I see that in this case, we would be
> doing prepare for a transaction that has no changes but as such cases
> won't be common, isn't that acceptable?

In this case, we will end up committing both the prepared (empty)
transaction and the transaction that updates the catalog, right? If
so, since these are separate transactions it can be a problem in case
of a crash between these two commits.

>
> > Currently, handling rollback prepared already
> > behaves so; it first checks whether we have prepared the transaction
> > or not and skip it if haven’t. So I think we need to do that also for
> > commit prepared case. With that, this requires protocol changes so
> > that the subscriber can get prepare-lsn and prepare-time when handling
> > commit prepared.
> >
> > So I’m writing a separate patch to add prepare-lsn and timestamp to
> > commit_prepared message, which will be a building block for skipping
> > prepared transactions. Actually, I think it’s beneficial even today;
> > we can skip preparing the transaction if it’s an empty transaction.
> > Although the comment it’s not a common case, I think that it could
> > happen quite often in some cases:
> >
> > * XXX, We can optimize such that at commit prepared time, we first check
> > * whether we have prepared the transaction or not but that doesn't seem
> > * worthwhile because such cases shouldn't be common.
> > */
> >
> > For example, if the publisher has multiple subscriptions and there are
> > many prepared transactions that modify the particular table subscribed
> > by one publisher, many empty transactions are replicated to other
> > subscribers.
> >
>
> I think this is not clear to me. Why would one have multiple
> subscriptions for the same publication? I thought it is possible when
> say some publisher doesn't publish any data of prepared transaction
> say because the corresponding action is not published or something
> like that. I don't deny that someday we want to optimize this case but
> it might be better if we don't need to do it along with this patch.

I imagined that the publisher has two publications (say pub-A and
pub-B) that publishes a diferent set of relations in the database and
there are two subscribers that are 

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 10:59 AM Bossart, Nathan  wrote:
> >> Another option we might want to consider is to just skip updating the
> >> state entirely for end-of-recovery checkpoints.  The state would
> >> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
> >> don't know if it's crucial to have a dedicated control file state for
> >> end-of-recovery checkpoints.
> >
> > Please note that end-of-recovery can take a while in production
> > systems (we have observed such things working with our customers) and
> > anything can happen during that period of time. The end-of-recovery
> > checkpoint is not something that gets finished momentarily. Therefore,
> > having a separate db state in the control file is useful.
>
> Is there some useful distinction between the states for users?  ISTM
> that users will be waiting either way, and I don't know that an extra
> control file state will help all that much.  The main reason I bring
> up this option is that the list of states is pretty short and appears
> to be intended to indicate the high-level status of the server.  Most
> of the states are over 20 years old, and the newest one is over 10
> years old, so I don't think new states can be added willy-nilly.

Firstly, updating the control file with "DB_SHUTDOWNING" and
"DB_SHUTDOWNED" for end-of-recovery checkpoint is wrong. I don't think
having DB_IN_CRASH_RECOVERY for end-of-recovery checkpoint is a great
idea. We have a checkpoint (which most of the time takes a while) in
between the states DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. The state
DB_IN_END_OF_RECOVERY_CHECKPOINT added by the v1 patch at [1] (in this
thread) helps users to understand and clearly distinguish what state
the db is in.

IMHO, the age of the code doesn't stop us adding/fixing/improving the code.

> Of course, I could be off-base and others might agree that this new
> state would be nice to have.

Let's see what others have to say about this.

[1] - 
https://www.postgresql.org/message-id/CALj2ACVn5M8xgQ3RD%3D6rSTbbXRBdBWZ%3DTTOBOY_5%2BedMCkWjHA%40mail.gmail.com

Regards,
Bharath Rupireddy.




Re: Data is copied twice when specifying both child and parent table in publication

2021-12-07 Thread Amit Kapila
On Wed, Dec 8, 2021 at 11:30 AM vignesh C  wrote:
>
> On Wed, Dec 8, 2021 at 11:11 AM Amit Kapila  wrote:
> >
> > On Tue, Dec 7, 2021 at 5:53 PM vignesh C  wrote:
> > >
> > > On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > >
> > > 2) Any particular reason why the code and tests are backbranched but
> > > not the document changes?
> > >
> >
> > I am not sure whether we need the doc change or not as this is not a
> > new feature and even if we need it as an improvement to docs, shall we
> > consider backpatching it? I felt that code changes are required to fix
> > a known issue so the case of backpatching it is clear.
>
> Thanks for the clarification, I got your point. I'm fine either way
> regarding the documentation change. The rest of the patch looks good
> to me.
>

Okay, I have also verified the code and test changes for all branches.
I'll wait for a day to see if anybody else has any comments and then
commit this.

-- 
With Regards,
Amit Kapila.




Re: row filtering for logical replication

2021-12-07 Thread Peter Smith
On Thu, Dec 2, 2021 at 2:59 PM houzj.f...@fujitsu.com
 wrote:
>
...
> Attach the v44-0005 top-up patch.
> This version addressed all the comments received so far,
> mainly including the following changes:
> 1) rename rfcol_valid_for_replica to rfcol_valid
> 2) Remove the struct PublicationInfo and add the rfcol_valid flag directly in 
> relation
> 3) report the invalid column number in the error message.
> 4) Rename some function to match the usage.
> 5) Fix some typos and add some code comments.
> 6) Fix a miss in testcase.

Below are my review comments for the most recent v44-0005 (top-up) patch:

==

1. src/backend/executor/execReplication.c
+ invalid_rfcol = RelationGetInvalRowFilterCol(rel);
+
+ /*
+ * It is only safe to execute UPDATE/DELETE when all columns of the row
+ * filters from publications which the relation is in are part of the
+ * REPLICA IDENTITY.
+ */
+ if (invalid_rfcol != InvalidAttrNumber)
+ {

It seemed confusing that when the invalid_rfcol is NOT invalid at all
then it is InvalidAttrNumber, so perhaps this code would be easier to
read if instead the condition was written just as:
---
if (invalid_rfcol)
{
...
}
---



2. invalid_rfcol var name
This variable name is used in a few places but I thought it was too
closely named with the "rfcol_valid" variable even though it has a
completely different meaning. IMO "invalid_rfcol" might be better
named "invalid_rfcolnum" or something like that to reinforce that it
is an AttributeNumber.



3. src/backend/utils/cache/relcache.c - function comment
+ * If not all the row filter columns are part of REPLICA IDENTITY, return the
+ * invalid column number, InvalidAttrNumber otherwise.
+ */

Minor rewording:
"InvalidAttrNumber otherwise." --> "otherwise InvalidAttrNumber."



4. src/backend/utils/cache/relcache.c - function name
+AttrNumber
+RelationGetInvalRowFilterCol(Relation relation)

IMO nothing was gained by saving 2 chars of the name.
"RelationGetInvalRowFilterCol" --> "RelationGetInvalidRowFilterCol"



5. src/backend/utils/cache/relcache.c
+/* For invalid_rowfilter_column_walker. */
+typedef struct {
+ AttrNumber invalid_rfcol;
+ Bitmapset  *bms_replident;
+} rf_context;
+

The members should be commented.



6. src/include/utils/rel.h
  /*
+ * true if the columns of row filters from all the publications the
+ * relation is in are part of replica identity.
+ */
+ bool rd_rfcol_valid;

I felt the member comment is not quite telling the full story. e.g.
IIUC this member is also true when pubaction is something other than
update/delete - but that case doesn't even do replica identity
checking at all. There might not even be any replica identity.



6. src/test/regress/sql/publication.sql
 CREATE PUBLICATION testpub6 FOR TABLE rf_tbl_abcd_pk WHERE (a > 99);
+-- fail - "a" is in PK but it is not part of REPLICA IDENTITY INDEX
+update rf_tbl_abcd_pk set a = 1;
+DROP PUBLICATION testpub6;
 -- ok - "c" is not in PK but it is part of REPLICA IDENTITY INDEX
-SET client_min_messages = 'ERROR';
 CREATE PUBLICATION testpub6 FOR TABLE rf_tbl_abcd_pk WHERE (c > 99);
 DROP PUBLICATION testpub6;
-RESET client_min_messages;
--- fail - "a" is not in REPLICA IDENTITY INDEX
 CREATE PUBLICATION testpub6 FOR TABLE rf_tbl_abcd_nopk WHERE (a > 99);
+-- fail - "a" is not in REPLICA IDENTITY INDEX
+update rf_tbl_abcd_nopk set a = 1;

The "update" DML should be uppercase "UPDATE" for consistency with the
surrounding tests.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Data is copied twice when specifying both child and parent table in publication

2021-12-07 Thread vignesh C
On Wed, Dec 8, 2021 at 11:11 AM Amit Kapila  wrote:
>
> On Tue, Dec 7, 2021 at 5:53 PM vignesh C  wrote:
> >
> > On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> >
> > 2) Any particular reason why the code and tests are backbranched but
> > not the document changes?
> >
>
> I am not sure whether we need the doc change or not as this is not a
> new feature and even if we need it as an improvement to docs, shall we
> consider backpatching it? I felt that code changes are required to fix
> a known issue so the case of backpatching it is clear.

Thanks for the clarification, I got your point. I'm fine either way
regarding the documentation change. The rest of the patch looks good
to me.

Regards,
Vignesh




Re: Data is copied twice when specifying both child and parent table in publication

2021-12-07 Thread Amit Kapila
On Tue, Dec 7, 2021 at 5:53 PM vignesh C  wrote:
>
> On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com
>  wrote:
> >
>
> 2) Any particular reason why the code and tests are backbranched but
> not the document changes?
>

I am not sure whether we need the doc change or not as this is not a
new feature and even if we need it as an improvement to docs, shall we
consider backpatching it? I felt that code changes are required to fix
a known issue so the case of backpatching it is clear.

-- 
With Regards,
Amit Kapila.




RE: [PATCH]Comment improvement in publication.sql

2021-12-07 Thread tanghy.f...@fujitsu.com
On Wednesday, December 8, 2021 1:49 PM, vignesh C  wrote:

> The patch no longer applies, could you post a rebased patch.

Thanks for your kindly reminder. Attached a rebased patch.
Some changes in v4 patch has been fixed by 5a28324, so I deleted those changes.

Regards,
Tang 


v5-0001-Fix-comments-in-publication.sql.patch
Description: v5-0001-Fix-comments-in-publication.sql.patch


Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 8:42 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan  wrote:
>> I think that's alright.  The only other small suggestion I have would
>> be to say "during end-of-recovery checkpoint" instead of "while in
>> end-of-recovery checkpoint."
>
> "while in" is being used by DB_IN_CRASH_RECOVERY and
> DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to
> deviate from that and use "during".

Fair enough.  I don't have a strong opinion about this.

>> Another option we might want to consider is to just skip updating the
>> state entirely for end-of-recovery checkpoints.  The state would
>> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
>> don't know if it's crucial to have a dedicated control file state for
>> end-of-recovery checkpoints.
>
> Please note that end-of-recovery can take a while in production
> systems (we have observed such things working with our customers) and
> anything can happen during that period of time. The end-of-recovery
> checkpoint is not something that gets finished momentarily. Therefore,
> having a separate db state in the control file is useful.

Is there some useful distinction between the states for users?  ISTM
that users will be waiting either way, and I don't know that an extra
control file state will help all that much.  The main reason I bring
up this option is that the list of states is pretty short and appears
to be intended to indicate the high-level status of the server.  Most
of the states are over 20 years old, and the newest one is over 10
years old, so I don't think new states can be added willy-nilly.

Of course, I could be off-base and others might agree that this new
state would be nice to have.

Nathan



Re: row filtering for logical replication

2021-12-07 Thread Amit Kapila
On Tue, Dec 7, 2021 at 6:31 PM Ashutosh Bapat
 wrote:
>
> On Tue, Dec 7, 2021 at 12:18 PM tanghy.f...@fujitsu.com
>  wrote:
> >
> > I have another problem with your patch. The document says:
> >
> > ... If the subscription has several publications in
> > +   which the same table has been published with different filters, those
> > +   expressions get OR'ed together so that rows satisfying any of the 
> > expressions
> > +   will be replicated. Notice this means if one of the publications has no 
> > filter
> > +   at all then all other filters become redundant.
> >
> > Then, what if one of the publications is specified as 'FOR ALL TABLES' or 
> > 'FOR
> > ALL TABLES IN SCHEMA'.
> >
> > For example:
> > create table tbl (a int primary key);"
> > create publication p1 for table tbl where (a > 10);
> > create publication p2 for all tables;
> > create subscription sub connection 'dbname=postgres port=5432' publication 
> > p1, p2;
>
> Thanks for the example. I was wondering about this case myself.
>

I think we should handle this case.

> >
> > I think for "FOR ALL TABLE" publication(p2 in my case), table tbl should be
> > treated as no filter, and table tbl should have no filter in subscription 
> > sub. Thoughts?
> >
> > But for now, the filter(a > 10) works both when copying initial data and 
> > later changes.
> >
> > To fix it, I think we can check if the table is published in a 'FOR ALL 
> > TABLES'
> > publication or published as part of schema in function 
> > pgoutput_row_filter_init
> > (which was introduced in v44-0003 patch), also we need to make some changes 
> > in
> > tablesync.c.
>
> In order to check "FOR ALL_TABLES", we might need to fetch publication
> metadata.
>

Do we really need to perform a separate fetch for this? In
get_rel_sync_entry(), we already have this information, can't we
someway stash that in the corresponding RelationSyncEntry so that same
can be used later for row filtering.

> Instead of that can we add a "TRUE" filter on all the tables
> which are part of FOR ALL TABLES publication?
>

How? We won't have an entry for such tables in pg_publication_rel
where we store row_filter information.

-- 
With Regards,
Amit Kapila.




Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 10:05 AM Michael Paquier  wrote:
>
> On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote:
> > Upon thinking further, having at least the messages at LOG level [1]
> > would be helpful to know what's happening with the system while in
> > recovery. Although these messages at LOG level seem to be filling up
> > the server logs, having a good log consumption and rotation mechanism
> > (I'm sure every major postgres vendor would have one) would be
> > sufficient to allay that concern.
>
> +  ereport(LOG,
> +  (errmsg("recovering WAL segment \"%s\" from source \"%s\"",
> +   xlogfname, srcname)));
> Isn't this approach an issue for translations?  It seems to me that
> terms like "stream" or "archive" had better be translated properly,
> meaning that this needs a fully-constructed sentence.

Thanks for taking a look at the patch. How about the attached v4?

I added a CF entry - https://commitfest.postgresql.org/36/3443/

Regards,
Bharath Rupireddy.


v4-0001-add-log-messages-for-recovery.patch
Description: Binary data


Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Amit Kapila
On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada  wrote:
>
> On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila  wrote:
>
> I'll submit the patch tomorrow.
>
> While updating the patch, I realized that skipping a transaction that
> is prepared on the publisher will be tricky a bit;
>
> First of all, since skip-xid is in pg_subscription catalog, we need to
> do a catalog update in a transaction and commit it to disable it. I
> think we need to set origin-lsn and timestamp of the transaction being
> skipped to the transaction that does the catalog update. That is,
> during skipping the (not prepared) transaction, we skip all
> data-modification changes coming from the publisher, do a catalog
> update, and commit the transaction. If we do the catalog update in the
> next transaction after skipping the whole transaction, skip_xid could
> be left in case of a server crash between them.
>

But if we haven't updated origin_lsn/timestamp before the crash, won't
it request the same transaction again from the publisher? If so, it
will be again able to skip it because skip_xid is still not updated.

> Also, we cannot set
> origin-lsn and timestamp to an empty transaction.
>

But won't we update the catalog for skip_xid in that case?

Do we see any advantage of updating the skip_xid in the same
transaction vs. doing it in a separate transaction? If not then
probably we can choose either of those ways and add some comments to
indicate the possibility of doing it another way.

> In prepared transaction cases, I think that when handling a prepare
> message, we need to commit the transaction to update the catalog,
> instead of preparing it. And at the commit prepared and rollback
> prepared time, we skip it since there is not the prepared transaction
> on the subscriber.
>

Can't we think of just allowing prepare in this case and updating the
skip_xid only at commit time? I see that in this case, we would be
doing prepare for a transaction that has no changes but as such cases
won't be common, isn't that acceptable?

> Currently, handling rollback prepared already
> behaves so; it first checks whether we have prepared the transaction
> or not and skip it if haven’t. So I think we need to do that also for
> commit prepared case. With that, this requires protocol changes so
> that the subscriber can get prepare-lsn and prepare-time when handling
> commit prepared.
>
> So I’m writing a separate patch to add prepare-lsn and timestamp to
> commit_prepared message, which will be a building block for skipping
> prepared transactions. Actually, I think it’s beneficial even today;
> we can skip preparing the transaction if it’s an empty transaction.
> Although the comment it’s not a common case, I think that it could
> happen quite often in some cases:
>
> * XXX, We can optimize such that at commit prepared time, we first check
> * whether we have prepared the transaction or not but that doesn't seem
> * worthwhile because such cases shouldn't be common.
> */
>
> For example, if the publisher has multiple subscriptions and there are
> many prepared transactions that modify the particular table subscribed
> by one publisher, many empty transactions are replicated to other
> subscribers.
>

I think this is not clear to me. Why would one have multiple
subscriptions for the same publication? I thought it is possible when
say some publisher doesn't publish any data of prepared transaction
say because the corresponding action is not published or something
like that. I don't deny that someday we want to optimize this case but
it might be better if we don't need to do it along with this patch.

-- 
With Regards,
Amit Kapila.




Re: [PATCH]Comment improvement in publication.sql

2021-12-07 Thread vignesh C
On Sun, Aug 8, 2021 at 4:26 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Sunday, August 8, 2021 6:34 PM, vignesh C  wrote
> >Thanks for the updated patch, your changes look good to me. You might
> >want to include the commit message in the patch, that will be useful.
>
> Thanks for your kindly suggestion.
> Updated patch attached. Added the patch commit message with no new fix.

The patch no longer applies, could you post a rebased patch.

Regards,
Vignesh




Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan  wrote:
>
> On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
>  wrote:
> > On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan  wrote:
> >> I noticed that some (but not all) of the surrounding messages say
> >> "last known up at" the control file time.  I'm curious why you chose
> >> not to use that phrasing in this case.
> >
> > If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was
> > interrupted while in end-of-recovery checkpoint, so I used the
> > phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY
> > cases. I would like to keep it as-is (in the v1 patch) unless anyone
> > has other thoughts here?
> > (errmsg("database system was interrupted while in recovery at %s",
> > (errmsg("database system was interrupted while in recovery at log time %s",
>
> I think that's alright.  The only other small suggestion I have would
> be to say "during end-of-recovery checkpoint" instead of "while in
> end-of-recovery checkpoint."

"while in" is being used by DB_IN_CRASH_RECOVERY and
DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to
deviate from that and use "during".

> Another option we might want to consider is to just skip updating the
> state entirely for end-of-recovery checkpoints.  The state would
> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
> don't know if it's crucial to have a dedicated control file state for
> end-of-recovery checkpoints.

Please note that end-of-recovery can take a while in production
systems (we have observed such things working with our customers) and
anything can happen during that period of time. The end-of-recovery
checkpoint is not something that gets finished momentarily. Therefore,
having a separate db state in the control file is useful.

Regards,
Bharath Rupireddy.




Re: parallel vacuum comments

2021-12-07 Thread Masahiko Sawada
On Wed, Dec 8, 2021 at 12:22 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, December 7, 2021 1:42 PM Masahiko Sawada  
> wrote:
> > I've attached an updated patch. I've removed 0003 patch that added
> > regression tests as per discussion. Regarding the terminology like "bulkdel"
> > and "cleanup" you pointed out, I've done that in 0002 patch while moving the
> > code to vacuumparallel.c. In that file, we can consistently use the terms
> > "bulkdel" and "cleanup" instead of "vacuum"
> > and "cleanup".
> Hi,
>
> Thanks for updating the patch.
> I noticed few minor things.

Thank you for the comments!

>
> 0001
> 1)
>
>  * Skip processing indexes that are unsafe for workers (these 
> are
> -* processed in do_serial_processing_for_unsafe_indexes() by 
> leader)
> +* processed in parallel_vacuum_process_unsafe_indexes() by 
> leader)
>
> It might be clearer to mention that the index to be skipped are unsafe OR not
> worthwhile.

Agreed. Will add the comments.

>
> 2)
> +   /* Set index vacuum status and mark as parallel safe or not */
> +   for (int i = 0; i < pvc->nindexes; i++)
> +   {
> ...
> +   pindstats->parallel_workers_can_process =
> +   parallel_vacuum_index_is_parallel_safe(vacrel,
> + 
>  vacrel->indrels[i],
> + 
>  vacuum);
>
> For the comments above the loop, maybe better to mention we are marking 
> whether
> worker can process the index(not only safe/unsafe).

Right. WIll fix.

>
> 0002
> 3)
>
> +   /*
> +* Skip indexes that are unsuitable target for parallel index 
> vacuum
> +*/
> +   if (parallel_vacuum_should_skip_index(indrel))
> +   continue;
> +
>
> It seems we can use will_parallel_vacuum[] here instead of invoking the 
> function
> again.

Oops, I missed updating it in 0002 patch. Will fix.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-07 Thread Michael Paquier
On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote:
> Upon thinking further, having at least the messages at LOG level [1]
> would be helpful to know what's happening with the system while in
> recovery. Although these messages at LOG level seem to be filling up
> the server logs, having a good log consumption and rotation mechanism
> (I'm sure every major postgres vendor would have one) would be
> sufficient to allay that concern.

+  ereport(LOG,
+  (errmsg("recovering WAL segment \"%s\" from source \"%s\"",
+   xlogfname, srcname)));
Isn't this approach an issue for translations?  It seems to me that
terms like "stream" or "archive" had better be translated properly,
meaning that this needs a fully-constructed sentence.
--
Michael


signature.asc
Description: PGP signature


Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan  wrote:
>> I noticed that some (but not all) of the surrounding messages say
>> "last known up at" the control file time.  I'm curious why you chose
>> not to use that phrasing in this case.
>
> If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was
> interrupted while in end-of-recovery checkpoint, so I used the
> phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY
> cases. I would like to keep it as-is (in the v1 patch) unless anyone
> has other thoughts here?
> (errmsg("database system was interrupted while in recovery at %s",
> (errmsg("database system was interrupted while in recovery at log time %s",

I think that's alright.  The only other small suggestion I have would
be to say "during end-of-recovery checkpoint" instead of "while in
end-of-recovery checkpoint."

Another option we might want to consider is to just skip updating the
state entirely for end-of-recovery checkpoints.  The state would
instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
don't know if it's crucial to have a dedicated control file state for
end-of-recovery checkpoints.

Nathan



Re: parse_subscription_options - suggested improvements

2021-12-07 Thread Peter Smith
On Wed, Dec 8, 2021 at 2:51 PM Michael Paquier  wrote:
>
> On Tue, Dec 07, 2021 at 08:12:59AM +1100, Peter Smith wrote:
> > On Tue, Dec 7, 2021 at 6:07 AM Bossart, Nathan  wrote:
> >> Attached a v14 with the initializations added back.
> >
> > LGTM.
>
> All the code paths previously covered still are, so applied this one.
> Thanks!

Thanks for pushing!

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan  wrote:
>> I agree with Tom.  I would just s/server/backend/ (as per the
>> attached) and call it a day.
>
> Thanks. v5 patch looks good to me.

I've marked the commitfest entry as ready-for-committer.

Nathan



Re: parse_subscription_options - suggested improvements

2021-12-07 Thread Michael Paquier
On Tue, Dec 07, 2021 at 08:12:59AM +1100, Peter Smith wrote:
> On Tue, Dec 7, 2021 at 6:07 AM Bossart, Nathan  wrote:
>> Attached a v14 with the initializations added back.
> 
> LGTM.

All the code paths previously covered still are, so applied this one.
Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?

2021-12-07 Thread Tomas Vondra
Hi,

On 12/7/21 10:44, 曾文旌(义从) wrote:
> Hi Hackers
> 
> For my previous proposal, I developed a prototype and passed
> regression testing. It works similarly to subquery's qual pushdown.
> We know that sublink expands at the beginning of each level of
> query. At this stage, The query's conditions and equivalence classes
> are not processed. But after generate_base_implied_equalities the
> conditions are processed,  which is why qual can push down to 
> subquery but sublink not.
> 
> My POC implementation chose to delay the sublink expansion in the
> SELECT clause (targetList) and where clause. Specifically, it is
> delayed after generate_base_implied_equalities. Thus, the equivalent
> conditions already established in the Up level query can be easily
> obtained in the sublink expansion process (make_subplan).
> 
> For example, if the up level query has a.id = 10 and the sublink
> query has a.id = b.id, then we get b.id = 10 and push it down to the
> sublink quey. If b is a partitioned table and is partitioned by id,
> then a large number of unrelated subpartitions are pruned out, This
> optimizes a significant amount of Planner and SQL execution time, 
> especially if the partitioned table has a large number of
> subpartitions and is what I want.
> 
> Currently, There were two SQL failures in the regression test,
> because the expansion order of sublink was changed, which did not
> affect the execution result of SQL.
> 
> Look forward to your suggestions on this proposal.
> 

I took a quick look, and while I don't see / can't think of any problems
with delaying it until after generating implied equalities, there seems
to be a number of gaps.


1) Are there any regression tests exercising this modified behavior?
Maybe there are, but if the only changes are due to change in order of
targetlist entries, that doesn't seem like a clear proof.

It'd be good to add a couple tests exercising both the positive and
negative case (i.e. when we can and can't pushdown a qual).


2) apparently, contrib/postgres_fdw does crash like this:

  #3  0x0077b412 in adjust_appendrel_attrs_mutator
  (node=0x13f7ea0, context=0x7fffc3351b30) at appendinfo.c:470
  470  Assert(!IsA(node, SubLink));
  (gdb) p node
  $1 = (Node *) 0x13f7ea0
  (gdb) p *node
  $2 = {type = T_SubLink}

  Backtrace attached.

3) various parts of the patch really need at least some comments, like:

  - try_push_outer_qual_to_sublink_query really needs some docs

  - new stuff at the end of initsplan.c


4) generate_base_implied_equalities

   shouldn't this

if (ec->ec_processed)
;

   really be?

if (ec->ec_processed)
continue;

5) I'm not sure why we need the new ec_processed flag.

6) So we now have lazy_process_sublink callback? Does that mean we
expand sublinks in two places - sometimes lazily, sometimes not?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyCore was generated by `postgres: user contrib_regression [local] EXPLAIN
 '.
Program terminated with signal SIGABRT, Aborted.
#0  0x79a07cfd79e5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.32-10.fc33.x86_64 libgcc-10.3.1-1.fc33.x86_64
(gdb) bt
#0  0x79a07cfd79e5 in raise () from /lib64/libc.so.6
#1  0x79a07cfc08a4 in abort () from /lib64/libc.so.6
#2  0x0094f22a in ExceptionalCondition 
(conditionName=conditionName@entry=0xab44ca "!IsA(node, SubLink)", 
errorType=errorType@entry=0x9a2017 "FailedAssertion", 
fileName=fileName@entry=0xab55cf "appendinfo.c", 
lineNumber=lineNumber@entry=470) at assert.c:69
#3  0x0077b412 in adjust_appendrel_attrs_mutator (node=0x13f7ea0, 
context=0x7fffc3351b30) at appendinfo.c:470
#4  0x0071f8f9 in expression_tree_mutator (node=0x13f7f90, 
mutator=0x77ae20 , context=0x7fffc3351b30) at 
nodeFuncs.c:3240
#5  0x0077b025 in adjust_appendrel_attrs_mutator (node=0x13f7f90, 
context=0x7fffc3351b30) at appendinfo.c:390
#6  0x00720066 in expression_tree_mutator (node=0x13eca78, 
mutator=0x77ae20 , context=0x7fffc3351b30) at 
nodeFuncs.c:3109
#7  0x0077b512 in adjust_appendrel_attrs (root=root@entry=0x13be5b0, 
node=, nappinfos=nappinfos@entry=1, 
appinfos=appinfos@entry=0x7fffc3351bd0) at appendinfo.c:210
#8  0x0073b88c in set_append_rel_size (rte=, rti=4, 
rel=0xf19538, root=0x13be5b0) at allpaths.c:1056
#9  set_rel_size (root=0x13be5b0, rel=0xf19538, rti=4, rte=) at 
allpaths.c:386
#10 0x0073e250 in set_base_rel_sizes (root=) at 
allpaths.c:326
#11 make_one_rel (root=root@entry=0x13be5b0, joinlist=joinlist@entry=0x13edf10) 
at allpaths.c:188
#12 0x00763fde in query_planner (root=root@entry=0x13be5b0, 
qp_callback=qp_callback@entry=0x764eb0 , 
qp_extra=qp_extra@entry=0x7fffc3351d90, lps_callback=) at 
planmain.c:286
#13 0x007694fa in grouping_planner (root=, 
tuple_fraction=) at 

RE: parallel vacuum comments

2021-12-07 Thread houzj.f...@fujitsu.com
On Tuesday, December 7, 2021 1:42 PM Masahiko Sawada  
wrote:
> I've attached an updated patch. I've removed 0003 patch that added
> regression tests as per discussion. Regarding the terminology like "bulkdel"
> and "cleanup" you pointed out, I've done that in 0002 patch while moving the
> code to vacuumparallel.c. In that file, we can consistently use the terms
> "bulkdel" and "cleanup" instead of "vacuum"
> and "cleanup".
Hi,

Thanks for updating the patch.
I noticed few minor things.

0001
1)

 * Skip processing indexes that are unsafe for workers (these 
are
-* processed in do_serial_processing_for_unsafe_indexes() by 
leader)
+* processed in parallel_vacuum_process_unsafe_indexes() by 
leader)

It might be clearer to mention that the index to be skipped are unsafe OR not
worthwhile.

2)
+   /* Set index vacuum status and mark as parallel safe or not */
+   for (int i = 0; i < pvc->nindexes; i++)
+   {
...
+   pindstats->parallel_workers_can_process =
+   parallel_vacuum_index_is_parallel_safe(vacrel,
+   
   vacrel->indrels[i],
+   
   vacuum);

For the comments above the loop, maybe better to mention we are marking whether
worker can process the index(not only safe/unsafe).

0002
3)

+   /*
+* Skip indexes that are unsuitable target for parallel index 
vacuum
+*/
+   if (parallel_vacuum_should_skip_index(indrel))
+   continue;
+

It seems we can use will_parallel_vacuum[] here instead of invoking the function
again.

Best regards,
Hou zj


Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Tomas Vondra



On 12/8/21 02:54, Bharath Rupireddy wrote:
> On Wed, Dec 8, 2021 at 4:07 AM Justin Pryzby  wrote:
>>> I'd bet squashing all of this into a single string (not really a flag)
>>> will just mean people will have to parse it, etc. Keeping individual
>>> boolean flags (or even separate string fields) would be better, I guess.
>>
>> Since the size of controldata is a concern, I suggest to add an int16 to
>> populate with flags, which pg_controldata can parse for display.
> 
> +1. I will come up with a patch soon.
> 
>> Note that this other patch intends to add the timestamp and LSN of most 
>> recent
>> recovery to controldata.
>> https://commitfest.postgresql.org/36/3183/
> 
> Thanks. I will go through it separately.
> 
>>> We already have some checkpoint info in pg_stat_bgwriter, but that's
>>> just aggregated data, not very convenient for info about the current
>>> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
>>> various info about the current checkpoint?
>>
>> It could go into the pg_stat_checkpointer view, which would be the 
>> culmination
>> of another patch (cc Melanie).
>> https://commitfest.postgresql.org/36/3272/
> 

I don't think the pg_stat_checkpointer would be a good match - that's
going to be an aggregated view of all past checkpoints, not a good
source info about the currently running one.

> +1 to have pg_stat_progress_checkpoint view. We have
> CheckpointStatsData already. What we need is to make this structure
> shared and add a little more info to represent the progress, so that
> the other backends can access it. I think we can discuss this in a
> separate thread to give it a fresh try rather than proposing this as a
> part of another thread. I will spend some time on
> pg_stat_progress_checkpoint proposal and try to come up with a
> separate thread to discuss.
> 

+1 to discuss it as part of this patch

I'm not sure whether the view should look at CheckpointStatsData, or do
the same thing as the other pg_stat_progress_* views - send the data to
stat collector, and read it from there.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 7:34 AM Tomas Vondra
 wrote:
> >> I agree it might be useful to provide information about the nature of
> >> the checkpoint, and perhaps even PID of the backend that triggered it
> >> (although that may be tricky, if the backend terminates).
> >
> > Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the
> > triggered backend can possibly go there (we can mention in the
> > documentation that the backend that triggered the checkpoint can get
> > terminated).
> >
>
> My concern is someone might run something that requires a checkpoint, so
> we start it and put the PID into the catalog. And then the person aborts
> the command and starts doing something else. But that does not abort the
> checkpoint, but the backend now runs something that doesn't requite
> checkpoint, which is rather confusing.
>
> >> I'm not sure about adding it to control data, though. That doesn't seem
> >> like a very good match for something that's mostly for monitoring.
> >
> > Having it in the control data file (along with the existing checkpoint
> > information) will be helpful to know what was the last checkpoint
> > information and we can use the existing pg_control_checkpoint function
> > or the tool to emit that info. I plan to add an int16 flag as
> > suggested by Justin in this thread and come up with a patch.
> >
>
> OK, although I'm not sure it's all that useful (if we have that in some
> sort of system view).

If the server is down, the control file will help. Since we already
have the other checkpoint info in the control file, it's much more
useful and sensible to have this extra piece of missing information
(checkpoint kind) there.

> >> We already have some checkpoint info in pg_stat_bgwriter, but that's
> >> just aggregated data, not very convenient for info about the current
> >> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> >> various info about the current checkpoint?
> >
> > +1 to have pg_stat_progress_checkpoint view to know what's going on
> > with the current checkpoint.
> >
>
> Do you plan to add it to this patch, or should it be a separate patch?

No, I will put some more thoughts around it and start a new thread.

Regards,
Bharath Rupireddy.




Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Tomas Vondra



On 12/8/21 02:54, Bharath Rupireddy wrote:
> On Wed, Dec 8, 2021 at 3:48 AM Tomas Vondra
>  wrote:
>>
>> On 12/7/21 15:36, Bharath Rupireddy wrote:
>>> Hi,
>>>
>>> Currently one can know the kind of on-going/last checkpoint (shutdown,
>>> end-of-recovery, immediate, force etc.) only via server logs that to
>>> when log_checkpoints GUC is on. At times, the users/service layer
>>> components would want to know the kind of checkpoint (along with other
>>> checkpoint related info) to take some actions and it will be a bit
>>> difficult to search through the server logs. The checkpoint info can
>>> be obtained from the control file (either by pg_control_checkpoint()
>>> or by pg_controldata tool) whereas checkpoint kind isn't available
>>> there.
>>>
>>> How about we add an extra string field to the control file alongside
>>> the other checkpoint info it already has? This way, the existing
>>> pg_control_checkpoint() or pg_controldata tool can easily be enhanced
>>> to show the checkpoint kind as well. One concern is that we don't want
>>> to increase the size of pg_controldata by more than the typical block
>>> size (of 8K) to avoid any torn-writes. With this change, we might add
>>> at max the strings specified at [1]. Adding it to the control file has
>>> an advantage of preserving the last checkpoint kind which might be
>>> useful.
>>>
>>> Thoughts?
>>>
>>
>> I agree it might be useful to provide information about the nature of
>> the checkpoint, and perhaps even PID of the backend that triggered it
>> (although that may be tricky, if the backend terminates).
> 
> Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the
> triggered backend can possibly go there (we can mention in the
> documentation that the backend that triggered the checkpoint can get
> terminated).
> 

My concern is someone might run something that requires a checkpoint, so
we start it and put the PID into the catalog. And then the person aborts
the command and starts doing something else. But that does not abort the
checkpoint, but the backend now runs something that doesn't requite
checkpoint, which is rather confusing.

>> I'm not sure about adding it to control data, though. That doesn't seem
>> like a very good match for something that's mostly for monitoring.
> 
> Having it in the control data file (along with the existing checkpoint
> information) will be helpful to know what was the last checkpoint
> information and we can use the existing pg_control_checkpoint function
> or the tool to emit that info. I plan to add an int16 flag as
> suggested by Justin in this thread and come up with a patch.
> 

OK, although I'm not sure it's all that useful (if we have that in some
sort of system view).

>> We already have some checkpoint info in pg_stat_bgwriter, but that's
>> just aggregated data, not very convenient for info about the current
>> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
>> various info about the current checkpoint?
> 
> +1 to have pg_stat_progress_checkpoint view to know what's going on
> with the current checkpoint.
> 

Do you plan to add it to this patch, or should it be a separate patch?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Transparent column encryption

2021-12-07 Thread Tomas Vondra



On 12/8/21 00:26, Jacob Champion wrote:
> On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
>> IMO it's impossible to solve this attack within TCE, because it requires 
>> ensuring consistency at the row level, but TCE obviously works at column 
>> level only.
> 
> I was under the impression that clients already had to be modified to
> figure out how to encrypt the data? If part of that process ends up
> including enforcement of encryption for a specific column set, then the
> addition of AEAD data could hypothetically be part of that hand-
> waviness.
> 

I think "transparency" here means the client just uses the regular
prepared-statement API without having to explicitly encrypt/decrypt any
data. The problem is we can't easily tie this to other columns in the
table, because the client may not even know what values are in those
columns.

Imagine you do this

  UPDATE t SET encrypted_column = $1 WHERE another_column = $2;

but you want to ensure the encrypted value belongs to a particular row
(which may or may not be identified by the another_column value). How
would the client do that? Should it fetch the value or what?

Similarly, what if the client just does

  SELECT encrypted_column FROM t;

How would it verify the values belong to the row, without having all the
data for the row (or just the required columns)?

> Unless "transparent" means that the client completely defers to the
> server on whether to encrypt or not, and silently goes along with it if
> the server tells it not to encrypt?
I think that's probably a valid concern - a "bad DBA" could alter the
table definition to not contain the "ENCRYPTED" bits, and then peek at
the plaintext values.

But it's not clear to me how exactly would the AEAD prevent this?
Wouldn't that be also specified on the server, somehow? In which case
the DBA could just tweak that too, no?

In other words, this issue seems mostly orthogonal to the AEAD, and the
right solution would be to allow the client to define which columns have
to be encrypted (in which case altering the server definition would not
be enough).

> That would only protect against a
> _completely_ passive DBA, like someone reading unencrypted backups,
> etc. And that still has a lot of value, certainly. But it seems like
> this prototype is very close to a system where the client can reliably
> secure data even if the server isn't trustworthy, if that's a use case
> you're interested in.
> 

Right. IMHO the "passive attacker" is a perfectly fine model for use
cases that would be fine with e.g. pgcrypto if there was no risk of
leaking plaintext values to logs, system catalogs, etc.

If we can improve it to provide (at least some) protection against
active attackers, that'd be a nice bonus.

>> I believe TCE can do AEAD at the column level, which protects against 
>> attacks that flipping bits, and similar attacks. It's just a matter of 
>> how the client encrypts the data.
> 
> Right, I think authenticated encryption ciphers (without AD) will be
> important to support in practice. I think users are going to want
> *some* protection against active attacks.
> 
>> Extending it to protect the whole row seems tricky, because the client 
>> may not even know the other columns, and it's not clear to me how it'd 
>> deal with things like updates of the other columns, hint bits, dropped 
>> columns, etc.
> 
> Covering the entire row automatically probably isn't super helpful in
> practice. As you mention later:
> 
>> It's probably possible to get something like this (row-level AEAD) by 
>> encrypting enriched data, i.e. not just the card number, but {user ID, 
>> card number} or something like that, and verify that in the webapp. The 
>> problem of course is that the "user ID" is just another column in the 
>> table, and there's nothing preventing the DBA from modifying that too.
> 
> Right. That's why the client has to be able to choose AD according to
> the application. In my previous example, the victim's email address can
> be copied by the DBA, but they wouldn't be able to authenticate as that
> user and couldn't convince the client to use the plaintext on their
> behalf.
> 

Well, yeah. But I'm not sure how to make that work easily, because the
client may not have the data :-(

I was thinking about using a composite data type combining the data with
the extra bits - that'd not be all that transparent as it'd require the
client to build this manually and then also cross-check it after loading
the data. So the user would be responsible for having all the data.

But doing that automatically/transparently seems hard, because how would
you deal e.g. with SELECT queries reading data through a view or CTE?

How would you declare this, either at the client or server?

Do any other databases have this capability? How do they do it?



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 4:07 AM Justin Pryzby  wrote:
> > I'd bet squashing all of this into a single string (not really a flag)
> > will just mean people will have to parse it, etc. Keeping individual
> > boolean flags (or even separate string fields) would be better, I guess.
>
> Since the size of controldata is a concern, I suggest to add an int16 to
> populate with flags, which pg_controldata can parse for display.

+1. I will come up with a patch soon.

> Note that this other patch intends to add the timestamp and LSN of most recent
> recovery to controldata.
> https://commitfest.postgresql.org/36/3183/

Thanks. I will go through it separately.

> > We already have some checkpoint info in pg_stat_bgwriter, but that's
> > just aggregated data, not very convenient for info about the current
> > checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> > various info about the current checkpoint?
>
> It could go into the pg_stat_checkpointer view, which would be the culmination
> of another patch (cc Melanie).
> https://commitfest.postgresql.org/36/3272/

+1 to have pg_stat_progress_checkpoint view. We have
CheckpointStatsData already. What we need is to make this structure
shared and add a little more info to represent the progress, so that
the other backends can access it. I think we can discuss this in a
separate thread to give it a fresh try rather than proposing this as a
part of another thread. I will spend some time on
pg_stat_progress_checkpoint proposal and try to come up with a
separate thread to discuss.

Regards,
Bharath Rupireddy.




Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 3:48 AM Tomas Vondra
 wrote:
>
> On 12/7/21 15:36, Bharath Rupireddy wrote:
> > Hi,
> >
> > Currently one can know the kind of on-going/last checkpoint (shutdown,
> > end-of-recovery, immediate, force etc.) only via server logs that to
> > when log_checkpoints GUC is on. At times, the users/service layer
> > components would want to know the kind of checkpoint (along with other
> > checkpoint related info) to take some actions and it will be a bit
> > difficult to search through the server logs. The checkpoint info can
> > be obtained from the control file (either by pg_control_checkpoint()
> > or by pg_controldata tool) whereas checkpoint kind isn't available
> > there.
> >
> > How about we add an extra string field to the control file alongside
> > the other checkpoint info it already has? This way, the existing
> > pg_control_checkpoint() or pg_controldata tool can easily be enhanced
> > to show the checkpoint kind as well. One concern is that we don't want
> > to increase the size of pg_controldata by more than the typical block
> > size (of 8K) to avoid any torn-writes. With this change, we might add
> > at max the strings specified at [1]. Adding it to the control file has
> > an advantage of preserving the last checkpoint kind which might be
> > useful.
> >
> > Thoughts?
> >
>
> I agree it might be useful to provide information about the nature of
> the checkpoint, and perhaps even PID of the backend that triggered it
> (although that may be tricky, if the backend terminates).

Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the
triggered backend can possibly go there (we can mention in the
documentation that the backend that triggered the checkpoint can get
terminated).

> I'm not sure about adding it to control data, though. That doesn't seem
> like a very good match for something that's mostly for monitoring.

Having it in the control data file (along with the existing checkpoint
information) will be helpful to know what was the last checkpoint
information and we can use the existing pg_control_checkpoint function
or the tool to emit that info. I plan to add an int16 flag as
suggested by Justin in this thread and come up with a patch.

> We already have some checkpoint info in pg_stat_bgwriter, but that's
> just aggregated data, not very convenient for info about the current
> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> various info about the current checkpoint?

+1 to have pg_stat_progress_checkpoint view to know what's going on
with the current checkpoint.

Regards,
Bharath Rupireddy.




Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan  wrote:
>
> On 11/18/21, 8:27 PM, "Bharath Rupireddy" 
>  wrote:
> > Here's the v4 patch with the above changes, the output looks like [1].
> > Please review it further.
>
> I agree with Tom.  I would just s/server/backend/ (as per the
> attached) and call it a day.

Thanks. v5 patch looks good to me.

Regards,
Bharath Rupireddy.




Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy
On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan  wrote:
>
> On 12/7/21, 12:10 AM, "Bharath Rupireddy" 
>  wrote:
> > Here's a patch that I've come up with. Please see if this looks okay
> > and let me know if we want to take it forward so that I can add a CF
> > entry.
>
> Overall, the patch seems reasonable to me.

Thanks for reviewing this.

> +   case DB_IN_END_OF_RECOVERY_CHECKPOINT:
> +   ereport(LOG,
> +   (errmsg("database system was 
> interrupted while in end-of-recovery checkpoint at %s",
> +   
> str_time(ControlFile->time;
> +   break;
>
> I noticed that some (but not all) of the surrounding messages say
> "last known up at" the control file time.  I'm curious why you chose
> not to use that phrasing in this case.

If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was
interrupted while in end-of-recovery checkpoint, so I used the
phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY
cases. I would like to keep it as-is (in the v1 patch) unless anyone
has other thoughts here?
(errmsg("database system was interrupted while in recovery at %s",
(errmsg("database system was interrupted while in recovery at log time %s",

I added a CF entry here - https://commitfest.postgresql.org/36/3442/

Regards,
Bharath Rupireddy.




Re: logical decoding and replication of sequences

2021-12-07 Thread Tomas Vondra
On 12/7/21 15:38, Peter Eisentraut wrote:
> I have checked the 0001 and 0003 patches.  (I think we have dismissed
> the approach in 0002 for now.  And let's talk about 0004 later.)
> 

Right, I think that's correct.

> I have attached a few fixup patches, described further below.
> 
> # 0001
> 
> The argument "create" for fill_seq_with_data() is always true (and
> patch 0002 removes it).  So I'm not sure if it's needed.  If it is, it
> should be documented somewhere.
> 

Good point. I think it could be removed, but IIRC it'll be needed when
adding sequence decoding to the built-in replication, and that patch is
meant to be just an implementation of the API, without changing WAL.

OTOH I don't see it in the last version of that patch (in ResetSequence2
call) so maybe it's not really needed. I've kept it for now, but I'll
investigate.

> About the comment you added in nextval_internal(): It's a good
> explanation, so I would leave it in.  I also agree with the
> conclusion of the comment that the current solution is reasonable.  We
> probably don't need the same comment again in fill_seq_with_data() and
> again in do_setval().  Note that both of the latter functions already
> point to nextval_interval() for other comments, so the same can be
> relied upon here as well.
> 

True. I moved it a bit in nextval_internal() and removed the other
copies. The existing references should be enough.

> The order of the new fields sequence_cb and stream_sequence_cb is a
> bit inconsistent compared to truncate_cb and stream_truncate_cb.
> Maybe take another look to make the order more uniform.
> 

You mean in OutputPluginCallbacks? I'd actually argue it's the truncate
callbacks that are inconsistent - in the regular section truncate_cb is
right before commit_cb, while in the streaming section it's at the end.

Or what order do you think would be better? I'm fine with changing it.

> Some documentation needs to be added to the Logical Decoding chapter.
> I have attached a patch that I think catches all the places that need
> to be updated, with some details left for you to fill in. ;-) The
> ordering of the some of the documentation chunks reflects the order in
> which the callbacks appear in the header files, which might not be
> optimal; see above.
> 

Thanks. I added a bit about the callbacks being optional and what the
parameters mean (only for sequence_cb, as the stream_ callbacks
generally don't copy that bit).

> In reorderbuffer.c, you left a comment about how to access a sequence
> tuple.  There is an easier way, using Form_pg_sequence_data, which is
> how sequence.c also does it.  See attached patch.
> 

Yeah, that looks much nicer.

> # 0003
> 
> The tests added in 0003 don't pass for me.  It appears that you might
> have forgotten to update the expected files after you added some tests
> or something.  The cfbot shows the same.  See attached patch for a
> correction, but do check what your intent was.
> 

Yeah. I suspect I removed the expected results due to the experimental
rework. I haven't noticed that because some of the tests for the
built-in replication are expected to fail.


Attached is an updated version of the first two parts (infrastructure
and test_decoding), with all your fixes merged.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 999514b9f802db9b888543532355e9c36646da30 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 24 Sep 2021 00:41:33 +0200
Subject: [PATCH 1/2] Logical decoding of sequences

This adds the infrastructure for logical decoding of sequences. Support
for built-in logical replication and test_decoding is added separately.

When decoding sequences, we differentiate between a sequences created in
a (running) transaction, and sequences created in other (already
committed) transactions. Changes for sequences created in the same
top-level transaction are treated as "transactional" i.e. just like any
other change from that transaction (and discarded in case of a
rollback). Changes for sequences created earlier are treated as not
transactional - are processed immediately, as if performed outside any
transaction (and thus not rolled back).

This mixed behavior is necessary - sequences are non-transactional (e.g.
ROLLBACK does not undo the sequence increments). But for new sequences,
we need to handle them in a transactional way, because if we ever get
some DDL support, the sequence won't exist until the transaction gets
applied. So we need to ensure the increments don't happen until the
sequence gets created.

To differentiate which sequences are "old" and which were created in a
still-running transaction, we track sequences created in running
transactions in a hash table. Each sequence is identified by it's
relfilenode, and at transaction commit we discard all sequences created
in that particular transaction. For each sequence we track the XID of
the (sub)transaction that created it, and we cleanup the sequences for
each 

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Imseih (AWS), Sami
I also want to +1 this this effort. Exposing subtransaction usage is very 
useful.

It would also be extremely beneficial to add both subtransaction usage and 
overflow counters to pg_stat_database. 

Monitoring tools that capture deltas on pg_stat_database will be able to 
generate historical analysis and usage trends of subtransactions.

On 12/7/21, 5:34 PM, "Bossart, Nathan"  wrote:

On 12/6/21, 8:19 PM, "Dilip Kumar"  wrote:
> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.
> Attached patch adds subtransaction count and subtransaction overflow
> status in pg_stat_activity.  I have implemented this because of the
> recent complain about the same[1]

I'd like to give a general +1 to this effort.  Thanks for doing this!
I've actually created a function to provide this information in the
past, so I will help review.

Nathan




Re: Transparent column encryption

2021-12-07 Thread Jacob Champion
On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
> IMO it's impossible to solve this attack within TCE, because it requires 
> ensuring consistency at the row level, but TCE obviously works at column 
> level only.

I was under the impression that clients already had to be modified to
figure out how to encrypt the data? If part of that process ends up
including enforcement of encryption for a specific column set, then the
addition of AEAD data could hypothetically be part of that hand-
waviness.

Unless "transparent" means that the client completely defers to the
server on whether to encrypt or not, and silently goes along with it if
the server tells it not to encrypt? That would only protect against a
_completely_ passive DBA, like someone reading unencrypted backups,
etc. And that still has a lot of value, certainly. But it seems like
this prototype is very close to a system where the client can reliably
secure data even if the server isn't trustworthy, if that's a use case
you're interested in.

> I believe TCE can do AEAD at the column level, which protects against 
> attacks that flipping bits, and similar attacks. It's just a matter of 
> how the client encrypts the data.

Right, I think authenticated encryption ciphers (without AD) will be
important to support in practice. I think users are going to want
*some* protection against active attacks.

> Extending it to protect the whole row seems tricky, because the client 
> may not even know the other columns, and it's not clear to me how it'd 
> deal with things like updates of the other columns, hint bits, dropped 
> columns, etc.

Covering the entire row automatically probably isn't super helpful in
practice. As you mention later:

> It's probably possible to get something like this (row-level AEAD) by 
> encrypting enriched data, i.e. not just the card number, but {user ID, 
> card number} or something like that, and verify that in the webapp. The 
> problem of course is that the "user ID" is just another column in the 
> table, and there's nothing preventing the DBA from modifying that too.

Right. That's why the client has to be able to choose AD according to
the application. In my previous example, the victim's email address can
be copied by the DBA, but they wouldn't be able to authenticate as that
user and couldn't convince the client to use the plaintext on their
behalf.

--Jacob


Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Peter Geoghegan
On Tue, Dec 7, 2021 at 1:59 PM Robert Haas  wrote:
> If we're only trying to decide whether or not to vacuum a table, we
> don't need units: the output is a Boolean.

I was imagining a world in which we preserve the
autovacuum_vacuum_scale_factor design, but interpret it creatively
(but never too creatively) -- an incremental approach seems best to
me. We can even sanity check our abstract bloat unit calculation, in
case the page-level sampling aggregates into a totally wild number of
dead tuples (based in part on the current number of not-all-visible
heap pages) -- so the abstract units are always anchored to the old
idea of dead tuples. Maybe this isn't the best approach, but at least
it addresses compatibility.

*Any* approach based on sampling relatively few random blocks (to look
for signs of bloat) is inherently prone to hugely underestimating the
extent of bloat (which is what we see in TPC-C). I am primarily
concerned about compensating for the inherent limitations that go with
that. To me it seems inappropriate to make statistical inferences
about dead tuples based on a random snapshot of random blocks (usually
only a tiny minority). It is not only possible for the picture to
change utterly -- it is routine, expected, and the whole entire point.

The entire intellectual justification for statistical sampling (that
mostly works for optimizer stats) just doesn't carry over to
autovacuum stats, for many reasons. At the same time, I don't have any
fundamentally better starting point. That's how I arrived at the idea
of probabilistic modeling based on several recent snapshots from
ANALYZE. The statistics are often rubbish, whether or not we like it,
and regardless of how we decide to count things on each page. And so
it's entirely reasonable to not limit the algorithm to concerns about
the state of things -- the actual exposure of the system to harm (from
overlooking harmful bloat) is also relevant.

> If we're trying to decide
> on an order in which to vacuum tables, then we need units. But such
> units can't be anything related to dead tuples, because vacuum can be
> needed based on XID age, or MXID age, or dead tuples. The units would
> have to be something like abstract vacuum-urgency units (if higher is
> more urgent) or abstract remaining-headroom-beform-catastrophe units
> (if lower is more urgent).

I like that idea. But I wonder if they should be totally unrelated. If
we're close to the "emergency" XID threshold, and also close to the
"bloat units" threshold, then it seems reasonable to put our finger on
the scales, and do an autovacuum before either threshold is crossed.
I'm not sure how that should work, but I find the idea of interpreting
the "bloat units" creatively/probabilistically appealing.

We're not actually making things up by erring in the direction of
launching an autovacuum worker, because we don't actually know the
number of dead tuples (or whatever) anyway -- we're just recognizing
the very real role of chance and noise. That is, if the "bloat units"
threshold might well not have been crossed due to random chance
(noise, the phase of the moon), why should we defer to random chance?
If we have better information to go on, like the thing with the XID
threshold, why not prefer that? Similarly, if we see that the system
as a whole is not very busy right now, why not consider that, just a
little, if the only downside is that we'll ignore a demonstrably
noise-level signal (from the stats)?

That's the high level intuition behind making "bloat units" a
probability density function, and not just a simple expected value.
Teaching the autovacuum.c scheduler to distinguish signal from noise
could be very valuable, if it enables opportunistic batching of work,
or off-hours work. We don't have to respect noise. The devil is in the
details, of course.

-- 
Peter Geoghegan




Re: RecoveryInProgress() has critical side effects

2021-12-07 Thread Robert Haas
On Sat, Dec 4, 2021 at 7:44 PM Michael Paquier  wrote:
> My main worry here is that this changes slightly the definition of
> doPageWrites across stable branches at the end of recovery as there
> could be records generated there.  Note that GetFullPageWriteInfo()
> gets called in XLogInsert(), while Insert->fullPageWrites gets updated
> before CleanupAfterArchiveRecovery().  And it may influence
> the value of doPageWrites in the startup process.

But ... so what? All the code that uses it retries if the value that
was tentatively used turns out to be wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Justin Pryzby
On Tue, Dec 07, 2021 at 11:18:37PM +0100, Tomas Vondra wrote:
> On 12/7/21 15:36, Bharath Rupireddy wrote:
> > Currently one can know the kind of on-going/last checkpoint (shutdown,
> > end-of-recovery, immediate, force etc.) only via server logs that to
> > when log_checkpoints GUC is on.

> > The checkpoint info can be obtained from the control file (either by
> > pg_control_checkpoint() or by pg_controldata tool) whereas checkpoint kind
> > isn't available there.

> > How about we add an extra string field to the control file alongside
> > the other checkpoint info it already has? This way, the existing
> > pg_control_checkpoint() or pg_controldata tool can easily be enhanced
> > to show the checkpoint kind as well. One concern is that we don't want
> > to increase the size of pg_controldata by more than the typical block
> > size (of 8K) to avoid any torn-writes. With this change, we might add
> > at max the strings specified at [1]. Adding it to the control file has
> > an advantage of preserving the last checkpoint kind which might be
> > useful.

> > [1] for checkpoint: "checkpoint shutdown end-of-recovery immediate
> > force wait wal time flush-all"
> > for restartpoint: "restartpoint shutdown end-of-recovery immediate
> > force wait wal time flush-all"
> 
> I'd bet squashing all of this into a single string (not really a flag)
> will just mean people will have to parse it, etc. Keeping individual
> boolean flags (or even separate string fields) would be better, I guess.

Since the size of controldata is a concern, I suggest to add an int16 to
populate with flags, which pg_controldata can parse for display.

Note that this other patch intends to add the timestamp and LSN of most recent
recovery to controldata.
https://commitfest.postgresql.org/36/3183/

> We already have some checkpoint info in pg_stat_bgwriter, but that's
> just aggregated data, not very convenient for info about the current
> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> various info about the current checkpoint?

It could go into the pg_stat_checkpointer view, which would be the culmination
of another patch (cc Melanie).
https://commitfest.postgresql.org/36/3272/

-- 
Justin




Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Tomas Vondra
On 12/7/21 15:36, Bharath Rupireddy wrote:
> Hi,
> 
> Currently one can know the kind of on-going/last checkpoint (shutdown,
> end-of-recovery, immediate, force etc.) only via server logs that to
> when log_checkpoints GUC is on. At times, the users/service layer
> components would want to know the kind of checkpoint (along with other
> checkpoint related info) to take some actions and it will be a bit
> difficult to search through the server logs. The checkpoint info can
> be obtained from the control file (either by pg_control_checkpoint()
> or by pg_controldata tool) whereas checkpoint kind isn't available
> there.
> 
> How about we add an extra string field to the control file alongside
> the other checkpoint info it already has? This way, the existing
> pg_control_checkpoint() or pg_controldata tool can easily be enhanced
> to show the checkpoint kind as well. One concern is that we don't want
> to increase the size of pg_controldata by more than the typical block
> size (of 8K) to avoid any torn-writes. With this change, we might add
> at max the strings specified at [1]. Adding it to the control file has
> an advantage of preserving the last checkpoint kind which might be
> useful.
> 
> Thoughts?
> 

I agree it might be useful to provide information about the nature of
the checkpoint, and perhaps even PID of the backend that triggered it
(although that may be tricky, if the backend terminates).

I'm not sure about adding it to control data, though. That doesn't seem
like a very good match for something that's mostly for monitoring.

We already have some checkpoint info in pg_stat_bgwriter, but that's
just aggregated data, not very convenient for info about the current
checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
various info about the current checkpoint?

> [1] for checkpoint: "checkpoint shutdown end-of-recovery immediate
> force wait wal time flush-all"
> for restartpoint: "restartpoint shutdown end-of-recovery immediate
> force wait wal time flush-all"
> 

I'd bet squashing all of this into a single string (not really a flag)
will just mean people will have to parse it, etc. Keeping individual
boolean flags (or even separate string fields) would be better, I guess.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Robert Haas
On Tue, Dec 7, 2021 at 3:44 PM Peter Geoghegan  wrote:
> Fair enough, but even then we still ultimately have to generate a
> final number that represents how close we are to a configurable "do an
> autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based
> threshold) -- the autovacuum.c side of this (the consumer side)
> fundamentally needs the model to reduce everything to a one
> dimensional number (even though the reality is that there isn't just
> one dimension). This single number (abstract bloat units, abstract
> dead tuples, whatever) is a function of things like the count of dead
> HOT chains, perhaps the concentration of dead tuples on heap pages,
> whatever -- but it's not the same thing as any one of those things we
> count.
>
> I think that this final number needs to be denominated in abstract
> units -- we need to call those abstract units *something*. I don't
> care what that name ends up being, as long as it reflects reality.

If we're only trying to decide whether or not to vacuum a table, we
don't need units: the output is a Boolean. If we're trying to decide
on an order in which to vacuum tables, then we need units. But such
units can't be anything related to dead tuples, because vacuum can be
needed based on XID age, or MXID age, or dead tuples. The units would
have to be something like abstract vacuum-urgency units (if higher is
more urgent) or abstract remaining-headroom-beform-catastrophe units
(if lower is more urgent).

Ignoring wraparound considerations for a moment, I think that we might
want to consider having multiple thresholds and vacuuming the table if
any one of them are met. For example, suppose a table qualifies for
vacuum when %-of-not-all-visible-pages > some-threshold, or
alternatively when %-of-index-tuples-thought-to-be-dead >
some-other-threshold.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Transparent column encryption

2021-12-07 Thread Tomas Vondra




On 12/7/21 19:02, Jacob Champion wrote:

On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:

On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)


I'm aware of this and plan to make use of SIV.  The current
implementation is just an example.


Sounds good.


Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.


I don't know how that is supposed to work.  When the value is encrypted
for insertion, the client may know things like table name or column
name, so it can tie it to those.  But it doesn't know what row it will
go in, so you can't prevent the value from being copied into another
row.  You would need some permanent logical row ID for this, I think.


Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.


This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.


What kinds of attacks are you hoping to prevent (and not prevent)?


The point is to prevent admins from getting at plaintext data.  The
scenario you show is an interesting one but I think it's not meant to be
addressed by this.  If admins can alter the database to their advantage,
they could perhaps increase their account balance, create discount
codes, etc. also.


Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.



IMO it's impossible to solve this attack within TCE, because it requires 
ensuring consistency at the row level, but TCE obviously works at column 
level only.


I believe TCE can do AEAD at the column level, which protects against 
attacks that flipping bits, and similar attacks. It's just a matter of 
how the client encrypts the data.


Extending it to protect the whole row seems tricky, because the client 
may not even know the other columns, and it's not clear to me how it'd 
deal with things like updates of the other columns, hint bits, dropped 
columns, etc.


It's probably possible to get something like this (row-level AEAD) by 
encrypting enriched data, i.e. not just the card number, but {user ID, 
card number} or something like that, and verify that in the webapp. The 
problem of course is that the "user ID" is just another column in the 
table, and there's nothing preventing the DBA from modifying that too.


So I think it's pointless to try extending this to row-level AEAD.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 12:10 AM, "Bharath Rupireddy" 
 wrote:
> Here's a patch that I've come up with. Please see if this looks okay
> and let me know if we want to take it forward so that I can add a CF
> entry.

Overall, the patch seems reasonable to me.

+   case DB_IN_END_OF_RECOVERY_CHECKPOINT:
+   ereport(LOG,
+   (errmsg("database system was 
interrupted while in end-of-recovery checkpoint at %s",
+   
str_time(ControlFile->time;
+   break;

I noticed that some (but not all) of the surrounding messages say
"last known up at" the control file time.  I'm curious why you chose
not to use that phrasing in this case.

Nathan



Re: Pre-allocating WAL files

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 9:35 AM, "Bossart, Nathan"  wrote:
> On 12/7/21, 12:29 AM, "Bharath Rupireddy" 
>  wrote:
>> Why can't the walwriter pre-allocate some of the WAL segments instead
>> of a new background process? Of course, it might delay the main
>> functionality of the walwriter i.e. flush and sync the WAL files, but
>> having checkpointer do the pre-allocation makes it do another extra
>> task. Here the amount of walwriter work vs checkpointer work, I'm not
>> sure which one does more work compared to the other.
>
> The argument against adding it to the WAL writer is that we want it to
> run with low latency to flush asynchronous commits.  If we added WAL
> pre-allocation to the WAL writer, there could periodically be large
> delays.

To your point on trying to avoid giving the checkpointer extra tasks
(basically what we are talking about on the other thread [0]), WAL
pre-allocation might not be of much concern because it will generally
be a small, fixed (and configurable) amount of work, and it can be
performed concurrently with the checkpoint.  Plus, WAL pre-allocation
should ordinarily be phased out as WAL segments become eligible for
recycling.  IMO it's not comparable to tasks like
CheckPointSnapBuild() that can delay checkpointing for a long time.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com



Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Peter Geoghegan
On Tue, Dec 7, 2021 at 12:27 PM Robert Haas  wrote:
> Well... I mean, I think we're almost saying the same thing, then, but
> I think you're saying it more confusingly. I have no objection to
> counting the number of dead HOT chains rather than the number of dead
> tules, because that's what affects the index contents, but there's no
> need to characterize that as "not the literal truth."

Works for me!

> Sure, but we don't *need* to be less accurate, and I don't think we
> even *benefit* from being less accurate. If we do something like count
> dead HOT chains instead of dead tuples, let's not call that a
> less-accurate count of dead tuples. Let's call it an accurate count of
> dead HOT chains.

Fair enough, but even then we still ultimately have to generate a
final number that represents how close we are to a configurable "do an
autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based
threshold) -- the autovacuum.c side of this (the consumer side)
fundamentally needs the model to reduce everything to a one
dimensional number (even though the reality is that there isn't just
one dimension). This single number (abstract bloat units, abstract
dead tuples, whatever) is a function of things like the count of dead
HOT chains, perhaps the concentration of dead tuples on heap pages,
whatever -- but it's not the same thing as any one of those things we
count.

I think that this final number needs to be denominated in abstract
units -- we need to call those abstract units *something*. I don't
care what that name ends up being, as long as it reflects reality.

-- 
Peter Geoghegan




Re: enable certain TAP tests for MSVC builds

2021-12-07 Thread Andrew Dunstan


On 12/5/21 18:32, Noah Misch wrote:
> On Sun, Dec 05, 2021 at 06:00:08PM -0500, Andrew Dunstan wrote:
>> On 12/5/21 14:47, Noah Misch wrote:
>>> On Sun, Dec 05, 2021 at 11:57:31AM -0500, Andrew Dunstan wrote:
 Certain TAP tests rely on settings that the Make files provide for them.
 However vcregress.pl doesn't provide those settings. This patch remedies
 that, and I propose to apply it shortly (when we have a fix for the SSL
 tests that I will write about separately) and backpatch it appropriately.
 --- a/src/tools/msvc/vcregress.pl
 +++ b/src/tools/msvc/vcregress.pl
 @@ -59,6 +59,12 @@ copy("$Config/autoinc/autoinc.dll",   
 "src/test/regress");
  copy("$Config/regress/regress.dll",   "src/test/regress");
  copy("$Config/dummy_seclabel/dummy_seclabel.dll", "src/test/regress");
  
 +# Settings used by TAP tests
 +$ENV{with_ssl} = $config->{openssl} ? 'openssl' : 'no';
 +$ENV{with_ldap} = $config->{ldap} ? 'yes' : 'no';
 +$ENV{with_icu} = $config->{icu} ? 'yes' : 'no';
 +$ENV{with_gssapi} = $config->{gss} ? 'yes' : 'no';
>>> That's appropriate.  There are more variables to cover:
>>>
>>> $ git grep -n ^export ':(glob)**/Makefile'
>>> src/bin/pg_basebackup/Makefile:22:export LZ4
>>> src/bin/pg_basebackup/Makefile:23:export TAR
>>> src/bin/pg_basebackup/Makefile:27:export GZIP_PROGRAM=$(GZIP)
>>> src/bin/psql/Makefile:20:export with_readline
>>> src/test/kerberos/Makefile:16:export with_gssapi with_krb_srvnam
>>> src/test/ldap/Makefile:16:export with_ldap
>>> src/test/modules/ssl_passphrase_callback/Makefile:3:export with_ssl
>>> src/test/recovery/Makefile:20:export REGRESS_SHLIB
>>> src/test/ssl/Makefile:18:export with_ssl
>>> src/test/subscription/Makefile:18:export with_icu
>> LZ4/TAR/GZIP_PROGAM: It's not clear what these should be set to. The TAP
>> tests skip tests that use them if they are not set.
> Could add config.pl entries for those.  Preventing those skips on Windows may
> or may not be worth making config.pl readers think about them.
>
>> with_readline: we don't build with readline on Windows, period. I guess
>> we could just set it to "no".
>> with_krb_srvnam: the default is "postgres", we could just set it to that
>> I guess.
> Works for me.



All done.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Robert Haas
On Tue, Dec 7, 2021 at 2:13 PM Peter Geoghegan  wrote:
> For example, why should we count dead heap-only tuples from earlier in
> a HOT chain, even when we see no evidence that opportunistic HOT
> pruning can't keep up on that page? Since we actually care about the
> direction of things, not just the present state of things, we'd be
> justified in completely ignoring those dead tuples. Similarly, it
> might well make sense to give more weight to concentrations of LP_DEAD
> items on a page -- that is a signal that things are not going well *at
> the level of the page*. Not so much when you have a few LP_DEAD stubs,
> but certainly when you have dozens of them on one page, or even
> hundreds. And so ISTM that the conditions of the page should influence
> how we interpret/count that page's dead tuples, in both directions
> (interpret the page as having more dead tuples, or fewer).

Well... I mean, I think we're almost saying the same thing, then, but
I think you're saying it more confusingly. I have no objection to
counting the number of dead HOT chains rather than the number of dead
tules, because that's what affects the index contents, but there's no
need to characterize that as "not the literal truth." There's nothing
fuzzy or untrue about it if we simply say that's what we're doing.

> Right. And as I keep saying, the truly important thing is to not
> *completely* ignore any relevant dimension of cost. I just don't want
> to ever be wildly wrong -- not even once. We can tolerate being
> somewhat less accurate all the time (not that we necessarily have to
> make a trade-off), but we cannot tolerate pathological behavior. Of
> course I include new/theoretical pathological behaviors here (not just
> the ones we know about today).

Sure, but we don't *need* to be less accurate, and I don't think we
even *benefit* from being less accurate. If we do something like count
dead HOT chains instead of dead tuples, let's not call that a
less-accurate count of dead tuples. Let's call it an accurate count of
dead HOT chains.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Tom Lane
I wrote:
> Peter Eisentraut  writes:
>> Could we add explicit casts (like polcmd::text) here?  Or would it break 
>> too much?

> I assumed it'd break too much to consider doing that.  But I suppose
> that since a typcategory change would be initdb-forcing anyway, maybe
> it's not out of the question.  I'll investigate and see exactly how
> many places would need an explicit cast.

Um, I definitely gave up too easily there.  The one usage in \dp
seems to be the *only* thing that breaks in describe.c, and pg_dump
doesn't need any changes so far as check-world reveals.  So let's
just move "char" to another category, as attached.

regards, tom lane

diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out
index ec99aaed5d..3bac0534fb 100644
--- a/contrib/citext/expected/citext.out
+++ b/contrib/citext/expected/citext.out
@@ -1089,7 +1089,12 @@ INSERT INTO caster (char)  VALUES ('f'::citext);
 INSERT INTO caster (citext)VALUES ('f'::char);
 INSERT INTO caster (chr)   VALUES ('f'::text);
 INSERT INTO caster (text)  VALUES ('f'::"char");
-INSERT INTO caster (chr)   VALUES ('f'::citext);
+INSERT INTO caster (chr)   VALUES ('f'::citext);  -- requires cast
+ERROR:  column "chr" is of type "char" but expression is of type citext
+LINE 1: INSERT INTO caster (chr)   VALUES ('f'::citext);
+   ^
+HINT:  You will need to rewrite or cast the expression.
+INSERT INTO caster (chr)   VALUES ('f'::citext::text);
 INSERT INTO caster (citext)VALUES ('f'::"char");
 INSERT INTO caster (name)  VALUES ('foo'::text);
 INSERT INTO caster (text)  VALUES ('foo'::name);
diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out
index 75fd08b7cc..57fc863f7a 100644
--- a/contrib/citext/expected/citext_1.out
+++ b/contrib/citext/expected/citext_1.out
@@ -1089,7 +1089,12 @@ INSERT INTO caster (char)  VALUES ('f'::citext);
 INSERT INTO caster (citext)VALUES ('f'::char);
 INSERT INTO caster (chr)   VALUES ('f'::text);
 INSERT INTO caster (text)  VALUES ('f'::"char");
-INSERT INTO caster (chr)   VALUES ('f'::citext);
+INSERT INTO caster (chr)   VALUES ('f'::citext);  -- requires cast
+ERROR:  column "chr" is of type "char" but expression is of type citext
+LINE 1: INSERT INTO caster (chr)   VALUES ('f'::citext);
+   ^
+HINT:  You will need to rewrite or cast the expression.
+INSERT INTO caster (chr)   VALUES ('f'::citext::text);
 INSERT INTO caster (citext)VALUES ('f'::"char");
 INSERT INTO caster (name)  VALUES ('foo'::text);
 INSERT INTO caster (text)  VALUES ('foo'::name);
diff --git a/contrib/citext/sql/citext.sql b/contrib/citext/sql/citext.sql
index 10232f5a9f..55fb1d11a6 100644
--- a/contrib/citext/sql/citext.sql
+++ b/contrib/citext/sql/citext.sql
@@ -361,7 +361,8 @@ INSERT INTO caster (citext)VALUES ('f'::char);
 
 INSERT INTO caster (chr)   VALUES ('f'::text);
 INSERT INTO caster (text)  VALUES ('f'::"char");
-INSERT INTO caster (chr)   VALUES ('f'::citext);
+INSERT INTO caster (chr)   VALUES ('f'::citext);  -- requires cast
+INSERT INTO caster (chr)   VALUES ('f'::citext::text);
 INSERT INTO caster (citext)VALUES ('f'::"char");
 
 INSERT INTO caster (name)  VALUES ('foo'::text);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1d11be73f..216aa4510d 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9305,6 +9305,10 @@ SCRAM-SHA-256$iteration count:
   X
   unknown type
  
+ 
+  Z
+  Internal-use types
+ 
 

   
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index ea721d963a..72d8547628 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1142,7 +1142,7 @@ permissionsList(const char *pattern)
 		  ",\n  pg_catalog.array_to_string(ARRAY(\n"
 		  "SELECT polname\n"
 		  "|| CASE WHEN polcmd != '*' THEN\n"
-		  "   E' (' || polcmd || E'):'\n"
+		  "   E' (' || polcmd::pg_catalog.text || E'):'\n"
 		  "   ELSE E':'\n"
 		  "   END\n"
 		  "|| CASE WHEN polqual IS NOT NULL THEN\n"
@@ -1176,7 +1176,7 @@ permissionsList(const char *pattern)
 		  "   E' (RESTRICTIVE)'\n"
 		  "   ELSE '' END\n"
 		  "|| CASE WHEN polcmd != '*' THEN\n"
-		  "   E' (' || polcmd || E'):'\n"
+		  "   E' (' || polcmd::pg_catalog.text || E'):'\n"
 		  "   ELSE E':'\n"
 		  "   END\n"
 		  "|| CASE WHEN polqual IS NOT NULL THEN\n"
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index 41074c994b..f3d94f3cf5 100644
--- a/src/include/catalog/pg_type.dat
+++ 

Re: MSVC SSL test failure

2021-12-07 Thread Andrew Dunstan


On 12/7/21 13:22, Tom Lane wrote:
> Alexander Lakhin  writes:
>> 07.12.2021 19:25, Tom Lane wrote:
>>> Hmm. I wonder whether using SD_BOTH behaves any differently. 
>> With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on
>> iterations 1, 2, 3, 16 (just as without shutdown() at all).
>> So shutdown with the SD_SEND flag definitely behaves much better (I've
>> seen over 200 successful iterations).
> Fun.  Well, I'll put in shutdown with SD_SEND for the moment,
> and we'll have to see whether we can improve further than that.
> It does sound like we may be running into OpenSSL bugs/oddities,
> not only kernel issues, so it may be impossible to do better
> on our side.


Yeah. My suspicion is that SD_BOTH is what closesocket() does if
shutdown() hasn't been previously called.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: pg_dump versus ancient server versions

2021-12-07 Thread Andrew Dunstan


On 12/7/21 13:59, Mark Dilger wrote:
>> On Dec 7, 2021, at 10:52 AM, Tom Lane  wrote:
>>
>> I'm not entirely following ... are you suggesting that each released minor
>> version needs to be kept buildable separately?
> No.  I'm just wondering if we want to share the product of such efforts if 
> anybody (me, for instance) volunteers to do it for some subset of minor 
> releases.  For my heap corruption checking work, I might want to be able to 
> build a small number of old minor releases that I know had corruption bugs.
>

I doubt there's going to be a whole lot of changes. You should just be
able to cherry-pick them in most cases I suspect.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Peter Geoghegan
On Tue, Dec 7, 2021 at 7:58 AM Robert Haas  wrote:
> I think that's a good observation. I think the current autovacuum
> algorithm works pretty well when things are normal. But in extreme
> scenarios it does not degrade gracefully.

+1 to all of the specific examples you go on to describe.

> > I also think that our problem is not so much that we don't have
> > accurate statistics about dead tuples (though we surely don't have
> > much accuracy). The main problem seems to be that there are various
> > specific, important ways in which the distribution of dead tuples may
> > matter (leading to various harmful biases). And so it seems reasonable
> > to fudge how we interpret dead tuples with the intention of capturing
> > some of that, as a medium term solution. Something that considers the
> > concentration of dead tuples in heap pages seems promising.
>
> I am less convinced about this part. It sort of sounds like you're
> saying - it doesn't really matter whether the numbers we gather are
> accurate, just that they produce the correct results.

That's not quite it. More like: I think that it would be reasonable to
define dead tuples abstractly, in a way that's more useful but
nevertheless cannot diverge too much from the current definition. We
should try to capture "directionality", with clear error bounds. This
won't represent the literal truth, but it will be true in spirit (or
much closer).

For example, why should we count dead heap-only tuples from earlier in
a HOT chain, even when we see no evidence that opportunistic HOT
pruning can't keep up on that page? Since we actually care about the
direction of things, not just the present state of things, we'd be
justified in completely ignoring those dead tuples. Similarly, it
might well make sense to give more weight to concentrations of LP_DEAD
items on a page -- that is a signal that things are not going well *at
the level of the page*. Not so much when you have a few LP_DEAD stubs,
but certainly when you have dozens of them on one page, or even
hundreds. And so ISTM that the conditions of the page should influence
how we interpret/count that page's dead tuples, in both directions
(interpret the page as having more dead tuples, or fewer).

We all know that there isn't a sensible answer to the question "if the
abstract cost units used in the optimizer say that one sequential page
access is 4x cheaper than one random page access, then what's the
difference between accessing 10 random pages sequentially in close
succession, versus accessing the same 10 pages randomly?". But at the
same time, we can say for sure that random is more expensive to *some*
degree, but certainly never by multiple orders of magnitude. The model
is imperfect, to be sure, but it is better than many alternative
models that are also technically possible. We need *some* cost model
for a cost-based optimizer, and it is better to be approximately
correct than exactly wrong. Again, the truly important thing is to not
be totally wrong about any one thing.

Another way of putting it is that I am willing to accept a more biased
definition of dead tuples, if that lowers the variance:

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

We have some certainty about what is possible in a world in which we
use the visibility map directly, and so our final estimate should
never be wildly unreasonable -- the abstract idea of dead tuples can
still be anchored to the physical reality.

As with the optimizer and its cost units, we don't have that many
degrees of freedom when autovacuum.c launches a worker, any I don't
see that changing -- we must ultimately decide to launch or not launch
an autovacuum worker (for any given table) each time the autovacuum
launcher wakes up. So we're practically forced to come up with a model
that has some abstract idea of one kind of bloat/dead tuple. I would
say that we already have one, in fact -- it's just not a very good one
because it doesn't take account of obviously-relevant factors like
HOT. It could quite easily be less wrong.

> If the
> information we currently gather wouldn't produce the right results
> even if it were fully accurate, that to me suggests that we're
> gathering the wrong information, and we should gather something else.

I think that counting dead tuples is useful, but not quite sufficient
on its own. At the same time, we still need something that works like
a threshold -- because that's just how the autovacuum.c scheduling
works. It's a go/no-go decision, regardless of how the decision is
made.

> For example, we could attack the useless-vacuuming problem by having
> each vacuum figure out - and store - the oldest XID that could
> potentially be worth using as a cutoff for vacuuming that table, and
> refusing to autovacuum that table again until we'd be able to use a
> cutoff >= that value. I suppose this would be the oldest of (a) any
> XID that exists in the table on a tuple that we judged recently dead,
> (b) any XID that is 

Re: pg_dump versus ancient server versions

2021-12-07 Thread Mark Dilger



> On Dec 7, 2021, at 10:52 AM, Tom Lane  wrote:
> 
> I'm not entirely following ... are you suggesting that each released minor
> version needs to be kept buildable separately?

No.  I'm just wondering if we want to share the product of such efforts if 
anybody (me, for instance) volunteers to do it for some subset of minor 
releases.  For my heap corruption checking work, I might want to be able to 
build a small number of old minor releases that I know had corruption bugs.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_dump versus ancient server versions

2021-12-07 Thread Tom Lane
Mark Dilger  writes:
> Wouldn't you be able to see what changed by comparing the last released tag 
> for version X.Y against the RELX_Y_STABLE branch?  Something like `git diff 
> REL8_4_22 origin/REL8_4_STABLE > buildability.patch`?

> Having such a patch should make reproducing old corruption bugs easier, as 
> you could apply the buildability.patch to the last branch that contained the 
> bug.  If anybody did that work, would we want it committed somewhere?  
> REL8_4_19_BUILDABLE or such?  For patches that apply trivially, that might 
> not be worth keeping, but if the merge is difficult, maybe sharing with the 
> community would make sense.

I'm not entirely following ... are you suggesting that each released minor
version needs to be kept buildable separately?  That seems like a huge
amount of extra committer effort with not much added value.  If someone
comes to me and wants to investigate a bug in a branch that's already
out-of-support, and they then say they're not running the last minor
release, I'm going to tell them to come back after updating.

It is (I suspect) true that diffing the last release against branch
tip would often yield a patch that could be used to make an older
minor release buildable again.  But when that patch doesn't work
trivially, I for one am not interested in making it work.  And
especially not interested in doing so "on spec", with no certainty
that anyone would ever need it.

regards, tom lane




Re: Post-CVE Wishlist

2021-12-07 Thread Jacob Champion
On Tue, 2021-11-23 at 18:27 +, Jacob Champion wrote:
> Now that the MITM CVEs are published [1], I wanted to share my wishlist
> of things that would have made those attacks difficult/impossible to
> pull off.

Now that we're post-commitfest, here's my summary of the responses so
far:

> = Client-Side Auth Selection =

There is interest in letting libpq reject certain auth methods coming
back from the server, perhaps using a simple connection option, and
there are some prior conversations on the list to look into.

> = Implicit TLS =

Reactions to implicit TLS were mixed, from "we should not do this" to
"it might be nice to have the option, from a technical standpoint".
Both a separate-port model and a shared-port model were tentatively
proposed. The general consensus seems to be that the StartTLS-style
flow is currently sufficient from a security standpoint.

I didn't see any responses that were outright in favor, so I think my
remaining question is: are there any committers who think a prototype
would be worth the time for a motivated implementer?

Thanks for the discussion!

--Jacob


Re: pg_dump versus ancient server versions

2021-12-07 Thread Mark Dilger



> On Dec 7, 2021, at 8:33 AM, Robert Haas  wrote:
> 
> However, it wouldn't be a great idea to back-patch a
> completely arbitrary subset of our fixes into those branches, because
> then it sort of gets confusing to understand what the status of that
> branch is. I don't know that I'm terribly bothered by the idea that
> the behavior of the branch might deviate from the last official
> release, because most bug fixes are pretty minor and wouldn't really
> affect testing much, but it would be a little annoying to explain to
> users that those branches contain an arbitrary subset of newer fixes,
> and a little hard for us to understand what is and is not there.

Wouldn't you be able to see what changed by comparing the last released tag for 
version X.Y against the RELX_Y_STABLE branch?  Something like `git diff 
REL8_4_22 origin/REL8_4_STABLE > buildability.patch`?

Having such a patch should make reproducing old corruption bugs easier, as you 
could apply the buildability.patch to the last branch that contained the bug.  
If anybody did that work, would we want it committed somewhere?  
REL8_4_19_BUILDABLE or such?  For patches that apply trivially, that might not 
be worth keeping, but if the merge is difficult, maybe sharing with the 
community would make sense.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: MSVC SSL test failure

2021-12-07 Thread Tom Lane
Alexander Lakhin  writes:
> 07.12.2021 19:25, Tom Lane wrote:
>> Hmm. I wonder whether using SD_BOTH behaves any differently. 

> With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on
> iterations 1, 2, 3, 16 (just as without shutdown() at all).
> So shutdown with the SD_SEND flag definitely behaves much better (I've
> seen over 200 successful iterations).

Fun.  Well, I'll put in shutdown with SD_SEND for the moment,
and we'll have to see whether we can improve further than that.
It does sound like we may be running into OpenSSL bugs/oddities,
not only kernel issues, so it may be impossible to do better
on our side.

regards, tom lane




Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Tom Lane
Robert Haas  writes:
> On Tue, Dec 7, 2021 at 12:30 PM Tom Lane  wrote:
>> If we consider that the alias renames the columns "for all purposes",
>> how is it okay for f() to select the "a" column?

> I'd say it isn't.

In a green field I'd probably agree with you, but IMO that will
break far too much existing SQL code.

It'd cause problems for us too, not only end-users.  As an example,
ruleutils.c would have to avoid attaching new column aliases to
tables that are referenced as whole-row Vars.  I'm not very sure
that that's even possible without creating insurmountable ambiguity
issues.  There are also fun issues around what happens to a stored
query after a table column rename.  Right now the query acts as
though it uses the old name as a column alias, and that introduces
no semantic problem; but that behavior would no longer be
acceptable.

So the alternatives I see are to revert what bf7ca1587 tried to do
here, or to try to make it work that way across-the-board, which
implies (a) a very much larger amount of work, and (b) breaking
important behaviors that are decades older than that commit.
It's not even entirely clear that we could get to complete
consistency if we went down that path.

regards, tom lane




Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Robert Haas
On Tue, Dec 7, 2021 at 12:19 PM Tom Lane  wrote:
> > What's wrong with that?
>
> Well, we don't allow things like
>
> regression=# select '(1,2)'::point::float8;
> ERROR:  cannot cast type point to double precision
> LINE 1: select '(1,2)'::point::float8;
>  ^
>
> It's not very clear to me why "char" should get a pass on that.
> We allow such cases when the target is text/varchar/etc, but
> the assumption is that the textual representation is sufficient
> for your purposes.  It's hard to claim that just the first
> byte is a useful textual representation.

Fair enough, I guess. I am pretty skeptical of the merits of refusing
an explicit cast. If I ran the zoo, I would probably choose to allow
all such casts and make them coerce via IO when no other pathway is
available. But I get that's not our policy.

> Worse, PG is actually treating this as an assignment-level cast,
> so we accept this:
>
> regression=# create table t(f1 "char");
> CREATE TABLE
> regression=# insert into t values ('(1,2)'::point);
> INSERT 0 1
> regression=# table t;
>  f1
> 
>  (
> (1 row)
>
> I definitely don't think that should have worked without
> any complaint.

Yes, that one's a bridge too far, even for me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Transparent column encryption

2021-12-07 Thread Jacob Champion
On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:
> On 06.12.21 21:44, Jacob Champion wrote:
> > I think reusing a zero IV will potentially leak more information than
> > just equality, depending on the cipher in use. You may be interested in
> > synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
> > like they would match this use case exactly. (But I'm not a
> > cryptographer.)
> 
> I'm aware of this and plan to make use of SIV.  The current 
> implementation is just an example.

Sounds good.

> > Have you given any thought to AEAD? As a client I'd like to be able to
> > tie an encrypted value to other column (or external) data. For example,
> > AEAD could be used to prevent a DBA from copying the (encrypted) value
> > of my credit card column into their account's row to use it.
> 
> I don't know how that is supposed to work.  When the value is encrypted 
> for insertion, the client may know things like table name or column 
> name, so it can tie it to those.  But it doesn't know what row it will 
> go in, so you can't prevent the value from being copied into another 
> row.  You would need some permanent logical row ID for this, I think.

Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.

> > > This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
> > > on the direction.
> > 
> > What kinds of attacks are you hoping to prevent (and not prevent)?
> 
> The point is to prevent admins from getting at plaintext data.  The 
> scenario you show is an interesting one but I think it's not meant to be 
> addressed by this.  If admins can alter the database to their advantage, 
> they could perhaps increase their account balance, create discount 
> codes, etc. also.

Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.

--Jacob


Re: MSVC SSL test failure

2021-12-07 Thread Alexander Lakhin
Hello Tom,
07.12.2021 19:25, Tom Lane wrote:
> Hmm. I wonder whether using SD_BOTH behaves any differently. 
With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on
iterations 1, 2, 3, 16 (just as without shutdown() at all).
So shutdown with the SD_SEND flag definitely behaves much better (I've
seen over 200 successful iterations).
 
Best regards,
Alexander




Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Robert Haas
On Tue, Dec 7, 2021 at 12:30 PM Tom Lane  wrote:
> If we consider that the alias renames the columns "for all purposes",
> how is it okay for f() to select the "a" column?

I'd say it isn't.

> Another way to phrase the issue is that the column names seen
> by f() are currently different from those seen by row_to_json():
>
> regression=# select row_to_json(t) from t as t(x,y);
>row_to_json
> -
>  {"x":11,"y":12}
> (1 row)
>
> and that seems hard to justify.

Yeah, I agree. The problem I have here is that, with your proposed
fix, it still won't be very consistent. True, row_to_json() and f()
will both see the unaliased column names ... but a straight select *
from t as t(x,y) will show the aliased names. That's unarguably better
than corrupting your data, but it seems "astonishing" in the POLA
sense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Alter all tables in schema owner fix

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 2:47 AM, "Greg Nancarrow"  wrote:
> On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila  wrote:
>>
>> Thanks, the patch looks mostly good to me. I have slightly modified it
>> to incorporate one of Michael's suggestions, ran pgindent, and
>> modified the commit message.
>>
>
> LGTM, except in the patch commit message I'd change "Create
> Publication" to "CREATE PUBLICATION".

LGTM, too.

Nathan



Re: pg_dump versus ancient server versions

2021-12-07 Thread Tom Lane
Robert Haas  writes:
> I guess the point about user-visible bug fixes is that, as soon as we
> start doing that, we don't really want it to be hit-or-miss. We could
> make a decision to back-patch all bug fixes or those of a certain
> severity or whatever we like back to older branches, and then those
> branches would be supported or semi-supported depending on what rule
> we adopted, and we could even continue to do releases for them if we
> so chose. However, it wouldn't be a great idea to back-patch a
> completely arbitrary subset of our fixes into those branches, because
> then it sort of gets confusing to understand what the status of that
> branch is.

Yup, and also confusing to understand whether a given new fix should
be back-patched into the out-of-support-but-keep-buildable branches.
I want to settle on a reasonably well-defined policy for that.

I'm basically suggesting that the policy should be "back-patch the
minimal fix needed so that you can still get a clean build and clean
check-world run, using thus-and-such configure options".  (The point
of the configure options limitation being to exclude moving-target
external dependencies, such as Python.)  I think that Peter's
original suggestion could be read the same way except for the
adjective "clean".  He also said that only core regression needs
to pass not check-world; but if we're trying to test things like
pg_dump compatibility, I think we want the wider scope of what to
keep working.

regards, tom lane




Re: Pre-allocating WAL files

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 12:29 AM, "Bharath Rupireddy" 
 wrote:
> Why can't the walwriter pre-allocate some of the WAL segments instead
> of a new background process? Of course, it might delay the main
> functionality of the walwriter i.e. flush and sync the WAL files, but
> having checkpointer do the pre-allocation makes it do another extra
> task. Here the amount of walwriter work vs checkpointer work, I'm not
> sure which one does more work compared to the other.

The argument against adding it to the WAL writer is that we want it to
run with low latency to flush asynchronous commits.  If we added WAL
pre-allocation to the WAL writer, there could periodically be large
delays.

> Another idea could be to let walwrtier or checkpointer pre-allocate
> the WAL files whichever seems free as-of-the-moment when the WAL
> segment pre-allocation request comes. We can go further to let the
> user choose which process i.e. checkpointer or walwrtier do the
> pre-allocation with a GUC maybe?

My latest patch set [0] adds WAL pre-allocation to the checkpointer.
In that patch set, WAL pre-allocation is done both outside of
checkpoints as well as during checkpoints (via
CheckPointWriteDelay()).  

Nathan

[0] 
https://www.postgresql.org/message-id/CB15BEBD-98FC-4E72-86AE-513D34014176%40amazon.com



Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Tom Lane
Robert Haas  writes:
> On Mon, Dec 6, 2021 at 4:05 PM Tom Lane  wrote:
>> select f(t) from t(x,y);
>> 
>> If we adopt the "rename for all purposes" interpretation, then
>> the second SELECT must fail, because what f() is being passed is
>> no longer of type t.

> For me, the second SELECT does fail:

> rhaas=# select f(t) from t(x,y);
> ERROR:  column "x" does not exist

Ah, sorry, I fat-fingered the alias syntax.  Here's a tested example:

regression=# create table t (a int, b int);
CREATE TABLE
regression=# insert into t values(11,12);
INSERT 0 1
regression=# create function f(t) returns int as 'select $1.a' language sql;
CREATE FUNCTION
regression=# select f(t) from t as t(x,y);
 f  

 11
(1 row)

If we consider that the alias renames the columns "for all purposes",
how is it okay for f() to select the "a" column?

Another way to phrase the issue is that the column names seen
by f() are currently different from those seen by row_to_json():

regression=# select row_to_json(t) from t as t(x,y);
   row_to_json   
-
 {"x":11,"y":12}
(1 row)

and that seems hard to justify.

regards, tom lane




Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Bossart, Nathan
On 12/6/21, 8:19 PM, "Dilip Kumar"  wrote:
> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.
> Attached patch adds subtransaction count and subtransaction overflow
> status in pg_stat_activity.  I have implemented this because of the
> recent complain about the same[1]

I'd like to give a general +1 to this effort.  Thanks for doing this!
I've actually created a function to provide this information in the
past, so I will help review.

Nathan



Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Tom Lane
Robert Haas  writes:
> On Thu, Dec 2, 2021 at 4:22 PM Tom Lane  wrote:
>> An example of the reasons not to treat these types as being
>> general-purpose strings can be seen at [1], where the "char"
>> type has acquired some never-intended cast behaviors.  Taking
>> that to an extreme, we currently accept
>> 
>> regression=# select '(1,2)'::point::"char";
>> char
>> --
>> (
>> (1 row)

> What's wrong with that?

Well, we don't allow things like

regression=# select '(1,2)'::point::float8;
ERROR:  cannot cast type point to double precision
LINE 1: select '(1,2)'::point::float8;
 ^

It's not very clear to me why "char" should get a pass on that.
We allow such cases when the target is text/varchar/etc, but
the assumption is that the textual representation is sufficient
for your purposes.  It's hard to claim that just the first
byte is a useful textual representation.

Worse, PG is actually treating this as an assignment-level cast,
so we accept this:

regression=# create table t(f1 "char");
CREATE TABLE
regression=# insert into t values ('(1,2)'::point);
INSERT 0 1
regression=# table t;
 f1 

 (
(1 row)

I definitely don't think that should have worked without
any complaint.

regards, tom lane




Re: pg_dump versus ancient server versions

2021-12-07 Thread Robert Haas
On Mon, Dec 6, 2021 at 4:19 PM Tom Lane  wrote:
> Right.  The question that's on the table is how much is the right
> amount of maintenance.  I think that back-patching user-visible bug
> fixes, for example, is taking things too far.  What we want is to
> be able to replicate the behavior of the branch's last released
> version, using whatever build tools we are currently using.  So
> back-patching something like that is counterproductive, because
> now the behavior is not what was released.
>
> A minimal amount of maintenance would be "only back-patch fixes
> for issues that cause failure-to-build".  The next step up is "fix
> issues that cause failure-to-pass-regression-tests", and then above
> that is "fix developer-facing annoyances, such as compiler warnings
> or unwanted test output, as long as you aren't changing user-facing
> behavior".  I now think that it'd be reasonable to include this
> last group, although I'm pretty sure Peter didn't have that in mind
> in his policy sketch.

Yep, that seems reasonable to me.

I guess the point about user-visible bug fixes is that, as soon as we
start doing that, we don't really want it to be hit-or-miss. We could
make a decision to back-patch all bug fixes or those of a certain
severity or whatever we like back to older branches, and then those
branches would be supported or semi-supported depending on what rule
we adopted, and we could even continue to do releases for them if we
so chose. However, it wouldn't be a great idea to back-patch a
completely arbitrary subset of our fixes into those branches, because
then it sort of gets confusing to understand what the status of that
branch is. I don't know that I'm terribly bothered by the idea that
the behavior of the branch might deviate from the last official
release, because most bug fixes are pretty minor and wouldn't really
affect testing much, but it would be a little annoying to explain to
users that those branches contain an arbitrary subset of newer fixes,
and a little hard for us to understand what is and is not there.

That being said, suppose that a new compiler version comes out and on
that new compiler version, 'make check' crashes on the older branch
due to a missing WhateverGetDatum() call that we rectified in a later,
back-patched commit. I would consider it reasonable to back-patch that
particular bug fix into an unsupported branch to make it testable,
just like we would do for a failure-to-build issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MSVC SSL test failure

2021-12-07 Thread Tom Lane
Alexander Lakhin  writes:
> It seems that the test failure rate may depend on the specs/environment.

No surprise there, since the issue is almost surely timing-dependent.

> shutdown(MyProcPort->sock, SD_SEND) apparently fixes the issue, I've got
> 83 successful runs, but then iteration 84 unfortunately failed:
> t/001_ssltests.pl .. 106/110
> #   Failed test 'intermediate client certificate is missing: matches'
> #   at t/001_ssltests.pl line 608.
> #   'psql: error: connection to server at "127.0.0.1",
> port 63187 failed: could not receive data from server: Software caused
> connection abort (0x2745/10053)
> # SSL SYSCALL error: Software caused connection abort (0x2745/10053)
> # could not send startup packet: No error (0x/0)'
> # doesn't match '(?^:SSL error: tlsv1 alert unknown ca)'
> # Looks like you failed 1 test of 110.

Hmm.  I wonder whether using SD_BOTH behaves any differently.

regards, tom lane




Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Robert Haas
On Mon, Dec 6, 2021 at 4:05 PM Tom Lane  wrote:
> Well, that was what I thought when I wrote bf7ca1587, but it leads
> to logical contradictions.  Consider
>
> create table t (a int, b int);
>
> create function f(t) returns ... ;
>
> select f(t) from t;
>
> select f(t) from t(x,y);
>
> If we adopt the "rename for all purposes" interpretation, then
> the second SELECT must fail, because what f() is being passed is
> no longer of type t.  If you ask me, that'll be a bigger problem
> for users than the change I'm proposing (quite independently of
> how hard it might be to implement).  It certainly will break
> a behavior that goes back much further than bf7ca1587.

For me, the second SELECT does fail:

rhaas=# select f(t) from t(x,y);
ERROR:  column "x" does not exist
LINE 1: select f(t) from t(x,y);
   ^

If it didn't fail, what would the behavior be? I suppose you could
make an argument for trying to match up the columns by position, but
if so this ought to work:

rhaas=# create table u(a int, b int);
CREATE TABLE
rhaas=# select f(u) from u;
ERROR:  function f(u) does not exist
rhaas=# select f(u::t) from u;
ERROR:  cannot cast type u to t

Matching columns by name can't work because the names don't match.
Matching columns by position does not work. So if that example
succeeds, the only real explanation is that we magically know that
we've still got a value of type t despite the user's best attempt to
decree otherwise. I know PostgreSQL sometimes ... does things like
that. I have no idea why anyone would consider it a desirable
behavior, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: AW: Assorted improvements in pg_dump

2021-12-07 Thread Tom Lane
Hans Buschmann  writes:
> I noticed that you patched master with all the improvements in pg_dump.
> Did you change your mind about backpatching patch 0005 to fix the toast size 
> matter?

I looked briefly at that and found that the patch would have to be
largely rewritten, because getTables() looks quite different in the
older branches.  I'm not really sufficiently excited about the point
to do that rewriting and re-testing.  I think that cases where the
old logic gets the scheduling badly wrong are probably rare.

regards, tom lane




Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Robert Haas
On Thu, Dec 2, 2021 at 4:22 PM Tom Lane  wrote:
> An example of the reasons not to treat these types as being
> general-purpose strings can be seen at [1], where the "char"
> type has acquired some never-intended cast behaviors.  Taking
> that to an extreme, we currently accept
>
> regression=# select '(1,2)'::point::"char";
>  char
> --
>  (
> (1 row)

What's wrong with that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Robert Haas
On Mon, Dec 6, 2021 at 9:42 PM Peter Geoghegan  wrote:
> On Mon, Dec 6, 2021 at 6:11 PM Robert Haas  wrote:
> > This doesn't seem convincing. Launching autovacuum too soon surely has
> > costs that someone might not want to pay. Clearly in the degenerate
> > case where we always autovacuum every table every time an autovacuum
> > worker is launched, we have gone insane.
>
> Unfortunately we already sometimes behave insanely in exactly the way
> that you describe:
>
> https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com

In the same ballpark is http://rhaas.blogspot.com/2020/02/useless-vacuuming.html

> It's hard to define a break-even point for launching an autovacuum
> worker. I think it would be more productive to come up with a design
> that at least doesn't go completely off the rails in various specific
> ways.

I think that's a good observation. I think the current autovacuum
algorithm works pretty well when things are normal. But in extreme
scenarios it does not degrade gracefully. The
vacuuming-over-and-over-without-doing-anything phenomenon is an
example of that. Another example is the user who creates 10,000
databases and we're happy to divide the 60s-autovacuum_naptime by
10,000 and try to launch a worker every 0.6 ms. A third example is
vacuuming the tables from pg_class in physical order on disk, so that
a table that is 1 XID past the wraparound limit can result in a long
delay vacuuming a table that is bloating quickly, or conversely a
table that is bloating very slowly but has just crawled over the
threshold for a regular vacuum gets processed before one that is
threatening an imminent wraparound shutdown. I think these are all
pathological cases that a well-informed human can easily recognize and
handle in an intelligent manner, and it doesn't seem crazy to program
those responses into the computer in some way.

> I also think that our problem is not so much that we don't have
> accurate statistics about dead tuples (though we surely don't have
> much accuracy). The main problem seems to be that there are various
> specific, important ways in which the distribution of dead tuples may
> matter (leading to various harmful biases). And so it seems reasonable
> to fudge how we interpret dead tuples with the intention of capturing
> some of that, as a medium term solution. Something that considers the
> concentration of dead tuples in heap pages seems promising.

I am less convinced about this part. It sort of sounds like you're
saying - it doesn't really matter whether the numbers we gather are
accurate, just that they produce the correct results. If the
information we currently gather wouldn't produce the right results
even if it were fully accurate, that to me suggests that we're
gathering the wrong information, and we should gather something else.
For example, we could attack the useless-vacuuming problem by having
each vacuum figure out - and store - the oldest XID that could
potentially be worth using as a cutoff for vacuuming that table, and
refusing to autovacuum that table again until we'd be able to use a
cutoff >= that value. I suppose this would be the oldest of (a) any
XID that exists in the table on a tuple that we judged recently dead,
(b) any XID that is currently-running, and (c) the next XID.

I also accept that knowing the concentration of dead tuples on pages
that contain at least 1 dead tuple could be interesting. I've felt for
a while that it's a mistake to know how many dead tuples there are but
not how many pages contain them, because that affects both the amount
of I/O required to vacuum and also how much need we have to set VM
bits. I'm not sure I would have approached gathering that information
in the way that you're proposing here, but I'm not deeply against it,
either. I do think that we should try to keep it as a principle that
whatever we do gather, we should try our best to make accurate. If
that doesn't work well, then gather different stuff instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Triage for unimplemented geometric operators

2021-12-07 Thread Anton Voloshin

On 07/12/2021 06:18, Tom Lane wrote:

So my inclination for HEAD is to implement poly_distance and nuke
the others.  I'm a bit less sure about the back branches, but maybe
just de-document all of them there.


I agree, seems to be a reasonable compromise. Removing 20+-years old 
unused and slightly misleading code probably should overweight the 
natural inclination to implement all of the functions promised in the 
catalog. Enhancing software by deleting the code is not an everyday 
chance and IMHO should be taken, even when it requires an extra 
catversion bump.


I am kind of split on whether it is worth it to implement poly_distance 
in back branches. Maybe there is a benefit in implementing: it should 
not cause any reasonable incompatibilities and will introduce somewhat 
better compatibility with v15+. We could even get away with not updating 
v10..12' docs on poly_distance because it's not mentioned anyway.


I agree on de-documenting all of the unimplemented functions in v13 and 
v14. Branches before v13 should not require any changes (including 
documentation) because detailed table on which operators support which 
geometry primitives only came in 13, and I could not find in e.g. 12's 
documentation any references to the cases you listed previously:

> dist_lb:   <->(line,box)
> dist_bl:   <->(box,line)
> close_sl:  lseg ## line
> close_lb:  line ## box
> poly_distance: polygon <-> polygon
> path_center:   @@ path

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru




Re: types reliant on encodings [was Re: Dubious usage of TYPCATEGORY_STRING]

2021-12-07 Thread Peter Eisentraut

On 03.12.21 19:42, Chapman Flack wrote:

Is there any way to find out, from the catalogs or in any automatable way,
which types are implemented with a dependence on the database encoding
(or on some encoding)?


What is this needed for?  C code can internally do whatever it wants, 
and the database encoding is effectively a constant, so there is no need 
for server-side code to be very much concerned about whether types do this.


Also, "types" is perhaps the wrong subject here.  Types only contain 
input and output functions and a few more bits.  Additional functions 
operating on the type could look at the server encoding without the type 
and its core functions knowing about it.






Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Tom Lane
Peter Eisentraut  writes:
> On 02.12.21 22:22, Tom Lane wrote:
>> My first thought about fixing point 1 was to put "char" into some
>> other typcategory, but that turns out to break some of psql's
>> catalog queries, with results like:
>> 
>> regression=# \dp
>> ERROR:  operator is not unique: unknown || "char"
>> LINE 16:E' (' || polcmd || E'):'
>> ^
>> HINT:  Could not choose a best candidate operator. You might need to add 
>> explicit type casts.

> Could we add explicit casts (like polcmd::text) here?  Or would it break 
> too much?

I assumed it'd break too much to consider doing that.  But I suppose
that since a typcategory change would be initdb-forcing anyway, maybe
it's not out of the question.  I'll investigate and see exactly how
many places would need an explicit cast.

regards, tom lane




Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Peter Eisentraut

On 02.12.21 22:22, Tom Lane wrote:

My first thought about fixing point 1 was to put "char" into some
other typcategory, but that turns out to break some of psql's
catalog queries, with results like:

regression=# \dp
ERROR:  operator is not unique: unknown || "char"
LINE 16:E' (' || polcmd || E'):'
   ^
HINT:  Could not choose a best candidate operator. You might need to add 
explicit type casts.


Could we add explicit casts (like polcmd::text) here?  Or would it break 
too much?






Re: Is there any documentation on how to correctly create extensions in HA(primary-standby) setup?

2021-12-07 Thread Peter Eisentraut

On 03.12.21 15:28, Bharath Rupireddy wrote:

I'm thinking of adding the above steps into the "Additional Supplied
Modules" section documentation. Any thoughts please?

[1] - https://www.postgresql.org/docs/devel/contrib.html


The chapter about extensions is probably better: 
https://www.postgresql.org/docs/devel/extend-extensions.html





Re: Transparent column encryption

2021-12-07 Thread Peter Eisentraut

On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)


I'm aware of this and plan to make use of SIV.  The current 
implementation is just an example.



Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.


I don't know how that is supposed to work.  When the value is encrypted 
for insertion, the client may know things like table name or column 
name, so it can tie it to those.  But it doesn't know what row it will 
go in, so you can't prevent the value from being copied into another 
row.  You would need some permanent logical row ID for this, I think. 
For this scenario, the deterministic encryption mode is perhaps not the 
right one.



This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.


What kinds of attacks are you hoping to prevent (and not prevent)?


The point is to prevent admins from getting at plaintext data.  The 
scenario you show is an interesting one but I think it's not meant to be 
addressed by this.  If admins can alter the database to their advantage, 
they could perhaps increase their account balance, create discount 
codes, etc. also.


If this is a problem, then perhaps a better approach would be to store 
parts of the data in a separate database with separate admins.





pg_dump/restore --no-tableam

2021-12-07 Thread Justin Pryzby
I first suggested this a couple years ago.
Is it desirable to implement in pg_dump and pg_restore ?
It'd be just like --tablespace.

On Tue, Jan 28, 2020 at 07:33:17AM -0600, Justin Pryzby wrote:
> I made these casual comments.  If there's any agreement on their merit, it'd 
> be
> nice to implement at least the first for v13.
> 
> In <20190818193533.gl11...@telsasoft.com>, I wrote: 
> >  . What do you think about pg_restore --no-tableam; similar to 
> >--no-tablespaces, it would allow restoring a table to a different AM:
> >PGOPTIONS='-c default_table_access_method=zedstore' pg_restore 
> > --no-tableam ./pg_dump.dat -d postgres
> >Otherwise, the dump says "SET default_table_access_method=heap", which
> >overrides any value from PGOPTIONS and precludes restoring to new AM.
> 
> That appears to be a trivial variation on no-tablespace:
> 
> /* do nothing in --no-tablespaces mode */
> if (ropt->noTablespace)
> return;
...




Re: Transparent column encryption

2021-12-07 Thread Peter Eisentraut

On 06.12.21 19:28, Robert Haas wrote:

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out).  It is required to use
protocol-level prepared statements (i.e., extended query) for this.

Why? If the client knows the CEK, can't the client choose to send
unprepared insert or update statements with pre-encrypted blobs? That
might be a bad idea from a security perspective because the encrypted
blob might then got logged, but we sometimes log parameters, too.


The client can send something like

PQexec(conn, "INSERT INTO tbl VALUES ('ENCBLOB', 'ENCBLOB')");

and it will work.  (See the included test suite where 'ENCBLOB' is 
actually computed by pgcrypto.)  But that is not transparent encryption. 
 The client wants to send "INSERT INTO tbl VALUES ('val1', 'val2')" and 
have libpq take care of encrypting 'val1' and 'val2' before hitting the 
wire.  For that you need to use the prepared statement API so that the 
values are available separately from the statement.  And furthermore the 
client needs to know what columns the insert statements is writing to, 
so that it can get the CEK for that column.  That's what it needs the 
parameter description for.


As alluded to, workarounds exist or might be made available to do part 
of that work yourself, but that shouldn't be the normal way of using it.





Re: Non-superuser subscription owners

2021-12-07 Thread Mark Dilger



> On Dec 7, 2021, at 2:29 AM, Amit Kapila  wrote:
> 
> Okay, let me try to explain again. Following is the text from docs
> [1]: " (a) To create a subscription, the user must be a superuser. (b)
> The subscription apply process will run in the local database with the
> privileges of a superuser. (c) Privileges are only checked once at the
> start of a replication connection. They are not re-checked as each
> change record is read from the publisher, nor are they re-checked for
> each change when applied.
> 
> My understanding is that we want to improve what is written as (c)
> which I think is the same as what you mentioned later as "Fix the
> current bug wherein subscription changes are applied with superuser
> force after the subscription owner has superuser privileges revoked.".
> Am I correct till here? If so, I think what I am suggesting should fix
> this with the assumption that we still want to follow (b) at least for
> the first patch.

Ok, that's a point of disagreement.  I was trying to fix both (b) and (c) in 
the first patch.

> One possibility is that our understanding of the
> first problem is the same but you want to allow apply worker running
> even when superuser privileges are revoked provided the user with
> which it is running has appropriate privileges on the objects being
> accessed by apply worker.

Correct, that's what I'm trying to make safe.

> We will talk about other points of the roadmap you mentioned once our
> understanding for the first one matches.

I am happy to have an off-list phone call with you, if you like.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Peter Eisentraut

On 03.12.21 03:53, Amit Kapila wrote:

I don't know how difficult it would be, but allowing multiple xids might
be desirable.


Are there many cases where there could be multiple xid failures that
the user can skip? Apply worker always keeps looping at the same error
failure so the user wouldn't know of the second xid failure (if any)
till the first failure is resolved.


Yeah, nevermind, doesn't make sense.


Yeah, I also think so. BTW, what do you think of providing extra
flexibility of giving other options like 'operation', 'rel' along with
xid? I think such options could be useful for large transactions that
operate on multiple tables as it is quite possible that only a
particular operation from the entire transaction is the cause of
failure. Now, on one side, we can argue that skipping the entire
transaction is better from the consistency point of view but I think
it is already possible that we just skip a particular update/delete
(if the corresponding tuple doesn't exist on the subscriber). For the
sake of simplicity, we can just allow providing xid at this stage and
then extend it later as required but I am not very sure of that point.


Skipping transactions partially sounds dangerous, especially when 
exposed as an option to users.  Needs more careful thought.





Re: logical decoding and replication of sequences

2021-12-07 Thread Peter Eisentraut

I have checked the 0001 and 0003 patches.  (I think we have dismissed
the approach in 0002 for now.  And let's talk about 0004 later.)

I have attached a few fixup patches, described further below.

# 0001

The argument "create" for fill_seq_with_data() is always true (and
patch 0002 removes it).  So I'm not sure if it's needed.  If it is, it
should be documented somewhere.

About the comment you added in nextval_internal(): It's a good
explanation, so I would leave it in.  I also agree with the
conclusion of the comment that the current solution is reasonable.  We
probably don't need the same comment again in fill_seq_with_data() and
again in do_setval().  Note that both of the latter functions already
point to nextval_interval() for other comments, so the same can be
relied upon here as well.

The order of the new fields sequence_cb and stream_sequence_cb is a
bit inconsistent compared to truncate_cb and stream_truncate_cb.
Maybe take another look to make the order more uniform.

Some documentation needs to be added to the Logical Decoding chapter.
I have attached a patch that I think catches all the places that need
to be updated, with some details left for you to fill in. ;-) The
ordering of the some of the documentation chunks reflects the order in
which the callbacks appear in the header files, which might not be
optimal; see above.

In reorderbuffer.c, you left a comment about how to access a sequence
tuple.  There is an easier way, using Form_pg_sequence_data, which is
how sequence.c also does it.  See attached patch.

# 0003

The tests added in 0003 don't pass for me.  It appears that you might
have forgotten to update the expected files after you added some tests
or something.  The cfbot shows the same.  See attached patch for a
correction, but do check what your intent was.

As mentioned before, we probably don't need the skip-sequences option
in test_decoding.From 2dd9a0e76b9496016e7abebebd0006ddee72d3c1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Dec 2021 14:19:42 +0100
Subject: [PATCH 1/4] Whitespace fixes

---
 contrib/test_decoding/expected/sequence.out | 6 +++---
 contrib/test_decoding/sql/sequence.sql  | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/contrib/test_decoding/expected/sequence.out 
b/contrib/test_decoding/expected/sequence.out
index 17c88990b1..ca55bcebc3 100644
--- a/contrib/test_decoding/expected/sequence.out
+++ b/contrib/test_decoding/expected/sequence.out
@@ -177,7 +177,7 @@ SELECT data FROM 
pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 --
 (0 rows)
 
--- rollback on table creation with serial column 
+-- rollback on table creation with serial column
 BEGIN;
 CREATE TABLE test_table (a SERIAL, b INT);
 INSERT INTO test_table (b) VALUES (100);
@@ -202,7 +202,7 @@ SELECT data FROM 
pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 --
 (0 rows)
 
--- rollback on table with serial column 
+-- rollback on table with serial column
 CREATE TABLE test_table (a SERIAL, b INT);
 BEGIN;
 INSERT INTO test_table (b) VALUES (100);
@@ -222,7 +222,7 @@ INSERT INTO test_table (b) VALUES (700);
 INSERT INTO test_table (b) VALUES (800);
 INSERT INTO test_table (b) VALUES (900);
 ROLLBACK;
--- check table and sequence values after rollback 
+-- check table and sequence values after rollback
 SELECT * from test_table_a_seq;
  last_value | log_cnt | is_called 
 +-+---
diff --git a/contrib/test_decoding/sql/sequence.sql 
b/contrib/test_decoding/sql/sequence.sql
index d8a34738f3..21c4b79222 100644
--- a/contrib/test_decoding/sql/sequence.sql
+++ b/contrib/test_decoding/sql/sequence.sql
@@ -48,7 +48,7 @@ CREATE TABLE test_table (a INT);
 ROLLBACK;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1', 'skip-sequences', '0');
 
--- rollback on table creation with serial column 
+-- rollback on table creation with serial column
 BEGIN;
 CREATE TABLE test_table (a SERIAL, b INT);
 INSERT INTO test_table (b) VALUES (100);
@@ -65,7 +65,7 @@ CREATE TABLE test_table (a SERIAL, b INT);
 ROLLBACK;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1', 'skip-sequences', '0');
 
--- rollback on table with serial column 
+-- rollback on table with serial column
 CREATE TABLE test_table (a SERIAL, b INT);
 
 BEGIN;
@@ -82,7 +82,7 @@ CREATE TABLE test_table (a SERIAL, b INT);
 INSERT INTO test_table (b) VALUES (900);
 ROLLBACK;
 
--- check table and sequence values after rollback 
+-- check table and sequence values after rollback
 SELECT * from test_table_a_seq;
 SELECT nextval('test_table_a_seq');
 
-- 
2.34.1

From c5e9eb2792bd276e493aed582627e93b4c681374 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Dec 2021 15:29:34 +0100
Subject: [PATCH 2/4] Make tests pass

---
 contrib/test_decoding/expected/sequence.out | 18 --
 1 

Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy
Hi,

Currently one can know the kind of on-going/last checkpoint (shutdown,
end-of-recovery, immediate, force etc.) only via server logs that to
when log_checkpoints GUC is on. At times, the users/service layer
components would want to know the kind of checkpoint (along with other
checkpoint related info) to take some actions and it will be a bit
difficult to search through the server logs. The checkpoint info can
be obtained from the control file (either by pg_control_checkpoint()
or by pg_controldata tool) whereas checkpoint kind isn't available
there.

How about we add an extra string field to the control file alongside
the other checkpoint info it already has? This way, the existing
pg_control_checkpoint() or pg_controldata tool can easily be enhanced
to show the checkpoint kind as well. One concern is that we don't want
to increase the size of pg_controldata by more than the typical block
size (of 8K) to avoid any torn-writes. With this change, we might add
at max the strings specified at [1]. Adding it to the control file has
an advantage of preserving the last checkpoint kind which might be
useful.

Thoughts?

[1] for checkpoint: "checkpoint shutdown end-of-recovery immediate
force wait wal time flush-all"
for restartpoint: "restartpoint shutdown end-of-recovery immediate
force wait wal time flush-all"

Regards,
Bharath Rupireddy.




Re: should we document an example to set multiple libraries in shared_preload_libraries?

2021-12-07 Thread Bharath Rupireddy
On Mon, Dec 6, 2021 at 9:20 PM Bharath Rupireddy
 wrote:
>
> On Fri, Dec 3, 2021 at 11:25 PM Bossart, Nathan  wrote:
> >
> > On 12/3/21, 6:21 AM, "Bharath Rupireddy" 
> >  wrote:
> > > +1 to add here in the "Parameter Names and Values section", but do we
> > > want to backlink every string parameter to this section? I think it
> > > needs more effort. IMO, we can just backlink for
> > > shared_preload_libraries alone. Thoughts?
> >
> > IMO this is most important for GUC_LIST_QUOTE parameters, of which
> > there are only a handful.  I don't think adding a link to every string
> > parameter is necessary.
>
> Agree.
>
> Should we specify something like below in the "Parameter Names and
> Values" section's "String:" para? Do we use generic terminology like
> 'name' and val1, val2, val3 and so on?
>
> ALTER SYSTEM SET name = val1,val2,val3;
> ALTER SYSTEM SET name = 'val1', 'val2', 'val3';
> ALTER SYSTEM SET name = '"val 1"', '"val,2"', 'val3';
>
> Another thing I observed is the difference between how the
> postgresql.conf file and ALTER SYSTEM SET command is parsed for
> GUC_LIST_QUOTE values.

Since there's a difference in the way the params are set in the
postgresql.conf file and ALTER SYSTEM SET command (as pointed out by
Tom in this thread [1]), I'm now confused. If we were to add examples
to the "Parameter Names and Values" section, which examples should we
addd, postgresql.conf files ones or ALTER SYSTEM SET command ones?

Thoughts?

[1] 
https://www.postgresql.org/message-id/flat/3108541.1638806477%40sss.pgh.pa.us

Regards,
Bharath Rupireddy.




Re: row filtering for logical replication

2021-12-07 Thread Ashutosh Bapat
On Tue, Dec 7, 2021 at 12:18 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Friday, December 3, 2021 10:09 AM Peter Smith  
> wrote:
> >
> > On Thu, Dec 2, 2021 at 2:32 PM tanghy.f...@fujitsu.com
> >  wrote:
> > >
> > > On Thursday, December 2, 2021 5:21 AM Peter Smith
> >  wrote:
> > > >
> > > > PSA the v44* set of patches.
> > > >
> > >
> > > Thanks for the new patch. Few comments:
> > >
> > > 1. This is an example in publication doc, but in fact it's not allowed. 
> > > Should we
> > > change this example?
> > >
> > > +CREATE PUBLICATION active_departments FOR TABLE departments WHERE
> > (active IS TRUE);
> > >
> > > postgres=# CREATE PUBLICATION active_departments FOR TABLE departments
> > WHERE (active IS TRUE);
> > > ERROR:  invalid publication WHERE expression for relation "departments"
> > > HINT:  only simple expressions using columns, constants and immutable 
> > > system
> > functions are allowed
> > >
> >
> > Thanks for finding this. Actually, the documentation looks correct to
> > me. The problem was the validation walker of patch 0002 was being
> > overly restrictive. It needed to also allow a BooleanTest node.
> >
> > Now it works (locally) for me. For example.
> >
> > test_pub=# create table departments(depno int primary key, active boolean);
> > CREATE TABLE
> > test_pub=# create publication pdept for table departments where
> > (active is true) with (publish="insert");
> > CREATE PUBLICATION
> > test_pub=# create publication pdept2 for table departments where
> > (active is false) with (publish="insert");
> > CREATE PUBLICATION
> >
> > This fix will be available in v45*.
> >
>
> Thanks for looking into it.
>
> I have another problem with your patch. The document says:
>
> ... If the subscription has several publications in
> +   which the same table has been published with different filters, those
> +   expressions get OR'ed together so that rows satisfying any of the 
> expressions
> +   will be replicated. Notice this means if one of the publications has no 
> filter
> +   at all then all other filters become redundant.
>
> Then, what if one of the publications is specified as 'FOR ALL TABLES' or 'FOR
> ALL TABLES IN SCHEMA'.
>
> For example:
> create table tbl (a int primary key);"
> create publication p1 for table tbl where (a > 10);
> create publication p2 for all tables;
> create subscription sub connection 'dbname=postgres port=5432' publication 
> p1, p2;

Thanks for the example. I was wondering about this case myself.

>
> I think for "FOR ALL TABLE" publication(p2 in my case), table tbl should be
> treated as no filter, and table tbl should have no filter in subscription 
> sub. Thoughts?
>
> But for now, the filter(a > 10) works both when copying initial data and 
> later changes.
>
> To fix it, I think we can check if the table is published in a 'FOR ALL 
> TABLES'
> publication or published as part of schema in function 
> pgoutput_row_filter_init
> (which was introduced in v44-0003 patch), also we need to make some changes in
> tablesync.c.

In order to check "FOR ALL_TABLES", we might need to fetch publication
metdata. Instead of that can we add a "TRUE" filter on all the tables
which are part of FOR ALL TABLES publication?

-- 
Best Wishes,
Ashutosh Bapat




Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-07 Thread Bharath Rupireddy
On Wed, Nov 17, 2021 at 8:01 AM Michael Paquier  wrote:
>
> On Tue, Nov 16, 2021 at 01:26:49PM -0300, Alvaro Herrera wrote:
> > My opinion is that adding these things willy-nilly is not a solution to
> > any actual problem.  Adding a few additional log lines that are
> > low-volume at DEBUG1 might be useful, but below that (DEBUG2 etc) it's
> > not good for anything other than specific development, IMO.  At least
> > this particular one for streaming replication I think we should not
> > include.
>
> Looking at v2, I think that this leaves the additions of the DEBUG1
> entries in SendBaseBackup() and WalRcvWaitForStartPosition(), then.
> The one in pgarch.c does not provide any additional information as the
> segment to-be-archived should be part of the command.

Thank you all for the inputs. Here's the patch that I've come up with.

Upon thinking further, having at least the messages at LOG level [1]
would be helpful to know what's happening with the system while in
recovery. Although these messages at LOG level seem to be filling up
the server logs, having a good log consumption and rotation mechanism
(I'm sure every major postgres vendor would have one) would be
sufficient to allay that concern.

These LOG messages would help us know how much time a restore command
takes to fetch the WAL file and what is the current WAL file the
server is recovering and where is it recovering from. The customer
often asks questions like: when will my server come up? how much time
does the recovery of my server take?

As a developer or admin, one can monitor these logs and do bunch of things:
1) see how many WAL files left to be recovered by looking at the WAL
files in the archive location or pg_wal directory or from primary
2) provide an approximate estimation of when the server will come up
or how much more the recovery takes by looking at these previous LOG
messages, one can know the number of WAL files that server recovered
over a minute and with the help of left-over WAL files calculated from
(1).

[1]
ereport(LOG,
(errmsg("waiting for WAL segment \"%s\" from archive",
xlogfname)));

ereport(LOG,
(errmsg("restored WAL segment \"%s\" from archive",
xlogfname)));

ereport(LOG,
(errmsg("recovering WAL segment \"%s\" from source \"%s\"",
xlogfname, srcname)));

Regards,
Bharath Rupireddy.


v3-0001-add-log-messages-for-recovery.patch
Description: Binary data


Re: pg_get_publication_tables() output duplicate relid

2021-12-07 Thread Amit Langote
On Tue, Dec 7, 2021 at 3:20 PM Amit Kapila  wrote:
> On Tue, Dec 7, 2021 at 10:52 AM Amit Langote  wrote:
> >  So if a partition is
> > explicitly present in a publication with a filter defined on it, I
> > suppose following are the possible scenarios:
> >
> > * publish_via_partition_root=true
> >
> > 1. If the root table is not present in the publication (perhaps
> > unlikely), use the partition's filter because there's nothing else to
> > consider.
> >
> > 2. If the root table is present in the publication (very likely), use
> > the root's filter, ignoring/overriding the partition's own if any.
> >
> > * publish_via_partition_root=false
> >
> > 1. If the root table is not present in the publication, use the
> > partition's filter because there's nothing else to consider.
> >
> > 2. If the root table is present in the publication, use the
> > partition's own filter if present, else root's.
>
> I have not tested/checked each of these scenarios individually but I
> think it is something like, if publish_via_partition_root is false
> then we will always try to use partitions row filter and if it is not
> there then we don't use any filter. Similarly, if
> publish_via_partition_root is true, then we always try to use a
> partitioned table row filter and if it is not there (either because
> the partitioned table is not part of the publication or because there
> is no filter defined for it) then we don't use any filter.

Okay, thanks for the explanation.


--
Amit Langote
EDB: http://www.enterprisedb.com




Re: Data is copied twice when specifying both child and parent table in publication

2021-12-07 Thread vignesh C
On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, December 2, 2021 4:54 PM houzj.f...@fujitsu.com 
>  wrote:
> > On Thursday, December 2, 2021 12:50 PM Amit Kapila
> >  wrote:
> > > On Thu, Dec 2, 2021 at 9:41 AM Greg Nancarrow 
> > > wrote:
> > > >
> > > > On Thu, Dec 2, 2021 at 1:48 PM Greg Nancarrow 
> > > wrote:
> > > > >
> > > > > If you updated my original description to say "(instead of just
> > > > > the individual partitions)", it would imply the same I think.
> > > > > But I don't mind if you want to explicitly state both cases to make 
> > > > > it clear.
> > > > >
> > > >
> > > > For example, something like:
> > > >
> > > > For publications of partitioned tables with
> > > > publish_via_partition_root set to true, only the partitioned table
> > > > (and not its partitions) is included in the view, whereas if
> > > > publish_via_partition_root is set to false, only the individual 
> > > > partitions are
> > included in the view.
> > > >
> > >
> > > Yeah, that sounds good to me.
> >
> > It looks good to me as well.
> > Attach the patches for (HEAD~13) which merge the suggested doc change. I
> > prepared the code patch and test patch separately to make it easier for
> > committer to confirm.
>
> It seems we might not need to backpatch the doc change, so
> attach another version which remove the doc changes from backpatch patches.

Thanks for the patches, the patch applies and the test passes in head
and the back branches. one minor suggestion:
1) Shall we change:
+  
+   For publications of partitioned tables with
+   publish_via_partition_root set to
+   true, only the partitioned table (and not its partitions)
+   is included in the view, whereas if
+   publish_via_partition_root is set to
+   false, only the individual partitions are included in the
+   view.
+  
To:
+  
+   For publications of partitioned tables with
+   publish_via_partition_root set to
+   true, only the partitioned table (and not its partitions)
+   is included in the view, whereas if
+   publish_via_partition_root is set to
+   false, only the individual partitions (and not the
+   partitioned table) are included in the
+   view.
+  

2) Any particular reason why the code and tests are backbranched but
not the document changes?

Regards,
Vignesh




Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Masahiko Sawada
On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila  wrote:
>
> On Fri, Dec 3, 2021 at 12:12 PM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 3, 2021 at 11:53 AM Amit Kapila  wrote:
> > >
> > > >  But this syntax gives you flexibility, so we can also
> > > > start with a simple implementation.
> > > >
> > >
> > > Yeah, I also think so. BTW, what do you think of providing extra
> > > flexibility of giving other options like 'operation', 'rel' along with
> > > xid? I think such options could be useful for large transactions that
> > > operate on multiple tables as it is quite possible that only a
> > > particular operation from the entire transaction is the cause of
> > > failure. Now, on one side, we can argue that skipping the entire
> > > transaction is better from the consistency point of view but I think
> > > it is already possible that we just skip a particular update/delete
> > > (if the corresponding tuple doesn't exist on the subscriber). For the
> > > sake of simplicity, we can just allow providing xid at this stage and
> > > then extend it later as required but I am not very sure of that point.
> >
> > +1
> >
> > Skipping a whole transaction by specifying xid would be a good start.
> >
>
> Okay, that sounds reasonable, so let's do that for now.

I'll submit the patch tomorrow.

While updating the patch, I realized that skipping a transaction that
is prepared on the publisher will be tricky a bit;

First of all, since skip-xid is in pg_subscription catalog, we need to
do a catalog update in a transaction and commit it to disable it. I
think we need to set origin-lsn and timestamp of the transaction being
skipped to the transaction that does the catalog update. That is,
during skipping the (not prepared) transaction, we skip all
data-modification changes coming from the publisher, do a catalog
update, and commit the transaction. If we do the catalog update in the
next transaction after skipping the whole transaction, skip_xid could
be left in case of a server crash between them. Also, we cannot set
origin-lsn and timestamp to an empty transaction.

In prepared transaction cases, I think that when handling a prepare
message, we need to commit the transaction to update the catalog,
instead of preparing it. And at the commit prepared and rollback
prepared time, we skip it since there is not the prepared transaction
on the subscriber. Currently, handling rollback prepared already
behaves so; it first checks whether we have prepared the transaction
or not and skip it if haven’t. So I think we need to do that also for
commit prepared case. With that, this requires protocol changes so
that the subscriber can get prepare-lsn and prepare-time when handling
commit prepared.

So I’m writing a separate patch to add prepare-lsn and timestamp to
commit_prepared message, which will be a building block for skipping
prepared transactions. Actually, I think it’s beneficial even today;
we can skip preparing the transaction if it’s an empty transaction.
Although the comment it’s not a common case, I think that it could
happen quite often in some cases:

* XXX, We can optimize such that at commit prepared time, we first check
* whether we have prepared the transaction or not but that doesn't seem
* worthwhile because such cases shouldn't be common.
*/

For example, if the publisher has multiple subscriptions and there are
many prepared transactions that modify the particular table subscribed
by one publisher, many empty transactions are replicated to other
subscribers.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Alter all tables in schema owner fix

2021-12-07 Thread Greg Nancarrow
On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila  wrote:
>
> Thanks, the patch looks mostly good to me. I have slightly modified it
> to incorporate one of Michael's suggestions, ran pgindent, and
> modified the commit message.
>

LGTM, except in the patch commit message I'd change "Create
Publication" to "CREATE PUBLICATION".


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Non-superuser subscription owners

2021-12-07 Thread Amit Kapila
On Mon, Dec 6, 2021 at 9:26 PM Mark Dilger  wrote:
>
> > On Dec 6, 2021, at 2:19 AM, Amit Kapila  wrote:
> >
> >>> If we want to maintain the property that subscriptions can only be
> >>> owned by superuser
>
> We don't want to maintain such a property, or at least, that's not what I 
> want.  I don't think that's what Jeff wants, either.
>
> To clarify, I'm not entirely sure how to interpret the verb "maintain" in 
> your question, since before the patch the property does not exist, and after 
> the patch, it continues to not exist.  We could *add* such a property, of 
> course, though this patch does not attempt any such thing.
>

Okay, let me try to explain again. Following is the text from docs
[1]: " (a) To create a subscription, the user must be a superuser. (b)
The subscription apply process will run in the local database with the
privileges of a superuser. (c) Privileges are only checked once at the
start of a replication connection. They are not re-checked as each
change record is read from the publisher, nor are they re-checked for
each change when applied.

My understanding is that we want to improve what is written as (c)
which I think is the same as what you mentioned later as "Fix the
current bug wherein subscription changes are applied with superuser
force after the subscription owner has superuser privileges revoked.".
Am I correct till here? If so, I think what I am suggesting should fix
this with the assumption that we still want to follow (b) at least for
the first patch. One possibility is that our understanding of the
first problem is the same but you want to allow apply worker running
even when superuser privileges are revoked provided the user with
which it is running has appropriate privileges on the objects being
accessed by apply worker.

We will talk about other points of the roadmap you mentioned once our
understanding for the first one matches.

[1] - https://www.postgresql.org/docs/devel/logical-replication-security.html

-- 
With Regards,
Amit Kapila.




Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Dilip Kumar
On Tue, Dec 7, 2021 at 10:29 AM Nikolay Samokhvalov
 wrote:
>
> On Mon, Dec 6, 2021 at 8:16 PM Dilip Kumar  wrote:
>>
>> If the subtransaction cache is overflowed in some of the transactions
>> then it will affect all the concurrent queries as they need to access
>> the SLRU for checking the visibility of each tuple.  But currently
>> there is no way to identify whether in any backend subtransaction is
>> overflowed or what is the current active subtransaction count.
>
>
> I think it's a good idea – had the same need when recently researching 
> various issues with subtransactions [1], needed to patch Postgres in 
> benchmarking environments. To be fair, there is a way to understand that the 
> overflowed state is reached for PG 13+ – on standbys, observe reads in 
> Subtrans in pg_stat_slru. But of course, it's an indirect way.

Yeah right.

> I see that the patch adds two new columns to pg_stat_activity: subxact_count 
> and subxact_overflowed. This should be helpful to have. Additionally, 
> exposing the lastOverflowedXid value would be also good for troubleshooting 
> of subtransaction edge and corner cases – a bug recently fixed in all current 
> versions [2] was really tricky to troubleshoot in production because this 
> value is not visible to DBAs.

Yeah, we can show this too, although we need to take ProcArrayLock in
the shared mode for reading this, but anyway that will be done on
users request so should not be an issue IMHO.

I will post the updated patch soon along with comments given by
Zhihong Yu and Justin.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Dilip Kumar
On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby  wrote:

Thanks for the review I will work on these comments.

> > +  
> > +   subxact_count xid
> > +  
> > +  
> > +   The current backend's active subtransactions count.
>
> subtransaction (no s)
>
> > +   Set to true if current backend's subtransaction cache is overflowed.
>
> Say "has overflowed"
>
> > + if (local_beentry->subxact_count > 0)
> > + {
> > + values[30] = local_beentry->subxact_count;
> > + values[31] = local_beentry->subxact_overflowed;
> > + }
> > + else
> > + {
> > + nulls[30] = true;
> > + nulls[31] = true;
> > + }
>
> Why is the subxact count set to NULL instead of zero ?

> You added this to pg_stat_activity, which already has a lot of fields.
> We talked a few months ago about not adding more fields that weren't commonly
> used.
> https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e
>
> Since I think this field is usually not interesting to most users of
> pg_stat_activity, maybe this should instead be implemented as a function like
> pg_backend_get_subxact_status(pid).
>
> People who want to could use it like:
> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;

Yeah, this is a valid point, I will change this accordingly.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Alter all tables in schema owner fix

2021-12-07 Thread Amit Kapila
On Tue, Dec 7, 2021 at 8:21 AM vignesh C  wrote:
>
> Thanks for your comments, I have made the changes. Additionally I have
> renamed IsSchemaPublication to is_schema_publication for keeping the
> naming similar around the code. The attached v3 patch has the changes
> for the same.
>

Thanks, the patch looks mostly good to me. I have slightly modified it
to incorporate one of Michael's suggestions, ran pgindent, and
modified the commit message.

I am planning to push the attached tomorrow unless there are further
comments. Michael, do let me know if you have any questions or
objections about this?

-- 
With Regards,
Amit Kapila.


v4-0001-Fix-changing-the-ownership-of-ALL-TABLES-IN-SCHEM.patch
Description: Binary data


  1   2   >