Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
At Wed, 8 Dec 2021 11:47:30 +0530, Bharath Rupireddy wrote in > On Wed, Dec 8, 2021 at 10:59 AM Bossart, Nathan wrote: > > >> Another option we might want to consider is to just skip updating the > > >> state entirely for end-of-recovery checkpoints. The state would > > >> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. I > > >> don't know if it's crucial to have a dedicated control file state for > > >> end-of-recovery checkpoints. FWIW I find it simple but sufficient since I regarded the end-of-recovery checkpoint as a part of recovery. In that case what is strange here is only that the state transition passes the DB_SHUTDOWN(ING/ED) states. On the other hand, when a server is going to shutdown, the state stays at DB_IN_PRODUCTION if there are clinging clients even if the shutdown procedure has been already started and no new clients can connect to the server. There's no reason we need to be so particular about states for recovery-end. > > > Please note that end-of-recovery can take a while in production > > > systems (we have observed such things working with our customers) and > > > anything can happen during that period of time. The end-of-recovery > > > checkpoint is not something that gets finished momentarily. Therefore, > > > having a separate db state in the control file is useful. > > > > Is there some useful distinction between the states for users? ISTM > > that users will be waiting either way, and I don't know that an extra > > control file state will help all that much. The main reason I bring > > up this option is that the list of states is pretty short and appears > > to be intended to indicate the high-level status of the server. Most > > of the states are over 20 years old, and the newest one is over 10 > > years old, so I don't think new states can be added willy-nilly. > > Firstly, updating the control file with "DB_SHUTDOWNING" and > "DB_SHUTDOWNED" for end-of-recovery checkpoint is wrong. I don't think > having DB_IN_CRASH_RECOVERY for end-of-recovery checkpoint is a great > idea. We have a checkpoint (which most of the time takes a while) in > between the states DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. The state > DB_IN_END_OF_RECOVERY_CHECKPOINT added by the v1 patch at [1] (in this > thread) helps users to understand and clearly distinguish what state > the db is in. > > IMHO, the age of the code doesn't stop us adding/fixing/improving the code. > > > Of course, I could be off-base and others might agree that this new > > state would be nice to have. > > Let's see what others have to say about this. I see it a bit too complex for the advantage. When end-of-recovery checkpoint takes so long, that state is shown in server log, which operators would look into before the control file. > [1] - > https://www.postgresql.org/message-id/CALj2ACVn5M8xgQ3RD%3D6rSTbbXRBdBWZ%3DTTOBOY_5%2BedMCkWjHA%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Make pg_waldump report replication origin ID, LSN, and timestamp.
On Mon, Dec 06, 2021 at 11:24:09PM +0900, Masahiko Sawada wrote: > On Mon, Dec 6, 2021 at 5:09 PM Michael Paquier wrote: >> Shouldn't you check for parsed.origin_lsn instead? The replication >> origin is stored there as far as I read EndPrepare(). > > Also, looking at PrepareRedoAdd(), we check the replication origin id. > So I think that it'd be better to check origin_id for consistency. Okay, this consistency would make sense, then. Perhaps some comments should be added to tell that? >> Commit records check after XACT_XINFO_HAS_ORIGIN, but >> xact_desc_abort() may include this information for ROLLBACK PREPARED >> transactions so we could use the same logic as xact_desc_commit() for >> the abort case, no? > > Good catch! I'll submit an updated patch. Thanks! -- Michael signature.asc Description: PGP signature
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Wed, Dec 08, 2021 at 10:47:57AM +0530, Bharath Rupireddy wrote: > Thanks for taking a look at the patch. How about the attached v4? > > I added a CF entry - https://commitfest.postgresql.org/36/3443/ + else if (source == XLOG_FROM_STREAM) + ereport(LOG, + (errmsg("recovering WAL segment \"%s\" received from primary", + xlogfname))); This is incorrect when using a cascading standby. And perhaps this could use a switch/case? While quickly testing, I got reminded that the LOG for a segment retrieved from the local pg_wal would generate some noise when running for example the bootstrap process. Is this one really interesting compared to the two others? -- Michael signature.asc Description: PGP signature
Re: Skipping logical replication transactions on subscriber side
On Wed, Dec 8, 2021 at 3:50 PM Amit Kapila wrote: > > On Wed, Dec 8, 2021 at 11:48 AM Masahiko Sawada wrote: > > > > On Wed, Dec 8, 2021 at 2:16 PM Amit Kapila wrote: > > > > > > On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada > > > wrote: > > > > > > > > On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila > > > > wrote: > > > > > > > > I'll submit the patch tomorrow. > > > > > > > > While updating the patch, I realized that skipping a transaction that > > > > is prepared on the publisher will be tricky a bit; > > > > > > > > First of all, since skip-xid is in pg_subscription catalog, we need to > > > > do a catalog update in a transaction and commit it to disable it. I > > > > think we need to set origin-lsn and timestamp of the transaction being > > > > skipped to the transaction that does the catalog update. That is, > > > > during skipping the (not prepared) transaction, we skip all > > > > data-modification changes coming from the publisher, do a catalog > > > > update, and commit the transaction. If we do the catalog update in the > > > > next transaction after skipping the whole transaction, skip_xid could > > > > be left in case of a server crash between them. > > > > > > > > > > But if we haven't updated origin_lsn/timestamp before the crash, won't > > > it request the same transaction again from the publisher? If so, it > > > will be again able to skip it because skip_xid is still not updated. > > > > Yes. I mean that if we update origin_lsn and origin_timestamp when > > committing the skipped transaction and then update the catalog in the > > next transaction it doesn't work in case of a crash. But it's not > > possible in the first place since the first transaction is empty and > > we cannot set origin_lsn and origin_timestamp to it. > > > > > > > > > Also, we cannot set > > > > origin-lsn and timestamp to an empty transaction. > > > > > > > > > > But won't we update the catalog for skip_xid in that case? > > > > Yes. Probably my explanation was not clear. Even if we skip all > > changes of the transaction, the transaction doesn't become empty since > > we update the catalog. > > > > > > > > Do we see any advantage of updating the skip_xid in the same > > > transaction vs. doing it in a separate transaction? If not then > > > probably we can choose either of those ways and add some comments to > > > indicate the possibility of doing it another way. > > > > I think that since the skipped transaction is always empty there is > > always one transaction. What we need to consider is when we update > > origin_lsn and origin_timestamp. In non-prepared transaction cases, > > the only option is when updating the catalog. > > > > Your last sentence is not completely clear to me but it seems you > agree that we can use one transaction instead of two to skip the > changes, perform a catalog update, and update origin_lsn/timestamp. Yes. > > > > > > > > In prepared transaction cases, I think that when handling a prepare > > > > message, we need to commit the transaction to update the catalog, > > > > instead of preparing it. And at the commit prepared and rollback > > > > prepared time, we skip it since there is not the prepared transaction > > > > on the subscriber. > > > > > > > > > > Can't we think of just allowing prepare in this case and updating the > > > skip_xid only at commit time? I see that in this case, we would be > > > doing prepare for a transaction that has no changes but as such cases > > > won't be common, isn't that acceptable? > > > > In this case, we will end up committing both the prepared (empty) > > transaction and the transaction that updates the catalog, right? > > > > Can't we do this catalog update before committing the prepared > transaction? If so, both in prepared and non-prepared cases, our > implementation could be the same and we have a reason to accomplish > the catalog update in the same transaction for which we skipped the > changes. But in case of a crash between these two transactions, given that skip_xid is already cleared how do we know the prepared transaction that was supposed to be skipped? > > > If > > so, since these are separate transactions it can be a problem in case > > of a crash between these two commits. > > > > > > > > > Currently, handling rollback prepared already > > > > behaves so; it first checks whether we have prepared the transaction > > > > or not and skip it if haven’t. So I think we need to do that also for > > > > commit prepared case. With that, this requires protocol changes so > > > > that the subscriber can get prepare-lsn and prepare-time when handling > > > > commit prepared. > > > > > > > > So I’m writing a separate patch to add prepare-lsn and timestamp to > > > > commit_prepared message, which will be a building block for skipping > > > > prepared transactions. Actually, I think it’s beneficial even today; > > > > we can skip preparing the transaction if it’s an empty transaction. > > > > Although the comment it’s not a common case,
Re: Make mesage at end-of-recovery less scary.
At Tue, 09 Nov 2021 16:27:51 +0900 (JST), Kyotaro Horiguchi wrote in > This is the updated version. > > - emode_for_currupt_record() now uses currentSource instead of > readSource. > > - If zero record length is faced, make sure the whole header is zeroed > before deciding it as the end-of-WAL. > > - Do not overwrite existig message when missing contrecord is > detected. The message added here is seen in the TAP test log > 026_overwrite_contrecord_standby.log d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs directly from xlogreader instead of the removed global variables. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From cc521692a9f98fabde07e248b63f1222f8406de1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v7] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 89 +++-- src/backend/access/transam/xlogreader.c | 42 src/backend/replication/walreceiver.c | 3 +- src/include/access/xlogreader.h | 1 + 4 files changed, 112 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d894af310a..fa435faec4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4469,6 +4469,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, ); if (record == NULL) @@ -4484,6 +4485,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; +ErrRecPtr = abortedRecPtr; + } + else + { +/* + * NULL ReadRecPtr means we could not read a record at + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ +ErrRecPtr = + xlogreader->ReadRecPtr ? + xlogreader->ReadRecPtr : xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4493,12 +4506,11 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error message + * right now. Otherwise the message if any is shown as a part of + * the end-of-WAL message below. */ - if (errormsg) + if (!xlogreader->EndOfWAL && errormsg) ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4530,11 +4542,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4547,11 +4560,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && -!fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { +/* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4599,12 +4618,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) -continue; - else -return NULL; + /* + * If we haven't emit an error message, we have safely reached the + *
Re: Skipping logical replication transactions on subscriber side
On Wed, Dec 8, 2021 at 11:48 AM Masahiko Sawada wrote: > > On Wed, Dec 8, 2021 at 2:16 PM Amit Kapila wrote: > > > > On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada > > wrote: > > > > > > On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila > > > wrote: > > > > > > I'll submit the patch tomorrow. > > > > > > While updating the patch, I realized that skipping a transaction that > > > is prepared on the publisher will be tricky a bit; > > > > > > First of all, since skip-xid is in pg_subscription catalog, we need to > > > do a catalog update in a transaction and commit it to disable it. I > > > think we need to set origin-lsn and timestamp of the transaction being > > > skipped to the transaction that does the catalog update. That is, > > > during skipping the (not prepared) transaction, we skip all > > > data-modification changes coming from the publisher, do a catalog > > > update, and commit the transaction. If we do the catalog update in the > > > next transaction after skipping the whole transaction, skip_xid could > > > be left in case of a server crash between them. > > > > > > > But if we haven't updated origin_lsn/timestamp before the crash, won't > > it request the same transaction again from the publisher? If so, it > > will be again able to skip it because skip_xid is still not updated. > > Yes. I mean that if we update origin_lsn and origin_timestamp when > committing the skipped transaction and then update the catalog in the > next transaction it doesn't work in case of a crash. But it's not > possible in the first place since the first transaction is empty and > we cannot set origin_lsn and origin_timestamp to it. > > > > > > Also, we cannot set > > > origin-lsn and timestamp to an empty transaction. > > > > > > > But won't we update the catalog for skip_xid in that case? > > Yes. Probably my explanation was not clear. Even if we skip all > changes of the transaction, the transaction doesn't become empty since > we update the catalog. > > > > > Do we see any advantage of updating the skip_xid in the same > > transaction vs. doing it in a separate transaction? If not then > > probably we can choose either of those ways and add some comments to > > indicate the possibility of doing it another way. > > I think that since the skipped transaction is always empty there is > always one transaction. What we need to consider is when we update > origin_lsn and origin_timestamp. In non-prepared transaction cases, > the only option is when updating the catalog. > Your last sentence is not completely clear to me but it seems you agree that we can use one transaction instead of two to skip the changes, perform a catalog update, and update origin_lsn/timestamp. > > > > > In prepared transaction cases, I think that when handling a prepare > > > message, we need to commit the transaction to update the catalog, > > > instead of preparing it. And at the commit prepared and rollback > > > prepared time, we skip it since there is not the prepared transaction > > > on the subscriber. > > > > > > > Can't we think of just allowing prepare in this case and updating the > > skip_xid only at commit time? I see that in this case, we would be > > doing prepare for a transaction that has no changes but as such cases > > won't be common, isn't that acceptable? > > In this case, we will end up committing both the prepared (empty) > transaction and the transaction that updates the catalog, right? > Can't we do this catalog update before committing the prepared transaction? If so, both in prepared and non-prepared cases, our implementation could be the same and we have a reason to accomplish the catalog update in the same transaction for which we skipped the changes. > If > so, since these are separate transactions it can be a problem in case > of a crash between these two commits. > > > > > > Currently, handling rollback prepared already > > > behaves so; it first checks whether we have prepared the transaction > > > or not and skip it if haven’t. So I think we need to do that also for > > > commit prepared case. With that, this requires protocol changes so > > > that the subscriber can get prepare-lsn and prepare-time when handling > > > commit prepared. > > > > > > So I’m writing a separate patch to add prepare-lsn and timestamp to > > > commit_prepared message, which will be a building block for skipping > > > prepared transactions. Actually, I think it’s beneficial even today; > > > we can skip preparing the transaction if it’s an empty transaction. > > > Although the comment it’s not a common case, I think that it could > > > happen quite often in some cases: > > > > > > * XXX, We can optimize such that at commit prepared time, we first > > > check > > > * whether we have prepared the transaction or not but that doesn't > > > seem > > > * worthwhile because such cases shouldn't be common. > > > */ > > > > > > For example, if the publisher has multiple subscriptions and there are > > > many prepared
Re: row filtering for logical replication
On Mon, Dec 6, 2021 at 6:04 PM Euler Taveira wrote: > > On Mon, Dec 6, 2021, at 3:35 AM, Dilip Kumar wrote: > > On Mon, Dec 6, 2021 at 6:49 AM Euler Taveira wrote: > > > > On Fri, Dec 3, 2021, at 8:12 PM, Euler Taveira wrote: > > > > PS> I will update the commit message in the next version. I barely changed > > the > > documentation to reflect the current behavior. I probably missed some > > changes > > but I will fix in the next version. > > > > I realized that I forgot to mention a few things about the UPDATE behavior. > > Regardless of 0003, we need to define which tuple will be used to evaluate > > the > > row filter for UPDATEs. We already discussed it circa [1]. This current > > version > > chooses *new* tuple. Is it the best choice? > > But with 0003, we are using both the tuple for evaluating the row > filter, so instead of fixing 0001, why we don't just merge 0003 with > 0001? I mean eventually, 0003 is doing what is the agreed behavior, > i.e. if just OLD is matching the filter then convert the UPDATE to > DELETE OTOH if only new is matching the filter then convert the UPDATE > to INSERT. Do you think that even we merge 0001 and 0003 then also > there is an open issue regarding which row to select for the filter? > > Maybe I was not clear. IIUC we are still discussing 0003 and I would like to > propose a different default based on the conclusion I came up. If we merged > 0003, that's fine; this change will be useless. If we don't or it is optional, > it still has its merit. > > Do we want to pay the overhead to evaluating both tuple for UPDATEs? I'm still > processing if it is worth it. If you think that in general the row filter > contains the primary key and it is rare to change it, it will waste cycles > evaluating the same expression twice. It seems this behavior could be > controlled by a parameter. > I think the first thing we should do in this regard is to evaluate the performance for both cases (when we apply a filter to both tuples vs. to one of the tuples). In case the performance difference is unacceptable, I think it would be better to still compare both tuples as default to avoid data inconsistency issues and have an option to allow comparing one of the tuples. -- With Regards, Amit Kapila.
Re: GUC flags
On Mon, Dec 06, 2021 at 07:36:55AM -0600, Justin Pryzby wrote: > The script checks that guc.c and sample config are consistent. > > I think your undertanding of INTENTIONALLY_NOT_INCLUDED is not right. > That's a list of stuff it "avoids reporting" as an suspected error, not an > additional list of stuff to checks. INTENTIONALLY_NOT_INCLUDED is a list of > stuff like NOT_IN_SAMPLE, which is better done by parsing /NOT_IN_SAMPLE/. Indeed. I got that wrong, thanks for clarifying. > I saw that Tom updated it within the last 12 months, which I took to mean that > it was still being maintained. But I'm okay with removing it. Yes, I saw that as of bf8a662. With 42 incorrect reports, I still see more evidence with removing it. Before doing anything, let's wait for and gather some opinions. I am adding Bruce (as the original author) and Tom in CC as they are the ones who have updated this script the most in the last ~15 years. -- Michael signature.asc Description: PGP signature
Re: Skipping logical replication transactions on subscriber side
On Wed, Dec 8, 2021 at 2:16 PM Amit Kapila wrote: > > On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada wrote: > > > > On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila wrote: > > > > I'll submit the patch tomorrow. > > > > While updating the patch, I realized that skipping a transaction that > > is prepared on the publisher will be tricky a bit; > > > > First of all, since skip-xid is in pg_subscription catalog, we need to > > do a catalog update in a transaction and commit it to disable it. I > > think we need to set origin-lsn and timestamp of the transaction being > > skipped to the transaction that does the catalog update. That is, > > during skipping the (not prepared) transaction, we skip all > > data-modification changes coming from the publisher, do a catalog > > update, and commit the transaction. If we do the catalog update in the > > next transaction after skipping the whole transaction, skip_xid could > > be left in case of a server crash between them. > > > > But if we haven't updated origin_lsn/timestamp before the crash, won't > it request the same transaction again from the publisher? If so, it > will be again able to skip it because skip_xid is still not updated. Yes. I mean that if we update origin_lsn and origin_timestamp when committing the skipped transaction and then update the catalog in the next transaction it doesn't work in case of a crash. But it's not possible in the first place since the first transaction is empty and we cannot set origin_lsn and origin_timestamp to it. > > > Also, we cannot set > > origin-lsn and timestamp to an empty transaction. > > > > But won't we update the catalog for skip_xid in that case? Yes. Probably my explanation was not clear. Even if we skip all changes of the transaction, the transaction doesn't become empty since we update the catalog. > > Do we see any advantage of updating the skip_xid in the same > transaction vs. doing it in a separate transaction? If not then > probably we can choose either of those ways and add some comments to > indicate the possibility of doing it another way. I think that since the skipped transaction is always empty there is always one transaction. What we need to consider is when we update origin_lsn and origin_timestamp. In non-prepared transaction cases, the only option is when updating the catalog. > > > In prepared transaction cases, I think that when handling a prepare > > message, we need to commit the transaction to update the catalog, > > instead of preparing it. And at the commit prepared and rollback > > prepared time, we skip it since there is not the prepared transaction > > on the subscriber. > > > > Can't we think of just allowing prepare in this case and updating the > skip_xid only at commit time? I see that in this case, we would be > doing prepare for a transaction that has no changes but as such cases > won't be common, isn't that acceptable? In this case, we will end up committing both the prepared (empty) transaction and the transaction that updates the catalog, right? If so, since these are separate transactions it can be a problem in case of a crash between these two commits. > > > Currently, handling rollback prepared already > > behaves so; it first checks whether we have prepared the transaction > > or not and skip it if haven’t. So I think we need to do that also for > > commit prepared case. With that, this requires protocol changes so > > that the subscriber can get prepare-lsn and prepare-time when handling > > commit prepared. > > > > So I’m writing a separate patch to add prepare-lsn and timestamp to > > commit_prepared message, which will be a building block for skipping > > prepared transactions. Actually, I think it’s beneficial even today; > > we can skip preparing the transaction if it’s an empty transaction. > > Although the comment it’s not a common case, I think that it could > > happen quite often in some cases: > > > > * XXX, We can optimize such that at commit prepared time, we first check > > * whether we have prepared the transaction or not but that doesn't seem > > * worthwhile because such cases shouldn't be common. > > */ > > > > For example, if the publisher has multiple subscriptions and there are > > many prepared transactions that modify the particular table subscribed > > by one publisher, many empty transactions are replicated to other > > subscribers. > > > > I think this is not clear to me. Why would one have multiple > subscriptions for the same publication? I thought it is possible when > say some publisher doesn't publish any data of prepared transaction > say because the corresponding action is not published or something > like that. I don't deny that someday we want to optimize this case but > it might be better if we don't need to do it along with this patch. I imagined that the publisher has two publications (say pub-A and pub-B) that publishes a diferent set of relations in the database and there are two subscribers that are
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On Wed, Dec 8, 2021 at 10:59 AM Bossart, Nathan wrote: > >> Another option we might want to consider is to just skip updating the > >> state entirely for end-of-recovery checkpoints. The state would > >> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. I > >> don't know if it's crucial to have a dedicated control file state for > >> end-of-recovery checkpoints. > > > > Please note that end-of-recovery can take a while in production > > systems (we have observed such things working with our customers) and > > anything can happen during that period of time. The end-of-recovery > > checkpoint is not something that gets finished momentarily. Therefore, > > having a separate db state in the control file is useful. > > Is there some useful distinction between the states for users? ISTM > that users will be waiting either way, and I don't know that an extra > control file state will help all that much. The main reason I bring > up this option is that the list of states is pretty short and appears > to be intended to indicate the high-level status of the server. Most > of the states are over 20 years old, and the newest one is over 10 > years old, so I don't think new states can be added willy-nilly. Firstly, updating the control file with "DB_SHUTDOWNING" and "DB_SHUTDOWNED" for end-of-recovery checkpoint is wrong. I don't think having DB_IN_CRASH_RECOVERY for end-of-recovery checkpoint is a great idea. We have a checkpoint (which most of the time takes a while) in between the states DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. The state DB_IN_END_OF_RECOVERY_CHECKPOINT added by the v1 patch at [1] (in this thread) helps users to understand and clearly distinguish what state the db is in. IMHO, the age of the code doesn't stop us adding/fixing/improving the code. > Of course, I could be off-base and others might agree that this new > state would be nice to have. Let's see what others have to say about this. [1] - https://www.postgresql.org/message-id/CALj2ACVn5M8xgQ3RD%3D6rSTbbXRBdBWZ%3DTTOBOY_5%2BedMCkWjHA%40mail.gmail.com Regards, Bharath Rupireddy.
Re: Data is copied twice when specifying both child and parent table in publication
On Wed, Dec 8, 2021 at 11:30 AM vignesh C wrote: > > On Wed, Dec 8, 2021 at 11:11 AM Amit Kapila wrote: > > > > On Tue, Dec 7, 2021 at 5:53 PM vignesh C wrote: > > > > > > On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com > > > wrote: > > > > > > > > > > 2) Any particular reason why the code and tests are backbranched but > > > not the document changes? > > > > > > > I am not sure whether we need the doc change or not as this is not a > > new feature and even if we need it as an improvement to docs, shall we > > consider backpatching it? I felt that code changes are required to fix > > a known issue so the case of backpatching it is clear. > > Thanks for the clarification, I got your point. I'm fine either way > regarding the documentation change. The rest of the patch looks good > to me. > Okay, I have also verified the code and test changes for all branches. I'll wait for a day to see if anybody else has any comments and then commit this. -- With Regards, Amit Kapila.
Re: row filtering for logical replication
On Thu, Dec 2, 2021 at 2:59 PM houzj.f...@fujitsu.com wrote: > ... > Attach the v44-0005 top-up patch. > This version addressed all the comments received so far, > mainly including the following changes: > 1) rename rfcol_valid_for_replica to rfcol_valid > 2) Remove the struct PublicationInfo and add the rfcol_valid flag directly in > relation > 3) report the invalid column number in the error message. > 4) Rename some function to match the usage. > 5) Fix some typos and add some code comments. > 6) Fix a miss in testcase. Below are my review comments for the most recent v44-0005 (top-up) patch: == 1. src/backend/executor/execReplication.c + invalid_rfcol = RelationGetInvalRowFilterCol(rel); + + /* + * It is only safe to execute UPDATE/DELETE when all columns of the row + * filters from publications which the relation is in are part of the + * REPLICA IDENTITY. + */ + if (invalid_rfcol != InvalidAttrNumber) + { It seemed confusing that when the invalid_rfcol is NOT invalid at all then it is InvalidAttrNumber, so perhaps this code would be easier to read if instead the condition was written just as: --- if (invalid_rfcol) { ... } --- 2. invalid_rfcol var name This variable name is used in a few places but I thought it was too closely named with the "rfcol_valid" variable even though it has a completely different meaning. IMO "invalid_rfcol" might be better named "invalid_rfcolnum" or something like that to reinforce that it is an AttributeNumber. 3. src/backend/utils/cache/relcache.c - function comment + * If not all the row filter columns are part of REPLICA IDENTITY, return the + * invalid column number, InvalidAttrNumber otherwise. + */ Minor rewording: "InvalidAttrNumber otherwise." --> "otherwise InvalidAttrNumber." 4. src/backend/utils/cache/relcache.c - function name +AttrNumber +RelationGetInvalRowFilterCol(Relation relation) IMO nothing was gained by saving 2 chars of the name. "RelationGetInvalRowFilterCol" --> "RelationGetInvalidRowFilterCol" 5. src/backend/utils/cache/relcache.c +/* For invalid_rowfilter_column_walker. */ +typedef struct { + AttrNumber invalid_rfcol; + Bitmapset *bms_replident; +} rf_context; + The members should be commented. 6. src/include/utils/rel.h /* + * true if the columns of row filters from all the publications the + * relation is in are part of replica identity. + */ + bool rd_rfcol_valid; I felt the member comment is not quite telling the full story. e.g. IIUC this member is also true when pubaction is something other than update/delete - but that case doesn't even do replica identity checking at all. There might not even be any replica identity. 6. src/test/regress/sql/publication.sql CREATE PUBLICATION testpub6 FOR TABLE rf_tbl_abcd_pk WHERE (a > 99); +-- fail - "a" is in PK but it is not part of REPLICA IDENTITY INDEX +update rf_tbl_abcd_pk set a = 1; +DROP PUBLICATION testpub6; -- ok - "c" is not in PK but it is part of REPLICA IDENTITY INDEX -SET client_min_messages = 'ERROR'; CREATE PUBLICATION testpub6 FOR TABLE rf_tbl_abcd_pk WHERE (c > 99); DROP PUBLICATION testpub6; -RESET client_min_messages; --- fail - "a" is not in REPLICA IDENTITY INDEX CREATE PUBLICATION testpub6 FOR TABLE rf_tbl_abcd_nopk WHERE (a > 99); +-- fail - "a" is not in REPLICA IDENTITY INDEX +update rf_tbl_abcd_nopk set a = 1; The "update" DML should be uppercase "UPDATE" for consistency with the surrounding tests. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Data is copied twice when specifying both child and parent table in publication
On Wed, Dec 8, 2021 at 11:11 AM Amit Kapila wrote: > > On Tue, Dec 7, 2021 at 5:53 PM vignesh C wrote: > > > > On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com > > wrote: > > > > > > > 2) Any particular reason why the code and tests are backbranched but > > not the document changes? > > > > I am not sure whether we need the doc change or not as this is not a > new feature and even if we need it as an improvement to docs, shall we > consider backpatching it? I felt that code changes are required to fix > a known issue so the case of backpatching it is clear. Thanks for the clarification, I got your point. I'm fine either way regarding the documentation change. The rest of the patch looks good to me. Regards, Vignesh
Re: Data is copied twice when specifying both child and parent table in publication
On Tue, Dec 7, 2021 at 5:53 PM vignesh C wrote: > > On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com > wrote: > > > > 2) Any particular reason why the code and tests are backbranched but > not the document changes? > I am not sure whether we need the doc change or not as this is not a new feature and even if we need it as an improvement to docs, shall we consider backpatching it? I felt that code changes are required to fix a known issue so the case of backpatching it is clear. -- With Regards, Amit Kapila.
RE: [PATCH]Comment improvement in publication.sql
On Wednesday, December 8, 2021 1:49 PM, vignesh C wrote: > The patch no longer applies, could you post a rebased patch. Thanks for your kindly reminder. Attached a rebased patch. Some changes in v4 patch has been fixed by 5a28324, so I deleted those changes. Regards, Tang v5-0001-Fix-comments-in-publication.sql.patch Description: v5-0001-Fix-comments-in-publication.sql.patch
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/7/21, 8:42 PM, "Bharath Rupireddy" wrote: > On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan wrote: >> I think that's alright. The only other small suggestion I have would >> be to say "during end-of-recovery checkpoint" instead of "while in >> end-of-recovery checkpoint." > > "while in" is being used by DB_IN_CRASH_RECOVERY and > DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to > deviate from that and use "during". Fair enough. I don't have a strong opinion about this. >> Another option we might want to consider is to just skip updating the >> state entirely for end-of-recovery checkpoints. The state would >> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. I >> don't know if it's crucial to have a dedicated control file state for >> end-of-recovery checkpoints. > > Please note that end-of-recovery can take a while in production > systems (we have observed such things working with our customers) and > anything can happen during that period of time. The end-of-recovery > checkpoint is not something that gets finished momentarily. Therefore, > having a separate db state in the control file is useful. Is there some useful distinction between the states for users? ISTM that users will be waiting either way, and I don't know that an extra control file state will help all that much. The main reason I bring up this option is that the list of states is pretty short and appears to be intended to indicate the high-level status of the server. Most of the states are over 20 years old, and the newest one is over 10 years old, so I don't think new states can be added willy-nilly. Of course, I could be off-base and others might agree that this new state would be nice to have. Nathan
Re: row filtering for logical replication
On Tue, Dec 7, 2021 at 6:31 PM Ashutosh Bapat wrote: > > On Tue, Dec 7, 2021 at 12:18 PM tanghy.f...@fujitsu.com > wrote: > > > > I have another problem with your patch. The document says: > > > > ... If the subscription has several publications in > > + which the same table has been published with different filters, those > > + expressions get OR'ed together so that rows satisfying any of the > > expressions > > + will be replicated. Notice this means if one of the publications has no > > filter > > + at all then all other filters become redundant. > > > > Then, what if one of the publications is specified as 'FOR ALL TABLES' or > > 'FOR > > ALL TABLES IN SCHEMA'. > > > > For example: > > create table tbl (a int primary key);" > > create publication p1 for table tbl where (a > 10); > > create publication p2 for all tables; > > create subscription sub connection 'dbname=postgres port=5432' publication > > p1, p2; > > Thanks for the example. I was wondering about this case myself. > I think we should handle this case. > > > > I think for "FOR ALL TABLE" publication(p2 in my case), table tbl should be > > treated as no filter, and table tbl should have no filter in subscription > > sub. Thoughts? > > > > But for now, the filter(a > 10) works both when copying initial data and > > later changes. > > > > To fix it, I think we can check if the table is published in a 'FOR ALL > > TABLES' > > publication or published as part of schema in function > > pgoutput_row_filter_init > > (which was introduced in v44-0003 patch), also we need to make some changes > > in > > tablesync.c. > > In order to check "FOR ALL_TABLES", we might need to fetch publication > metadata. > Do we really need to perform a separate fetch for this? In get_rel_sync_entry(), we already have this information, can't we someway stash that in the corresponding RelationSyncEntry so that same can be used later for row filtering. > Instead of that can we add a "TRUE" filter on all the tables > which are part of FOR ALL TABLES publication? > How? We won't have an entry for such tables in pg_publication_rel where we store row_filter information. -- With Regards, Amit Kapila.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Wed, Dec 8, 2021 at 10:05 AM Michael Paquier wrote: > > On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote: > > Upon thinking further, having at least the messages at LOG level [1] > > would be helpful to know what's happening with the system while in > > recovery. Although these messages at LOG level seem to be filling up > > the server logs, having a good log consumption and rotation mechanism > > (I'm sure every major postgres vendor would have one) would be > > sufficient to allay that concern. > > + ereport(LOG, > + (errmsg("recovering WAL segment \"%s\" from source \"%s\"", > + xlogfname, srcname))); > Isn't this approach an issue for translations? It seems to me that > terms like "stream" or "archive" had better be translated properly, > meaning that this needs a fully-constructed sentence. Thanks for taking a look at the patch. How about the attached v4? I added a CF entry - https://commitfest.postgresql.org/36/3443/ Regards, Bharath Rupireddy. v4-0001-add-log-messages-for-recovery.patch Description: Binary data
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada wrote: > > On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila wrote: > > I'll submit the patch tomorrow. > > While updating the patch, I realized that skipping a transaction that > is prepared on the publisher will be tricky a bit; > > First of all, since skip-xid is in pg_subscription catalog, we need to > do a catalog update in a transaction and commit it to disable it. I > think we need to set origin-lsn and timestamp of the transaction being > skipped to the transaction that does the catalog update. That is, > during skipping the (not prepared) transaction, we skip all > data-modification changes coming from the publisher, do a catalog > update, and commit the transaction. If we do the catalog update in the > next transaction after skipping the whole transaction, skip_xid could > be left in case of a server crash between them. > But if we haven't updated origin_lsn/timestamp before the crash, won't it request the same transaction again from the publisher? If so, it will be again able to skip it because skip_xid is still not updated. > Also, we cannot set > origin-lsn and timestamp to an empty transaction. > But won't we update the catalog for skip_xid in that case? Do we see any advantage of updating the skip_xid in the same transaction vs. doing it in a separate transaction? If not then probably we can choose either of those ways and add some comments to indicate the possibility of doing it another way. > In prepared transaction cases, I think that when handling a prepare > message, we need to commit the transaction to update the catalog, > instead of preparing it. And at the commit prepared and rollback > prepared time, we skip it since there is not the prepared transaction > on the subscriber. > Can't we think of just allowing prepare in this case and updating the skip_xid only at commit time? I see that in this case, we would be doing prepare for a transaction that has no changes but as such cases won't be common, isn't that acceptable? > Currently, handling rollback prepared already > behaves so; it first checks whether we have prepared the transaction > or not and skip it if haven’t. So I think we need to do that also for > commit prepared case. With that, this requires protocol changes so > that the subscriber can get prepare-lsn and prepare-time when handling > commit prepared. > > So I’m writing a separate patch to add prepare-lsn and timestamp to > commit_prepared message, which will be a building block for skipping > prepared transactions. Actually, I think it’s beneficial even today; > we can skip preparing the transaction if it’s an empty transaction. > Although the comment it’s not a common case, I think that it could > happen quite often in some cases: > > * XXX, We can optimize such that at commit prepared time, we first check > * whether we have prepared the transaction or not but that doesn't seem > * worthwhile because such cases shouldn't be common. > */ > > For example, if the publisher has multiple subscriptions and there are > many prepared transactions that modify the particular table subscribed > by one publisher, many empty transactions are replicated to other > subscribers. > I think this is not clear to me. Why would one have multiple subscriptions for the same publication? I thought it is possible when say some publisher doesn't publish any data of prepared transaction say because the corresponding action is not published or something like that. I don't deny that someday we want to optimize this case but it might be better if we don't need to do it along with this patch. -- With Regards, Amit Kapila.
Re: [PATCH]Comment improvement in publication.sql
On Sun, Aug 8, 2021 at 4:26 PM tanghy.f...@fujitsu.com wrote: > > On Sunday, August 8, 2021 6:34 PM, vignesh C wrote > >Thanks for the updated patch, your changes look good to me. You might > >want to include the commit message in the patch, that will be useful. > > Thanks for your kindly suggestion. > Updated patch attached. Added the patch commit message with no new fix. The patch no longer applies, could you post a rebased patch. Regards, Vignesh
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan wrote: > > On 12/7/21, 5:21 PM, "Bharath Rupireddy" > wrote: > > On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan wrote: > >> I noticed that some (but not all) of the surrounding messages say > >> "last known up at" the control file time. I'm curious why you chose > >> not to use that phrasing in this case. > > > > If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was > > interrupted while in end-of-recovery checkpoint, so I used the > > phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY > > cases. I would like to keep it as-is (in the v1 patch) unless anyone > > has other thoughts here? > > (errmsg("database system was interrupted while in recovery at %s", > > (errmsg("database system was interrupted while in recovery at log time %s", > > I think that's alright. The only other small suggestion I have would > be to say "during end-of-recovery checkpoint" instead of "while in > end-of-recovery checkpoint." "while in" is being used by DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to deviate from that and use "during". > Another option we might want to consider is to just skip updating the > state entirely for end-of-recovery checkpoints. The state would > instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. I > don't know if it's crucial to have a dedicated control file state for > end-of-recovery checkpoints. Please note that end-of-recovery can take a while in production systems (we have observed such things working with our customers) and anything can happen during that period of time. The end-of-recovery checkpoint is not something that gets finished momentarily. Therefore, having a separate db state in the control file is useful. Regards, Bharath Rupireddy.
Re: parallel vacuum comments
On Wed, Dec 8, 2021 at 12:22 PM houzj.f...@fujitsu.com wrote: > > On Tuesday, December 7, 2021 1:42 PM Masahiko Sawada > wrote: > > I've attached an updated patch. I've removed 0003 patch that added > > regression tests as per discussion. Regarding the terminology like "bulkdel" > > and "cleanup" you pointed out, I've done that in 0002 patch while moving the > > code to vacuumparallel.c. In that file, we can consistently use the terms > > "bulkdel" and "cleanup" instead of "vacuum" > > and "cleanup". > Hi, > > Thanks for updating the patch. > I noticed few minor things. Thank you for the comments! > > 0001 > 1) > > * Skip processing indexes that are unsafe for workers (these > are > -* processed in do_serial_processing_for_unsafe_indexes() by > leader) > +* processed in parallel_vacuum_process_unsafe_indexes() by > leader) > > It might be clearer to mention that the index to be skipped are unsafe OR not > worthwhile. Agreed. Will add the comments. > > 2) > + /* Set index vacuum status and mark as parallel safe or not */ > + for (int i = 0; i < pvc->nindexes; i++) > + { > ... > + pindstats->parallel_workers_can_process = > + parallel_vacuum_index_is_parallel_safe(vacrel, > + > vacrel->indrels[i], > + > vacuum); > > For the comments above the loop, maybe better to mention we are marking > whether > worker can process the index(not only safe/unsafe). Right. WIll fix. > > 0002 > 3) > > + /* > +* Skip indexes that are unsuitable target for parallel index > vacuum > +*/ > + if (parallel_vacuum_should_skip_index(indrel)) > + continue; > + > > It seems we can use will_parallel_vacuum[] here instead of invoking the > function > again. Oops, I missed updating it in 0002 patch. Will fix. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote: > Upon thinking further, having at least the messages at LOG level [1] > would be helpful to know what's happening with the system while in > recovery. Although these messages at LOG level seem to be filling up > the server logs, having a good log consumption and rotation mechanism > (I'm sure every major postgres vendor would have one) would be > sufficient to allay that concern. + ereport(LOG, + (errmsg("recovering WAL segment \"%s\" from source \"%s\"", + xlogfname, srcname))); Isn't this approach an issue for translations? It seems to me that terms like "stream" or "archive" had better be translated properly, meaning that this needs a fully-constructed sentence. -- Michael signature.asc Description: PGP signature
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/7/21, 5:21 PM, "Bharath Rupireddy" wrote: > On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan wrote: >> I noticed that some (but not all) of the surrounding messages say >> "last known up at" the control file time. I'm curious why you chose >> not to use that phrasing in this case. > > If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was > interrupted while in end-of-recovery checkpoint, so I used the > phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY > cases. I would like to keep it as-is (in the v1 patch) unless anyone > has other thoughts here? > (errmsg("database system was interrupted while in recovery at %s", > (errmsg("database system was interrupted while in recovery at log time %s", I think that's alright. The only other small suggestion I have would be to say "during end-of-recovery checkpoint" instead of "while in end-of-recovery checkpoint." Another option we might want to consider is to just skip updating the state entirely for end-of-recovery checkpoints. The state would instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. I don't know if it's crucial to have a dedicated control file state for end-of-recovery checkpoints. Nathan
Re: parse_subscription_options - suggested improvements
On Wed, Dec 8, 2021 at 2:51 PM Michael Paquier wrote: > > On Tue, Dec 07, 2021 at 08:12:59AM +1100, Peter Smith wrote: > > On Tue, Dec 7, 2021 at 6:07 AM Bossart, Nathan wrote: > >> Attached a v14 with the initializations added back. > > > > LGTM. > > All the code paths previously covered still are, so applied this one. > Thanks! Thanks for pushing! -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?
On 12/7/21, 5:21 PM, "Bharath Rupireddy" wrote: > On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan wrote: >> I agree with Tom. I would just s/server/backend/ (as per the >> attached) and call it a day. > > Thanks. v5 patch looks good to me. I've marked the commitfest entry as ready-for-committer. Nathan
Re: parse_subscription_options - suggested improvements
On Tue, Dec 07, 2021 at 08:12:59AM +1100, Peter Smith wrote: > On Tue, Dec 7, 2021 at 6:07 AM Bossart, Nathan wrote: >> Attached a v14 with the initializations added back. > > LGTM. All the code paths previously covered still are, so applied this one. Thanks! -- Michael signature.asc Description: PGP signature
Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?
Hi, On 12/7/21 10:44, 曾文旌(义从) wrote: > Hi Hackers > > For my previous proposal, I developed a prototype and passed > regression testing. It works similarly to subquery's qual pushdown. > We know that sublink expands at the beginning of each level of > query. At this stage, The query's conditions and equivalence classes > are not processed. But after generate_base_implied_equalities the > conditions are processed, which is why qual can push down to > subquery but sublink not. > > My POC implementation chose to delay the sublink expansion in the > SELECT clause (targetList) and where clause. Specifically, it is > delayed after generate_base_implied_equalities. Thus, the equivalent > conditions already established in the Up level query can be easily > obtained in the sublink expansion process (make_subplan). > > For example, if the up level query has a.id = 10 and the sublink > query has a.id = b.id, then we get b.id = 10 and push it down to the > sublink quey. If b is a partitioned table and is partitioned by id, > then a large number of unrelated subpartitions are pruned out, This > optimizes a significant amount of Planner and SQL execution time, > especially if the partitioned table has a large number of > subpartitions and is what I want. > > Currently, There were two SQL failures in the regression test, > because the expansion order of sublink was changed, which did not > affect the execution result of SQL. > > Look forward to your suggestions on this proposal. > I took a quick look, and while I don't see / can't think of any problems with delaying it until after generating implied equalities, there seems to be a number of gaps. 1) Are there any regression tests exercising this modified behavior? Maybe there are, but if the only changes are due to change in order of targetlist entries, that doesn't seem like a clear proof. It'd be good to add a couple tests exercising both the positive and negative case (i.e. when we can and can't pushdown a qual). 2) apparently, contrib/postgres_fdw does crash like this: #3 0x0077b412 in adjust_appendrel_attrs_mutator (node=0x13f7ea0, context=0x7fffc3351b30) at appendinfo.c:470 470 Assert(!IsA(node, SubLink)); (gdb) p node $1 = (Node *) 0x13f7ea0 (gdb) p *node $2 = {type = T_SubLink} Backtrace attached. 3) various parts of the patch really need at least some comments, like: - try_push_outer_qual_to_sublink_query really needs some docs - new stuff at the end of initsplan.c 4) generate_base_implied_equalities shouldn't this if (ec->ec_processed) ; really be? if (ec->ec_processed) continue; 5) I'm not sure why we need the new ec_processed flag. 6) So we now have lazy_process_sublink callback? Does that mean we expand sublinks in two places - sometimes lazily, sometimes not? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyCore was generated by `postgres: user contrib_regression [local] EXPLAIN '. Program terminated with signal SIGABRT, Aborted. #0 0x79a07cfd79e5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: dnf debuginfo-install glibc-2.32-10.fc33.x86_64 libgcc-10.3.1-1.fc33.x86_64 (gdb) bt #0 0x79a07cfd79e5 in raise () from /lib64/libc.so.6 #1 0x79a07cfc08a4 in abort () from /lib64/libc.so.6 #2 0x0094f22a in ExceptionalCondition (conditionName=conditionName@entry=0xab44ca "!IsA(node, SubLink)", errorType=errorType@entry=0x9a2017 "FailedAssertion", fileName=fileName@entry=0xab55cf "appendinfo.c", lineNumber=lineNumber@entry=470) at assert.c:69 #3 0x0077b412 in adjust_appendrel_attrs_mutator (node=0x13f7ea0, context=0x7fffc3351b30) at appendinfo.c:470 #4 0x0071f8f9 in expression_tree_mutator (node=0x13f7f90, mutator=0x77ae20 , context=0x7fffc3351b30) at nodeFuncs.c:3240 #5 0x0077b025 in adjust_appendrel_attrs_mutator (node=0x13f7f90, context=0x7fffc3351b30) at appendinfo.c:390 #6 0x00720066 in expression_tree_mutator (node=0x13eca78, mutator=0x77ae20 , context=0x7fffc3351b30) at nodeFuncs.c:3109 #7 0x0077b512 in adjust_appendrel_attrs (root=root@entry=0x13be5b0, node=, nappinfos=nappinfos@entry=1, appinfos=appinfos@entry=0x7fffc3351bd0) at appendinfo.c:210 #8 0x0073b88c in set_append_rel_size (rte=, rti=4, rel=0xf19538, root=0x13be5b0) at allpaths.c:1056 #9 set_rel_size (root=0x13be5b0, rel=0xf19538, rti=4, rte=) at allpaths.c:386 #10 0x0073e250 in set_base_rel_sizes (root=) at allpaths.c:326 #11 make_one_rel (root=root@entry=0x13be5b0, joinlist=joinlist@entry=0x13edf10) at allpaths.c:188 #12 0x00763fde in query_planner (root=root@entry=0x13be5b0, qp_callback=qp_callback@entry=0x764eb0 , qp_extra=qp_extra@entry=0x7fffc3351d90, lps_callback=) at planmain.c:286 #13 0x007694fa in grouping_planner (root=, tuple_fraction=) at
RE: parallel vacuum comments
On Tuesday, December 7, 2021 1:42 PM Masahiko Sawada wrote: > I've attached an updated patch. I've removed 0003 patch that added > regression tests as per discussion. Regarding the terminology like "bulkdel" > and "cleanup" you pointed out, I've done that in 0002 patch while moving the > code to vacuumparallel.c. In that file, we can consistently use the terms > "bulkdel" and "cleanup" instead of "vacuum" > and "cleanup". Hi, Thanks for updating the patch. I noticed few minor things. 0001 1) * Skip processing indexes that are unsafe for workers (these are -* processed in do_serial_processing_for_unsafe_indexes() by leader) +* processed in parallel_vacuum_process_unsafe_indexes() by leader) It might be clearer to mention that the index to be skipped are unsafe OR not worthwhile. 2) + /* Set index vacuum status and mark as parallel safe or not */ + for (int i = 0; i < pvc->nindexes; i++) + { ... + pindstats->parallel_workers_can_process = + parallel_vacuum_index_is_parallel_safe(vacrel, + vacrel->indrels[i], + vacuum); For the comments above the loop, maybe better to mention we are marking whether worker can process the index(not only safe/unsafe). 0002 3) + /* +* Skip indexes that are unsuitable target for parallel index vacuum +*/ + if (parallel_vacuum_should_skip_index(indrel)) + continue; + It seems we can use will_parallel_vacuum[] here instead of invoking the function again. Best regards, Hou zj
Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
On 12/8/21 02:54, Bharath Rupireddy wrote: > On Wed, Dec 8, 2021 at 4:07 AM Justin Pryzby wrote: >>> I'd bet squashing all of this into a single string (not really a flag) >>> will just mean people will have to parse it, etc. Keeping individual >>> boolean flags (or even separate string fields) would be better, I guess. >> >> Since the size of controldata is a concern, I suggest to add an int16 to >> populate with flags, which pg_controldata can parse for display. > > +1. I will come up with a patch soon. > >> Note that this other patch intends to add the timestamp and LSN of most >> recent >> recovery to controldata. >> https://commitfest.postgresql.org/36/3183/ > > Thanks. I will go through it separately. > >>> We already have some checkpoint info in pg_stat_bgwriter, but that's >>> just aggregated data, not very convenient for info about the current >>> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing >>> various info about the current checkpoint? >> >> It could go into the pg_stat_checkpointer view, which would be the >> culmination >> of another patch (cc Melanie). >> https://commitfest.postgresql.org/36/3272/ > I don't think the pg_stat_checkpointer would be a good match - that's going to be an aggregated view of all past checkpoints, not a good source info about the currently running one. > +1 to have pg_stat_progress_checkpoint view. We have > CheckpointStatsData already. What we need is to make this structure > shared and add a little more info to represent the progress, so that > the other backends can access it. I think we can discuss this in a > separate thread to give it a fresh try rather than proposing this as a > part of another thread. I will spend some time on > pg_stat_progress_checkpoint proposal and try to come up with a > separate thread to discuss. > +1 to discuss it as part of this patch I'm not sure whether the view should look at CheckpointStatsData, or do the same thing as the other pg_stat_progress_* views - send the data to stat collector, and read it from there. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
On Wed, Dec 8, 2021 at 7:34 AM Tomas Vondra wrote: > >> I agree it might be useful to provide information about the nature of > >> the checkpoint, and perhaps even PID of the backend that triggered it > >> (although that may be tricky, if the backend terminates). > > > > Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the > > triggered backend can possibly go there (we can mention in the > > documentation that the backend that triggered the checkpoint can get > > terminated). > > > > My concern is someone might run something that requires a checkpoint, so > we start it and put the PID into the catalog. And then the person aborts > the command and starts doing something else. But that does not abort the > checkpoint, but the backend now runs something that doesn't requite > checkpoint, which is rather confusing. > > >> I'm not sure about adding it to control data, though. That doesn't seem > >> like a very good match for something that's mostly for monitoring. > > > > Having it in the control data file (along with the existing checkpoint > > information) will be helpful to know what was the last checkpoint > > information and we can use the existing pg_control_checkpoint function > > or the tool to emit that info. I plan to add an int16 flag as > > suggested by Justin in this thread and come up with a patch. > > > > OK, although I'm not sure it's all that useful (if we have that in some > sort of system view). If the server is down, the control file will help. Since we already have the other checkpoint info in the control file, it's much more useful and sensible to have this extra piece of missing information (checkpoint kind) there. > >> We already have some checkpoint info in pg_stat_bgwriter, but that's > >> just aggregated data, not very convenient for info about the current > >> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing > >> various info about the current checkpoint? > > > > +1 to have pg_stat_progress_checkpoint view to know what's going on > > with the current checkpoint. > > > > Do you plan to add it to this patch, or should it be a separate patch? No, I will put some more thoughts around it and start a new thread. Regards, Bharath Rupireddy.
Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
On 12/8/21 02:54, Bharath Rupireddy wrote: > On Wed, Dec 8, 2021 at 3:48 AM Tomas Vondra > wrote: >> >> On 12/7/21 15:36, Bharath Rupireddy wrote: >>> Hi, >>> >>> Currently one can know the kind of on-going/last checkpoint (shutdown, >>> end-of-recovery, immediate, force etc.) only via server logs that to >>> when log_checkpoints GUC is on. At times, the users/service layer >>> components would want to know the kind of checkpoint (along with other >>> checkpoint related info) to take some actions and it will be a bit >>> difficult to search through the server logs. The checkpoint info can >>> be obtained from the control file (either by pg_control_checkpoint() >>> or by pg_controldata tool) whereas checkpoint kind isn't available >>> there. >>> >>> How about we add an extra string field to the control file alongside >>> the other checkpoint info it already has? This way, the existing >>> pg_control_checkpoint() or pg_controldata tool can easily be enhanced >>> to show the checkpoint kind as well. One concern is that we don't want >>> to increase the size of pg_controldata by more than the typical block >>> size (of 8K) to avoid any torn-writes. With this change, we might add >>> at max the strings specified at [1]. Adding it to the control file has >>> an advantage of preserving the last checkpoint kind which might be >>> useful. >>> >>> Thoughts? >>> >> >> I agree it might be useful to provide information about the nature of >> the checkpoint, and perhaps even PID of the backend that triggered it >> (although that may be tricky, if the backend terminates). > > Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the > triggered backend can possibly go there (we can mention in the > documentation that the backend that triggered the checkpoint can get > terminated). > My concern is someone might run something that requires a checkpoint, so we start it and put the PID into the catalog. And then the person aborts the command and starts doing something else. But that does not abort the checkpoint, but the backend now runs something that doesn't requite checkpoint, which is rather confusing. >> I'm not sure about adding it to control data, though. That doesn't seem >> like a very good match for something that's mostly for monitoring. > > Having it in the control data file (along with the existing checkpoint > information) will be helpful to know what was the last checkpoint > information and we can use the existing pg_control_checkpoint function > or the tool to emit that info. I plan to add an int16 flag as > suggested by Justin in this thread and come up with a patch. > OK, although I'm not sure it's all that useful (if we have that in some sort of system view). >> We already have some checkpoint info in pg_stat_bgwriter, but that's >> just aggregated data, not very convenient for info about the current >> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing >> various info about the current checkpoint? > > +1 to have pg_stat_progress_checkpoint view to know what's going on > with the current checkpoint. > Do you plan to add it to this patch, or should it be a separate patch? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Transparent column encryption
On 12/8/21 00:26, Jacob Champion wrote: > On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote: >> IMO it's impossible to solve this attack within TCE, because it requires >> ensuring consistency at the row level, but TCE obviously works at column >> level only. > > I was under the impression that clients already had to be modified to > figure out how to encrypt the data? If part of that process ends up > including enforcement of encryption for a specific column set, then the > addition of AEAD data could hypothetically be part of that hand- > waviness. > I think "transparency" here means the client just uses the regular prepared-statement API without having to explicitly encrypt/decrypt any data. The problem is we can't easily tie this to other columns in the table, because the client may not even know what values are in those columns. Imagine you do this UPDATE t SET encrypted_column = $1 WHERE another_column = $2; but you want to ensure the encrypted value belongs to a particular row (which may or may not be identified by the another_column value). How would the client do that? Should it fetch the value or what? Similarly, what if the client just does SELECT encrypted_column FROM t; How would it verify the values belong to the row, without having all the data for the row (or just the required columns)? > Unless "transparent" means that the client completely defers to the > server on whether to encrypt or not, and silently goes along with it if > the server tells it not to encrypt? I think that's probably a valid concern - a "bad DBA" could alter the table definition to not contain the "ENCRYPTED" bits, and then peek at the plaintext values. But it's not clear to me how exactly would the AEAD prevent this? Wouldn't that be also specified on the server, somehow? In which case the DBA could just tweak that too, no? In other words, this issue seems mostly orthogonal to the AEAD, and the right solution would be to allow the client to define which columns have to be encrypted (in which case altering the server definition would not be enough). > That would only protect against a > _completely_ passive DBA, like someone reading unencrypted backups, > etc. And that still has a lot of value, certainly. But it seems like > this prototype is very close to a system where the client can reliably > secure data even if the server isn't trustworthy, if that's a use case > you're interested in. > Right. IMHO the "passive attacker" is a perfectly fine model for use cases that would be fine with e.g. pgcrypto if there was no risk of leaking plaintext values to logs, system catalogs, etc. If we can improve it to provide (at least some) protection against active attackers, that'd be a nice bonus. >> I believe TCE can do AEAD at the column level, which protects against >> attacks that flipping bits, and similar attacks. It's just a matter of >> how the client encrypts the data. > > Right, I think authenticated encryption ciphers (without AD) will be > important to support in practice. I think users are going to want > *some* protection against active attacks. > >> Extending it to protect the whole row seems tricky, because the client >> may not even know the other columns, and it's not clear to me how it'd >> deal with things like updates of the other columns, hint bits, dropped >> columns, etc. > > Covering the entire row automatically probably isn't super helpful in > practice. As you mention later: > >> It's probably possible to get something like this (row-level AEAD) by >> encrypting enriched data, i.e. not just the card number, but {user ID, >> card number} or something like that, and verify that in the webapp. The >> problem of course is that the "user ID" is just another column in the >> table, and there's nothing preventing the DBA from modifying that too. > > Right. That's why the client has to be able to choose AD according to > the application. In my previous example, the victim's email address can > be copied by the DBA, but they wouldn't be able to authenticate as that > user and couldn't convince the client to use the plaintext on their > behalf. > Well, yeah. But I'm not sure how to make that work easily, because the client may not have the data :-( I was thinking about using a composite data type combining the data with the extra bits - that'd not be all that transparent as it'd require the client to build this manually and then also cross-check it after loading the data. So the user would be responsible for having all the data. But doing that automatically/transparently seems hard, because how would you deal e.g. with SELECT queries reading data through a view or CTE? How would you declare this, either at the client or server? Do any other databases have this capability? How do they do it? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
On Wed, Dec 8, 2021 at 4:07 AM Justin Pryzby wrote: > > I'd bet squashing all of this into a single string (not really a flag) > > will just mean people will have to parse it, etc. Keeping individual > > boolean flags (or even separate string fields) would be better, I guess. > > Since the size of controldata is a concern, I suggest to add an int16 to > populate with flags, which pg_controldata can parse for display. +1. I will come up with a patch soon. > Note that this other patch intends to add the timestamp and LSN of most recent > recovery to controldata. > https://commitfest.postgresql.org/36/3183/ Thanks. I will go through it separately. > > We already have some checkpoint info in pg_stat_bgwriter, but that's > > just aggregated data, not very convenient for info about the current > > checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing > > various info about the current checkpoint? > > It could go into the pg_stat_checkpointer view, which would be the culmination > of another patch (cc Melanie). > https://commitfest.postgresql.org/36/3272/ +1 to have pg_stat_progress_checkpoint view. We have CheckpointStatsData already. What we need is to make this structure shared and add a little more info to represent the progress, so that the other backends can access it. I think we can discuss this in a separate thread to give it a fresh try rather than proposing this as a part of another thread. I will spend some time on pg_stat_progress_checkpoint proposal and try to come up with a separate thread to discuss. Regards, Bharath Rupireddy.
Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
On Wed, Dec 8, 2021 at 3:48 AM Tomas Vondra wrote: > > On 12/7/21 15:36, Bharath Rupireddy wrote: > > Hi, > > > > Currently one can know the kind of on-going/last checkpoint (shutdown, > > end-of-recovery, immediate, force etc.) only via server logs that to > > when log_checkpoints GUC is on. At times, the users/service layer > > components would want to know the kind of checkpoint (along with other > > checkpoint related info) to take some actions and it will be a bit > > difficult to search through the server logs. The checkpoint info can > > be obtained from the control file (either by pg_control_checkpoint() > > or by pg_controldata tool) whereas checkpoint kind isn't available > > there. > > > > How about we add an extra string field to the control file alongside > > the other checkpoint info it already has? This way, the existing > > pg_control_checkpoint() or pg_controldata tool can easily be enhanced > > to show the checkpoint kind as well. One concern is that we don't want > > to increase the size of pg_controldata by more than the typical block > > size (of 8K) to avoid any torn-writes. With this change, we might add > > at max the strings specified at [1]. Adding it to the control file has > > an advantage of preserving the last checkpoint kind which might be > > useful. > > > > Thoughts? > > > > I agree it might be useful to provide information about the nature of > the checkpoint, and perhaps even PID of the backend that triggered it > (although that may be tricky, if the backend terminates). Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the triggered backend can possibly go there (we can mention in the documentation that the backend that triggered the checkpoint can get terminated). > I'm not sure about adding it to control data, though. That doesn't seem > like a very good match for something that's mostly for monitoring. Having it in the control data file (along with the existing checkpoint information) will be helpful to know what was the last checkpoint information and we can use the existing pg_control_checkpoint function or the tool to emit that info. I plan to add an int16 flag as suggested by Justin in this thread and come up with a patch. > We already have some checkpoint info in pg_stat_bgwriter, but that's > just aggregated data, not very convenient for info about the current > checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing > various info about the current checkpoint? +1 to have pg_stat_progress_checkpoint view to know what's going on with the current checkpoint. Regards, Bharath Rupireddy.
Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?
On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan wrote: > > On 11/18/21, 8:27 PM, "Bharath Rupireddy" > wrote: > > Here's the v4 patch with the above changes, the output looks like [1]. > > Please review it further. > > I agree with Tom. I would just s/server/backend/ (as per the > attached) and call it a day. Thanks. v5 patch looks good to me. Regards, Bharath Rupireddy.
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan wrote: > > On 12/7/21, 12:10 AM, "Bharath Rupireddy" > wrote: > > Here's a patch that I've come up with. Please see if this looks okay > > and let me know if we want to take it forward so that I can add a CF > > entry. > > Overall, the patch seems reasonable to me. Thanks for reviewing this. > + case DB_IN_END_OF_RECOVERY_CHECKPOINT: > + ereport(LOG, > + (errmsg("database system was > interrupted while in end-of-recovery checkpoint at %s", > + > str_time(ControlFile->time; > + break; > > I noticed that some (but not all) of the surrounding messages say > "last known up at" the control file time. I'm curious why you chose > not to use that phrasing in this case. If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was interrupted while in end-of-recovery checkpoint, so I used the phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY cases. I would like to keep it as-is (in the v1 patch) unless anyone has other thoughts here? (errmsg("database system was interrupted while in recovery at %s", (errmsg("database system was interrupted while in recovery at log time %s", I added a CF entry here - https://commitfest.postgresql.org/36/3442/ Regards, Bharath Rupireddy.
Re: logical decoding and replication of sequences
On 12/7/21 15:38, Peter Eisentraut wrote: > I have checked the 0001 and 0003 patches. (I think we have dismissed > the approach in 0002 for now. And let's talk about 0004 later.) > Right, I think that's correct. > I have attached a few fixup patches, described further below. > > # 0001 > > The argument "create" for fill_seq_with_data() is always true (and > patch 0002 removes it). So I'm not sure if it's needed. If it is, it > should be documented somewhere. > Good point. I think it could be removed, but IIRC it'll be needed when adding sequence decoding to the built-in replication, and that patch is meant to be just an implementation of the API, without changing WAL. OTOH I don't see it in the last version of that patch (in ResetSequence2 call) so maybe it's not really needed. I've kept it for now, but I'll investigate. > About the comment you added in nextval_internal(): It's a good > explanation, so I would leave it in. I also agree with the > conclusion of the comment that the current solution is reasonable. We > probably don't need the same comment again in fill_seq_with_data() and > again in do_setval(). Note that both of the latter functions already > point to nextval_interval() for other comments, so the same can be > relied upon here as well. > True. I moved it a bit in nextval_internal() and removed the other copies. The existing references should be enough. > The order of the new fields sequence_cb and stream_sequence_cb is a > bit inconsistent compared to truncate_cb and stream_truncate_cb. > Maybe take another look to make the order more uniform. > You mean in OutputPluginCallbacks? I'd actually argue it's the truncate callbacks that are inconsistent - in the regular section truncate_cb is right before commit_cb, while in the streaming section it's at the end. Or what order do you think would be better? I'm fine with changing it. > Some documentation needs to be added to the Logical Decoding chapter. > I have attached a patch that I think catches all the places that need > to be updated, with some details left for you to fill in. ;-) The > ordering of the some of the documentation chunks reflects the order in > which the callbacks appear in the header files, which might not be > optimal; see above. > Thanks. I added a bit about the callbacks being optional and what the parameters mean (only for sequence_cb, as the stream_ callbacks generally don't copy that bit). > In reorderbuffer.c, you left a comment about how to access a sequence > tuple. There is an easier way, using Form_pg_sequence_data, which is > how sequence.c also does it. See attached patch. > Yeah, that looks much nicer. > # 0003 > > The tests added in 0003 don't pass for me. It appears that you might > have forgotten to update the expected files after you added some tests > or something. The cfbot shows the same. See attached patch for a > correction, but do check what your intent was. > Yeah. I suspect I removed the expected results due to the experimental rework. I haven't noticed that because some of the tests for the built-in replication are expected to fail. Attached is an updated version of the first two parts (infrastructure and test_decoding), with all your fixes merged. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 999514b9f802db9b888543532355e9c36646da30 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Fri, 24 Sep 2021 00:41:33 +0200 Subject: [PATCH 1/2] Logical decoding of sequences This adds the infrastructure for logical decoding of sequences. Support for built-in logical replication and test_decoding is added separately. When decoding sequences, we differentiate between a sequences created in a (running) transaction, and sequences created in other (already committed) transactions. Changes for sequences created in the same top-level transaction are treated as "transactional" i.e. just like any other change from that transaction (and discarded in case of a rollback). Changes for sequences created earlier are treated as not transactional - are processed immediately, as if performed outside any transaction (and thus not rolled back). This mixed behavior is necessary - sequences are non-transactional (e.g. ROLLBACK does not undo the sequence increments). But for new sequences, we need to handle them in a transactional way, because if we ever get some DDL support, the sequence won't exist until the transaction gets applied. So we need to ensure the increments don't happen until the sequence gets created. To differentiate which sequences are "old" and which were created in a still-running transaction, we track sequences created in running transactions in a hash table. Each sequence is identified by it's relfilenode, and at transaction commit we discard all sequences created in that particular transaction. For each sequence we track the XID of the (sub)transaction that created it, and we cleanup the sequences for each
Re: Add sub-transaction overflow status in pg_stat_activity
I also want to +1 this this effort. Exposing subtransaction usage is very useful. It would also be extremely beneficial to add both subtransaction usage and overflow counters to pg_stat_database. Monitoring tools that capture deltas on pg_stat_database will be able to generate historical analysis and usage trends of subtransactions. On 12/7/21, 5:34 PM, "Bossart, Nathan" wrote: On 12/6/21, 8:19 PM, "Dilip Kumar" wrote: > If the subtransaction cache is overflowed in some of the transactions > then it will affect all the concurrent queries as they need to access > the SLRU for checking the visibility of each tuple. But currently > there is no way to identify whether in any backend subtransaction is > overflowed or what is the current active subtransaction count. > Attached patch adds subtransaction count and subtransaction overflow > status in pg_stat_activity. I have implemented this because of the > recent complain about the same[1] I'd like to give a general +1 to this effort. Thanks for doing this! I've actually created a function to provide this information in the past, so I will help review. Nathan
Re: Transparent column encryption
On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote: > IMO it's impossible to solve this attack within TCE, because it requires > ensuring consistency at the row level, but TCE obviously works at column > level only. I was under the impression that clients already had to be modified to figure out how to encrypt the data? If part of that process ends up including enforcement of encryption for a specific column set, then the addition of AEAD data could hypothetically be part of that hand- waviness. Unless "transparent" means that the client completely defers to the server on whether to encrypt or not, and silently goes along with it if the server tells it not to encrypt? That would only protect against a _completely_ passive DBA, like someone reading unencrypted backups, etc. And that still has a lot of value, certainly. But it seems like this prototype is very close to a system where the client can reliably secure data even if the server isn't trustworthy, if that's a use case you're interested in. > I believe TCE can do AEAD at the column level, which protects against > attacks that flipping bits, and similar attacks. It's just a matter of > how the client encrypts the data. Right, I think authenticated encryption ciphers (without AD) will be important to support in practice. I think users are going to want *some* protection against active attacks. > Extending it to protect the whole row seems tricky, because the client > may not even know the other columns, and it's not clear to me how it'd > deal with things like updates of the other columns, hint bits, dropped > columns, etc. Covering the entire row automatically probably isn't super helpful in practice. As you mention later: > It's probably possible to get something like this (row-level AEAD) by > encrypting enriched data, i.e. not just the card number, but {user ID, > card number} or something like that, and verify that in the webapp. The > problem of course is that the "user ID" is just another column in the > table, and there's nothing preventing the DBA from modifying that too. Right. That's why the client has to be able to choose AD according to the application. In my previous example, the victim's email address can be copied by the DBA, but they wouldn't be able to authenticate as that user and couldn't convince the client to use the plaintext on their behalf. --Jacob
Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?
On Tue, Dec 7, 2021 at 1:59 PM Robert Haas wrote: > If we're only trying to decide whether or not to vacuum a table, we > don't need units: the output is a Boolean. I was imagining a world in which we preserve the autovacuum_vacuum_scale_factor design, but interpret it creatively (but never too creatively) -- an incremental approach seems best to me. We can even sanity check our abstract bloat unit calculation, in case the page-level sampling aggregates into a totally wild number of dead tuples (based in part on the current number of not-all-visible heap pages) -- so the abstract units are always anchored to the old idea of dead tuples. Maybe this isn't the best approach, but at least it addresses compatibility. *Any* approach based on sampling relatively few random blocks (to look for signs of bloat) is inherently prone to hugely underestimating the extent of bloat (which is what we see in TPC-C). I am primarily concerned about compensating for the inherent limitations that go with that. To me it seems inappropriate to make statistical inferences about dead tuples based on a random snapshot of random blocks (usually only a tiny minority). It is not only possible for the picture to change utterly -- it is routine, expected, and the whole entire point. The entire intellectual justification for statistical sampling (that mostly works for optimizer stats) just doesn't carry over to autovacuum stats, for many reasons. At the same time, I don't have any fundamentally better starting point. That's how I arrived at the idea of probabilistic modeling based on several recent snapshots from ANALYZE. The statistics are often rubbish, whether or not we like it, and regardless of how we decide to count things on each page. And so it's entirely reasonable to not limit the algorithm to concerns about the state of things -- the actual exposure of the system to harm (from overlooking harmful bloat) is also relevant. > If we're trying to decide > on an order in which to vacuum tables, then we need units. But such > units can't be anything related to dead tuples, because vacuum can be > needed based on XID age, or MXID age, or dead tuples. The units would > have to be something like abstract vacuum-urgency units (if higher is > more urgent) or abstract remaining-headroom-beform-catastrophe units > (if lower is more urgent). I like that idea. But I wonder if they should be totally unrelated. If we're close to the "emergency" XID threshold, and also close to the "bloat units" threshold, then it seems reasonable to put our finger on the scales, and do an autovacuum before either threshold is crossed. I'm not sure how that should work, but I find the idea of interpreting the "bloat units" creatively/probabilistically appealing. We're not actually making things up by erring in the direction of launching an autovacuum worker, because we don't actually know the number of dead tuples (or whatever) anyway -- we're just recognizing the very real role of chance and noise. That is, if the "bloat units" threshold might well not have been crossed due to random chance (noise, the phase of the moon), why should we defer to random chance? If we have better information to go on, like the thing with the XID threshold, why not prefer that? Similarly, if we see that the system as a whole is not very busy right now, why not consider that, just a little, if the only downside is that we'll ignore a demonstrably noise-level signal (from the stats)? That's the high level intuition behind making "bloat units" a probability density function, and not just a simple expected value. Teaching the autovacuum.c scheduler to distinguish signal from noise could be very valuable, if it enables opportunistic batching of work, or off-hours work. We don't have to respect noise. The devil is in the details, of course. -- Peter Geoghegan
Re: RecoveryInProgress() has critical side effects
On Sat, Dec 4, 2021 at 7:44 PM Michael Paquier wrote: > My main worry here is that this changes slightly the definition of > doPageWrites across stable branches at the end of recovery as there > could be records generated there. Note that GetFullPageWriteInfo() > gets called in XLogInsert(), while Insert->fullPageWrites gets updated > before CleanupAfterArchiveRecovery(). And it may influence > the value of doPageWrites in the startup process. But ... so what? All the code that uses it retries if the value that was tentatively used turns out to be wrong. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
On Tue, Dec 07, 2021 at 11:18:37PM +0100, Tomas Vondra wrote: > On 12/7/21 15:36, Bharath Rupireddy wrote: > > Currently one can know the kind of on-going/last checkpoint (shutdown, > > end-of-recovery, immediate, force etc.) only via server logs that to > > when log_checkpoints GUC is on. > > The checkpoint info can be obtained from the control file (either by > > pg_control_checkpoint() or by pg_controldata tool) whereas checkpoint kind > > isn't available there. > > How about we add an extra string field to the control file alongside > > the other checkpoint info it already has? This way, the existing > > pg_control_checkpoint() or pg_controldata tool can easily be enhanced > > to show the checkpoint kind as well. One concern is that we don't want > > to increase the size of pg_controldata by more than the typical block > > size (of 8K) to avoid any torn-writes. With this change, we might add > > at max the strings specified at [1]. Adding it to the control file has > > an advantage of preserving the last checkpoint kind which might be > > useful. > > [1] for checkpoint: "checkpoint shutdown end-of-recovery immediate > > force wait wal time flush-all" > > for restartpoint: "restartpoint shutdown end-of-recovery immediate > > force wait wal time flush-all" > > I'd bet squashing all of this into a single string (not really a flag) > will just mean people will have to parse it, etc. Keeping individual > boolean flags (or even separate string fields) would be better, I guess. Since the size of controldata is a concern, I suggest to add an int16 to populate with flags, which pg_controldata can parse for display. Note that this other patch intends to add the timestamp and LSN of most recent recovery to controldata. https://commitfest.postgresql.org/36/3183/ > We already have some checkpoint info in pg_stat_bgwriter, but that's > just aggregated data, not very convenient for info about the current > checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing > various info about the current checkpoint? It could go into the pg_stat_checkpointer view, which would be the culmination of another patch (cc Melanie). https://commitfest.postgresql.org/36/3272/ -- Justin
Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
On 12/7/21 15:36, Bharath Rupireddy wrote: > Hi, > > Currently one can know the kind of on-going/last checkpoint (shutdown, > end-of-recovery, immediate, force etc.) only via server logs that to > when log_checkpoints GUC is on. At times, the users/service layer > components would want to know the kind of checkpoint (along with other > checkpoint related info) to take some actions and it will be a bit > difficult to search through the server logs. The checkpoint info can > be obtained from the control file (either by pg_control_checkpoint() > or by pg_controldata tool) whereas checkpoint kind isn't available > there. > > How about we add an extra string field to the control file alongside > the other checkpoint info it already has? This way, the existing > pg_control_checkpoint() or pg_controldata tool can easily be enhanced > to show the checkpoint kind as well. One concern is that we don't want > to increase the size of pg_controldata by more than the typical block > size (of 8K) to avoid any torn-writes. With this change, we might add > at max the strings specified at [1]. Adding it to the control file has > an advantage of preserving the last checkpoint kind which might be > useful. > > Thoughts? > I agree it might be useful to provide information about the nature of the checkpoint, and perhaps even PID of the backend that triggered it (although that may be tricky, if the backend terminates). I'm not sure about adding it to control data, though. That doesn't seem like a very good match for something that's mostly for monitoring. We already have some checkpoint info in pg_stat_bgwriter, but that's just aggregated data, not very convenient for info about the current checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing various info about the current checkpoint? > [1] for checkpoint: "checkpoint shutdown end-of-recovery immediate > force wait wal time flush-all" > for restartpoint: "restartpoint shutdown end-of-recovery immediate > force wait wal time flush-all" > I'd bet squashing all of this into a single string (not really a flag) will just mean people will have to parse it, etc. Keeping individual boolean flags (or even separate string fields) would be better, I guess. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?
On Tue, Dec 7, 2021 at 3:44 PM Peter Geoghegan wrote: > Fair enough, but even then we still ultimately have to generate a > final number that represents how close we are to a configurable "do an > autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based > threshold) -- the autovacuum.c side of this (the consumer side) > fundamentally needs the model to reduce everything to a one > dimensional number (even though the reality is that there isn't just > one dimension). This single number (abstract bloat units, abstract > dead tuples, whatever) is a function of things like the count of dead > HOT chains, perhaps the concentration of dead tuples on heap pages, > whatever -- but it's not the same thing as any one of those things we > count. > > I think that this final number needs to be denominated in abstract > units -- we need to call those abstract units *something*. I don't > care what that name ends up being, as long as it reflects reality. If we're only trying to decide whether or not to vacuum a table, we don't need units: the output is a Boolean. If we're trying to decide on an order in which to vacuum tables, then we need units. But such units can't be anything related to dead tuples, because vacuum can be needed based on XID age, or MXID age, or dead tuples. The units would have to be something like abstract vacuum-urgency units (if higher is more urgent) or abstract remaining-headroom-beform-catastrophe units (if lower is more urgent). Ignoring wraparound considerations for a moment, I think that we might want to consider having multiple thresholds and vacuuming the table if any one of them are met. For example, suppose a table qualifies for vacuum when %-of-not-all-visible-pages > some-threshold, or alternatively when %-of-index-tuples-thought-to-be-dead > some-other-threshold. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Transparent column encryption
On 12/7/21 19:02, Jacob Champion wrote: On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote: On 06.12.21 21:44, Jacob Champion wrote: I think reusing a zero IV will potentially leak more information than just equality, depending on the cipher in use. You may be interested in synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem like they would match this use case exactly. (But I'm not a cryptographer.) I'm aware of this and plan to make use of SIV. The current implementation is just an example. Sounds good. Have you given any thought to AEAD? As a client I'd like to be able to tie an encrypted value to other column (or external) data. For example, AEAD could be used to prevent a DBA from copying the (encrypted) value of my credit card column into their account's row to use it. I don't know how that is supposed to work. When the value is encrypted for insertion, the client may know things like table name or column name, so it can tie it to those. But it doesn't know what row it will go in, so you can't prevent the value from being copied into another row. You would need some permanent logical row ID for this, I think. Sorry, my description was confusing. There's nothing preventing the DBA from copying the value inside the database, but AEAD can make it so that the copied value isn't useful to the DBA. Sample case. Say I have a webapp backed by Postgres, which stores encrypted credit card numbers. Users authenticate to the webapp which then uses the client (which has the keys) to talk to the database. Additionally, I assume that: - the DBA can't access the client directly (because if they can, then they can unencrypt the victim's info using the client's keys), and - the DBA can't authenticate as the user/victim (because if they can, they can just log in themselves and have the data). The webapp might for example use federated authn with a separate provider, using an email address as an identifier. Now, if the client encrypts a user's credit card number using their email address as associated data, then it doesn't matter if the DBA copies that user's encrypted card over to their own account. The DBA can't log in as the victim, so the client will fail to authenticate the value because its associated data won't match. This is not targeting PostgreSQL 15. But I'd appreciate some feedback on the direction. What kinds of attacks are you hoping to prevent (and not prevent)? The point is to prevent admins from getting at plaintext data. The scenario you show is an interesting one but I think it's not meant to be addressed by this. If admins can alter the database to their advantage, they could perhaps increase their account balance, create discount codes, etc. also. Sure, but increasing account balances and discount codes don't lead to getting at plaintext data, right? Whereas stealing someone else's encrypted value seems like it would be covered under your threat model, since it lets you trick a real-world client into decrypting it for you. Other avenues of attack might depend on how you choose to add HMAC to the current choice of AES-CBC. My understanding of AE ciphers (with or without associated data) is that you don't have to design that yourself, which is nice. IMO it's impossible to solve this attack within TCE, because it requires ensuring consistency at the row level, but TCE obviously works at column level only. I believe TCE can do AEAD at the column level, which protects against attacks that flipping bits, and similar attacks. It's just a matter of how the client encrypts the data. Extending it to protect the whole row seems tricky, because the client may not even know the other columns, and it's not clear to me how it'd deal with things like updates of the other columns, hint bits, dropped columns, etc. It's probably possible to get something like this (row-level AEAD) by encrypting enriched data, i.e. not just the card number, but {user ID, card number} or something like that, and verify that in the webapp. The problem of course is that the "user ID" is just another column in the table, and there's nothing preventing the DBA from modifying that too. So I think it's pointless to try extending this to row-level AEAD. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/7/21, 12:10 AM, "Bharath Rupireddy" wrote: > Here's a patch that I've come up with. Please see if this looks okay > and let me know if we want to take it forward so that I can add a CF > entry. Overall, the patch seems reasonable to me. + case DB_IN_END_OF_RECOVERY_CHECKPOINT: + ereport(LOG, + (errmsg("database system was interrupted while in end-of-recovery checkpoint at %s", + str_time(ControlFile->time; + break; I noticed that some (but not all) of the surrounding messages say "last known up at" the control file time. I'm curious why you chose not to use that phrasing in this case. Nathan
Re: Pre-allocating WAL files
On 12/7/21, 9:35 AM, "Bossart, Nathan" wrote: > On 12/7/21, 12:29 AM, "Bharath Rupireddy" > wrote: >> Why can't the walwriter pre-allocate some of the WAL segments instead >> of a new background process? Of course, it might delay the main >> functionality of the walwriter i.e. flush and sync the WAL files, but >> having checkpointer do the pre-allocation makes it do another extra >> task. Here the amount of walwriter work vs checkpointer work, I'm not >> sure which one does more work compared to the other. > > The argument against adding it to the WAL writer is that we want it to > run with low latency to flush asynchronous commits. If we added WAL > pre-allocation to the WAL writer, there could periodically be large > delays. To your point on trying to avoid giving the checkpointer extra tasks (basically what we are talking about on the other thread [0]), WAL pre-allocation might not be of much concern because it will generally be a small, fixed (and configurable) amount of work, and it can be performed concurrently with the checkpoint. Plus, WAL pre-allocation should ordinarily be phased out as WAL segments become eligible for recycling. IMO it's not comparable to tasks like CheckPointSnapBuild() that can delay checkpointing for a long time. Nathan [0] https://www.postgresql.org/message-id/flat/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com
Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?
On Tue, Dec 7, 2021 at 12:27 PM Robert Haas wrote: > Well... I mean, I think we're almost saying the same thing, then, but > I think you're saying it more confusingly. I have no objection to > counting the number of dead HOT chains rather than the number of dead > tules, because that's what affects the index contents, but there's no > need to characterize that as "not the literal truth." Works for me! > Sure, but we don't *need* to be less accurate, and I don't think we > even *benefit* from being less accurate. If we do something like count > dead HOT chains instead of dead tuples, let's not call that a > less-accurate count of dead tuples. Let's call it an accurate count of > dead HOT chains. Fair enough, but even then we still ultimately have to generate a final number that represents how close we are to a configurable "do an autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based threshold) -- the autovacuum.c side of this (the consumer side) fundamentally needs the model to reduce everything to a one dimensional number (even though the reality is that there isn't just one dimension). This single number (abstract bloat units, abstract dead tuples, whatever) is a function of things like the count of dead HOT chains, perhaps the concentration of dead tuples on heap pages, whatever -- but it's not the same thing as any one of those things we count. I think that this final number needs to be denominated in abstract units -- we need to call those abstract units *something*. I don't care what that name ends up being, as long as it reflects reality. -- Peter Geoghegan
Re: enable certain TAP tests for MSVC builds
On 12/5/21 18:32, Noah Misch wrote: > On Sun, Dec 05, 2021 at 06:00:08PM -0500, Andrew Dunstan wrote: >> On 12/5/21 14:47, Noah Misch wrote: >>> On Sun, Dec 05, 2021 at 11:57:31AM -0500, Andrew Dunstan wrote: Certain TAP tests rely on settings that the Make files provide for them. However vcregress.pl doesn't provide those settings. This patch remedies that, and I propose to apply it shortly (when we have a fix for the SSL tests that I will write about separately) and backpatch it appropriately. --- a/src/tools/msvc/vcregress.pl +++ b/src/tools/msvc/vcregress.pl @@ -59,6 +59,12 @@ copy("$Config/autoinc/autoinc.dll", "src/test/regress"); copy("$Config/regress/regress.dll", "src/test/regress"); copy("$Config/dummy_seclabel/dummy_seclabel.dll", "src/test/regress"); +# Settings used by TAP tests +$ENV{with_ssl} = $config->{openssl} ? 'openssl' : 'no'; +$ENV{with_ldap} = $config->{ldap} ? 'yes' : 'no'; +$ENV{with_icu} = $config->{icu} ? 'yes' : 'no'; +$ENV{with_gssapi} = $config->{gss} ? 'yes' : 'no'; >>> That's appropriate. There are more variables to cover: >>> >>> $ git grep -n ^export ':(glob)**/Makefile' >>> src/bin/pg_basebackup/Makefile:22:export LZ4 >>> src/bin/pg_basebackup/Makefile:23:export TAR >>> src/bin/pg_basebackup/Makefile:27:export GZIP_PROGRAM=$(GZIP) >>> src/bin/psql/Makefile:20:export with_readline >>> src/test/kerberos/Makefile:16:export with_gssapi with_krb_srvnam >>> src/test/ldap/Makefile:16:export with_ldap >>> src/test/modules/ssl_passphrase_callback/Makefile:3:export with_ssl >>> src/test/recovery/Makefile:20:export REGRESS_SHLIB >>> src/test/ssl/Makefile:18:export with_ssl >>> src/test/subscription/Makefile:18:export with_icu >> LZ4/TAR/GZIP_PROGAM: It's not clear what these should be set to. The TAP >> tests skip tests that use them if they are not set. > Could add config.pl entries for those. Preventing those skips on Windows may > or may not be worth making config.pl readers think about them. > >> with_readline: we don't build with readline on Windows, period. I guess >> we could just set it to "no". >> with_krb_srvnam: the default is "postgres", we could just set it to that >> I guess. > Works for me. All done. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?
On Tue, Dec 7, 2021 at 2:13 PM Peter Geoghegan wrote: > For example, why should we count dead heap-only tuples from earlier in > a HOT chain, even when we see no evidence that opportunistic HOT > pruning can't keep up on that page? Since we actually care about the > direction of things, not just the present state of things, we'd be > justified in completely ignoring those dead tuples. Similarly, it > might well make sense to give more weight to concentrations of LP_DEAD > items on a page -- that is a signal that things are not going well *at > the level of the page*. Not so much when you have a few LP_DEAD stubs, > but certainly when you have dozens of them on one page, or even > hundreds. And so ISTM that the conditions of the page should influence > how we interpret/count that page's dead tuples, in both directions > (interpret the page as having more dead tuples, or fewer). Well... I mean, I think we're almost saying the same thing, then, but I think you're saying it more confusingly. I have no objection to counting the number of dead HOT chains rather than the number of dead tules, because that's what affects the index contents, but there's no need to characterize that as "not the literal truth." There's nothing fuzzy or untrue about it if we simply say that's what we're doing. > Right. And as I keep saying, the truly important thing is to not > *completely* ignore any relevant dimension of cost. I just don't want > to ever be wildly wrong -- not even once. We can tolerate being > somewhat less accurate all the time (not that we necessarily have to > make a trade-off), but we cannot tolerate pathological behavior. Of > course I include new/theoretical pathological behaviors here (not just > the ones we know about today). Sure, but we don't *need* to be less accurate, and I don't think we even *benefit* from being less accurate. If we do something like count dead HOT chains instead of dead tuples, let's not call that a less-accurate count of dead tuples. Let's call it an accurate count of dead HOT chains. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Dubious usage of TYPCATEGORY_STRING
I wrote: > Peter Eisentraut writes: >> Could we add explicit casts (like polcmd::text) here? Or would it break >> too much? > I assumed it'd break too much to consider doing that. But I suppose > that since a typcategory change would be initdb-forcing anyway, maybe > it's not out of the question. I'll investigate and see exactly how > many places would need an explicit cast. Um, I definitely gave up too easily there. The one usage in \dp seems to be the *only* thing that breaks in describe.c, and pg_dump doesn't need any changes so far as check-world reveals. So let's just move "char" to another category, as attached. regards, tom lane diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out index ec99aaed5d..3bac0534fb 100644 --- a/contrib/citext/expected/citext.out +++ b/contrib/citext/expected/citext.out @@ -1089,7 +1089,12 @@ INSERT INTO caster (char) VALUES ('f'::citext); INSERT INTO caster (citext)VALUES ('f'::char); INSERT INTO caster (chr) VALUES ('f'::text); INSERT INTO caster (text) VALUES ('f'::"char"); -INSERT INTO caster (chr) VALUES ('f'::citext); +INSERT INTO caster (chr) VALUES ('f'::citext); -- requires cast +ERROR: column "chr" is of type "char" but expression is of type citext +LINE 1: INSERT INTO caster (chr) VALUES ('f'::citext); + ^ +HINT: You will need to rewrite or cast the expression. +INSERT INTO caster (chr) VALUES ('f'::citext::text); INSERT INTO caster (citext)VALUES ('f'::"char"); INSERT INTO caster (name) VALUES ('foo'::text); INSERT INTO caster (text) VALUES ('foo'::name); diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out index 75fd08b7cc..57fc863f7a 100644 --- a/contrib/citext/expected/citext_1.out +++ b/contrib/citext/expected/citext_1.out @@ -1089,7 +1089,12 @@ INSERT INTO caster (char) VALUES ('f'::citext); INSERT INTO caster (citext)VALUES ('f'::char); INSERT INTO caster (chr) VALUES ('f'::text); INSERT INTO caster (text) VALUES ('f'::"char"); -INSERT INTO caster (chr) VALUES ('f'::citext); +INSERT INTO caster (chr) VALUES ('f'::citext); -- requires cast +ERROR: column "chr" is of type "char" but expression is of type citext +LINE 1: INSERT INTO caster (chr) VALUES ('f'::citext); + ^ +HINT: You will need to rewrite or cast the expression. +INSERT INTO caster (chr) VALUES ('f'::citext::text); INSERT INTO caster (citext)VALUES ('f'::"char"); INSERT INTO caster (name) VALUES ('foo'::text); INSERT INTO caster (text) VALUES ('foo'::name); diff --git a/contrib/citext/sql/citext.sql b/contrib/citext/sql/citext.sql index 10232f5a9f..55fb1d11a6 100644 --- a/contrib/citext/sql/citext.sql +++ b/contrib/citext/sql/citext.sql @@ -361,7 +361,8 @@ INSERT INTO caster (citext)VALUES ('f'::char); INSERT INTO caster (chr) VALUES ('f'::text); INSERT INTO caster (text) VALUES ('f'::"char"); -INSERT INTO caster (chr) VALUES ('f'::citext); +INSERT INTO caster (chr) VALUES ('f'::citext); -- requires cast +INSERT INTO caster (chr) VALUES ('f'::citext::text); INSERT INTO caster (citext)VALUES ('f'::"char"); INSERT INTO caster (name) VALUES ('foo'::text); diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index c1d11be73f..216aa4510d 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -9305,6 +9305,10 @@ SCRAM-SHA-256$iteration count: X unknown type + + Z + Internal-use types + diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index ea721d963a..72d8547628 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -1142,7 +1142,7 @@ permissionsList(const char *pattern) ",\n pg_catalog.array_to_string(ARRAY(\n" "SELECT polname\n" "|| CASE WHEN polcmd != '*' THEN\n" - " E' (' || polcmd || E'):'\n" + " E' (' || polcmd::pg_catalog.text || E'):'\n" " ELSE E':'\n" " END\n" "|| CASE WHEN polqual IS NOT NULL THEN\n" @@ -1176,7 +1176,7 @@ permissionsList(const char *pattern) " E' (RESTRICTIVE)'\n" " ELSE '' END\n" "|| CASE WHEN polcmd != '*' THEN\n" - " E' (' || polcmd || E'):'\n" + " E' (' || polcmd::pg_catalog.text || E'):'\n" " ELSE E':'\n" " END\n" "|| CASE WHEN polqual IS NOT NULL THEN\n" diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat index 41074c994b..f3d94f3cf5 100644 --- a/src/include/catalog/pg_type.dat +++
Re: MSVC SSL test failure
On 12/7/21 13:22, Tom Lane wrote: > Alexander Lakhin writes: >> 07.12.2021 19:25, Tom Lane wrote: >>> Hmm. I wonder whether using SD_BOTH behaves any differently. >> With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on >> iterations 1, 2, 3, 16 (just as without shutdown() at all). >> So shutdown with the SD_SEND flag definitely behaves much better (I've >> seen over 200 successful iterations). > Fun. Well, I'll put in shutdown with SD_SEND for the moment, > and we'll have to see whether we can improve further than that. > It does sound like we may be running into OpenSSL bugs/oddities, > not only kernel issues, so it may be impossible to do better > on our side. Yeah. My suspicion is that SD_BOTH is what closesocket() does if shutdown() hasn't been previously called. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: pg_dump versus ancient server versions
On 12/7/21 13:59, Mark Dilger wrote: >> On Dec 7, 2021, at 10:52 AM, Tom Lane wrote: >> >> I'm not entirely following ... are you suggesting that each released minor >> version needs to be kept buildable separately? > No. I'm just wondering if we want to share the product of such efforts if > anybody (me, for instance) volunteers to do it for some subset of minor > releases. For my heap corruption checking work, I might want to be able to > build a small number of old minor releases that I know had corruption bugs. > I doubt there's going to be a whole lot of changes. You should just be able to cherry-pick them in most cases I suspect. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?
On Tue, Dec 7, 2021 at 7:58 AM Robert Haas wrote: > I think that's a good observation. I think the current autovacuum > algorithm works pretty well when things are normal. But in extreme > scenarios it does not degrade gracefully. +1 to all of the specific examples you go on to describe. > > I also think that our problem is not so much that we don't have > > accurate statistics about dead tuples (though we surely don't have > > much accuracy). The main problem seems to be that there are various > > specific, important ways in which the distribution of dead tuples may > > matter (leading to various harmful biases). And so it seems reasonable > > to fudge how we interpret dead tuples with the intention of capturing > > some of that, as a medium term solution. Something that considers the > > concentration of dead tuples in heap pages seems promising. > > I am less convinced about this part. It sort of sounds like you're > saying - it doesn't really matter whether the numbers we gather are > accurate, just that they produce the correct results. That's not quite it. More like: I think that it would be reasonable to define dead tuples abstractly, in a way that's more useful but nevertheless cannot diverge too much from the current definition. We should try to capture "directionality", with clear error bounds. This won't represent the literal truth, but it will be true in spirit (or much closer). For example, why should we count dead heap-only tuples from earlier in a HOT chain, even when we see no evidence that opportunistic HOT pruning can't keep up on that page? Since we actually care about the direction of things, not just the present state of things, we'd be justified in completely ignoring those dead tuples. Similarly, it might well make sense to give more weight to concentrations of LP_DEAD items on a page -- that is a signal that things are not going well *at the level of the page*. Not so much when you have a few LP_DEAD stubs, but certainly when you have dozens of them on one page, or even hundreds. And so ISTM that the conditions of the page should influence how we interpret/count that page's dead tuples, in both directions (interpret the page as having more dead tuples, or fewer). We all know that there isn't a sensible answer to the question "if the abstract cost units used in the optimizer say that one sequential page access is 4x cheaper than one random page access, then what's the difference between accessing 10 random pages sequentially in close succession, versus accessing the same 10 pages randomly?". But at the same time, we can say for sure that random is more expensive to *some* degree, but certainly never by multiple orders of magnitude. The model is imperfect, to be sure, but it is better than many alternative models that are also technically possible. We need *some* cost model for a cost-based optimizer, and it is better to be approximately correct than exactly wrong. Again, the truly important thing is to not be totally wrong about any one thing. Another way of putting it is that I am willing to accept a more biased definition of dead tuples, if that lowers the variance: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff We have some certainty about what is possible in a world in which we use the visibility map directly, and so our final estimate should never be wildly unreasonable -- the abstract idea of dead tuples can still be anchored to the physical reality. As with the optimizer and its cost units, we don't have that many degrees of freedom when autovacuum.c launches a worker, any I don't see that changing -- we must ultimately decide to launch or not launch an autovacuum worker (for any given table) each time the autovacuum launcher wakes up. So we're practically forced to come up with a model that has some abstract idea of one kind of bloat/dead tuple. I would say that we already have one, in fact -- it's just not a very good one because it doesn't take account of obviously-relevant factors like HOT. It could quite easily be less wrong. > If the > information we currently gather wouldn't produce the right results > even if it were fully accurate, that to me suggests that we're > gathering the wrong information, and we should gather something else. I think that counting dead tuples is useful, but not quite sufficient on its own. At the same time, we still need something that works like a threshold -- because that's just how the autovacuum.c scheduling works. It's a go/no-go decision, regardless of how the decision is made. > For example, we could attack the useless-vacuuming problem by having > each vacuum figure out - and store - the oldest XID that could > potentially be worth using as a cutoff for vacuuming that table, and > refusing to autovacuum that table again until we'd be able to use a > cutoff >= that value. I suppose this would be the oldest of (a) any > XID that exists in the table on a tuple that we judged recently dead, > (b) any XID that is
Re: pg_dump versus ancient server versions
> On Dec 7, 2021, at 10:52 AM, Tom Lane wrote: > > I'm not entirely following ... are you suggesting that each released minor > version needs to be kept buildable separately? No. I'm just wondering if we want to share the product of such efforts if anybody (me, for instance) volunteers to do it for some subset of minor releases. For my heap corruption checking work, I might want to be able to build a small number of old minor releases that I know had corruption bugs. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_dump versus ancient server versions
Mark Dilger writes: > Wouldn't you be able to see what changed by comparing the last released tag > for version X.Y against the RELX_Y_STABLE branch? Something like `git diff > REL8_4_22 origin/REL8_4_STABLE > buildability.patch`? > Having such a patch should make reproducing old corruption bugs easier, as > you could apply the buildability.patch to the last branch that contained the > bug. If anybody did that work, would we want it committed somewhere? > REL8_4_19_BUILDABLE or such? For patches that apply trivially, that might > not be worth keeping, but if the merge is difficult, maybe sharing with the > community would make sense. I'm not entirely following ... are you suggesting that each released minor version needs to be kept buildable separately? That seems like a huge amount of extra committer effort with not much added value. If someone comes to me and wants to investigate a bug in a branch that's already out-of-support, and they then say they're not running the last minor release, I'm going to tell them to come back after updating. It is (I suspect) true that diffing the last release against branch tip would often yield a patch that could be used to make an older minor release buildable again. But when that patch doesn't work trivially, I for one am not interested in making it work. And especially not interested in doing so "on spec", with no certainty that anyone would ever need it. regards, tom lane
Re: Post-CVE Wishlist
On Tue, 2021-11-23 at 18:27 +, Jacob Champion wrote: > Now that the MITM CVEs are published [1], I wanted to share my wishlist > of things that would have made those attacks difficult/impossible to > pull off. Now that we're post-commitfest, here's my summary of the responses so far: > = Client-Side Auth Selection = There is interest in letting libpq reject certain auth methods coming back from the server, perhaps using a simple connection option, and there are some prior conversations on the list to look into. > = Implicit TLS = Reactions to implicit TLS were mixed, from "we should not do this" to "it might be nice to have the option, from a technical standpoint". Both a separate-port model and a shared-port model were tentatively proposed. The general consensus seems to be that the StartTLS-style flow is currently sufficient from a security standpoint. I didn't see any responses that were outright in favor, so I think my remaining question is: are there any committers who think a prototype would be worth the time for a motivated implementer? Thanks for the discussion! --Jacob
Re: pg_dump versus ancient server versions
> On Dec 7, 2021, at 8:33 AM, Robert Haas wrote: > > However, it wouldn't be a great idea to back-patch a > completely arbitrary subset of our fixes into those branches, because > then it sort of gets confusing to understand what the status of that > branch is. I don't know that I'm terribly bothered by the idea that > the behavior of the branch might deviate from the last official > release, because most bug fixes are pretty minor and wouldn't really > affect testing much, but it would be a little annoying to explain to > users that those branches contain an arbitrary subset of newer fixes, > and a little hard for us to understand what is and is not there. Wouldn't you be able to see what changed by comparing the last released tag for version X.Y against the RELX_Y_STABLE branch? Something like `git diff REL8_4_22 origin/REL8_4_STABLE > buildability.patch`? Having such a patch should make reproducing old corruption bugs easier, as you could apply the buildability.patch to the last branch that contained the bug. If anybody did that work, would we want it committed somewhere? REL8_4_19_BUILDABLE or such? For patches that apply trivially, that might not be worth keeping, but if the merge is difficult, maybe sharing with the community would make sense. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: MSVC SSL test failure
Alexander Lakhin writes: > 07.12.2021 19:25, Tom Lane wrote: >> Hmm. I wonder whether using SD_BOTH behaves any differently. > With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on > iterations 1, 2, 3, 16 (just as without shutdown() at all). > So shutdown with the SD_SEND flag definitely behaves much better (I've > seen over 200 successful iterations). Fun. Well, I'll put in shutdown with SD_SEND for the moment, and we'll have to see whether we can improve further than that. It does sound like we may be running into OpenSSL bugs/oddities, not only kernel issues, so it may be impossible to do better on our side. regards, tom lane
Re: ExecTypeSetColNames is fundamentally broken
Robert Haas writes: > On Tue, Dec 7, 2021 at 12:30 PM Tom Lane wrote: >> If we consider that the alias renames the columns "for all purposes", >> how is it okay for f() to select the "a" column? > I'd say it isn't. In a green field I'd probably agree with you, but IMO that will break far too much existing SQL code. It'd cause problems for us too, not only end-users. As an example, ruleutils.c would have to avoid attaching new column aliases to tables that are referenced as whole-row Vars. I'm not very sure that that's even possible without creating insurmountable ambiguity issues. There are also fun issues around what happens to a stored query after a table column rename. Right now the query acts as though it uses the old name as a column alias, and that introduces no semantic problem; but that behavior would no longer be acceptable. So the alternatives I see are to revert what bf7ca1587 tried to do here, or to try to make it work that way across-the-board, which implies (a) a very much larger amount of work, and (b) breaking important behaviors that are decades older than that commit. It's not even entirely clear that we could get to complete consistency if we went down that path. regards, tom lane
Re: Dubious usage of TYPCATEGORY_STRING
On Tue, Dec 7, 2021 at 12:19 PM Tom Lane wrote: > > What's wrong with that? > > Well, we don't allow things like > > regression=# select '(1,2)'::point::float8; > ERROR: cannot cast type point to double precision > LINE 1: select '(1,2)'::point::float8; > ^ > > It's not very clear to me why "char" should get a pass on that. > We allow such cases when the target is text/varchar/etc, but > the assumption is that the textual representation is sufficient > for your purposes. It's hard to claim that just the first > byte is a useful textual representation. Fair enough, I guess. I am pretty skeptical of the merits of refusing an explicit cast. If I ran the zoo, I would probably choose to allow all such casts and make them coerce via IO when no other pathway is available. But I get that's not our policy. > Worse, PG is actually treating this as an assignment-level cast, > so we accept this: > > regression=# create table t(f1 "char"); > CREATE TABLE > regression=# insert into t values ('(1,2)'::point); > INSERT 0 1 > regression=# table t; > f1 > > ( > (1 row) > > I definitely don't think that should have worked without > any complaint. Yes, that one's a bridge too far, even for me. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Transparent column encryption
On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote: > On 06.12.21 21:44, Jacob Champion wrote: > > I think reusing a zero IV will potentially leak more information than > > just equality, depending on the cipher in use. You may be interested in > > synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem > > like they would match this use case exactly. (But I'm not a > > cryptographer.) > > I'm aware of this and plan to make use of SIV. The current > implementation is just an example. Sounds good. > > Have you given any thought to AEAD? As a client I'd like to be able to > > tie an encrypted value to other column (or external) data. For example, > > AEAD could be used to prevent a DBA from copying the (encrypted) value > > of my credit card column into their account's row to use it. > > I don't know how that is supposed to work. When the value is encrypted > for insertion, the client may know things like table name or column > name, so it can tie it to those. But it doesn't know what row it will > go in, so you can't prevent the value from being copied into another > row. You would need some permanent logical row ID for this, I think. Sorry, my description was confusing. There's nothing preventing the DBA from copying the value inside the database, but AEAD can make it so that the copied value isn't useful to the DBA. Sample case. Say I have a webapp backed by Postgres, which stores encrypted credit card numbers. Users authenticate to the webapp which then uses the client (which has the keys) to talk to the database. Additionally, I assume that: - the DBA can't access the client directly (because if they can, then they can unencrypt the victim's info using the client's keys), and - the DBA can't authenticate as the user/victim (because if they can, they can just log in themselves and have the data). The webapp might for example use federated authn with a separate provider, using an email address as an identifier. Now, if the client encrypts a user's credit card number using their email address as associated data, then it doesn't matter if the DBA copies that user's encrypted card over to their own account. The DBA can't log in as the victim, so the client will fail to authenticate the value because its associated data won't match. > > > This is not targeting PostgreSQL 15. But I'd appreciate some feedback > > > on the direction. > > > > What kinds of attacks are you hoping to prevent (and not prevent)? > > The point is to prevent admins from getting at plaintext data. The > scenario you show is an interesting one but I think it's not meant to be > addressed by this. If admins can alter the database to their advantage, > they could perhaps increase their account balance, create discount > codes, etc. also. Sure, but increasing account balances and discount codes don't lead to getting at plaintext data, right? Whereas stealing someone else's encrypted value seems like it would be covered under your threat model, since it lets you trick a real-world client into decrypting it for you. Other avenues of attack might depend on how you choose to add HMAC to the current choice of AES-CBC. My understanding of AE ciphers (with or without associated data) is that you don't have to design that yourself, which is nice. --Jacob
Re: MSVC SSL test failure
Hello Tom, 07.12.2021 19:25, Tom Lane wrote: > Hmm. I wonder whether using SD_BOTH behaves any differently. With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on iterations 1, 2, 3, 16 (just as without shutdown() at all). So shutdown with the SD_SEND flag definitely behaves much better (I've seen over 200 successful iterations). Best regards, Alexander
Re: ExecTypeSetColNames is fundamentally broken
On Tue, Dec 7, 2021 at 12:30 PM Tom Lane wrote: > If we consider that the alias renames the columns "for all purposes", > how is it okay for f() to select the "a" column? I'd say it isn't. > Another way to phrase the issue is that the column names seen > by f() are currently different from those seen by row_to_json(): > > regression=# select row_to_json(t) from t as t(x,y); >row_to_json > - > {"x":11,"y":12} > (1 row) > > and that seems hard to justify. Yeah, I agree. The problem I have here is that, with your proposed fix, it still won't be very consistent. True, row_to_json() and f() will both see the unaliased column names ... but a straight select * from t as t(x,y) will show the aliased names. That's unarguably better than corrupting your data, but it seems "astonishing" in the POLA sense. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Alter all tables in schema owner fix
On 12/7/21, 2:47 AM, "Greg Nancarrow" wrote: > On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila wrote: >> >> Thanks, the patch looks mostly good to me. I have slightly modified it >> to incorporate one of Michael's suggestions, ran pgindent, and >> modified the commit message. >> > > LGTM, except in the patch commit message I'd change "Create > Publication" to "CREATE PUBLICATION". LGTM, too. Nathan
Re: pg_dump versus ancient server versions
Robert Haas writes: > I guess the point about user-visible bug fixes is that, as soon as we > start doing that, we don't really want it to be hit-or-miss. We could > make a decision to back-patch all bug fixes or those of a certain > severity or whatever we like back to older branches, and then those > branches would be supported or semi-supported depending on what rule > we adopted, and we could even continue to do releases for them if we > so chose. However, it wouldn't be a great idea to back-patch a > completely arbitrary subset of our fixes into those branches, because > then it sort of gets confusing to understand what the status of that > branch is. Yup, and also confusing to understand whether a given new fix should be back-patched into the out-of-support-but-keep-buildable branches. I want to settle on a reasonably well-defined policy for that. I'm basically suggesting that the policy should be "back-patch the minimal fix needed so that you can still get a clean build and clean check-world run, using thus-and-such configure options". (The point of the configure options limitation being to exclude moving-target external dependencies, such as Python.) I think that Peter's original suggestion could be read the same way except for the adjective "clean". He also said that only core regression needs to pass not check-world; but if we're trying to test things like pg_dump compatibility, I think we want the wider scope of what to keep working. regards, tom lane
Re: Pre-allocating WAL files
On 12/7/21, 12:29 AM, "Bharath Rupireddy" wrote: > Why can't the walwriter pre-allocate some of the WAL segments instead > of a new background process? Of course, it might delay the main > functionality of the walwriter i.e. flush and sync the WAL files, but > having checkpointer do the pre-allocation makes it do another extra > task. Here the amount of walwriter work vs checkpointer work, I'm not > sure which one does more work compared to the other. The argument against adding it to the WAL writer is that we want it to run with low latency to flush asynchronous commits. If we added WAL pre-allocation to the WAL writer, there could periodically be large delays. > Another idea could be to let walwrtier or checkpointer pre-allocate > the WAL files whichever seems free as-of-the-moment when the WAL > segment pre-allocation request comes. We can go further to let the > user choose which process i.e. checkpointer or walwrtier do the > pre-allocation with a GUC maybe? My latest patch set [0] adds WAL pre-allocation to the checkpointer. In that patch set, WAL pre-allocation is done both outside of checkpoints as well as during checkpoints (via CheckPointWriteDelay()). Nathan [0] https://www.postgresql.org/message-id/CB15BEBD-98FC-4E72-86AE-513D34014176%40amazon.com
Re: ExecTypeSetColNames is fundamentally broken
Robert Haas writes: > On Mon, Dec 6, 2021 at 4:05 PM Tom Lane wrote: >> select f(t) from t(x,y); >> >> If we adopt the "rename for all purposes" interpretation, then >> the second SELECT must fail, because what f() is being passed is >> no longer of type t. > For me, the second SELECT does fail: > rhaas=# select f(t) from t(x,y); > ERROR: column "x" does not exist Ah, sorry, I fat-fingered the alias syntax. Here's a tested example: regression=# create table t (a int, b int); CREATE TABLE regression=# insert into t values(11,12); INSERT 0 1 regression=# create function f(t) returns int as 'select $1.a' language sql; CREATE FUNCTION regression=# select f(t) from t as t(x,y); f 11 (1 row) If we consider that the alias renames the columns "for all purposes", how is it okay for f() to select the "a" column? Another way to phrase the issue is that the column names seen by f() are currently different from those seen by row_to_json(): regression=# select row_to_json(t) from t as t(x,y); row_to_json - {"x":11,"y":12} (1 row) and that seems hard to justify. regards, tom lane
Re: Add sub-transaction overflow status in pg_stat_activity
On 12/6/21, 8:19 PM, "Dilip Kumar" wrote: > If the subtransaction cache is overflowed in some of the transactions > then it will affect all the concurrent queries as they need to access > the SLRU for checking the visibility of each tuple. But currently > there is no way to identify whether in any backend subtransaction is > overflowed or what is the current active subtransaction count. > Attached patch adds subtransaction count and subtransaction overflow > status in pg_stat_activity. I have implemented this because of the > recent complain about the same[1] I'd like to give a general +1 to this effort. Thanks for doing this! I've actually created a function to provide this information in the past, so I will help review. Nathan
Re: Dubious usage of TYPCATEGORY_STRING
Robert Haas writes: > On Thu, Dec 2, 2021 at 4:22 PM Tom Lane wrote: >> An example of the reasons not to treat these types as being >> general-purpose strings can be seen at [1], where the "char" >> type has acquired some never-intended cast behaviors. Taking >> that to an extreme, we currently accept >> >> regression=# select '(1,2)'::point::"char"; >> char >> -- >> ( >> (1 row) > What's wrong with that? Well, we don't allow things like regression=# select '(1,2)'::point::float8; ERROR: cannot cast type point to double precision LINE 1: select '(1,2)'::point::float8; ^ It's not very clear to me why "char" should get a pass on that. We allow such cases when the target is text/varchar/etc, but the assumption is that the textual representation is sufficient for your purposes. It's hard to claim that just the first byte is a useful textual representation. Worse, PG is actually treating this as an assignment-level cast, so we accept this: regression=# create table t(f1 "char"); CREATE TABLE regression=# insert into t values ('(1,2)'::point); INSERT 0 1 regression=# table t; f1 ( (1 row) I definitely don't think that should have worked without any complaint. regards, tom lane
Re: pg_dump versus ancient server versions
On Mon, Dec 6, 2021 at 4:19 PM Tom Lane wrote: > Right. The question that's on the table is how much is the right > amount of maintenance. I think that back-patching user-visible bug > fixes, for example, is taking things too far. What we want is to > be able to replicate the behavior of the branch's last released > version, using whatever build tools we are currently using. So > back-patching something like that is counterproductive, because > now the behavior is not what was released. > > A minimal amount of maintenance would be "only back-patch fixes > for issues that cause failure-to-build". The next step up is "fix > issues that cause failure-to-pass-regression-tests", and then above > that is "fix developer-facing annoyances, such as compiler warnings > or unwanted test output, as long as you aren't changing user-facing > behavior". I now think that it'd be reasonable to include this > last group, although I'm pretty sure Peter didn't have that in mind > in his policy sketch. Yep, that seems reasonable to me. I guess the point about user-visible bug fixes is that, as soon as we start doing that, we don't really want it to be hit-or-miss. We could make a decision to back-patch all bug fixes or those of a certain severity or whatever we like back to older branches, and then those branches would be supported or semi-supported depending on what rule we adopted, and we could even continue to do releases for them if we so chose. However, it wouldn't be a great idea to back-patch a completely arbitrary subset of our fixes into those branches, because then it sort of gets confusing to understand what the status of that branch is. I don't know that I'm terribly bothered by the idea that the behavior of the branch might deviate from the last official release, because most bug fixes are pretty minor and wouldn't really affect testing much, but it would be a little annoying to explain to users that those branches contain an arbitrary subset of newer fixes, and a little hard for us to understand what is and is not there. That being said, suppose that a new compiler version comes out and on that new compiler version, 'make check' crashes on the older branch due to a missing WhateverGetDatum() call that we rectified in a later, back-patched commit. I would consider it reasonable to back-patch that particular bug fix into an unsupported branch to make it testable, just like we would do for a failure-to-build issue. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MSVC SSL test failure
Alexander Lakhin writes: > It seems that the test failure rate may depend on the specs/environment. No surprise there, since the issue is almost surely timing-dependent. > shutdown(MyProcPort->sock, SD_SEND) apparently fixes the issue, I've got > 83 successful runs, but then iteration 84 unfortunately failed: > t/001_ssltests.pl .. 106/110 > # Failed test 'intermediate client certificate is missing: matches' > # at t/001_ssltests.pl line 608. > # 'psql: error: connection to server at "127.0.0.1", > port 63187 failed: could not receive data from server: Software caused > connection abort (0x2745/10053) > # SSL SYSCALL error: Software caused connection abort (0x2745/10053) > # could not send startup packet: No error (0x/0)' > # doesn't match '(?^:SSL error: tlsv1 alert unknown ca)' > # Looks like you failed 1 test of 110. Hmm. I wonder whether using SD_BOTH behaves any differently. regards, tom lane
Re: ExecTypeSetColNames is fundamentally broken
On Mon, Dec 6, 2021 at 4:05 PM Tom Lane wrote: > Well, that was what I thought when I wrote bf7ca1587, but it leads > to logical contradictions. Consider > > create table t (a int, b int); > > create function f(t) returns ... ; > > select f(t) from t; > > select f(t) from t(x,y); > > If we adopt the "rename for all purposes" interpretation, then > the second SELECT must fail, because what f() is being passed is > no longer of type t. If you ask me, that'll be a bigger problem > for users than the change I'm proposing (quite independently of > how hard it might be to implement). It certainly will break > a behavior that goes back much further than bf7ca1587. For me, the second SELECT does fail: rhaas=# select f(t) from t(x,y); ERROR: column "x" does not exist LINE 1: select f(t) from t(x,y); ^ If it didn't fail, what would the behavior be? I suppose you could make an argument for trying to match up the columns by position, but if so this ought to work: rhaas=# create table u(a int, b int); CREATE TABLE rhaas=# select f(u) from u; ERROR: function f(u) does not exist rhaas=# select f(u::t) from u; ERROR: cannot cast type u to t Matching columns by name can't work because the names don't match. Matching columns by position does not work. So if that example succeeds, the only real explanation is that we magically know that we've still got a value of type t despite the user's best attempt to decree otherwise. I know PostgreSQL sometimes ... does things like that. I have no idea why anyone would consider it a desirable behavior, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: AW: Assorted improvements in pg_dump
Hans Buschmann writes: > I noticed that you patched master with all the improvements in pg_dump. > Did you change your mind about backpatching patch 0005 to fix the toast size > matter? I looked briefly at that and found that the patch would have to be largely rewritten, because getTables() looks quite different in the older branches. I'm not really sufficiently excited about the point to do that rewriting and re-testing. I think that cases where the old logic gets the scheduling badly wrong are probably rare. regards, tom lane
Re: Dubious usage of TYPCATEGORY_STRING
On Thu, Dec 2, 2021 at 4:22 PM Tom Lane wrote: > An example of the reasons not to treat these types as being > general-purpose strings can be seen at [1], where the "char" > type has acquired some never-intended cast behaviors. Taking > that to an extreme, we currently accept > > regression=# select '(1,2)'::point::"char"; > char > -- > ( > (1 row) What's wrong with that? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?
On Mon, Dec 6, 2021 at 9:42 PM Peter Geoghegan wrote: > On Mon, Dec 6, 2021 at 6:11 PM Robert Haas wrote: > > This doesn't seem convincing. Launching autovacuum too soon surely has > > costs that someone might not want to pay. Clearly in the degenerate > > case where we always autovacuum every table every time an autovacuum > > worker is launched, we have gone insane. > > Unfortunately we already sometimes behave insanely in exactly the way > that you describe: > > https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com In the same ballpark is http://rhaas.blogspot.com/2020/02/useless-vacuuming.html > It's hard to define a break-even point for launching an autovacuum > worker. I think it would be more productive to come up with a design > that at least doesn't go completely off the rails in various specific > ways. I think that's a good observation. I think the current autovacuum algorithm works pretty well when things are normal. But in extreme scenarios it does not degrade gracefully. The vacuuming-over-and-over-without-doing-anything phenomenon is an example of that. Another example is the user who creates 10,000 databases and we're happy to divide the 60s-autovacuum_naptime by 10,000 and try to launch a worker every 0.6 ms. A third example is vacuuming the tables from pg_class in physical order on disk, so that a table that is 1 XID past the wraparound limit can result in a long delay vacuuming a table that is bloating quickly, or conversely a table that is bloating very slowly but has just crawled over the threshold for a regular vacuum gets processed before one that is threatening an imminent wraparound shutdown. I think these are all pathological cases that a well-informed human can easily recognize and handle in an intelligent manner, and it doesn't seem crazy to program those responses into the computer in some way. > I also think that our problem is not so much that we don't have > accurate statistics about dead tuples (though we surely don't have > much accuracy). The main problem seems to be that there are various > specific, important ways in which the distribution of dead tuples may > matter (leading to various harmful biases). And so it seems reasonable > to fudge how we interpret dead tuples with the intention of capturing > some of that, as a medium term solution. Something that considers the > concentration of dead tuples in heap pages seems promising. I am less convinced about this part. It sort of sounds like you're saying - it doesn't really matter whether the numbers we gather are accurate, just that they produce the correct results. If the information we currently gather wouldn't produce the right results even if it were fully accurate, that to me suggests that we're gathering the wrong information, and we should gather something else. For example, we could attack the useless-vacuuming problem by having each vacuum figure out - and store - the oldest XID that could potentially be worth using as a cutoff for vacuuming that table, and refusing to autovacuum that table again until we'd be able to use a cutoff >= that value. I suppose this would be the oldest of (a) any XID that exists in the table on a tuple that we judged recently dead, (b) any XID that is currently-running, and (c) the next XID. I also accept that knowing the concentration of dead tuples on pages that contain at least 1 dead tuple could be interesting. I've felt for a while that it's a mistake to know how many dead tuples there are but not how many pages contain them, because that affects both the amount of I/O required to vacuum and also how much need we have to set VM bits. I'm not sure I would have approached gathering that information in the way that you're proposing here, but I'm not deeply against it, either. I do think that we should try to keep it as a principle that whatever we do gather, we should try our best to make accurate. If that doesn't work well, then gather different stuff instead. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Triage for unimplemented geometric operators
On 07/12/2021 06:18, Tom Lane wrote: So my inclination for HEAD is to implement poly_distance and nuke the others. I'm a bit less sure about the back branches, but maybe just de-document all of them there. I agree, seems to be a reasonable compromise. Removing 20+-years old unused and slightly misleading code probably should overweight the natural inclination to implement all of the functions promised in the catalog. Enhancing software by deleting the code is not an everyday chance and IMHO should be taken, even when it requires an extra catversion bump. I am kind of split on whether it is worth it to implement poly_distance in back branches. Maybe there is a benefit in implementing: it should not cause any reasonable incompatibilities and will introduce somewhat better compatibility with v15+. We could even get away with not updating v10..12' docs on poly_distance because it's not mentioned anyway. I agree on de-documenting all of the unimplemented functions in v13 and v14. Branches before v13 should not require any changes (including documentation) because detailed table on which operators support which geometry primitives only came in 13, and I could not find in e.g. 12's documentation any references to the cases you listed previously: > dist_lb: <->(line,box) > dist_bl: <->(box,line) > close_sl: lseg ## line > close_lb: line ## box > poly_distance: polygon <-> polygon > path_center: @@ path -- Anton Voloshin Postgres Professional, The Russian Postgres Company https://postgrespro.ru
Re: types reliant on encodings [was Re: Dubious usage of TYPCATEGORY_STRING]
On 03.12.21 19:42, Chapman Flack wrote: Is there any way to find out, from the catalogs or in any automatable way, which types are implemented with a dependence on the database encoding (or on some encoding)? What is this needed for? C code can internally do whatever it wants, and the database encoding is effectively a constant, so there is no need for server-side code to be very much concerned about whether types do this. Also, "types" is perhaps the wrong subject here. Types only contain input and output functions and a few more bits. Additional functions operating on the type could look at the server encoding without the type and its core functions knowing about it.
Re: Dubious usage of TYPCATEGORY_STRING
Peter Eisentraut writes: > On 02.12.21 22:22, Tom Lane wrote: >> My first thought about fixing point 1 was to put "char" into some >> other typcategory, but that turns out to break some of psql's >> catalog queries, with results like: >> >> regression=# \dp >> ERROR: operator is not unique: unknown || "char" >> LINE 16:E' (' || polcmd || E'):' >> ^ >> HINT: Could not choose a best candidate operator. You might need to add >> explicit type casts. > Could we add explicit casts (like polcmd::text) here? Or would it break > too much? I assumed it'd break too much to consider doing that. But I suppose that since a typcategory change would be initdb-forcing anyway, maybe it's not out of the question. I'll investigate and see exactly how many places would need an explicit cast. regards, tom lane
Re: Dubious usage of TYPCATEGORY_STRING
On 02.12.21 22:22, Tom Lane wrote: My first thought about fixing point 1 was to put "char" into some other typcategory, but that turns out to break some of psql's catalog queries, with results like: regression=# \dp ERROR: operator is not unique: unknown || "char" LINE 16:E' (' || polcmd || E'):' ^ HINT: Could not choose a best candidate operator. You might need to add explicit type casts. Could we add explicit casts (like polcmd::text) here? Or would it break too much?
Re: Is there any documentation on how to correctly create extensions in HA(primary-standby) setup?
On 03.12.21 15:28, Bharath Rupireddy wrote: I'm thinking of adding the above steps into the "Additional Supplied Modules" section documentation. Any thoughts please? [1] - https://www.postgresql.org/docs/devel/contrib.html The chapter about extensions is probably better: https://www.postgresql.org/docs/devel/extend-extensions.html
Re: Transparent column encryption
On 06.12.21 21:44, Jacob Champion wrote: I think reusing a zero IV will potentially leak more information than just equality, depending on the cipher in use. You may be interested in synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem like they would match this use case exactly. (But I'm not a cryptographer.) I'm aware of this and plan to make use of SIV. The current implementation is just an example. Have you given any thought to AEAD? As a client I'd like to be able to tie an encrypted value to other column (or external) data. For example, AEAD could be used to prevent a DBA from copying the (encrypted) value of my credit card column into their account's row to use it. I don't know how that is supposed to work. When the value is encrypted for insertion, the client may know things like table name or column name, so it can tie it to those. But it doesn't know what row it will go in, so you can't prevent the value from being copied into another row. You would need some permanent logical row ID for this, I think. For this scenario, the deterministic encryption mode is perhaps not the right one. This is not targeting PostgreSQL 15. But I'd appreciate some feedback on the direction. What kinds of attacks are you hoping to prevent (and not prevent)? The point is to prevent admins from getting at plaintext data. The scenario you show is an interesting one but I think it's not meant to be addressed by this. If admins can alter the database to their advantage, they could perhaps increase their account balance, create discount codes, etc. also. If this is a problem, then perhaps a better approach would be to store parts of the data in a separate database with separate admins.
pg_dump/restore --no-tableam
I first suggested this a couple years ago. Is it desirable to implement in pg_dump and pg_restore ? It'd be just like --tablespace. On Tue, Jan 28, 2020 at 07:33:17AM -0600, Justin Pryzby wrote: > I made these casual comments. If there's any agreement on their merit, it'd > be > nice to implement at least the first for v13. > > In <20190818193533.gl11...@telsasoft.com>, I wrote: > > . What do you think about pg_restore --no-tableam; similar to > >--no-tablespaces, it would allow restoring a table to a different AM: > >PGOPTIONS='-c default_table_access_method=zedstore' pg_restore > > --no-tableam ./pg_dump.dat -d postgres > >Otherwise, the dump says "SET default_table_access_method=heap", which > >overrides any value from PGOPTIONS and precludes restoring to new AM. > > That appears to be a trivial variation on no-tablespace: > > /* do nothing in --no-tablespaces mode */ > if (ropt->noTablespace) > return; ...
Re: Transparent column encryption
On 06.12.21 19:28, Robert Haas wrote: Speaking of parameter descriptions, the trickiest part of this whole thing appears to be how to get transparently encrypted data into the database (as opposed to reading it out). It is required to use protocol-level prepared statements (i.e., extended query) for this. Why? If the client knows the CEK, can't the client choose to send unprepared insert or update statements with pre-encrypted blobs? That might be a bad idea from a security perspective because the encrypted blob might then got logged, but we sometimes log parameters, too. The client can send something like PQexec(conn, "INSERT INTO tbl VALUES ('ENCBLOB', 'ENCBLOB')"); and it will work. (See the included test suite where 'ENCBLOB' is actually computed by pgcrypto.) But that is not transparent encryption. The client wants to send "INSERT INTO tbl VALUES ('val1', 'val2')" and have libpq take care of encrypting 'val1' and 'val2' before hitting the wire. For that you need to use the prepared statement API so that the values are available separately from the statement. And furthermore the client needs to know what columns the insert statements is writing to, so that it can get the CEK for that column. That's what it needs the parameter description for. As alluded to, workarounds exist or might be made available to do part of that work yourself, but that shouldn't be the normal way of using it.
Re: Non-superuser subscription owners
> On Dec 7, 2021, at 2:29 AM, Amit Kapila wrote: > > Okay, let me try to explain again. Following is the text from docs > [1]: " (a) To create a subscription, the user must be a superuser. (b) > The subscription apply process will run in the local database with the > privileges of a superuser. (c) Privileges are only checked once at the > start of a replication connection. They are not re-checked as each > change record is read from the publisher, nor are they re-checked for > each change when applied. > > My understanding is that we want to improve what is written as (c) > which I think is the same as what you mentioned later as "Fix the > current bug wherein subscription changes are applied with superuser > force after the subscription owner has superuser privileges revoked.". > Am I correct till here? If so, I think what I am suggesting should fix > this with the assumption that we still want to follow (b) at least for > the first patch. Ok, that's a point of disagreement. I was trying to fix both (b) and (c) in the first patch. > One possibility is that our understanding of the > first problem is the same but you want to allow apply worker running > even when superuser privileges are revoked provided the user with > which it is running has appropriate privileges on the objects being > accessed by apply worker. Correct, that's what I'm trying to make safe. > We will talk about other points of the roadmap you mentioned once our > understanding for the first one matches. I am happy to have an off-list phone call with you, if you like. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Skipping logical replication transactions on subscriber side
On 03.12.21 03:53, Amit Kapila wrote: I don't know how difficult it would be, but allowing multiple xids might be desirable. Are there many cases where there could be multiple xid failures that the user can skip? Apply worker always keeps looping at the same error failure so the user wouldn't know of the second xid failure (if any) till the first failure is resolved. Yeah, nevermind, doesn't make sense. Yeah, I also think so. BTW, what do you think of providing extra flexibility of giving other options like 'operation', 'rel' along with xid? I think such options could be useful for large transactions that operate on multiple tables as it is quite possible that only a particular operation from the entire transaction is the cause of failure. Now, on one side, we can argue that skipping the entire transaction is better from the consistency point of view but I think it is already possible that we just skip a particular update/delete (if the corresponding tuple doesn't exist on the subscriber). For the sake of simplicity, we can just allow providing xid at this stage and then extend it later as required but I am not very sure of that point. Skipping transactions partially sounds dangerous, especially when exposed as an option to users. Needs more careful thought.
Re: logical decoding and replication of sequences
I have checked the 0001 and 0003 patches. (I think we have dismissed the approach in 0002 for now. And let's talk about 0004 later.) I have attached a few fixup patches, described further below. # 0001 The argument "create" for fill_seq_with_data() is always true (and patch 0002 removes it). So I'm not sure if it's needed. If it is, it should be documented somewhere. About the comment you added in nextval_internal(): It's a good explanation, so I would leave it in. I also agree with the conclusion of the comment that the current solution is reasonable. We probably don't need the same comment again in fill_seq_with_data() and again in do_setval(). Note that both of the latter functions already point to nextval_interval() for other comments, so the same can be relied upon here as well. The order of the new fields sequence_cb and stream_sequence_cb is a bit inconsistent compared to truncate_cb and stream_truncate_cb. Maybe take another look to make the order more uniform. Some documentation needs to be added to the Logical Decoding chapter. I have attached a patch that I think catches all the places that need to be updated, with some details left for you to fill in. ;-) The ordering of the some of the documentation chunks reflects the order in which the callbacks appear in the header files, which might not be optimal; see above. In reorderbuffer.c, you left a comment about how to access a sequence tuple. There is an easier way, using Form_pg_sequence_data, which is how sequence.c also does it. See attached patch. # 0003 The tests added in 0003 don't pass for me. It appears that you might have forgotten to update the expected files after you added some tests or something. The cfbot shows the same. See attached patch for a correction, but do check what your intent was. As mentioned before, we probably don't need the skip-sequences option in test_decoding.From 2dd9a0e76b9496016e7abebebd0006ddee72d3c1 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 7 Dec 2021 14:19:42 +0100 Subject: [PATCH 1/4] Whitespace fixes --- contrib/test_decoding/expected/sequence.out | 6 +++--- contrib/test_decoding/sql/sequence.sql | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/contrib/test_decoding/expected/sequence.out b/contrib/test_decoding/expected/sequence.out index 17c88990b1..ca55bcebc3 100644 --- a/contrib/test_decoding/expected/sequence.out +++ b/contrib/test_decoding/expected/sequence.out @@ -177,7 +177,7 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc -- (0 rows) --- rollback on table creation with serial column +-- rollback on table creation with serial column BEGIN; CREATE TABLE test_table (a SERIAL, b INT); INSERT INTO test_table (b) VALUES (100); @@ -202,7 +202,7 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc -- (0 rows) --- rollback on table with serial column +-- rollback on table with serial column CREATE TABLE test_table (a SERIAL, b INT); BEGIN; INSERT INTO test_table (b) VALUES (100); @@ -222,7 +222,7 @@ INSERT INTO test_table (b) VALUES (700); INSERT INTO test_table (b) VALUES (800); INSERT INTO test_table (b) VALUES (900); ROLLBACK; --- check table and sequence values after rollback +-- check table and sequence values after rollback SELECT * from test_table_a_seq; last_value | log_cnt | is_called +-+--- diff --git a/contrib/test_decoding/sql/sequence.sql b/contrib/test_decoding/sql/sequence.sql index d8a34738f3..21c4b79222 100644 --- a/contrib/test_decoding/sql/sequence.sql +++ b/contrib/test_decoding/sql/sequence.sql @@ -48,7 +48,7 @@ CREATE TABLE test_table (a INT); ROLLBACK; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'skip-sequences', '0'); --- rollback on table creation with serial column +-- rollback on table creation with serial column BEGIN; CREATE TABLE test_table (a SERIAL, b INT); INSERT INTO test_table (b) VALUES (100); @@ -65,7 +65,7 @@ CREATE TABLE test_table (a SERIAL, b INT); ROLLBACK; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'skip-sequences', '0'); --- rollback on table with serial column +-- rollback on table with serial column CREATE TABLE test_table (a SERIAL, b INT); BEGIN; @@ -82,7 +82,7 @@ CREATE TABLE test_table (a SERIAL, b INT); INSERT INTO test_table (b) VALUES (900); ROLLBACK; --- check table and sequence values after rollback +-- check table and sequence values after rollback SELECT * from test_table_a_seq; SELECT nextval('test_table_a_seq'); -- 2.34.1 From c5e9eb2792bd276e493aed582627e93b4c681374 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 7 Dec 2021 15:29:34 +0100 Subject: [PATCH 2/4] Make tests pass --- contrib/test_decoding/expected/sequence.out | 18 -- 1
Is there a way (except from server logs) to know the kind of on-going/last checkpoint?
Hi, Currently one can know the kind of on-going/last checkpoint (shutdown, end-of-recovery, immediate, force etc.) only via server logs that to when log_checkpoints GUC is on. At times, the users/service layer components would want to know the kind of checkpoint (along with other checkpoint related info) to take some actions and it will be a bit difficult to search through the server logs. The checkpoint info can be obtained from the control file (either by pg_control_checkpoint() or by pg_controldata tool) whereas checkpoint kind isn't available there. How about we add an extra string field to the control file alongside the other checkpoint info it already has? This way, the existing pg_control_checkpoint() or pg_controldata tool can easily be enhanced to show the checkpoint kind as well. One concern is that we don't want to increase the size of pg_controldata by more than the typical block size (of 8K) to avoid any torn-writes. With this change, we might add at max the strings specified at [1]. Adding it to the control file has an advantage of preserving the last checkpoint kind which might be useful. Thoughts? [1] for checkpoint: "checkpoint shutdown end-of-recovery immediate force wait wal time flush-all" for restartpoint: "restartpoint shutdown end-of-recovery immediate force wait wal time flush-all" Regards, Bharath Rupireddy.
Re: should we document an example to set multiple libraries in shared_preload_libraries?
On Mon, Dec 6, 2021 at 9:20 PM Bharath Rupireddy wrote: > > On Fri, Dec 3, 2021 at 11:25 PM Bossart, Nathan wrote: > > > > On 12/3/21, 6:21 AM, "Bharath Rupireddy" > > wrote: > > > +1 to add here in the "Parameter Names and Values section", but do we > > > want to backlink every string parameter to this section? I think it > > > needs more effort. IMO, we can just backlink for > > > shared_preload_libraries alone. Thoughts? > > > > IMO this is most important for GUC_LIST_QUOTE parameters, of which > > there are only a handful. I don't think adding a link to every string > > parameter is necessary. > > Agree. > > Should we specify something like below in the "Parameter Names and > Values" section's "String:" para? Do we use generic terminology like > 'name' and val1, val2, val3 and so on? > > ALTER SYSTEM SET name = val1,val2,val3; > ALTER SYSTEM SET name = 'val1', 'val2', 'val3'; > ALTER SYSTEM SET name = '"val 1"', '"val,2"', 'val3'; > > Another thing I observed is the difference between how the > postgresql.conf file and ALTER SYSTEM SET command is parsed for > GUC_LIST_QUOTE values. Since there's a difference in the way the params are set in the postgresql.conf file and ALTER SYSTEM SET command (as pointed out by Tom in this thread [1]), I'm now confused. If we were to add examples to the "Parameter Names and Values" section, which examples should we addd, postgresql.conf files ones or ALTER SYSTEM SET command ones? Thoughts? [1] https://www.postgresql.org/message-id/flat/3108541.1638806477%40sss.pgh.pa.us Regards, Bharath Rupireddy.
Re: row filtering for logical replication
On Tue, Dec 7, 2021 at 12:18 PM tanghy.f...@fujitsu.com wrote: > > On Friday, December 3, 2021 10:09 AM Peter Smith > wrote: > > > > On Thu, Dec 2, 2021 at 2:32 PM tanghy.f...@fujitsu.com > > wrote: > > > > > > On Thursday, December 2, 2021 5:21 AM Peter Smith > > wrote: > > > > > > > > PSA the v44* set of patches. > > > > > > > > > > Thanks for the new patch. Few comments: > > > > > > 1. This is an example in publication doc, but in fact it's not allowed. > > > Should we > > > change this example? > > > > > > +CREATE PUBLICATION active_departments FOR TABLE departments WHERE > > (active IS TRUE); > > > > > > postgres=# CREATE PUBLICATION active_departments FOR TABLE departments > > WHERE (active IS TRUE); > > > ERROR: invalid publication WHERE expression for relation "departments" > > > HINT: only simple expressions using columns, constants and immutable > > > system > > functions are allowed > > > > > > > Thanks for finding this. Actually, the documentation looks correct to > > me. The problem was the validation walker of patch 0002 was being > > overly restrictive. It needed to also allow a BooleanTest node. > > > > Now it works (locally) for me. For example. > > > > test_pub=# create table departments(depno int primary key, active boolean); > > CREATE TABLE > > test_pub=# create publication pdept for table departments where > > (active is true) with (publish="insert"); > > CREATE PUBLICATION > > test_pub=# create publication pdept2 for table departments where > > (active is false) with (publish="insert"); > > CREATE PUBLICATION > > > > This fix will be available in v45*. > > > > Thanks for looking into it. > > I have another problem with your patch. The document says: > > ... If the subscription has several publications in > + which the same table has been published with different filters, those > + expressions get OR'ed together so that rows satisfying any of the > expressions > + will be replicated. Notice this means if one of the publications has no > filter > + at all then all other filters become redundant. > > Then, what if one of the publications is specified as 'FOR ALL TABLES' or 'FOR > ALL TABLES IN SCHEMA'. > > For example: > create table tbl (a int primary key);" > create publication p1 for table tbl where (a > 10); > create publication p2 for all tables; > create subscription sub connection 'dbname=postgres port=5432' publication > p1, p2; Thanks for the example. I was wondering about this case myself. > > I think for "FOR ALL TABLE" publication(p2 in my case), table tbl should be > treated as no filter, and table tbl should have no filter in subscription > sub. Thoughts? > > But for now, the filter(a > 10) works both when copying initial data and > later changes. > > To fix it, I think we can check if the table is published in a 'FOR ALL > TABLES' > publication or published as part of schema in function > pgoutput_row_filter_init > (which was introduced in v44-0003 patch), also we need to make some changes in > tablesync.c. In order to check "FOR ALL_TABLES", we might need to fetch publication metdata. Instead of that can we add a "TRUE" filter on all the tables which are part of FOR ALL TABLES publication? -- Best Wishes, Ashutosh Bapat
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Wed, Nov 17, 2021 at 8:01 AM Michael Paquier wrote: > > On Tue, Nov 16, 2021 at 01:26:49PM -0300, Alvaro Herrera wrote: > > My opinion is that adding these things willy-nilly is not a solution to > > any actual problem. Adding a few additional log lines that are > > low-volume at DEBUG1 might be useful, but below that (DEBUG2 etc) it's > > not good for anything other than specific development, IMO. At least > > this particular one for streaming replication I think we should not > > include. > > Looking at v2, I think that this leaves the additions of the DEBUG1 > entries in SendBaseBackup() and WalRcvWaitForStartPosition(), then. > The one in pgarch.c does not provide any additional information as the > segment to-be-archived should be part of the command. Thank you all for the inputs. Here's the patch that I've come up with. Upon thinking further, having at least the messages at LOG level [1] would be helpful to know what's happening with the system while in recovery. Although these messages at LOG level seem to be filling up the server logs, having a good log consumption and rotation mechanism (I'm sure every major postgres vendor would have one) would be sufficient to allay that concern. These LOG messages would help us know how much time a restore command takes to fetch the WAL file and what is the current WAL file the server is recovering and where is it recovering from. The customer often asks questions like: when will my server come up? how much time does the recovery of my server take? As a developer or admin, one can monitor these logs and do bunch of things: 1) see how many WAL files left to be recovered by looking at the WAL files in the archive location or pg_wal directory or from primary 2) provide an approximate estimation of when the server will come up or how much more the recovery takes by looking at these previous LOG messages, one can know the number of WAL files that server recovered over a minute and with the help of left-over WAL files calculated from (1). [1] ereport(LOG, (errmsg("waiting for WAL segment \"%s\" from archive", xlogfname))); ereport(LOG, (errmsg("restored WAL segment \"%s\" from archive", xlogfname))); ereport(LOG, (errmsg("recovering WAL segment \"%s\" from source \"%s\"", xlogfname, srcname))); Regards, Bharath Rupireddy. v3-0001-add-log-messages-for-recovery.patch Description: Binary data
Re: pg_get_publication_tables() output duplicate relid
On Tue, Dec 7, 2021 at 3:20 PM Amit Kapila wrote: > On Tue, Dec 7, 2021 at 10:52 AM Amit Langote wrote: > > So if a partition is > > explicitly present in a publication with a filter defined on it, I > > suppose following are the possible scenarios: > > > > * publish_via_partition_root=true > > > > 1. If the root table is not present in the publication (perhaps > > unlikely), use the partition's filter because there's nothing else to > > consider. > > > > 2. If the root table is present in the publication (very likely), use > > the root's filter, ignoring/overriding the partition's own if any. > > > > * publish_via_partition_root=false > > > > 1. If the root table is not present in the publication, use the > > partition's filter because there's nothing else to consider. > > > > 2. If the root table is present in the publication, use the > > partition's own filter if present, else root's. > > I have not tested/checked each of these scenarios individually but I > think it is something like, if publish_via_partition_root is false > then we will always try to use partitions row filter and if it is not > there then we don't use any filter. Similarly, if > publish_via_partition_root is true, then we always try to use a > partitioned table row filter and if it is not there (either because > the partitioned table is not part of the publication or because there > is no filter defined for it) then we don't use any filter. Okay, thanks for the explanation. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Data is copied twice when specifying both child and parent table in publication
On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com wrote: > > On Thursday, December 2, 2021 4:54 PM houzj.f...@fujitsu.com > wrote: > > On Thursday, December 2, 2021 12:50 PM Amit Kapila > > wrote: > > > On Thu, Dec 2, 2021 at 9:41 AM Greg Nancarrow > > > wrote: > > > > > > > > On Thu, Dec 2, 2021 at 1:48 PM Greg Nancarrow > > > wrote: > > > > > > > > > > If you updated my original description to say "(instead of just > > > > > the individual partitions)", it would imply the same I think. > > > > > But I don't mind if you want to explicitly state both cases to make > > > > > it clear. > > > > > > > > > > > > > For example, something like: > > > > > > > > For publications of partitioned tables with > > > > publish_via_partition_root set to true, only the partitioned table > > > > (and not its partitions) is included in the view, whereas if > > > > publish_via_partition_root is set to false, only the individual > > > > partitions are > > included in the view. > > > > > > > > > > Yeah, that sounds good to me. > > > > It looks good to me as well. > > Attach the patches for (HEAD~13) which merge the suggested doc change. I > > prepared the code patch and test patch separately to make it easier for > > committer to confirm. > > It seems we might not need to backpatch the doc change, so > attach another version which remove the doc changes from backpatch patches. Thanks for the patches, the patch applies and the test passes in head and the back branches. one minor suggestion: 1) Shall we change: + + For publications of partitioned tables with + publish_via_partition_root set to + true, only the partitioned table (and not its partitions) + is included in the view, whereas if + publish_via_partition_root is set to + false, only the individual partitions are included in the + view. + To: + + For publications of partitioned tables with + publish_via_partition_root set to + true, only the partitioned table (and not its partitions) + is included in the view, whereas if + publish_via_partition_root is set to + false, only the individual partitions (and not the + partitioned table) are included in the + view. + 2) Any particular reason why the code and tests are backbranched but not the document changes? Regards, Vignesh
Re: Skipping logical replication transactions on subscriber side
On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila wrote: > > On Fri, Dec 3, 2021 at 12:12 PM Masahiko Sawada wrote: > > > > On Fri, Dec 3, 2021 at 11:53 AM Amit Kapila wrote: > > > > > > > But this syntax gives you flexibility, so we can also > > > > start with a simple implementation. > > > > > > > > > > Yeah, I also think so. BTW, what do you think of providing extra > > > flexibility of giving other options like 'operation', 'rel' along with > > > xid? I think such options could be useful for large transactions that > > > operate on multiple tables as it is quite possible that only a > > > particular operation from the entire transaction is the cause of > > > failure. Now, on one side, we can argue that skipping the entire > > > transaction is better from the consistency point of view but I think > > > it is already possible that we just skip a particular update/delete > > > (if the corresponding tuple doesn't exist on the subscriber). For the > > > sake of simplicity, we can just allow providing xid at this stage and > > > then extend it later as required but I am not very sure of that point. > > > > +1 > > > > Skipping a whole transaction by specifying xid would be a good start. > > > > Okay, that sounds reasonable, so let's do that for now. I'll submit the patch tomorrow. While updating the patch, I realized that skipping a transaction that is prepared on the publisher will be tricky a bit; First of all, since skip-xid is in pg_subscription catalog, we need to do a catalog update in a transaction and commit it to disable it. I think we need to set origin-lsn and timestamp of the transaction being skipped to the transaction that does the catalog update. That is, during skipping the (not prepared) transaction, we skip all data-modification changes coming from the publisher, do a catalog update, and commit the transaction. If we do the catalog update in the next transaction after skipping the whole transaction, skip_xid could be left in case of a server crash between them. Also, we cannot set origin-lsn and timestamp to an empty transaction. In prepared transaction cases, I think that when handling a prepare message, we need to commit the transaction to update the catalog, instead of preparing it. And at the commit prepared and rollback prepared time, we skip it since there is not the prepared transaction on the subscriber. Currently, handling rollback prepared already behaves so; it first checks whether we have prepared the transaction or not and skip it if haven’t. So I think we need to do that also for commit prepared case. With that, this requires protocol changes so that the subscriber can get prepare-lsn and prepare-time when handling commit prepared. So I’m writing a separate patch to add prepare-lsn and timestamp to commit_prepared message, which will be a building block for skipping prepared transactions. Actually, I think it’s beneficial even today; we can skip preparing the transaction if it’s an empty transaction. Although the comment it’s not a common case, I think that it could happen quite often in some cases: * XXX, We can optimize such that at commit prepared time, we first check * whether we have prepared the transaction or not but that doesn't seem * worthwhile because such cases shouldn't be common. */ For example, if the publisher has multiple subscriptions and there are many prepared transactions that modify the particular table subscribed by one publisher, many empty transactions are replicated to other subscribers. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Alter all tables in schema owner fix
On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila wrote: > > Thanks, the patch looks mostly good to me. I have slightly modified it > to incorporate one of Michael's suggestions, ran pgindent, and > modified the commit message. > LGTM, except in the patch commit message I'd change "Create Publication" to "CREATE PUBLICATION". Regards, Greg Nancarrow Fujitsu Australia
Re: Non-superuser subscription owners
On Mon, Dec 6, 2021 at 9:26 PM Mark Dilger wrote: > > > On Dec 6, 2021, at 2:19 AM, Amit Kapila wrote: > > > >>> If we want to maintain the property that subscriptions can only be > >>> owned by superuser > > We don't want to maintain such a property, or at least, that's not what I > want. I don't think that's what Jeff wants, either. > > To clarify, I'm not entirely sure how to interpret the verb "maintain" in > your question, since before the patch the property does not exist, and after > the patch, it continues to not exist. We could *add* such a property, of > course, though this patch does not attempt any such thing. > Okay, let me try to explain again. Following is the text from docs [1]: " (a) To create a subscription, the user must be a superuser. (b) The subscription apply process will run in the local database with the privileges of a superuser. (c) Privileges are only checked once at the start of a replication connection. They are not re-checked as each change record is read from the publisher, nor are they re-checked for each change when applied. My understanding is that we want to improve what is written as (c) which I think is the same as what you mentioned later as "Fix the current bug wherein subscription changes are applied with superuser force after the subscription owner has superuser privileges revoked.". Am I correct till here? If so, I think what I am suggesting should fix this with the assumption that we still want to follow (b) at least for the first patch. One possibility is that our understanding of the first problem is the same but you want to allow apply worker running even when superuser privileges are revoked provided the user with which it is running has appropriate privileges on the objects being accessed by apply worker. We will talk about other points of the roadmap you mentioned once our understanding for the first one matches. [1] - https://www.postgresql.org/docs/devel/logical-replication-security.html -- With Regards, Amit Kapila.
Re: Add sub-transaction overflow status in pg_stat_activity
On Tue, Dec 7, 2021 at 10:29 AM Nikolay Samokhvalov wrote: > > On Mon, Dec 6, 2021 at 8:16 PM Dilip Kumar wrote: >> >> If the subtransaction cache is overflowed in some of the transactions >> then it will affect all the concurrent queries as they need to access >> the SLRU for checking the visibility of each tuple. But currently >> there is no way to identify whether in any backend subtransaction is >> overflowed or what is the current active subtransaction count. > > > I think it's a good idea – had the same need when recently researching > various issues with subtransactions [1], needed to patch Postgres in > benchmarking environments. To be fair, there is a way to understand that the > overflowed state is reached for PG 13+ – on standbys, observe reads in > Subtrans in pg_stat_slru. But of course, it's an indirect way. Yeah right. > I see that the patch adds two new columns to pg_stat_activity: subxact_count > and subxact_overflowed. This should be helpful to have. Additionally, > exposing the lastOverflowedXid value would be also good for troubleshooting > of subtransaction edge and corner cases – a bug recently fixed in all current > versions [2] was really tricky to troubleshoot in production because this > value is not visible to DBAs. Yeah, we can show this too, although we need to take ProcArrayLock in the shared mode for reading this, but anyway that will be done on users request so should not be an issue IMHO. I will post the updated patch soon along with comments given by Zhihong Yu and Justin. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add sub-transaction overflow status in pg_stat_activity
On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby wrote: Thanks for the review I will work on these comments. > > + > > + subxact_count xid > > + > > + > > + The current backend's active subtransactions count. > > subtransaction (no s) > > > + Set to true if current backend's subtransaction cache is overflowed. > > Say "has overflowed" > > > + if (local_beentry->subxact_count > 0) > > + { > > + values[30] = local_beentry->subxact_count; > > + values[31] = local_beentry->subxact_overflowed; > > + } > > + else > > + { > > + nulls[30] = true; > > + nulls[31] = true; > > + } > > Why is the subxact count set to NULL instead of zero ? > You added this to pg_stat_activity, which already has a lot of fields. > We talked a few months ago about not adding more fields that weren't commonly > used. > https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e > > Since I think this field is usually not interesting to most users of > pg_stat_activity, maybe this should instead be implemented as a function like > pg_backend_get_subxact_status(pid). > > People who want to could use it like: > SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub; Yeah, this is a valid point, I will change this accordingly. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Alter all tables in schema owner fix
On Tue, Dec 7, 2021 at 8:21 AM vignesh C wrote: > > Thanks for your comments, I have made the changes. Additionally I have > renamed IsSchemaPublication to is_schema_publication for keeping the > naming similar around the code. The attached v3 patch has the changes > for the same. > Thanks, the patch looks mostly good to me. I have slightly modified it to incorporate one of Michael's suggestions, ran pgindent, and modified the commit message. I am planning to push the attached tomorrow unless there are further comments. Michael, do let me know if you have any questions or objections about this? -- With Regards, Amit Kapila. v4-0001-Fix-changing-the-ownership-of-ALL-TABLES-IN-SCHEM.patch Description: Binary data