Re: "out of relcache_callback_list slots" after multiple calls to pg_logical_slot_get_binary_changes
Good catch! At Sun, 19 Feb 2023 02:40:31 +, "shiy.f...@fujitsu.com" wrote in > init_rel_sync_cache()), but they are not unregistered when it shutdowns. So, > after multiple calls to the function, MAX_RELCACHE_CALLBACKS is exceeded. This > is mentioned in the following comment. > > /* >* We can get here if the plugin was used in SQL interface as the >* RelSchemaSyncCache is destroyed when the decoding finishes, but there >* is no way to unregister the relcache invalidation callback. >*/ > if (RelationSyncCache == NULL) > return; > > Could we fix it by adding two new function to unregister relcache callback and > syscache callback? I tried to do so in the attached patch. I'm pretty sure that everytime an output plugin is initialized on a process, it installs the same set of syscache/relcache callbacks each time. Do you think we could simply stop duplicate registration of those callbacks by using a static boolean? It would be far simpler. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add WAL read stats to pg_stat_wal
At Thu, 16 Feb 2023 11:11:38 -0800, Andres Freund wrote in > I wonder if we should keep the checkpointer around for longer. If we have > checkpointer signal postmaster after it wrote the shutdown checkpoint, > postmaster could signal walsenders to shut down, and checkpointer could do > some final work, like writing out the stats. > I suspect this could be useful for other things as well. It's awkward that we > don't have a place to put "just before shutting down" type tasks. And > checkpointer seems well suited for that. I totally agree that it will be useful, but I'm not quite sure how checkpointer would be able to let postmaster know about that state without requiring access to shared memory. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Refactor calculations to use instr_time
At Fri, 17 Feb 2023 13:53:36 +0300, Nazir Bilal Yavuz wrote in > Thanks for the review. I updated the patch. WalUsageAccumDiff(, , ); - PendingWalStats.wal_records = diff.wal_records; - PendingWalStats.wal_fpi = diff.wal_fpi; - PendingWalStats.wal_bytes = diff.wal_bytes; ... + WALSTAT_ACC(wal_records, diff); + WALSTAT_ACC(wal_fpi, diff); + WALSTAT_ACC(wal_bytes, diff); + WALSTAT_ACC(wal_buffers_full, PendingWalStats); The lifetime of the variable "diff" seems to be longer now. Wouldn't it be clearer if we renamed it to something more meaningful, like wal_usage_diff, WalUsageDiff or PendingWalUsage? Along those same lines, it occurs to me that the new struct should be named PgStat_PendingWalStats, instead of ..Usage. That change makes the name of the type and the variable consistent. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Horiguchi-san, > > Thank you for responding! Before modifying patches, I want to confirm > something > you said. > > > As Amit-K mentioned, we may need to change the name of the option in > > this version, since the delay mechanism in this version causes a delay > > in sending from publisher than delaying apply on the subscriber side. > > Right, will be changed. > > > I'm not sure why output plugin is involved in the delay mechanism. It > > appears to me that it would be simpler if the delay occurred in > > reorder buffer or logical decoder instead. > > I'm planning to change, but.. Yeah, I don't think we've made up our minds about which way to go yet, so it's a bit too early to work on that. > > Perhaps what I understand > > correctly is that we could delay right before only sending commit > > records in this case. If we delay at publisher end, all changes will > > be sent at once if !streaming, and otherwise, all changes in a > > transaction will be spooled at subscriber end. In any case, apply > > worker won't be holding an active transaction unnecessarily. > > What about parallel case? Latest patch does not reject the combination of > parallel > streaming mode and delay. If delay is done at commit and subscriber uses an > parallel > apply worker, it may acquire lock for a long time. I didn't looked too closely, but my guess is that transactions are conveyed in spool files in parallel mode, with each file storing a complete transaction. > > Of > > course we need add the mechanism to process keep-alive and status > > report messages. > > Could you share the good way to handle keep-alive and status messages if you > have? > If we changed to the decoding layer, it is strange to call walsender function > directly. I'm sorry, but I don't have a concrete idea at the moment. When I read through the last patch, I missed that WalSndDelay is actually a subset of WalSndLoop. Although it can handle keep-alives correctly, I'm not sure we can accept that structure.. > > Those setups work fine when no > > apply-delay involved, but they won't work with the patches we're > > talking about because the subscriber won't respond to the keep-alive > > packets from the peer. So when I wrote "practically works" in the > > last mail, this is what I meant. > > I'm not sure around the part. I think in the latest patch, subscriber can > respond > to the keepalive packets from the peer. Also, publisher can respond to the > peer. > Could you please tell me if you know a case that publisher or subscriber > cannot > respond to the opposite side? Note that if we apply the publisher-side patch, > we > don't have to apply subscriber-side patch. Sorry about that again, I missed that part in the last patch as mentioned earlier.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Wed, 15 Feb 2023 11:29:18 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Andres and other hackers, > > > OTOH, if we want to implement the functionality on publisher-side, > > I think we need to first consider the interface. > > We can think of two options (a) Have it as a subscription parameter as the > > patch > > has now and > > then pass it as an option to the publisher which it will use to delay; > > (b) Have it defined on publisher-side, say via GUC or some other way. > > The basic idea could be that while processing commit record (in > > DecodeCommit), > > we can somehow check the value of delay and then use it there to delay > > sending > > the xact. > > Also, during delay, we need to somehow send the keepalive and process > > replies, > > probably via a new callback or by some existing callback. > > We also need to handle in-progress and 2PC xacts in a similar way. > > For the former, probably we would need to apply the delay before sending > > the first > > stream. > > Could you please share what you feel on this direction as well ? > > I implemented a patch that the delaying is done on the publisher side. In > this patch, > approach (a) was chosen, in which min_apply_delay is specified as a > subscription > parameter, and then apply worker passes it to the publisher as an output > plugin option. As Amit-K mentioned, we may need to change the name of the option in this version, since the delay mechanism in this version causes a delay in sending from publisher than delaying apply on the subscriber side. I'm not sure why output plugin is involved in the delay mechanism. It appears to me that it would be simpler if the delay occurred in reorder buffer or logical decoder instead. Perhaps what I understand correctly is that we could delay right before only sending commit records in this case. If we delay at publisher end, all changes will be sent at once if !streaming, and otherwise, all changes in a transaction will be spooled at subscriber end. In any case, apply worker won't be holding an active transaction unnecessarily. Of course we need add the mechanism to process keep-alive and status report messages. > During the delay, the walsender periodically checks and processes replies > from the > apply worker and sends keepalive messages if needed. Therefore, the ability > to handle > keepalives is not loosed. My understanding is that the keep-alives is a different mechanism with a different objective from status reports. Even if subscriber doesn't send a spontaneous or extra status reports at all, connection can be checked and maintained by keep-alive packets. It is possible to setup an asymmetric configuration where only walsender sends keep-alives, but none are sent from the peer. Those setups work fine when no apply-delay involved, but they won't work with the patches we're talking about because the subscriber won't respond to the keep-alive packets from the peer. So when I wrote "practically works" in the last mail, this is what I meant. Thus if someone plans to enable apply_delay for logical replication, that person should be aware of some additional subtle restrictions that are required compared to a non-delayed setups. > To delay the transaction in the output plugin layer, the new > LogicalOutputPlugin > API was added. For now, I choose the output plugin layer but can consider to > do > it from the core if there is a better way. > > Could you please share your opinion? > > Note: thanks for Osumi-san to help implementing. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
At Tue, 14 Feb 2023 22:35:01 -0800, Maciek Sakrejda wrote in > On Tue, Feb 14, 2023 at 11:08 AM Andres Freund wrote: > > One thing I started to wonder about since is whether we should remove the > > io_ > > prefix from io_object, io_context. The prefixes make sense on the C level, > > but > > it's not clear to me that that's also the case on the table level. > > Yeah, +1. It's hard to argue that there would be any confusion, > considering `io_` is in the name of the view. We usually add such prefixes to the columns of system views and catalogs, but it seems that's not the case for the stats views. Thus +1 from me, too. > (Unless, I suppose, some other, non-I/O, "some_object" or > "some_context" column were to be introduced to this view in the > future. But that doesn't seem likely?) I don't think that can happen. As for corss-views ambiguity, that is already present. Many columns in stats views share the same names with some other views. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Mon, 13 Feb 2023 17:13:43 -0800, Andres Freund wrote in > On 2023-02-14 10:05:40 +0900, Kyotaro Horiguchi wrote: > > What do you think about the need for explicitly specifying the > > default? I'm fine with specifying the default using a single word, > > such as WAIT_FOR_REMOTE_FLUSH. > > We obviously shouldn't force the option to be present. Why would we want to > break existing clients unnecessarily? Without it the behaviour should be > unchanged from today's. I didn't suggest making the option mandatory. I just suggested providing a way to specify the default value explicitly, like in the recent commit 746915c686. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
KeepLogSeg needs some fixes on behavior
This is a derived thread form [1], that discusses some subtle behaviors of KeepLogSeg. 1: https://www.postgresql.org/message-id/20230213194131.hgzs6ropcvhda...@awork3.anarazel.de At Mon, 13 Feb 2023 11:41:31 -0800, Andres Freund wrote > Hi, > > On 2023-02-13 15:45:49 +0900, Kyotaro Horiguchi wrote: > > This seems to have a thin connection to the issue, but. > > I was worried that the changes could lead us to removing WAL without > max_slot_wal_keep_size set. > > > > > It seems decidedly not great to not log at least a debug1 (but probably it > > > should be LOG) message when KeepLogSeg() decides to limit based on > > > max_slot_wal_keep_size. > > > > It's easy to do that, but that log is highly accompanied by a LOG line > > "terminating process %d to release replication slot \"%s\"". I don't > > mind adding it if it is a DEBUGx. > > My problem with that is that we might *NOT* see those log messages for some > reason, but that that's impossible to debug as-is. And even if we see them, > it's not that easy to figure out by how much we were over > max_slot_wal_keep_size, because we always report it in the context of a > specific slot. Since 551aa6b7b9, InvalidatePossiblyObsoleteSlot() emits the following detail message in that case for both "terminating" and "invalidating" messages. errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.", LSN_FORMAT_ARGS(restart_lsn), (unsigned long long) (oldestLSN - restart_lsn)) Where oldestLSN is the cutoff LSN by KeepLogSeg(). > Removing WAL that's still needed is a *very* severe operation. Emitting an > additional line in case it happens isn't a problem. Totally agreed about the severity. The message above doesn't explicitly say the source of the cutoff LSN but the only possible source is max_slot_wal_keep_size. I think that DEBUG1 is appropriate for the message from KeepLogSeg(), especially given how often we see it. > > > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's > > > the end > > > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size > > > aside, > > > only actually remove WAL if older than the segment that RedoRecPtr (the > > > logical start of the checkpoint) is in. If the checkpoint is large we'll > > > end > > > up removing replication slots even though they potentially would only have > > > retained one additional WAL segment. > > > > I think that it is a controversial part, but that variable is defined > > the similar way to wal_keep_size. And I think that all max_wal_size, > > wal_keep_size and max_slot_wal_keep_size being defined with the same > > base LSN makes things simpler for users (also for developers). > > Regardless of checkpoint length, if slots get frequently invalidated, > > the setting should be considered to be too small for the system > > requirements. > > I think it's bad that we define wal_keep_size, max_slot_wal_keep_size that > way. I don't think bringing max_wal_size into this is useful, as it influences > different things. In my faint memory, when wal_keep_segments was switched to wal_keep_size, in the first cut patch, I translated the latter to the former by rounding up manner but it was rejected and ended up with the formula we have now. Speaking of max_slot_wal_keep_size, I think it depends on how we interpret the variable. If we see it as the minimum amount to ensure, then we should round it up. But if we see it as the maximum amount that can't be exceeded, then we would round it down like we do now. However, I also think that the "max" prefix could imply something about the upper limit. > > > Isn't it problematic to use ConvertToXSegs() to implement > > > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a > > > large > > > wal_segment_size that'd afaict lead to being much more aggressive > > > invalidating > > > slots. > > > > I think max_slot_wal_keep_size is, like max_wal_size for checkpoints, > > a safeguard for slots not to fill-up WAL directory. Thus they both are > > rounded down. If you have 1GB WAL directory and set wal_segment_size > > to 4192MB, I don't see it a sane setup. But if segment size is smaller > > than one hundredth of max_wal_size, that difference won't matter for > > anyone. But anyway, it's a pain in the a.. that the both of them (and > > wal_keep_size) don't work in a immediate manner, though.. > > It doesn't matter a lot for 16MB segments, but with 1GB segments it's a > different story. > > To me the way it's done now is a bug, one that can in extr
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund wrote in > Hi, > > On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote: > > Done, PSA v2 patch. > > This feels way too complicated to me. How about something more like the > attached? I like this one, but the parameters offset and size are in a different order from pwrite(fd, buf, count, offset). I perfer the arrangement suggested by Bharath. And isn't it better to use Min(remaining_size, BLCKSZ) instead of a bare if statement? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
At Tue, 14 Feb 2023 15:43:26 +0100, "Drouvot, Bertrand" wrote in > Oh right, my bad (the issue has been introduced in V2). > Fixed in V4. Great! > > I thought that we might be able to return entry_ref->pending since the > > callers don't call pfree on the returned pointer, but it is not great > > that we don't inform the callers if the returned memory can be safely > > pfreed or not. > > Thus what I have in mind is the following. > > > >>if (!entry_ref) > >> + { > >>entry_ref = > >> pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, > >>InvalidOid, rel_id); > >> + if (!entry_ref) > >> + return NULL; > >> + } > > LGTM, done that way in V4. That part looks good to me, thanks! I was going through v4 and it seems to me that the comment for find_tabstat_entry may not be quite right. > * find any existing PgStat_TableStatus entry for rel > * > * Find any existing PgStat_TableStatus entry for rel_id in the current > * database. If not found, try finding from shared tables. > * > * If no entry found, return NULL, don't create a new one The comment assumed that the function directly returns an entry from shared memory, but now it copies the entry's contents into a palloc'ed memory and stores the sums of some counters for the current transaction in it. Do you think we should update the comment to reflect this change? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
Thanks for the new version. At Mon, 13 Feb 2023 09:58:52 +0100, "Drouvot, Bertrand" wrote in > Hi, > > On 2/13/23 8:40 AM, Kyotaro Horiguchi wrote: > > At Mon, 13 Feb 2023 08:09:50 +0100, "Drouvot, Bertrand" > > wrote in > >>> I think this is useful beyond being able to generate those functions > >>> with > >>> macros. The fact that we had to deal with transactional code in > >>> pgstatfuncs.c > >>> meant that a lot of the relevant itnernals had to be exposed "outside" > >>> pgstat, > >>> which seems architecturally not great. > >>> > >> Right, good point. > > Agreed. > > > >> Removing the pfrees in V2 attached. > > Ah, that sound good. > > if (!entry_ref) > > + { > > entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, > > InvalidOid, rel_id); > > + return tablestatus; > > + } > > We should return something if the call returns a non-null value? > > What we do is: if entry_ref is NULL then we return NULL (so that the > caller returns 0). > > If entry_ref is not NULL then we return a copy of entry_ref->pending > (with or without subtrans). Isn't it ignoring the second call to pgstat_fetch_pending_entry? What the code did is: if entry_ref is NULL for MyDatabaseId then we *retry* fetching an global (not database-wise) entry. If any global entry is found, return it (correctly entry_ref->pending) to the caller. The current patch returns NULL when a glboal entry is found. I thought that we might be able to return entry_ref->pending since the callers don't call pfree on the returned pointer, but it is not great that we don't inform the callers if the returned memory can be safely pfreed or not. Thus what I have in mind is the following. > if (!entry_ref) > + { > entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, > InvalidOid, rel_id); > + if (!entry_ref) > + return NULL; > + } > > So, since we want to hide the internal from pgstatfuncs, the > > additional flag should be gone. > > I think there is pros and cons for both but I don't have a strong > opinion about that. > > So also proposing V3 attached without the flag in case this is the > preferred approach. That part looks good to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Mon, 13 Feb 2023 15:51:25 +0530, Amit Kapila wrote in > I think we can introduce a new variable as last_feedback_time in the > LogicalRepWorker structure and probably for the last_received, we can > last_lsn in MyLogicalRepWorker as that seems to be updated correctly. > I think it would be good to avoid global variables. MyLogicalRepWorker is a global variable:p, but it is far better than a bear one. By the way, we are trying to send the status messages regularly, but as Andres pointed out, worker does not read nor reply to keepalive messages from publisher while delaying. It is not possible as far as we choke the stream at the subscriber end. It doesn't seem to be a practical problem, but IMHO I think he's right in terms of adherence to the wire protocol, which was also one of my own initial concern. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl
At Mon, 13 Feb 2023 12:18:07 +0900, Michael Paquier wrote in > On Mon, Feb 13, 2023 at 11:27:58AM +0900, Kyotaro Horiguchi wrote: > > I think currently the output by --describe-config can be used only for > > consulting while editing a (possiblly broken) config file. Thus I > > think it's no use showing GIC_DISALLOW_IN_FILE items there unless we > > use help_config() for an on-session use. > > > > On the other hand, don't we need to remove the condition > > GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config() > > should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if > > it is GUC_NOT_IN_SAMPLE. I'm not sure whether there's any variable > > that are marked that way, though. > > As in marked with GUC_NOT_IN_SAMPLE but not GUC_DISALLOW_IN_FILE? > There are quite a lot, developer GUCs being one (say > ignore_invalid_pages). We don't want to list them in the sample file > so as common users don't play with them, still they make sense if > listed in a file. Ah, right. I think I faintly had them in my mind. > If you add a check meaning that GUC_DISALLOW_IN_FILE implies > GUC_NOT_IN_SAMPLE, where one change would need to be applied to > config_file as all the other GUC_DISALLOW_IN_FILE GUCs already do > that, you could remove GUC_DISALLOW_IN_FILE. However, > GUC_NOT_IN_SAMPLE should be around to not expose options, we don't > want common users to know too much about. Okay, I thought that "postgres --help-config" was a sort of developer option, but your explanation above makes sense. > The question about how much people rely on --describe-config these > days is a good one, so perhaps there could be an argument in removing Yeah, that the reason for my thought it was a developer option... > GUC_NOT_IN_SAMPLE from the set. TBH, I would be really surprised that > anybody able to use a developer option writes an configuration file in > an incorrect format and needs to use this option, though :) Hmm. I didn't directly link GUC_NOT_IN_SAMPLE to being a developer option. But on second thought, it seems that it is. So, the current code looks good for me now. Thanks for the explanation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Mon, 13 Feb 2023 08:27:01 +0530, Amit Kapila wrote in > On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi > wrote: > > > > IMHO I vaguely don't like that we lose a means to specify the default > > behavior here. And I'm not sure we definitely don't need other than > > flush and immedaite for both physical and logical replication. > > > > If we can think of any use case that requires its extension then it > makes sense to make it a non-boolean option but otherwise, let's keep > things simple by having a boolean option. What do you think about the need for explicitly specifying the default? I'm fine with specifying the default using a single word, such as WAIT_FOR_REMOTE_FLUSH. > > If it's > > not the case, I don't object to make it a Boolean. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy wrote in > On Sun, Feb 12, 2023 at 11:01 PM Andres Freund wrote: > > > > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote: > > > /* Prepare to write out a lot of copies of our zero buffer at once. > > > */ > > > for (i = 0; i < lengthof(iov); ++i) > > > { > > > - iov[i].iov_base = zbuffer.data; > > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, > > > )->data); > > > iov[i].iov_len = zbuffer_sz; > > > } > > > > Another thing: I think we should either avoid iterating over all the IOVs if > > we don't need them, or, even better, initialize the array as a constant, > > once. FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension? [*1]) for constant array initialization, but individual members don't accept assigning a const value, thus I did deconstify as the follows. > static const struct iovec iov[PG_IOV_MAX] = > {[0 ... PG_IOV_MAX - 1] = >{ >.iov_base = (void *), >.iov_len = BLCKSZ >} > }; I didn't checked the actual mapping, but if I tried an assignment "iov[0].iov_base = NULL", it failed as "assignment of member ‘iov_base’ in read-only object", so is it successfully placed in a read-only segment? Later code assigns iov[0].iov_len thus we need to provide a separate iov non-const variable, or can we use pwrite instead there? (I didn't find pg_pwrite_with_retry(), though) > How about like the attached patch? It makes the iovec static variable > and points the zero buffer only once/for the first time to iovec. This > avoids for-loop on every call. As the patch itself, it seems forgetting to reset iov[0].iov_len after writing a partial block. retards. *1: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Designated-Inits.html -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
At Mon, 13 Feb 2023 08:09:50 +0100, "Drouvot, Bertrand" wrote in >> I think this is useful beyond being able to generate those functions >> with >> macros. The fact that we had to deal with transactional code in >> pgstatfuncs.c >> meant that a lot of the relevant itnernals had to be exposed "outside" >> pgstat, >> which seems architecturally not great. >> > Right, good point. Agreed. > Removing the pfrees in V2 attached. Ah, that sound good. if (!entry_ref) + { entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, InvalidOid, rel_id); + return tablestatus; + } We should return something if the call returns a non-null value? So, since we want to hide the internal from pgstatfuncs, the additional flag should be gone. If the additional cost doesn't bother anyone, I don't mind to remove the flag. The patch will get far simpler by that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl
At Fri, 10 Feb 2023 16:43:15 +0900, Michael Paquier wrote in > On Wed, Feb 08, 2023 at 09:42:13PM -0500, Tom Lane wrote: > > Hm. On the one hand, if it is in fact not in postgresql.conf.sample, > > then that flag should be set for sure. OTOH I see that that flag > > isn't purely documentation: help_config.c thinks it should hide > > GUCs that are marked that way. Do we really want that behavior? > > Not sure. I can see an argument that you might want --describe-config > > to tell you that, but there are a lot of other GUC_NOT_IN_SAMPLE > > GUCs that maybe do indeed deserve to be left out. > > I am not sure to follow. help_config() won't show something that's > either marked NO_SHOW_ALL, NOT_IN_SAMPLE or DISALLOW_IN_FILE, hence > config_file does not show up already in what postgres > --describe-config prints, because it has DISALLOW_IN_FILE, so adding > NOT_IN_SAMPLE changes nothing. I think currently the output by --describe-config can be used only for consulting while editing a (possiblly broken) config file. Thus I think it's no use showing GIC_DISALLOW_IN_FILE items there unless we use help_config() for an on-session use. On the other hand, don't we need to remove the condition GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config() should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if it is GUC_NOT_IN_SAMPLE. I'm not sure whether there's any variable that are marked that way, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Fri, 10 Feb 2023 12:40:43 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Amit, > > > Can't we have this option just as a bool (like shutdown_immediate)? > > Why do we want to keep multiple modes? > > Of course we can use boolean instead, but current style is motivated by the > post[1]. > This allows to add another option in future, whereas I do not have idea now. > > I want to ask other reviewers which one is better... > > [1]: > https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com IMHO I vaguely don't like that we lose a means to specify the default behavior here. And I'm not sure we definitely don't need other than flush and immedaite for both physical and logical replication. If it's not the case, I don't object to make it a Boolean. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila wrote in > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila wrote: > > > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi > > wrote: > > > We have suffered this kind of feedback silence many times. Thus I > > > don't want to rely on luck here. I had in mind of exposing last_send > > > itself or providing interval-calclation function to the logic. > > > > I think we have last_send time in send_feedback(), so we can expose it > > if we want but how would that solve the problem you are worried about? Wal receiver can avoid a too-long sleep by knowing when to wake up for the next feedback. > I have an idea to use last_send time to avoid walsenders being > timeout. Instead of directly using wal_receiver_status_interval as a > minimum interval to send the feedback, we should check if it is > greater than last_send time then we should send the feedback after > (wal_receiver_status_interval - last_send). I think they can probably > be different only on the very first time. Any better ideas? If it means MyLogicalRepWorker->last_send_time, it is not the last time when walreceiver sent a feedback but the last time when wal*sender* sent a data. So I'm not sure that works. We could use the variable that way, but AFAIU in turn when so many changes have been spooled that the control doesn't return to LogicalRepApplyLoop longer than wal_r_s_interval, maybe_apply_delay() starts calling send_feedback() at every call after the first feedback timing. Even in that case, send_feedback() won't send one actually until the next feedback timing, I don't think that behavior is great. The only packets walreceiver sends back is the feedback packets and currently only send_feedback knows the last feedback time. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_usleep for multisecond delays
At Thu, 9 Feb 2023 14:51:14 -0800, Nathan Bossart wrote in > On Thu, Feb 09, 2023 at 01:30:27PM -0800, Andres Freund wrote: > > So I'm not sure it's the right direction to make pg_usleep() easier to > > use... > Hm... We might be able to use WaitLatch() for some of these. And I think we are actully going to reduce or eliminate the use of pg_sleep(). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
At Thu, 9 Feb 2023 11:38:18 +0100, "Drouvot, Bertrand" wrote in > Hi hackers, > > Please find attached a patch proposal for $SUBJECT. > > The idea has been raised in [1] by Andres: it would allow to simplify > even more the work done to > generate pg_stat_get_xact*() functions with Macros. > > Indeed, with the reconciliation done in find_tabstat_entry() then all > the pg_stat_get_xact*() functions > (making use of find_tabstat_entry()) now "look the same" (should they > take into account live subtransactions or not). > > Looking forward to your feedback, I like that direction. Don't we rename PgStat_FunctionCounts to PgStat_FuncStatus, unifying neighboring functions? Why does find_tabstat_entry() copies the whole pending data and performs subxaction summarization? The summarization is needed only by few callers but now that cost is imposed to the all callers along with additional palloc()/pfree() calls. That doesn't seem reasonable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Is psSocketPoll doing the right thing?
> Subject: is p*s*Socket.. Oops... At Thu, 9 Feb 2023 17:32:08 +0900, Fujii Masao wrote in > > > On 2023/02/09 11:50, Kyotaro Horiguchi wrote: > > Hello. > > While looking a patch, I found that pqSocketPoll passes through the > > result from poll(2) to the caller and throws away revents. If I > > understand it correctly, poll() *doesn't* return -1 nor errno by the > > reason it has set POLLERR, POLLHUP, POLLNVAL, and POLLRDHUP for some > > of the target sockets, and returns 0 unless poll() itself failed to > > work. > > As far as I understand correctly, poll() returns >0 if "revents" > has either of those bits, not 0 nor -1. Right. as my understanding. If any of the sockets is in any of the states, pqSocketPoll returns a positive, which makes pqSocketCheck return 1. Finally pqRead/WriteReady return "ready" even though the connection socket is in an error state. Actually that behavior doesn't harm since the succeeding actual read/write will "properly" fail. However, once we use this function to simply check the socket is sound without doing an actual read/write, that behavior starts giving a harm by the false answer. > You're thinking that pqSocketPoll() should check "revents" and > return -1 if either of those bits is set? In short, yes. pqSocketPoll() should somehow inform callers about that state. Fortunately pqSocketPoll is a private function thus we can refactor the function so that it can do that properly. If no one object to that change, I'll work on that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
Mmm. A part of the previous mail have gone anywhere for a uncertain reason and placed by a mysterious blank lines... At Fri, 10 Feb 2023 09:57:22 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila > wrote in > > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi > > wrote: > > > > > > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" > > > wrote in > > > > Thank you for reviewing! PSA new version. > > > > > > + if (statusinterval_ms > 0 && diffms > statusinterval_ms) > > > > > > The next expected feedback time is measured from the last status > > > report. Thus, it seems to me this may suppress feedbacks from being > > > sent for an unexpectedly long time especially when min_apply_delay is > > > shorter than wal_r_s_interval. > > > > > > > I think the minimum time before we send any feedback during the wait > > is wal_r_s_interval. Now, I think if there is no transaction for a > > long time before we get a new transaction, there should be keep-alive > > messages in between which would allow us to send feedback at regular > > intervals (wal_receiver_status_interval). So, I think we should be > > Right. > > > able to send feedback in less than 2 * wal_receiver_status_interval > > unless wal_sender/receiver timeout is very large and there is a very > > low volume of transactions. Now, we can try to send the feedback > > We have suffered this kind of feedback silence many times. Thus I > don't want to rely on luck here. I had in mind of exposing last_send > itself or providing interval-calclation function to the logic. > > > before we start waiting or maybe after every > > wal_receiver_status_interval / 2 but I think that will lead to more > > spurious feedback messages than we get the benefit from them. Agreed. I think we dont want to do that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila wrote in > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi > wrote: > > > > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" > > wrote in > > > Thank you for reviewing! PSA new version. > > > > + if (statusinterval_ms > 0 && diffms > statusinterval_ms) > > > > The next expected feedback time is measured from the last status > > report. Thus, it seems to me this may suppress feedbacks from being > > sent for an unexpectedly long time especially when min_apply_delay is > > shorter than wal_r_s_interval. > > > > I think the minimum time before we send any feedback during the wait > is wal_r_s_interval. Now, I think if there is no transaction for a > long time before we get a new transaction, there should be keep-alive > messages in between which would allow us to send feedback at regular > intervals (wal_receiver_status_interval). So, I think we should be Right. > able to send feedback in less than 2 * wal_receiver_status_interval > unless wal_sender/receiver timeout is very large and there is a very > low volume of transactions. Now, we can try to send the feedback We have suffered this kind of feedback silence many times. Thus I don't want to rely on luck here. I had in mind of exposing last_send itself or providing interval-calclation function to the logic. > before we start waiting or maybe after every > wal_receiver_status_interval / 2 but I think that will lead to more > spurious feedback messages than we get the benefit from them. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Thu, 9 Feb 2023 13:26:19 +0530, Amit Kapila wrote in amit.kapila16> On Thu, Feb 9, 2023 at 12:17 AM Peter Smith wrote: > > I understand in theory, your code is more efficient, but in practice, > > I think the overhead of a single variable assignment every loop > > iteration (which is doing WaitLatch anyway) is of insignificant > > concern, whereas having one assignment is simpler than having two IMO. > > > > Yeah, that sounds better to me as well. FWIW, I'm on board with this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" wrote in > Thank you for reviewing! PSA new version. + if (statusinterval_ms > 0 && diffms > statusinterval_ms) The next expected feedback time is measured from the last status report. Thus, it seems to me this may suppress feedbacks from being sent for an unexpectedly long time especially when min_apply_delay is shorter than wal_r_s_interval. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Is psSocketPoll doing the right thing?
Hello. While looking a patch, I found that pqSocketPoll passes through the result from poll(2) to the caller and throws away revents. If I understand it correctly, poll() *doesn't* return -1 nor errno by the reason it has set POLLERR, POLLHUP, POLLNVAL, and POLLRDHUP for some of the target sockets, and returns 0 unless poll() itself failed to work. It doesn't seem to be the intended behavior since the function sets POLLERR to pollfd.events. (but the bit is ignored by poll(), though) Is the above diagnosis correct? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Proposal] Add foreign-server health checks infrastructure
At Fri, 27 Jan 2023 06:57:01 +, "Hayato Kuroda (Fujitsu)" wrote in > I found cfbot failure, PSA fixed version. + Unlike , this function checks socket + health. This check is performed by polling the socket. This function is + currently available only on systems that support the non-standard + POLLRDHUP extension to the poll system I find it quite confusing that we have pqSocketCheck and PQconnCheck, that does almost the same thing.. Since pqSocketCheck is a static function, we can modify the function as we like. I still don't understand why we need pqconnCheck_internal separate from pqSocketPoll(), and PQconnCheck from pqSocketCheck. https://www.postgresql.org/message-id/flat/TYAPR01MB58665BF23D38EDF10028DE2AF5299%40TYAPR01MB5866.jpnprd01.prod.outlook.com#47d21431bf9fa94f763c824f6e81fa54 > IIUC, pqSocketCheck () calls pqSocketPoll(), > and in the pqSocketPoll() we poll()'d the POLLIN or POLLOUT event. > But according to [1], we must wait POLLRDHUP event, > so we cannot reuse it straightforward. Yeah, I didn't suggest to use the function as-is. Couldn't we extend the fucntion by letting it accept end_time = 0 && !forRead && !forWrite, not causing side effects? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
At Tue, 7 Feb 2023 22:38:14 -0800, Andres Freund wrote in > Hi, > > I did another read through the series. I do have some minor changes, but > they're minor. I think this is ready for commit. I plan to start pushing > tomorrow. > > The changes I made are: > - the tablespace test changes didn't quite work in isolation / needed a bit of > polishing > - moved the tablespace changes to later in the series > - split the tests out of the commit adding the view into its own commit > - minor code formatting things (e.g. didn't like nested for()s without {}) > On 2023-01-25 16:56:17 +0900, Kyotaro Horiguchi wrote: > > At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund > > wrote in > > > > + write_chunk_s(fpout, ); > > > > + if (!read_chunk_s(fpin, >io.stats)) > > > > > > > > The names of the functions hardly make sense alone to me. How about > > > > write_struct()/read_struct()? (I personally prefer to use > > > > write_chunk() directly..) > > > > > > That's not related to this patch - there's several existing callers for > > > it. And write_struct wouldn't be better imo, because it's not just for > > > structs. > > > > Hmm. Then what the "_s" stands for? > > Size. It's a macro that just forwards to read_chunk()/write_chunk(). I know what the macros do. But, I'm fine with the names as they are there since before this patch. Sorry for the noise. > > > > > +Number of read operations in units of > > > > > op_bytes. > > > > > > > > I may be the only one who see the name as umbiguous between "total > > > > number of handled bytes" and "bytes hadled at an operation". Can't it > > > > be op_blocksize or just block_size? > > > > > > > > + b.io_object, > > > > + b.io_context, > > > > > > No, block wouldn't be helpful - we'd like to use this for something that > > > isn't > > > uniform blocks. > > > > What does the field show in that case? The mean of operation size? Or > > one row per opration size? If the former, the name looks somewhat > > wrong. If the latter, block_size seems making sense. > > 1, so that it's clear that the rest are in bytes. Thanks. Okay, I guess the documentation will be changed as necessary. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
I agree to the direction and thanks for the patch. At Tue, 7 Feb 2023 17:08:54 +, "Hayato Kuroda (Fujitsu)" wrote in > > I noticed that previous ones are rejected by cfbot, even if they passed on > > my > > environment... > > PSA fixed version. > > While analyzing more, I found the further bug that forgets initialization. > PSA new version that could be passed automated tests on my github repository. > Sorry for noise. 0002: This patch doesn't seem to offer a means to change the default walsender behavior. We need a subscription option named like "walsender_exit_mode" to do that. +ConsumeWalsenderOptions(List *options, WalSndData *data) I wonder if it is the right design to put options for different things into a single list. I rather choose to embed the walsender option in the syntax than needing this function. K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline opt_shutdown_mode K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate". == If we go with the current design, I think it is better to share the option list rule between the logical and physical START_REPLCIATION commands. I'm not sure I like the option syntax "exit_before_confirming=". I imagin that other options may come in future. Thus, how about "walsender_shutdown_mode=", where the mode is one of "wait_flush"(default) and "immediate"? +typedef struct +{ + boolexit_before_confirming; +} WalSndData; Data doesn't seem to represent the variable. Why not WalSndOptions? - !equal(newsub->publications, MySubscription->publications)) + !equal(newsub->publications, MySubscription->publications) || + (newsub->minapplydelay > 0 && MySubscription->minapplydelay == 0) || + (newsub->minapplydelay == 0 && MySubscription->minapplydelay > 0)) I slightly prefer the following expression (Others may disagree:p): ((newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0)) And I think we need a comment for the term. For example, /* minapplydelay affects START_REPLICATION option exit_before_confirming */ + * Reads all entrly of the list and consume if needed. s/entrly/entries/ ? ... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Make mesage at end-of-recovery less scary.
Thanks! At Fri, 3 Feb 2023 15:16:02 +0100, Alvaro Herrera wrote in > So this patch is now failing because it applies new tests to > 011_crash_recovery.pl, which was removed recently. Can you please move > them elsewhere? I don't find an appropriate file to move to. In the end I created a new file with the name 034_recovery.pl. I added a test for standbys, too. (which is the first objective of this patch.) > I think the comment for ValidXLogRecordLength should explain what the > return value is. Agreed. /* * Validate record length of an XLOG record header. * * This is substantially a part of ValidXLogRecordHeader. But XLogReadRecord * needs this separate from the function in case of a partial record header. + * + * Returns true if the xl_tot_len header field has a seemingly valid value, + * which means the caller can proceed reading to the following part of the + * record. */ static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, I added a similar description to ValidXLogRecordHeader. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From c58ca4d5db52c75dec9882d158d5724e12617005 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 30 Nov 2022 11:51:46 +0900 Subject: [PATCH v24] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. To make sure that the detection is correct, this patch checks if all trailing bytes in the same page are zeroed in that case. --- src/backend/access/transam/xlogreader.c | 144 +- src/backend/access/transam/xlogrecovery.c | 94 ++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 +- src/include/access/xlogreader.h | 1 + src/test/recovery/t/034_recovery.pl | 135 6 files changed, 335 insertions(+), 59 deletions(-) create mode 100644 src/test/recovery/t/034_recovery.pl diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index aa6c929477..8cb2d55333 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -48,6 +48,8 @@ static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking) /* reset error state */ state->errormsg_buf[0] = '\0'; decoded = NULL; + state->EndOfWAL = false; state->abortedRecPtr = InvalidXLogRecPtr; state->missingContrecPtr = InvalidXLogRecPtr; @@ -640,25 +644,21 @@ restart: Assert(pageHeaderSize <= readOff); /* - * Read the record length. + * Validate the record header. * - * NB: Even though we use an XLogRecord pointer here, the whole record - * header might not fit on this page. xl_tot_len is the first field of the - * struct, so it must be on this page (the records are MAXALIGNed), but we - * cannot access any other fields until we've verified that we got the - * whole header. - */ - record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; - - /* - * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header after reading it from the next page. The xl_tot_len + * Even though we use an XLogRecord pointer here, the whole record header + * might not fit on this page. If the whole record header is on this page, + * validate it immediately. Even otherwise xl_tot_len must be on this page + * (it is the first field of MAXALIGNed records), but we still cannot + * access any further fields until we've verified that we got the whole + * header, so do just a basic sanity check on record length, and validate + * the rest of the header after reading it from the next page. The length * check
Re: Where is the logig to create a table file?
At Fri, 3 Feb 2023 13:44:46 +0400, Pavel Borisov wrote in > Hi, Jack > > On Fri, 3 Feb 2023 at 13:19, jack...@gmail.com wrote: > > > > When I use 'create table t(a int);'; suppose that this table t's oid is > > 1200, > > then postgres will create a file named 1200 in the $PGDATA/base, So where > > is the logic code in the internal? > > > heapam_relation_set_new_filenode()->RelationCreateStorage() Or if you are searching for the logic to determin the file name, see GetNewRelFileNumber(). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
Thank you for the comment! At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas wrote in > I want to call out this part of this patch: > > > Also this allows for the cleanup of files left behind in the crash of > > the transaction that created it. > > This is interesting to a lot wider audience than ALTER TABLE SET > LOGGED/UNLOGGED. It also adds most of the complexity, with the new > marker files. Can you please split the first patch into two: > > 1. Cleanup of newly created relations on crash > > 2. ALTER TABLE SET LOGGED/UNLOGGED changes > > Then we can review the first part independently. Ah, indeed. I'll do that. > Regarding the first part, I'm not sure the marker files are the best > approach to implement it. You need to create an extra file for every > relation, just to delete it at commit. It feels a bit silly, but maybe Agreed. (But I didn't come up with better idea..) > it's OK in practice. The undo log patch set solved this problem with > the undo log, but it looks like that patch set isn't going > anywhere. Maybe invent a very lightweight version of the undo log for > this? I didn't thought on that line. Yes, indeed the marker files are a kind of undo log. Anyway, I'll split the current patch to two parts as suggested. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
Thanks! At Mon, 6 Feb 2023 13:10:01 +, "Takamichi Osumi (Fujitsu)" wrote in > The attached patch v29 has included your changes. catalogs.sgml + + The minimum delay (ms) for applying changes. + I think we don't use unit symbols that way. Namely I think we would write it as "The minimum delay for applying changes in milliseconds" alter_subscription.sgml are slot_name, synchronous_commit, binary, streaming, - disable_on_error, and - origin. + disable_on_error, + origin, and + min_apply_delay. By the way, is there any rule for the order among the words? They don't seem in alphabetical order nor in the same order to the create_sbuscription page. (I seems like in the order of SUBOPT_* symbols, but I'm not sure it's a good idea..) subscriptioncmds.c + if (opts.streaming == LOGICALREP_STREAM_PARALLEL && + !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0) .. + if (opts.min_apply_delay > 0 && + !IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL) Don't we wrap the lines? worker.c + if (wal_receiver_status_interval > 0 && + diffms > wal_receiver_status_interval * 1000L) + { + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + wal_receiver_status_interval * 1000L, + WAIT_EVENT_RECOVERY_APPLY_DELAY); + send_feedback(last_received, true, false, true); + } + else + WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + diffms, + WAIT_EVENT_RECOVERY_APPLY_DELAY); send_feedback always handles the case where wal_receiver_status_interval == 0. thus we can simply wait for min(wal_receiver_status_interval, diffms) then call send_feedback() unconditionally. -start_apply(XLogRecPtr origin_startpos) +start_apply(void) -LogicalRepApplyLoop(XLogRecPtr last_received) +LogicalRepApplyLoop(void) Does this patch requires this change? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 7 Feb 2023 09:10:01 +0530, Amit Kapila wrote in > On Tue, Feb 7, 2023 at 6:03 AM Peter Smith wrote: > > 5b. > > Since there are no translator considerations here why not write the > > second error like: > > > > errmsg("%d ms is outside the valid range for parameter > > \"min_apply_delay\" (%d .. %d)", > > result, 0, PG_INT32_MAX)) > > > > I see that existing usage in the code matches what the patch had > before this comment. See below and similar usages in the code. > if (start <= 0) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("invalid value for parameter \"%s\": %d", > "start", start))); The same errmsg text occurs mamy times in the tree. On the other hand the pointed message is the only one. I suppose Peter considered this aspect. # "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)" # also appears just once As for me, it seems to me a good practice to do that regadless of the number of duplicates to (semi)mechanically avoid duplicates. (But I believe I would do as Peter suggests by myself for the first cut, though:p) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila wrote in > On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada wrote: > > > > Otherwise, we will end up terminating > > the WAL stream without the done message. Which will lead to an error > > message "ERROR: could not receive data from WAL stream: server closed > > the connection unexpectedly" on the subscriber even at a clean > > shutdown. > > > > But will that be a problem? As per docs of shutdown [1] ( “Smart” mode > disallows new connections, then waits for all existing clients to > disconnect. If the server is in hot standby, recovery and streaming > replication will be terminated once all clients have disconnected.), > there is no such guarantee. I see that it is required for the > switchover in physical replication to ensure that all the WAL is sent > and replicated but we don't need that for logical replication. +1 Since publisher is not aware of apply-delay (by this patch), as a matter of fact publisher seems gone before sending EOS in that case. The error message is correctly describing that situation. > > In a case where pq_is_send_pending() doesn't become false > > for a long time, (e.g., the network socket buffer got full due to the > > apply worker waiting on a lock), I think users should unblock it by > > themselves. Or it might be practically better to shutdown the > > subscriber first in the logical replication case, unlike the physical > > replication case. > > > > Yeah, will users like such a dependency? And what will they gain by doing so? If PostgreSQL required such kind of special care about shutdown at facing a trouble to keep replication consistency, that won't be acceptable. The current time-delayed logical replication can be seen as a kind of intentional continuous large network lag in this aspect. And I think the consistency is guaranteed even in such cases. On the other hand I don't think the almost all people care about the exact progress when facing such troubles, as far as replication consistently is maintained. IMHO that is also true for the logical-delayed-replication case. > [1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Wed, 1 Feb 2023 08:38:11 +0530, Amit Kapila wrote in > On Wed, Feb 1, 2023 at 8:13 AM Kyotaro Horiguchi > wrote: > > > > At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila > > wrote in > > > So, shall we check if the result of parse_int is in the range 0 and > > > PG_INT32_MAX to ameliorate this concern? > > > > Yeah, it is exactly what I wanted to suggest. > > > > > If this works then we need to > > > probably change the return value of defGetMinApplyDelay() to int32. > > > > I didn't thought doing that, int can store all values in the valid > > range (I'm assuming we implicitly assume int >= int32 in bit width) > > and it is the natural integer in C. Either will do for me but I > > slightly prefer to use int there. > > > > I think it would be clear to use int32 because the parameter where we > store the return value is also int32. I'm fine with that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila wrote in > On Tue, Jan 31, 2023 at 1:40 PM Kyotaro Horiguchi > wrote: > > > > Hi, Kuroda-san, Thanks for the detailed study. > > > > At Tue, 31 Jan 2023 07:06:40 +, "Hayato Kuroda (Fujitsu)" > > wrote in > > > Therefore, I think we can say that modern platforms that are supported by > > > PostgreSQL define int as 32-bit. > > > It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep > > > to use INT_MAX. > > > > Yeah, I know that that's practically correct. Just I wanted to make > > clear is whether we (always) assume int == int32. I don't want to do > > that just because that works. Even though we cannot be perfect, in > > this particular case the destination space is explicitly made as > > int32. > > > > So, shall we check if the result of parse_int is in the range 0 and > PG_INT32_MAX to ameliorate this concern? Yeah, it is exactly what I wanted to suggest. > If this works then we need to > probably change the return value of defGetMinApplyDelay() to int32. I didn't thought doing that, int can store all values in the valid range (I'm assuming we implicitly assume int >= int32 in bit width) and it is the natural integer in C. Either will do for me but I slightly prefer to use int there. As the result I'd like to propose the following change. diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 489eae85ee..9de2745623 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -2293,16 +2293,16 @@ defGetMinApplyDelay(DefElem *def) hintmsg ? errhint("%s", _(hintmsg)) : 0)); /* -* Check lower bound. parse_int() has already been confirmed that result -* is less than or equal to INT_MAX. +* Check the both boundary. Although parse_int() checked the result against +* INT_MAX, this value is to be stored in a catalog column of int32. */ - if (result < 0) + if (result < 0 || result > PG_INT32_MAX) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("%d ms is outside the valid range for parameter \"%s\" (%d .. %d)", result, "min_apply_delay", - 0, INT_MAX))); + 0, PG_INT32_MAX))); return result; } regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
Hi, Kuroda-san, Thanks for the detailed study. At Tue, 31 Jan 2023 07:06:40 +, "Hayato Kuroda (Fujitsu)" wrote in > Therefore, I think we can say that modern platforms that are supported by > PostgreSQL define int as 32-bit. > It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep to > use INT_MAX. Yeah, I know that that's practically correct. Just I wanted to make clear is whether we (always) assume int == int32. I don't want to do that just because that works. Even though we cannot be perfect, in this particular case the destination space is explicitly made as int32. It's a similar discussion to the recent commit 3b4ac33254. We choosed to use the "correct" symbols refusing to employ an implicit assumption about the actual values. (In that sense, it is a compromize to assume int32 being narrower than int is a premise, but the code will get uselessly complex without that assumption:p) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Mon, 30 Jan 2023 14:24:31 +0530, Amit Kapila wrote in > On Mon, Jan 30, 2023 at 12:38 PM Kyotaro Horiguchi > wrote: > > > > At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila > > wrote in > > > #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000)) > > > > Which can lead to overflow, which is practically harmless. > > > > But here tz is always TimestampTz (which is int64), so do, we need to worry? Sorry, I was putting an assuption that int were int64 here. > > > the function defGetMinApplyDelay() sufficient to ensure that the > > > 'delay' value stored in the catalog will always be lesser than > > > INT_MAX? > > > > I'm concerned about cases where INT_MAX is wider than int32. If we > > don't assume such cases, I'm fine with INT_MAX there. > > > > I am not aware of such cases. Anyway, if any such case is discovered > then we need to change the checks in defGetMinApplyDelay(), right? If > so, then I think it is better to keep it as it is unless we know that > this could be an issue on some platform. I'm not sure. I think that int is generally thought that it is tied with an integer type of any size. min_apply_delay is tightly bond with a catalog column of int32 thus I thought that (PG_)INT32_MAX is the right limit. So, as I expressed before, if we assume sizeof(int) <= sizeof(int32), I' fine with using INT_MAX there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila wrote in > On Mon, Jan 30, 2023 at 9:43 AM Kyotaro Horiguchi > wrote: > > > > At Mon, 30 Jan 2023 08:51:05 +0530, Amit Kapila > > wrote in > > > On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi > > > wrote: > > > > > > > > At Sat, 28 Jan 2023 04:28:29 +, "Takamichi Osumi (Fujitsu)" > > > > wrote in > > > > > On Friday, January 27, 2023 8:00 PM Amit Kapila > > > > > wrote: > > > > > > So, you have changed min_apply_delay from int64 to int32, but you > > > > > > haven't > > > > > > mentioned the reason for the same? We use 'int' for the similar > > > > > > parameter > > > > > > recovery_min_apply_delay, so, ideally, it makes sense but still > > > > > > better to tell your > > > > > > reason explicitly. > > > > > Yes. It's because I thought I need to make this feature consistent > > > > > with the recovery_min_apply_delay. > > > > > This feature handles the range same as the recovery_min_apply delay > > > > > from 0 to INT_MAX now > > > > > so should be adjusted to match it. > > > > > > > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where > > > > that actually happens, though.) We can use PG_INT32_MAX instead. > > > > > > > > > > But in other integer GUCs including recovery_min_apply_delay, we use > > > INT_MAX, so not sure if it is a good idea to do something different > > > here. > > > > The GUC is not stored in a catalog, but.. oh... it is multiplied by > > 1000. > > Which part of the patch you are referring to here? Isn't the check in Where recovery_min_apply_delay is used. It is allowed to be set up to INT_MAX but it is used as: > delayUntil = TimestampTzPlusMilliseconds(xtime, > recovery_min_apply_delay); Where the macro is defined as: > #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000)) Which can lead to overflow, which is practically harmless. > the function defGetMinApplyDelay() sufficient to ensure that the > 'delay' value stored in the catalog will always be lesser than > INT_MAX? I'm concerned about cases where INT_MAX is wider than int32. If we don't assume such cases, I'm fine with INT_MAX there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Mon, 30 Jan 2023 08:51:05 +0530, Amit Kapila wrote in > On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi > wrote: > > > > At Sat, 28 Jan 2023 04:28:29 +, "Takamichi Osumi (Fujitsu)" > > wrote in > > > On Friday, January 27, 2023 8:00 PM Amit Kapila > > > wrote: > > > > So, you have changed min_apply_delay from int64 to int32, but you > > > > haven't > > > > mentioned the reason for the same? We use 'int' for the similar > > > > parameter > > > > recovery_min_apply_delay, so, ideally, it makes sense but still better > > > > to tell your > > > > reason explicitly. > > > Yes. It's because I thought I need to make this feature consistent with > > > the recovery_min_apply_delay. > > > This feature handles the range same as the recovery_min_apply delay from > > > 0 to INT_MAX now > > > so should be adjusted to match it. > > > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where > > that actually happens, though.) We can use PG_INT32_MAX instead. > > > > But in other integer GUCs including recovery_min_apply_delay, we use > INT_MAX, so not sure if it is a good idea to do something different > here. The GUC is not stored in a catalog, but.. oh... it is multiplied by 1000. So if it is larger than (INT_MAX / 1000), it overflows... If we officially accept that (I don't think great) behavior (even only for impractical values), I don't object further. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Sat, 28 Jan 2023 04:28:29 +, "Takamichi Osumi (Fujitsu)" wrote in > On Friday, January 27, 2023 8:00 PM Amit Kapila > wrote: > > So, you have changed min_apply_delay from int64 to int32, but you haven't > > mentioned the reason for the same? We use 'int' for the similar parameter > > recovery_min_apply_delay, so, ideally, it makes sense but still better to > > tell your > > reason explicitly. > Yes. It's because I thought I need to make this feature consistent with the > recovery_min_apply_delay. > This feature handles the range same as the recovery_min_apply delay from 0 to > INT_MAX now > so should be adjusted to match it. INT_MAX can stick out of int32 on some platforms. (I'm not sure where that actually happens, though.) We can use PG_INT32_MAX instead. IMHO, I think we don't use int as a catalog column and I agree that int32 is sufficient since I don't think more than 49 days delay is practical. On the other hand, maybe I wouldn't want to use int32 for intermediate calculations. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Wed, 25 Jan 2023 12:30:19 +0530, Amit Kapila wrote in > On Wed, Jan 25, 2023 at 11:57 AM Kyotaro Horiguchi > wrote: > > > > At Tue, 24 Jan 2023 12:19:04 +, "Takamichi Osumi (Fujitsu)" > > wrote in > > > Attached the patch v20 that has incorporated all comments so far. > > > ... > > > > > > + in which case no additional wait is necessary. If the system > > clocks > > + on publisher and subscriber are not synchronized, this may lead > > to > > + apply changes earlier than expected, but this is not a major > > issue > > + because this parameter is typically much larger than the time > > + deviations between servers. Note that if this parameter is set > > to a > > > > This doesn't seem to fit our documentation. It is not our business > > whether a certain amount deviation is critical or not. How about > > somethig like the following? > > > > But we have a similar description for 'recovery_min_apply_delay' [1]. > See "...If the system clocks on primary and standby are not > synchronized, this may lead to recovery applying records earlier than > expected; but that is not a major issue because useful settings of > this parameter are much larger than typical time deviations between > servers." Mmmm. I thought that we might be able to gather the description (including other common descriptions, if any), but I didn't find an appropreate place.. Okay. I agree to the current description. Thanks for the kind explanation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund wrote in > > 0002: > > > > + Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType)); > > > > This is relatively complex checking. We already asserts-out increments > > of invalid counters. Thus this is checking if some unrelated codes > > clobbered them, which we do only when consistency is critical. Is > > there any needs to do that here? I saw another occurance of the same > > assertion. > > I found it useful to find problems. Okay. > > + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || > > + bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || > > + bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP; > > > > I'm not sure I like to omit parentheses for such a long Boolean > > expression on the right side. > > What parens would help? I thought about the following. no_temp_rel = (bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP); > > + write_chunk_s(fpout, ); > > + if (!read_chunk_s(fpin, >io.stats)) > > > > The names of the functions hardly make sense alone to me. How about > > write_struct()/read_struct()? (I personally prefer to use > > write_chunk() directly..) > > That's not related to this patch - there's several existing callers for > it. And write_struct wouldn't be better imo, because it's not just for > structs. Hmm. Then what the "_s" stands for? > > + PgStat_BktypeIO > > > > This patch abbreviates "backend" as "bk" but "be" is used in many > > places. I think that naming should follow the predecessors. > > The precedence aren't consistent unfortunately :) Uuuum. Okay, just I like "be" there! Anyway, I don't strongly push that. > > > +Number of read operations in units of > > > op_bytes. > > > > I may be the only one who see the name as umbiguous between "total > > number of handled bytes" and "bytes hadled at an operation". Can't it > > be op_blocksize or just block_size? > > > > + b.io_object, > > + b.io_context, > > No, block wouldn't be helpful - we'd like to use this for something that isn't > uniform blocks. What does the field show in that case? The mean of operation size? Or one row per opration size? If the former, the name looks somewhat wrong. If the latter, block_size seems making sense. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: wake up logical workers after ALTER SUBSCRIPTION
At Tue, 24 Jan 2023 10:42:17 -0800, Nathan Bossart wrote in > On Tue, Jan 24, 2023 at 01:13:55PM -0500, Tom Lane wrote: > > Either that comment needs to be rewritten or we need to invent some > > more macros. > > Here is a first attempt at a patch. I scanned through all the existing > uses of InvalidDsaPointer and DSM_HANDLE_INVALID and didn't notice anything > else that needed adjusting. There seems to be two cases for DSA_HANDLE_INVALID in dsa_get_handle and dsa_attach_in_place, one of which is Assert(), though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 24 Jan 2023 12:19:04 +, "Takamichi Osumi (Fujitsu)" wrote in > Attached the patch v20 that has incorporated all comments so far. Thanks! I looked thourgh the documentation part. + + subminapplydelay int8 + + + Total time spent delaying the application of changes, in milliseconds. + I was confused becase it reads as this column shows the summarized actual waiting time caused by min_apply_delay. IIUC actually it shows the min_apply_delay setting for the subscription. Thus shouldn't it be something like this? "The minimum amount of time to delay applying changes, in milliseconds" And it might be better to mention the corresponding subscription paramter. + error. If wal_receiver_status_interval is set to + zero, the apply worker doesn't send any feedback messages during the + min_apply_delay period. I took a bit longer time to understand what this sentence means. I'd like to suggest something like the follwoing. "Since no status-update messages are sent while delaying, note that wal_receiver_status_interval is the only source of keepalive messages during that period." + + A logical replication subscription can delay the application of changes by + specifying the min_apply_delay subscription parameter. + See for details. + I'm not sure "logical replication subscription" is a common term. Doesn't just "subscription" mean the same, especially in that context? (Note that 31.2 starts with "A subscription is the downstream.."). + Any delay occurs only on WAL records for transaction begins after all + initial table synchronization has finished. The delay is calculated There is no "transaction begin" WAL records. Maybe it is "logical replication transaction begin message". The timestamp is of "commit time". (I took "transaction begins" as a noun, but that might be wrong..) +may reduce the actual wait time. It is also possible that the overhead +already exceeds the requested min_apply_delay value, +in which case no additional wait is necessary. If the system clocks I'm not sure it is right to say "necessary" here. IMHO it might be better be "in which case no delay is applied". + in which case no additional wait is necessary. If the system clocks + on publisher and subscriber are not synchronized, this may lead to + apply changes earlier than expected, but this is not a major issue + because this parameter is typically much larger than the time + deviations between servers. Note that if this parameter is set to a This doesn't seem to fit our documentation. It is not our business whether a certain amount deviation is critical or not. How about somethig like the following? "Note that the delay is measured between the timestamp assigned by publisher and the system clock on subscriber. You need to manage the system clocks to be in sync so that the delay works properly." +Delaying the replication can mean there is a much longer time +between making a change on the publisher, and that change being +committed on the subscriber. This can impact the performance of +synchronous replication. See +parameter. Do we need the "can" in "Delaying the replication can mean"? If we want to say, it might be "Delaying the replication means there can be a much longer..."? + + Create a subscription to a remote server that replicates tables in + the mypub publication and starts replicating immediately + on commit. Pre-existing data is not copied. The application of changes is + delayed by 4 hours. + +CREATE SUBSCRIPTION mysub + CONNECTION 'host=192.0.2.4 port=5432 user=foo dbname=foodb' +PUBLICATION mypub + WITH (copy_data = false, min_apply_delay = '4h'); + I'm not sure we need this additional example. We already have two exmaples one of which differs from the above only by actual values for PUBLICATION and WITH clauses. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
Sorry for making you bothered by this. At Tue, 24 Jan 2023 10:12:40 +, "Hayato Kuroda (Fujitsu)" wrote in > > > Couldn't we maintain an additional static variable "last_applied" > > > along with last_received? > > > > > > > It won't be easy to maintain the meaning of last_applied because there > > are cases where we don't apply the change directly. For example, in > > case of streaming xacts, we will just keep writing it to the file, > > now, say, due to some reason, we have to send the feedback, then it > > will not allow you to update the latest write locations. This would > > then become different then what we are doing without the patch. > > Another point to think about is that we also need to keep the variable > > updated for keep-alive ('k') messages even though we don't apply > > anything in that case. Still, other cases to consider are where we > > have mix of streaming and non-streaming transactions. > > I have tried to implement that, but it might be difficult because of a corner > case related with the initial data sync. > > First of all, I have made last_applied to update when > > * transactions are committed, prepared, or aborted > * apply worker receives keepalive message. Yeah, I vagurly thought that it is enough that the update happens just befor existing send_feecback() calls. But it turned out to introduce another unprincipledness.. > I thought during the initial data sync, we must not update the last applied > triggered by keepalive messages, so following lines were added just after > updating last_received. > > ``` > + if (last_applied < end_lsn && > AllTablesyncsReady()) > + last_applied = > end_lsn; > ``` Maybe, the name "last_applied" made you confused. As I mentioned in another message, the variable points to the remote LSN of last "processed" 'w/k' messages. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 24 Jan 2023 14:22:19 +0530, Amit Kapila wrote in > On Tue, Jan 24, 2023 at 12:44 PM Peter Smith wrote: > > > > On Tue, Jan 24, 2023 at 5:58 PM Amit Kapila wrote: > > > > > > On Tue, Jan 24, 2023 at 8:15 AM Kyotaro Horiguchi > > > wrote: > > > > > > > > > Attached the updated patch v19. > > > > > > > > + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) > > > > > > > > I look this spelling strange. How about maybe_apply_delay()? > > > > > > > > > > +1. > > > > It depends on how you read it. I read it like this: > > > > maybe_delay_apply === means "maybe delay [the] apply" > > (which is exactly what the function does) > > > > versus > > > > maybe_apply_delay === means "maybe [the] apply [needs a] delay" > > (which is also correct, but it seemed a more awkward way to say it IMO) > > > > This matches more with GUC and all other usages of variables in the > patch. So, I still prefer the second one. I read it as "maybe apply [the] delay [to something suggested by the context]". If we go the first way, I will name it as "maybe_delay_apply_change" or something that has an extra word. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
In short, I'd like to propose renaming the parameter in_delayed_apply of send_feedback to "has_unprocessed_change". At Tue, 24 Jan 2023 12:27:58 +0530, Amit Kapila wrote in > > send_feedback(): > > +* If the subscriber side apply is delayed (because of time-delayed > > +* replication) then do not tell the publisher that the received > > latest > > +* LSN is already applied and flushed, otherwise, it leads to the > > +* publisher side making a wrong assumption of logical replication > > +* progress. Instead, we just send a feedback message to avoid a > > publisher > > +* timeout during the delay. > > */ > > - if (!have_pending_txes) > > + if (!have_pending_txes && !in_delayed_apply) > > flushpos = writepos = recvpos; > > > > Honestly I don't like this wart. The reason for this is the function > > assumes recvpos = applypos but we actually call it while holding > > unapplied changes, that is, applypos < recvpos. > > > > Couldn't we maintain an additional static variable "last_applied" > > along with last_received? > > > > It won't be easy to maintain the meaning of last_applied because there > are cases where we don't apply the change directly. For example, in > case of streaming xacts, we will just keep writing it to the file, > now, say, due to some reason, we have to send the feedback, then it > will not allow you to update the latest write locations. This would > then become different then what we are doing without the patch. > Another point to think about is that we also need to keep the variable > updated for keep-alive ('k') messages even though we don't apply > anything in that case. Still, other cases to consider are where we > have mix of streaming and non-streaming transactions. Yeah. Even though I named it as "last_applied", its objective is to have get_flush_position returning the correct have_pending_txes without a hint from callers, that is, "let g_f_position know if store_flush_position has been called with the last received data". Anyway I tried that but didn't find a clean and simple way. However, while on it, I realized what the code made me confused. +static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply, + bool in_delayed_apply); The name "in_delayed_apply" doesn't donsn't give me an idea of what the function should do for it. If it is named "has_unprocessed_change", I think it makes sense that send_feedback should think there may be an outstanding transaction that is not known to the function. So, my conclusion here is I'd like to propose changing the parameter name to "has_unapplied_change". > > In this case the condition cited above > > would be as follows and in_delayed_apply will become unnecessary. > > > > + if (!have_pending_txes && last_received == last_applied) > > > > The function is a static function and always called with a variable > > last_received that has the same scope with the function, as the first Sorry for the noise, I misread it. Maybe I took the "function-scoped" variable as file-scoped.. Thus the discussion is false. > > parameter. Thus we can remove the first parameter then let the > > function directly look at the both two varaibles instead. > > > > I think this is true without this patch, so why that has not been > followed in the first place? One comment, I see in this regard is as > below: > > /* It's legal to not pass a recvpos */ > if (recvpos < last_recvpos) > recvpos = last_recvpos; Sorry. I don't understand this. It is just a part of the ratchet mechanism for the last received lsn to report. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 24 Jan 2023 11:45:36 +0530, Amit Kapila wrote in > Personally, I would prefer the above LOGs to be in reverse order as it > doesn't make much sense to me to first say that we are skipping > changes and then say the transaction is delayed. What do you think? In the first place, I misunderstood maybe_start_skipping_changes(), which doesn't actually skip changes. So... sorry for the noise. For the record, I agree that the current order is right. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 24 Jan 2023 11:28:58 +0530, Amit Kapila wrote in > On Tue, Jan 24, 2023 at 6:17 AM Kyotaro Horiguchi > wrote: > > > > IMHO "foo > bar" is not an "option". I think we say "foo and bar are > > mutually exclusive options" but I think don't say "foo = x and bar = y > > are.. options". I wrote a comment as "this should be more like > > human-speaking" and Euler seems having the same feeling for another > > error message. > > > > Concretely I would spell this as "min_apply_delay cannot be enabled > > when parallel streaming mode is enabled" or something. > > > > We can change it but the current message seems to be in line with some > nearby messages like "slot_name = NONE and enabled = true are mutually > exclusive options". So, isn't it better to keep this as one in sync > with existing messages? Ooo. subscriptioncmds.c is full of such messages. Okay I agree that it is better to leave it as is.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
At Tue, 24 Jan 2023 17:22:03 +0900 (JST), Kyotaro Horiguchi wrote in > +pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op) > +{ > + Assert(io_object < IOOBJECT_NUM_TYPES); > + Assert(io_context < IOCONTEXT_NUM_TYPES); > + Assert(io_op < IOOP_NUM_TYPES); > + Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, > io_op)); > > Is there any reason for not checking the value ranges at the > bottom-most functions? They can lead to out-of-bounds access so I To make sure, the "They" means "out-of-range io_object/context/op values".. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hello. At Thu, 19 Jan 2023 21:15:34 -0500, Melanie Plageman wrote in > Oh dear-- an extra FlushBuffer() snuck in there somehow. > Removed it in attached v51. > Also, I fixed an issue in my tablespace.sql updates I only looked 0002 and 0004. (Sorry for the random order of the comment..) 0002: + Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType)); This is relatively complex checking. We already asserts-out increments of invalid counters. Thus this is checking if some unrelated codes clobbered them, which we do only when consistency is critical. Is there any needs to do that here? I saw another occurance of the same assertion. -/* Reset some shared cluster-wide counters */ +/* + * Reset some shared cluster-wide counters + * + * When adding a new reset target, ideally the name should match that in + * pgstat_kind_infos, if relevant. + */ I'm not sure the addition is useful.. +pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op) +{ + Assert(io_object < IOOBJECT_NUM_TYPES); + Assert(io_context < IOCONTEXT_NUM_TYPES); + Assert(io_op < IOOP_NUM_TYPES); + Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)); Is there any reason for not checking the value ranges at the bottom-most functions? They can lead to out-of-bounds access so I don't think we need to continue execution for such invalid values. + no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER || + bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER || + bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP; I'm not sure I like to omit parentheses for such a long Boolean expression on the right side. + write_chunk_s(fpout, ); + if (!read_chunk_s(fpin, >io.stats)) The names of the functions hardly make sense alone to me. How about write_struct()/read_struct()? (I personally prefer to use write_chunk() directly..) + PgStat_BktypeIO This patch abbreviates "backend" as "bk" but "be" is used in many places. I think that naming should follow the predecessors. 0004: system_views.sql: +FROM pg_stat_get_io() b; What does the "b" stand for? (Backend? then "s" or "i" seems straight-forward.) + nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op); + + if (nulls[col_idx]) + continue; + + values[col_idx] = + Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]); This is a bit hard to read since it requires to follow the condition flow. The following is simpler and I thhink close to our standard. if (pgstat_tacks_io_op()) values[col_idx] = Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]); else nulls[col_idx] = true; > +Number of read operations in units of op_bytes. I may be the only one who see the name as umbiguous between "total number of handled bytes" and "bytes hadled at an operation". Can't it be op_blocksize or just block_size? + b.io_object, + b.io_context, It's uncertain to me why only the two columns are prefixed by "io". Don't "object_type" and just "context" work instead? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
Sorry, I forgot to write one comment. At Tue, 24 Jan 2023 11:45:35 +0900 (JST), Kyotaro Horiguchi wrote in + /* Should we delay the current transaction? */ + if (finish_ts) + maybe_delay_apply(xid, finish_ts); + if (!am_parallel_apply_worker()) maybe_start_skipping_changes(lsn); It may not give actual advantages, but isn't it better that delay happens after skipping? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
> Attached the updated patch v19. + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts) I look this spelling strange. How about maybe_apply_delay()? send_feedback(): +* If the subscriber side apply is delayed (because of time-delayed +* replication) then do not tell the publisher that the received latest +* LSN is already applied and flushed, otherwise, it leads to the +* publisher side making a wrong assumption of logical replication +* progress. Instead, we just send a feedback message to avoid a publisher +* timeout during the delay. */ - if (!have_pending_txes) + if (!have_pending_txes && !in_delayed_apply) flushpos = writepos = recvpos; Honestly I don't like this wart. The reason for this is the function assumes recvpos = applypos but we actually call it while holding unapplied changes, that is, applypos < recvpos. Couldn't we maintain an additional static variable "last_applied" along with last_received? In this case the condition cited above would be as follows and in_delayed_apply will become unnecessary. + if (!have_pending_txes && last_received == last_applied) The function is a static function and always called with a variable last_received that has the same scope with the function, as the first parameter. Thus we can remove the first parameter then let the function directly look at the both two varaibles instead. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value
At Thu, 19 Jan 2023 18:43:52 -0500, Tom Lane wrote in > sirisha chamarthi writes: > > On Wed, Oct 19, 2022 at 7:59 PM Kyotaro Horiguchi > > wrote: > >> In short, the proposed fix alone seems fine to me. If we want to show > >> further details, I would add a bit as follows. > >> > >> | * * WALAVAIL_REMOVED means it has been removed. A replication stream on > >> | * a slot with this LSN cannot continue. Note that the affected > >> | * processes have been terminated by checkpointer, too. > > > Thanks for your comments! Attached the patch with your suggestions. > > Pushed with a bit of additional wordsmithing. I thought "have been" Thanks! > was a bit too strong of an assertion considering that this function > does not pay any attention to the actual state of any processes, > so I made it say "should have been". I think you're correct here. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Mon, 23 Jan 2023 17:36:13 +0530, Amit Kapila wrote in > On Sun, Jan 22, 2023 at 6:12 PM Takamichi Osumi (Fujitsu) > wrote: > > > > > > Attached the updated patch v19. > Few comments: > 2. > + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) && > + opts->min_apply_delay > 0 && opts->streaming == LOGICALREP_STREAM_PARALLEL) > + ereport(ERROR, > + errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("%s and %s are mutually exclusive options", > +"min_apply_delay > 0", "streaming = parallel")); > } > > I think here we should add a comment for the translator as we are > doing in some other nearby cases. IMHO "foo > bar" is not an "option". I think we say "foo and bar are mutually exclusive options" but I think don't say "foo = x and bar = y are.. options". I wrote a comment as "this should be more like human-speaking" and Euler seems having the same feeling for another error message. Concretely I would spell this as "min_apply_delay cannot be enabled when parallel streaming mode is enabled" or something. And the opposite-direction message nearby would be "parallel streaming mode cannot be enabled when min_apply_delay is enabled." regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Perform streaming logical transactions by background workers and parallel apply
At Tue, 10 Jan 2023 12:01:43 +0530, Amit Kapila wrote in > On Tue, Jan 10, 2023 at 11:16 AM Kyotaro Horiguchi > wrote: > > Although I don't see a technical difference between the two, all the > > other occurances including the just above (except test_shm_mq) use > > "could not". A faint memory in my non-durable memory tells me that we > > have a policy that we use "can/could not" than "unable". > > > > Right, it is mentioned in docs [1] (see section "Tricky Words to Avoid"). Thanks for confirmation. > Can you please start a new thread and post these changes as we are > proposing to change existing message as well? All right. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Lazy allocation of pages required for verifying FPI consistency
At Thu, 12 Jan 2023 15:02:25 +0530, Bharath Rupireddy wrote in > On the contrary, PGAlignedBlock is being used elsewhere in the code; I noticed it and had the same feeling, and thought that they don't justify to do the same at other places. > some of them are hot paths. verifyBackupPageConsistency() is not > something that gets called always i.e. WAL consistency checks are done > conditionally - when either one enables wal_consistency_checking for > the rmgr or the WAL record is flagged with > XLR_CHECK_CONSISTENCY (core doesn't do, it's an external module, if > any, do that). Right. So we could allocate them at the first use as below, but... > I really don't see much of a problem in allocating them statically and > pushing closer to where they're being used. If this really concerns, > at the least, the dynamic allocation needs to be pushed to > verifyBackupPageConsistency() IMO with if (first_time) { allocate two > blocks with palloc} and use them. This at least saves some memory on > the heap for most of the servers out there. Yeah, we could do that. But as I mentioned before, that happens only on startup thus it can be said that that's not worth bothering. On the other hand I don't think it's great to waste 16kB * max_backends memory especially when it is clearly recognized and easily avoidable. I guess the reason for the code is more or less that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila wrote in > Okay, but what happens in the case of physical replication when > synchronous_commit = remote_apply? In that case, won't it ensure that > apply has also happened? If so, then shouldn't the time delay feature > also cause a similar problem for physical replication as well? As written in another mail, WalSndDone doesn't honor synchronous_commit. In other words, AFAIS walsender finishes not waiting remote_apply. The unapplied recods will be applied at the next startup. I didn't confirmed that behavior for myself, though.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Wed, 28 Dec 2022 09:15:41 +, "Hayato Kuroda (Fujitsu)" wrote in > > Another thing we can investigate here why do we need to ensure that > > there is no pending send before shutdown. > > I have not done yet about it, will continue next year. > It seems that walsenders have been sending all data before shutting down > since ea5516, > e0b581 and 754baa. > There were many threads related with streaming replication, so I could not pin > the specific message that written in the commit message of ea5516. > > I have also checked some wiki pages [1][2], but I could not find any design > about it. > > [1]: https://wiki.postgresql.org/wiki/Streaming_Replication > [2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal If I'm grabbing the discussion here correctly, in my memory, it is because: physical replication needs all records that have written on primary are written on standby for switchover to succeed. It is annoying that normal shutdown occasionally leads to switchover failure. Thus WalSndDone explicitly waits for remote flush/write regardless of the setting of synchronous_commit. Thus apply delay doesn't affect shutdown (AFAICS), and that is sufficient since all the records will be applied at the next startup. In logical replication apply preceeds write and flush so we have no indication whether a record is "replicated" to standby by other than apply LSN. On the other hand, logical recplication doesn't have a business with switchover so that assurarance is useless. Thus I think we can (practically) ignore apply_lsn at shutdown. It seems subtly irregular, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Lazy allocation of pages required for verifying FPI consistency
At Mon, 9 Jan 2023 20:00:00 +0530, Bharath Rupireddy wrote in > I propose to statically allocate these two pages using PGAlignedBlock > structure lazily in verifyBackupPageConsistency() to not waste dynamic > memory worth 2*BLCKSZ bytes. I'm attaching a small patch herewith. > > Thoughts? IMHO, it's a bit scaring to me to push down the execution stack by that large size. I tend to choose the (current) possible memory wasting only on startup process than digging stack deeply. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Wed, 11 Jan 2023 12:46:24 +, "Hayato Kuroda (Fujitsu)" wrote in > them. Which version is better? Some comments by a quick loock, different from the above. + CONNECTION 'host=192.168.1.50 port=5432 user=foo dbname=foodb' I understand that we (not PG people, but IT people) are supposed to use in documents a certain set of special addresses that is guaranteed not to be routed in the field. > TEST-NET-1 : 192.0.2.0/24 > TEST-NET-2 : 198.51.100.0/24 > TEST-NET-3 : 203.0.113.0/24 (I found 192.83.123.89 in the postgres_fdw doc, but it'd be another issue..) + if (strspn(tmp, "-0123456789 ") == strlen(tmp)) Do we need to bother spending another memory block for apparent non-digits here? + errmsg(INT64_FORMAT " ms is outside the valid range for parameter \"%s\"", We don't use INT64_FORMAT in translatable message strings. Cast then use %lld instead. This message looks unfriendly as it doesn't suggest the valid range, and it shows the input value in a different unit from what was in the input. A I think we can spell it as "\"%s\" is outside the valid range for subsciription parameter \"%s\" (0 .. in millisecond)" + int64 min_apply_delay; .. + if (ms < 0 || ms > INT_MAX) Why is the variable wider than required? + errmsg("%s and %s are mutually exclusive options", + "min_apply_delay > 0", "streaming = parallel")); Mmm. Couldn't we refuse 0 as min_apply_delay? + sub->minapplydelay > 0) ... + if (opts.min_apply_delay > 0 && Is there any reason for the differenciation? + errmsg("cannot set %s for subscription with %s", + "streaming = parallel", "min_apply_delay > 0")); I think that this shoud be more like human-speking. Say, "cannot enable min_apply_delay for subscription in parallel streaming mode" or something.. The same is applicable to the nearby message. +static void maybe_delay_apply(TimestampTz ts); apply_spooled_messages(FileSet *stream_fileset, TransactionId xid, - XLogRecPtr lsn) + XLogRecPtr lsn, TimestampTz ts) "ts" looks too generic. Couldn't it be more specific? We need a explanation for the parameter in the function comment. + if (!am_parallel_apply_worker()) + { + Assert(ts > 0); + maybe_delay_apply(ts); It seems to me better that the if condition and assertion are checked inside maybe_delay_apply(). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: mprove tab completion for ALTER EXTENSION ADD/DROP
At Mon, 2 Jan 2023 13:19:50 +0530, vignesh C wrote in > On Mon, 5 Dec 2022 at 06:53, Michael Paquier wrote: > > > > The DROP could be matched with the objects that are actually part of > > the so-said extension? > > The modified v2 version has the changes to handle the same. Sorry for > the delay as I was working on another project. It suggests the *kinds* of objects that are part of the extension, but lists the objects of that kind regardless of dependency. I read Michael suggested (and I agree) to restrict the objects (not kinds) to actually be a part of the extension. (And not for object kinds.) However I'm not sure it is useful to restrict object kinds since the operator already knows what to drop, if you still want to do that, the use of completion_dont_quote looks ugly since the function (requote_identifier) is assuming an identifier as input. I didn't looked closer but maybe we need another way to do that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] Add overlaps geometric operators that ignore point overlaps
Hello. At Sun, 1 Jan 2023 01:13:24 +0530, Ankit Kumar Pandey wrote in > This is patch for todo item: Add overlaps geometric operators that > ignore point overlaps > > Issue: > > SELECT circle '((0,0), 1)' && circle '((2,0),1) returns True > > Expectation: In above case, both figures touch other but do not > overlap (i.e. touching != overlap). Hence, it should return false. This may be slightly off from the common definition in other geometric processing systems, it is the established behavior of PostgreSQL that should already have users. About the behavior itself, since it seems to me that the words "touch" and "overlap" have no rigorous mathematical definitions, that depends on definition. The following discussion would be mere a word play.. If circle ((0,0),1) means a circumference, i.e. a set of points described as "x^2 + y^2 = 1" (or it may be a disc containing the area inside (<=) here) and "overlap" means "share at least a point", the two circles are overlapping. This seems to be our current stand point and what is expressed in the doc. If it meant the area exclusively inside the outline (i.e. x^2 + y^2 < 1), the two circles could be said touching but not overlapping. Or, if circle is defined as "(<)= 1" but "overlap" meant "share at least an area", they could be said not overlapping but touching? (I'm not sure about the border between a point and an area here and the distinction would be connected with the annoying EPSILON..) The same discussion holds for boxes or other shapes. > Now, as per as discussion > (https://www.postgresql.org/message-id/20100322175532.GG26428%40fetter.org) > and corresponding change in docs, > https://www.postgresql.org/docs/15/functions-geometry.html, it > mentions > > `Do these objects overlap? (One point in common makes this true.) > `. Does this means current behavior is correct? Or do we still need > the proposed change (if so, with proper updates in docs)? > > If current behavior is correct, this todo item might need some update > (unless I missed anything) otherwise any suggestion is welcomed. I read the todo description as we may want *another set* of operators to do that, not to change the current behavior of the existing operators. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Perform streaming logical transactions by background workers and parallel apply
Hello. At Mon, 9 Jan 2023 14:21:03 +0530, Amit Kapila wrote in > Pushed the first (0001) patch. It added the following error message. + seg = dsm_attach(handle); + if (!seg) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("unable to map dynamic shared memory segment"))); On the other hand we already have the following one in parallel.c (another in pg_prewarm) seg = dsm_attach(DatumGetUInt32(main_arg)); if (seg == NULL) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("could not map dynamic shared memory segment"))); Although I don't see a technical difference between the two, all the other occurances including the just above (except test_shm_mq) use "could not". A faint memory in my non-durable memory tells me that we have a policy that we use "can/could not" than "unable". (Mmm. I find ones in StartBackgroundWorker and sepgsql_client_auth.) Shouldn't we use the latter than the former? If that's true, it seems to me that test_shm_mq also needs the same amendment to avoid the same mistake in future. = index 2e5914d5d9..a2d7474ed4 100644 --- a/src/backend/replication/logical/applyparallelworker.c +++ b/src/backend/replication/logical/applyparallelworker.c @@ -891,7 +891,7 @@ ParallelApplyWorkerMain(Datum main_arg) if (!seg) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), -errmsg("unable to map dynamic shared memory segment"))); +errmsg("could not map dynamic shared memory segment"))); toc = shm_toc_attach(PG_LOGICAL_APPLY_SHM_MAGIC, dsm_segment_address(seg)); if (!toc) diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c index 8807727337..005b56023b 100644 --- a/src/test/modules/test_shm_mq/worker.c +++ b/src/test/modules/test_shm_mq/worker.c @@ -81,7 +81,7 @@ test_shm_mq_main(Datum main_arg) if (seg == NULL) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), -errmsg("unable to map dynamic shared memory segment"))); +errmsg("could not map dynamic shared memory segment"))); toc = shm_toc_attach(PG_TEST_SHM_MQ_MAGIC, dsm_segment_address(seg)); if (toc == NULL) ereport(ERROR, = regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: ARRNELEMS Out-of-bounds possible errors
At Fri, 23 Dec 2022 17:37:55 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 22 Dec 2022 12:35:58 -0300, Ranier Vilela wrote > in > > Hi. > > > > Per Coverity. > > > > The commit ccff2d2 > > <https://github.com/postgres/postgres/commit/ccff2d20ed9622815df2a7deffce8a7b14830965>, > > changed the behavior function ArrayGetNItems, > > with the introduction of the function ArrayGetNItemsSafe. > > > > Now ArrayGetNItems may return -1, according to the comment. > > " instead of throwing an exception. -1 is returned after an error." > > If I'm reading the code correctly, it's the definition of > ArrayGetNItems*Safe*. ArrayGetNItems() calls that function with a NULL > escontext and the NULL turns ereturn() into ereport(). > That doesn't seem to be changed by the commit. No.. It seems to me that the commit didn't change its behavior in that regard. > Of course teaching Coverity not to issue the false warnings would be > another actual issue that we should do, maybe. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: ARRNELEMS Out-of-bounds possible errors
At Thu, 22 Dec 2022 12:35:58 -0300, Ranier Vilela wrote in > Hi. > > Per Coverity. > > The commit ccff2d2 > <https://github.com/postgres/postgres/commit/ccff2d20ed9622815df2a7deffce8a7b14830965>, > changed the behavior function ArrayGetNItems, > with the introduction of the function ArrayGetNItemsSafe. > > Now ArrayGetNItems may return -1, according to the comment. > " instead of throwing an exception. -1 is returned after an error." If I'm reading the code correctly, it's the definition of ArrayGetNItems*Safe*. ArrayGetNItems() calls that function with a NULL escontext and the NULL turns ereturn() into ereport(). That doesn't seem to be changed by the commit. Of course teaching Coverity not to issue the false warnings would be another actual issue that we should do, maybe. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Force streaming every change in logical decoding
At Thu, 22 Dec 2022 14:53:34 +0530, Amit Kapila wrote in > For this, I like your proposal for "buffered" as an explicit default value. Good to hear. > Okay, how about modifying it to: "When set to > immediate, stream each change if > streaming option (see optional parameters set by > CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change. Looks fine. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Exit walsender before confirming remote flush in logical replication
At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat wrote in > On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu) > wrote: > > In case of logical replication, however, we cannot support the use-case that > > switches the role publisher <-> subscriber. Suppose same case as above, > > additional .. > > Therefore, I think that we can ignore the condition for shutting down the > > walsender in logical replication. ... > > This change may be useful for time-delayed logical replication. The > > walsender > > waits the shutdown until all changes are applied on subscriber, even if it > > is > > delayed. This causes that publisher cannot be stopped if large delay-time is > > specified. > > I think the current behaviour is an artifact of using the same WAL > sender code for both logical and physical replication. Yeah, I don't think we do that for the reason of switchover. On the other hand I think the behavior was intentionally taken over since it is thought as sensible alone. And I'm afraind that many people already relies on that behavior. > I agree with you that the logical WAL sender need not wait for all the > WAL to be replayed downstream. Thus I feel that it might be a bit outrageous to get rid of that bahavior altogether because of a new feature stumbling on it. I'm fine doing that only in the apply_delay case, though. However, I have another concern that we are introducing the second exception for XLogSendLogical in the common path. I doubt that anyone wants to use synchronous logical replication with apply_delay since the sender transaction is inevitablly affected back by that delay. Thus how about before entering an apply_delay, logrep worker sending a kind of crafted feedback, which reports commit_data.end_lsn as flushpos? A little tweak is needed in send_feedback() but seems to work.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Force streaming every change in logical decoding
At Thu, 22 Dec 2022 16:59:30 +0900, Masahiko Sawada wrote in > On Thu, Dec 22, 2022 at 4:18 PM Hayato Kuroda (Fujitsu) > wrote: > > > > Dear Amit, > > > > Thank you for updating the patch. I have also checked the patch > > and basically it has worked well. Almost all things I found were modified > > by v4. > > > > One comment: while setting logical_decoding_mode to wrong value, > > I got unfriendly ERROR message. > > > > ``` > > postgres=# SET logical_decoding_mode = 1; > > ERROR: invalid value for parameter "logical_decoding_mode": "1" > > HINT: Available values: , immediate > > ``` > > > > Here all acceptable enum should be output as HINT, but we could not see the > > empty string. > > Should we modify config_enum_get_options() for treating empty string, maybe > > like (empty)? > > Good point. I think the hint message can say "The only allowed value > is \"immediate\" as recovery_target does. Or considering the name of > logical_decoding_mode, we might want to have a non-empty string, say > 'normal' as Kuroda-san proposed, as the default value. Oh. I missed this, and agree to have the explicit default option. I slightly prefer "buffered" but "normal" also works fine for me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
At Thu, 22 Dec 2022 10:09:23 +0530, Bharath Rupireddy wrote in > On Thu, Dec 22, 2022 at 7:57 AM Michael Paquier wrote: > > > > On Wed, Dec 21, 2022 at 10:22:02PM +0100, Magnus Hagander wrote: > > > Basically, we take one thing and turn it into 3. That very naturally rings > > > with "split" to me. > > > > > > Parse might work as well, certainly better than dissect. I'd still prefer > > > split though. > > > > Honestly, I don't have any counter-arguments, so I am fine to switch > > the name as you are suggesting. And pg_split_walfile_name() it is? > > +1. FWIW, a simple patch is here > https://www.postgresql.org/message-id/CALj2ACXdZ7WGRD-_jPPeZugvWLN%2Bgxo3QtV-eZPRicUwjesM%3Dg%40mail.gmail.com. By the way the function is documented as the follows. > Extracts the file sequence number and timeline ID from a WAL file name. I didn't find the definition for the workd "file sequence number" in the doc. Instead I find "segment number" (a bit doubtful, though..). In the first place "file sequence number" and "segno" can hardly be associated by appearance by readers, I think. (Yeah, we can identify that since the another parameter is identifiable alone.) Why don't we spell out the parameter simply as "segment number"? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Force streaming every change in logical decoding
At Thu, 22 Dec 2022 12:35:46 +0530, Amit Kapila wrote in > I have addressed these comments in the attached. Additionally, I have > modified the docs and commit messages to make those clear. I think > instead of adding new tests with this patch, it may be better to > change some of the existing tests related to streaming to use this > parameter as that will clearly show one of the purposes of this patch. Being late but I'm warried that we might sometime regret that the lack of the explicit default. Don't we really need it? +Allows streaming or serializing changes immediately in logical decoding. +The allowed values of logical_decoding_mode are the +empty string and immediate. When set to +immediate, stream each change if +streaming option is enabled, otherwise, serialize +each change. When set to an empty string, which is the default, +decoding will stream or serialize changes when +logical_decoding_work_mem is reached. With (really) fresh eyes, I took a bit long time to understand what the "streaming" option is. Couldn't we augment the description by a bit? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistency in reporting checkpointer stats
At Wed, 21 Dec 2022 17:14:12 +0530, Nitin Jadhav wrote in > [1]: > 2022-12-21 10:52:25.931 UTC [63530] LOG: checkpoint complete: wrote > 4670 buffers (28.5%), wrote 3 slru buffers (0.0%); 0 WAL file(s) > added, 0 removed, 4 recycled; write=0.045 s, sync=0.161 s, total=0.244 > s; sync files=25, longest=0.146 s, average=0.007 s; distance=66130 kB, > estimate=66130 kB; lsn=0/5557C78, redo lsn=0/5557C40 > > Thanks & Regards, > Nitin Jadhav > > On Tue, Dec 20, 2022 at 11:08 PM Andres Freund wrote: > > > > On 2022-12-20 08:18:36 -0500, Robert Haas wrote: > > > I think that the SLRU information is potentially useful, but mixing it > > > with the information about regular buffers just seems confusing. > > > > +1 > > > > At least for now, it'd be different if/when we manage to move SLRUs to > > the main buffer pool. It sounds reasonable to exclude SRLU write from buffer writes. But I'm not sure its useful to count SLRU writes separately since it is under the noise level of buffer writes in reglular cases and the value doesn't lead to tuning. However I'm not strongly opposed to adding it either. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistency in reporting checkpointer stats
At Mon, 19 Dec 2022 18:05:38 +0530, Bharath Rupireddy wrote in > On Fri, Dec 16, 2022 at 2:14 PM Kyotaro Horiguchi > wrote: > > > > In the first place I don't like that we count the same things twice. > > Couldn't we count the number only by any one of them? > > > > If we remove CheckPointerStats.ckpt_bufs_written, CreateCheckPoint can > > get the final number as the difference between the start-end values of > > *the shared stats*. As long as a checkpoint runs on a single process, > > trace info in BufferSync will work fine. Assuming single process > > checkpointing there must be no problem to do that. (Anyway the current > > shared stats update for checkpointer is assuming single-process). > > What if someone resets checkpointer shared stats with > pg_stat_reset_shared()? In such a case, the checkpoint complete > message will not have the stats, no? I don't know. I don't believe the stats system doesn't follow such a strict resetting policy. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: (non) translatable string splicing
At Mon, 19 Dec 2022 16:14:04 -0500, Robert Haas wrote in > On Fri, Dec 16, 2022 at 8:25 AM Justin Pryzby wrote: > > Due to incomplete translation, that allows some pretty fancy output, > > like: > > | You must identify the directory where the residen los binarios del > > clúster antiguo. > > I can't see how that could be mejor. :-) It was quite annoying but not untranslatable. But the "the" before "residen" looks like badly misplaced:p It should be a part of the inner text ("residen los.."). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: (non) translatable string splicing
At Mon, 19 Dec 2022 13:20:55 +0900, Michael Paquier wrote in > On Fri, Dec 16, 2022 at 07:24:52AM -0600, Justin Pryzby wrote: > > Due to incomplete translation, that allows some pretty fancy output, > > like: > > | You must identify the directory where the residen los binarios del > > clúster antiguo. > > > > That commit also does this a couple times: > > > > +_(" which is an index on \"%s.%s\""), For this specific case I didn't feel a difficulty since it is compatible with "This is blah" in that context. > Ugh. Perhaps we could just simplify the wordings as of "index on > blah", "index on OID %u", "TOAST table for blah" and "TOAST table for > OID %u" with newlines after each item? I'm fine with just removing " which ". but I don't understand about the extra newlines. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistency in reporting checkpointer stats
At Wed, 14 Dec 2022 16:54:53 +0530, Bharath Rupireddy wrote in > Indeed PendingCheckpointerStats.buf_written_checkpoints needs to count > buffer writes in SlruInternalWritePage(). However, does it need to be > done immediately there? The stats will not be visible to the users > until the next pgstat_report_checkpointer(). Incrementing > buf_written_checkpoints in BufferSync() makes sense as the > pgstat_report_checkpointer() gets called in there via > CheckpointWriteDelay() and it becomes visible to the user immediately. > Have you checked if pgstat_report_checkpointer() gets called while the > checkpoint calls SlruInternalWritePage()? If not, then you can just > assign ckpt_bufs_written to buf_written_checkpoints in > LogCheckpointEnd() like its other friends > checkpoint_write_time and checkpoint_sync_time. If I'm getting Bharath correctly, it results in double counting of BufferSync. If we want to keep the realtime-reporting nature of BufferSync, BufferSync should give up to increment CheckPointerStats' counter. Such separation seems to be a kind of stupid and quite bug-prone. In the first place I don't like that we count the same things twice. Couldn't we count the number only by any one of them? If we remove CheckPointerStats.ckpt_bufs_written, CreateCheckPoint can get the final number as the difference between the start-end values of *the shared stats*. As long as a checkpoint runs on a single process, trace info in BufferSync will work fine. Assuming single process checkpointing there must be no problem to do that. (Anyway the current shared stats update for checkpointer is assuming single-process). Otherwise, in exchange with giving up the realtime nature, we can count the number only by CheckPointerStats.ckpt_bufs_written. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: plpgsq_plugin's stmt_end() is not called when an error is caught
At Thu, 15 Dec 2022 09:03:12 +0100, Pavel Stehule wrote in > I found some solution based by using fmgr hook > > https://github.com/okbob/plpgsql_check/commit/9a17e97354a48913de5219048ee3be6f8460bae9 Oh! Thanks for the pointer, will look into that. By the way, It seems to me that the tool is using RegisterResourceReleaseCallback to reset the function nest level. But since there's a case where the mechanism doesn't work, I suspect that the callback can be missed in some cases of error return, which seems to be a bug if it is true.. # I haven't confirmed that behavior by myself, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: plpgsq_plugin's stmt_end() is not called when an error is caught
At Thu, 15 Dec 2022 08:41:21 +0100, Pavel Stehule wrote in > čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada > napsal: > > Is this a bug in plpgsql? > > > > I think it is by design. There is not any callback that is called after an > exception. > > It is true, so some callbacks on statement error and function's error can > be nice. It can help me to implement profilers, or tracers more simply and > more robustly. > > But I am not sure about performance impacts. This is on a critical path. I didn't searched for, but I guess all of the end-side callback of all begin-end type callbacks are not called on exception. Additional PG_TRY level wouldn't be acceptable for performance reasons. What we (pg_hint_plan people) want is any means to know that the top-level function is exited, to reset function nest level. It would be simpler than calling end callback at every nest level. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Thu, 15 Dec 2022 10:29:17 +0530, Amit Kapila wrote in > On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi > wrote: > > > > At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila > > wrote in > > > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi > > > wrote: > > > subscriber was busy enough that it doesn't need to add an additional > > > delay before applying a particular transaction(s) but adding a delay > > > to such a transaction on the publisher will actually make it take much > > > longer to reflect than expected. We probably need to name this > > > > Isn't the name min_apply_delay implying the same behavior? Even though > > the delay time will be a bit prolonged. > > > > Sorry, I don't understand what you intend to say in this point. In > above, I mean that the currently proposed patch won't have such a > problem but if we apply delay on publisher the problem can happen. Are you saing about the sender-side delay lets the whole transaction (if it have not streamed out) stay on the sender side? If so... yeah, I agree that it is undesirable. > > > parameter as min_send_delay if we want to do what you are saying and > > > then I don't know if it serves the actual need and also it will be > > > different from what we do in physical standby. > > > > In the first place phisical and logical replication works differently > > and the mechanism to delaying "apply" differs even in the current > > state in terms of logrep delay choking stream. > > > > I think the first preference is to make it work in a similar way (as > much as possible) to how this parameter works in physical standby and > if that is not at all possible then we may consider other approaches. I uderstood that. However, still I think choking the stream on the receiver-side alone is kind of ugly since it is breaking the protocol assumption, that is, the in-band maintenance packets are processed in a on-time manner on the peer under normal operation (even though involving some delays for some natural reasons). In this regard, I inclined to be in favor of Kuroda-san'sproposal.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila wrote in > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi > wrote: > > > > At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila > > wrote in > > > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu) > > > wrote: > > > > One idea to avoid that is to send the min_apply_delay subscriber option > > > > to publisher > > > > and compare them, but it may be not sufficient. Because XXX_timout GUC > > > > parameters > > > > could be modified later. > > > > > > > > > > How about restarting the apply worker if min_apply_delay changes? Will > > > that be sufficient? > > > > Mmm. If publisher knows that value, isn't it albe to delay *sending* > > data in the first place? This will resolve many known issues including > > walsender's un-terminatability, possible buffer-full and status packet > > exchanging. > > > > Yeah, but won't it change the meaning of this parameter? Say the Internally changes, but does not change on its face. The difference is only in where the choking point exists. If ".._apply_delay" should work literally, we should go the way Kuroda-san proposed. Namely, "apply worker has received the data, but will deilay applying it". If we technically name it correctly for the current behavior, it would be "min_receive_delay" or "min_choking_interval". > subscriber was busy enough that it doesn't need to add an additional > delay before applying a particular transaction(s) but adding a delay > to such a transaction on the publisher will actually make it take much > longer to reflect than expected. We probably need to name this Isn't the name min_apply_delay implying the same behavior? Even though the delay time will be a bit prolonged. > parameter as min_send_delay if we want to do what you are saying and > then I don't know if it serves the actual need and also it will be > different from what we do in physical standby. In the first place phisical and logical replication works differently and the mechanism to delaying "apply" differs even in the current state in terms of logrep delay choking stream. I guess they cannot be different in terms of normal operation. But I'm not sure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Thu, 15 Dec 2022 09:23:12 +0530, Amit Kapila wrote in > On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi > wrote: > > Allowing walsender to finish ignoring replication status > > wouldn't be great. > > > > Yes, that would be ideal. But do you know why that is a must? I believe a graceful shutdown (fast and smart) of a replication set is expected to be in sync. Of course we can change the policy to allow walsnder to stop before confirming all WAL have been applied. However walsender doesn't have an idea of wheter the peer is intentionally delaying or not. > > One idea is to let logical workers send delaying > > status. > > > > How can that help? If logical worker notifies "I'm intentionally pausing replication for now, so if you wan to shutting down, plese go ahead ignoring me", publisher can legally run a (kind of) dirty shut down. # It looks a bit too much, though.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila wrote in > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu) > wrote: > > One idea to avoid that is to send the min_apply_delay subscriber option to > > publisher > > and compare them, but it may be not sufficient. Because XXX_timout GUC > > parameters > > could be modified later. > > > > How about restarting the apply worker if min_apply_delay changes? Will > that be sufficient? Mmm. If publisher knows that value, isn't it albe to delay *sending* data in the first place? This will resolve many known issues including walsender's un-terminatability, possible buffer-full and status packet exchanging. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Wed, 14 Dec 2022 10:46:17 +, "Hayato Kuroda (Fujitsu)" wrote in > I have implemented and tested that workers wake up per wal_receiver_timeout/2 > and send keepalive. Basically it works well, but I found two problems. > Do you have any good suggestions about them? > > 1) > > With this PoC at present, workers calculate sending intervals based on its > wal_receiver_timeout, and it is suppressed when the parameter is set to zero. > > This means that there is a possibility that walsender is timeout when > wal_sender_timeout > in publisher and wal_receiver_timeout in subscriber is different. > Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min, It seems to me wal_receiver_status_interval is better for this use. It's enough for us to docuemnt that "wal_r_s_interval should be shorter than wal_sener_timeout/2 especially when logical replication connection is using min_apply_delay. Otherwise you will suffer repeated termination of walsender". > and min_apply_delay is 10min. The worker on subscriber will wake up per > 2.5min and > send keepalives, but walsender exits before the message arrives to publisher. > > One idea to avoid that is to send the min_apply_delay subscriber option to > publisher > and compare them, but it may be not sufficient. Because XXX_timout GUC > parameters > could be modified later. # Anyway, I don't think such asymmetric setup is preferable. > 2) > > The issue reported by Vignesh-san[1] has still remained. I have already > analyzed that [2], > the root cause is that flushed WAL is not updated and sent to the publisher. > Even > if workers send keepalive messages to pub during the delay, the flushed > position > cannot be modified. I didn't look closer but the cause I guess is walsender doesn't die until all WAL has been sent, while logical delay chokes replication stream. Allowing walsender to finish ignoring replication status wouldn't be great. One idea is to let logical workers send delaying status. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_upgrade: Make testing different transfer modes easier
At Wed, 14 Dec 2022 10:40:45 +0100, Daniel Gustafsson wrote in > > On 14 Dec 2022, at 08:04, Peter Eisentraut > > wrote: > > > > On 07.12.22 17:33, Peter Eisentraut wrote: > >> I think if we want to make this configurable on the fly, and environment > >> variable would be much easier, like > >> my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy'; > > > > Here is an updated patch set that incorporates this idea. We have already PG_TEST_EXTRA. Shouldn't we use it here as well? > I would prefer a small note about it in src/bin/pg_upgrade/TESTING to document > it outside of the code, but otherwise LGTM. > > + $mode, > '--check' > ], > > ... > > - '-p', $oldnode->port, '-P', $newnode->port > + '-p', $oldnode->port, '-P', $newnode->port, > + $mode, > ], > > Minor nitpick, but while in there should we take the opportunity to add a > trailing comma on the other two array declarations which now ends with > --check? > It's good Perl practice and will make the code consistent. Otherwise look god to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Shortening the Scope of Critical Section in Creation of New MultiXacts
Hello. In short, the code as-is looks correct. At Wed, 14 Dec 2022 00:14:34 +, "Bagga, Rishu" wrote in >* Critical section from here until caller has written the data into the >* just-reserved SLRU space; we don't want to error out with a partly "the data" here means the whole this multi transaction, which includes members. We shouldn't exit the critical section at least until the very end of RecordNewMultiXact(). > This makes sense, as we need the multixact state and multixact offset > data to be consistent, but once we write data into the offsets, we can > end the critical section. Currently we wait until the members data is > also written before we end the critical section. Why do you think that the members are not a part of a multitransaction? A multitransaction is not complete without them. Addition to the above, we cannot simply move the END_CRIT_SECTION() to there since RecordNewMultiXact() has another caller that doesn't start a critical section. By the way, I didn't tested this for myself but.. > This passes regression tests Maybe if you did the same with an assertion-build, you will get an assertion-failure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 13 Dec 2022 17:05:35 +0530, Amit Kapila wrote in > On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi > wrote: > > > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila > > wrote in > Yeah, I think ideally it will timeout but if we have a solution like > during delay, we keep sending ping messages time-to-time, it should > work fine. However, that needs to be verified. Do you see any reasons > why that won't work? Ah. I meant that "I have no clear idea of whether" by "I'm not sure". I looked there a bit further. Finally ProcessPendingWrites() waits for streaming socket to be writable thus no critical problem found here. That being said, it might be better ProcessPendingWrites() refrain from sending consecutive keepalives while waiting, 30s ping timeout and 1h delay may result in 120 successive pings. It might not be a big deal but.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 13 Dec 2022 02:28:49 +, "Takamichi Osumi (Fujitsu)" wrote in > On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi > wrote: > > We couldn't reproduce this failure and > find the same type of failure on the cfbot from the past failures. > It seems no subtests run in your environment. Very sorry for that. The test script is found to be a left-over file in a git-reset'ed working tree. Please forget about it. FWIW, the latest patch passed make-world for me on Rocky8/x86_64. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila wrote in > On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu) > wrote: > > once and apply later. Our basic design is as follows: > > > > * All transactions areserialized to files if min_apply_delay is set to > > non-zero. > > * After receiving the commit message and spending time, workers reads and > > applies spooled messages > > > > I think this may be more work than required because in some cases > doing I/O just to delay xacts will later lead to more work. Can't we > send some ping to walsender to communicate that walreceiver is alive? > We already seem to be sending a ping in LogicalRepApplyLoop if we > haven't heard anything from the server for more than > wal_receiver_timeout / 2. Now, it is possible that the walsender is > terminated due to some other reason and we need to see if we can > detect that or if it will only be detected once the walreceiver's > delay time is over. FWIW, I thought the same thing with Amit. What we should do here is logrep workers notifying to walsender that it's living and the communication in-between is fine, and maybe the worker's status. Spontaneous send_feedback() calls while delaying will be sufficient for this purpose. We might need to supress extra forced feedbacks instead. In contrast the worker doesn't need to bother to know whether the peer is living until it receives the next data. But we might need to adjust the wait_time in LogicalRepApplyLoop(). But, I'm not sure what will happen when walsender is blocked by buffer-full for a long time. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
Hello. At Mon, 12 Dec 2022 07:42:30 +, "Takamichi Osumi (Fujitsu)" wrote in > On Monday, December 12, 2022 2:54 PM Kyotaro Horiguchi > wrote: > > I asked about unexpected walsender termination caused by this feature but I > > think I didn't received an answer for it and the behavior is still exists. .. > Thank you so much for your report! > Yes. Currently, how to deal with the timeout issue is under discussion. > Some analysis about the root cause are also there. > > Kindly have a look at [1]. > > > [1] - > https://www.postgresql.org/message-id/TYAPR01MB58669394A67F2340B82E42D1F5E29%40TYAPR01MB5866.jpnprd01.prod.outlook.com Oops. Thank you for the pointer. Will visit there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
Hello. I asked about unexpected walsender termination caused by this feature but I think I didn't received an answer for it and the behavior is still exists. Specifically, if servers have the following settings, walsender terminates for replication timeout. After that, connection is restored after the LR delay elapses. Although it can be said to be working in that sense, the error happens repeatedly every about min_apply_delay internvals but is hard to distinguish from network troubles. I'm not sure you're deliberately okay with it but, I don't think the behavior causing replication timeouts is acceptable. > wal_sender_timeout = 10s; > wal_receiver_temeout = 10s; > > create subscription ... with (min_apply_delay='60s'); This is a kind of artificial but timeout=60s and delay=5m is not an uncommon setup and that also causes this "issue". subscriber: > 2022-12-12 14:17:18.139 JST LOG: terminating walsender process due to > replication timeout > 2022-12-12 14:18:11.076 JST LOG: starting logical decoding for slot "s1" ... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Improve WALRead() to suck data directly from WAL buffers when possible
Sorry for the confusion. At Mon, 12 Dec 2022 12:06:36 +0900 (JST), Kyotaro Horiguchi wrote in > At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi > wrote in > > This patch copies the bleeding edge WAL page without recording the > > (next) insertion point nor checking whether all in-progress insertion > > behind the target LSN have finished. Thus the copied page may have > > holes. That being said, the sequential-reading nature and the fact > > that WAL buffers are zero-initialized may make it work for recovery, > > but I don't think this also works for replication. > > Mmm. I'm a bit dim. Recovery doesn't read concurrently-written > records. Please forget about recovery. NO... Logical walsenders do that. So, please forget about this... > > I remember that the one of the advantage of reading the on-memory WAL > > records is that that allows walsender to presend the unwritten > > records. So perhaps we should manage how far the buffer is filled with > > valid content (or how far we can presend) in this feature. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Improve WALRead() to suck data directly from WAL buffers when possible
At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi wrote in > This patch copies the bleeding edge WAL page without recording the > (next) insertion point nor checking whether all in-progress insertion > behind the target LSN have finished. Thus the copied page may have > holes. That being said, the sequential-reading nature and the fact > that WAL buffers are zero-initialized may make it work for recovery, > but I don't think this also works for replication. Mmm. I'm a bit dim. Recovery doesn't read concurrently-written records. Please forget about recovery. > I remember that the one of the advantage of reading the on-memory WAL > records is that that allows walsender to presend the unwritten > records. So perhaps we should manage how far the buffer is filled with > valid content (or how far we can presend) in this feature. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Improve WALRead() to suck data directly from WAL buffers when possible
At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy wrote in > The patch introduces concurrent readers for the WAL buffers, so far > only there are concurrent writers. In the patch, WALRead() takes just > one lock (WALBufMappingLock) in shared mode to enable concurrent > readers and does minimal things - checks if the requested WAL page is > present in WAL buffers, if so, copies the page and releases the lock. > I think taking just WALBufMappingLock is enough here as the concurrent > writers depend on it to initialize and replace a page in WAL buffers. > > I'll add this to the next commitfest. > > Thoughts? This adds copying of the whole page (at least) at every WAL *record* read, fighting all WAL writers by taking WALBufMappingLock on a very busy page while the copying. I'm a bit doubtful that it results in an overall improvement. It seems to me almost all pread()s here happens on file buffer so it is unclear to me that copying a whole WAL page (then copying the target record again) wins over a pread() call that copies only the record to read. Do you have an actual number of how frequent WAL reads go to disk, or the actual number of performance gain or real I/O reduction this patch offers? This patch copies the bleeding edge WAL page without recording the (next) insertion point nor checking whether all in-progress insertion behind the target LSN have finished. Thus the copied page may have holes. That being said, the sequential-reading nature and the fact that WAL buffers are zero-initialized may make it work for recovery, but I don't think this also works for replication. I remember that the one of the advantage of reading the on-memory WAL records is that that allows walsender to presend the unwritten records. So perhaps we should manage how far the buffer is filled with valid content (or how far we can presend) in this feature. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Question regarding "Make archiver process an auxiliary process. commit"
At Wed, 7 Dec 2022 11:28:23 +0530, Sravan Kumar wrote in > On Tue, Dec 6, 2022 at 5:24 PM Bharath Rupireddy < > bharath.rupireddyforpostg...@gmail.com> wrote: > > > On Tue, Dec 6, 2022 at 4:57 PM Sravan Kumar > > wrote: > > > > > > Thank you for the feedback. > > > > > > I'll be glad to help with the fix. Here's the patch for review. > > > > Thanks. +1 for fixing this. > >> I would like to quote recent discussions on reducing the useless > >> wakeups or increasing the sleep/hibernation times in various processes > >> to reduce the power savings [1] [2] [3] [4] [5]. With that in context, > >> does the archiver need to wake up every 60 sec at all when its latch > >> gets set (PgArchWakeup()) whenever the server switches to a new WAL > >> file? What happens if we get rid of PGARCH_AUTOWAKE_INTERVAL and rely > >> on its latch being set? If required, we can spread PgArchWakeup() to > >> more places, no? > > > > > I like the idea of not having to wake up intermittently and probably we > should start a discussion about it. > > I see the following comment in PgArchWakeup(). > > /* > * We don't acquire ProcArrayLock here. It's actually fine because > * procLatch isn't ever freed, so we just can potentially set the wrong > * process' (or no process') latch. Even in that case the archiver will > * be relaunched shortly and will start archiving. > */ > > While I do not fully understand the comment, it gives me an impression that > the SetLatch() done here is counting on the timeout to wake up the archiver > in some cases where the latch is wrongly set. It is telling about the first iteration of archive process, not periodical wakeups. So I think it is doable by considering how to handle incomplete archiving iterations. > The proposed idea is a behaviour change while this thread intends to clean > up some code that's > a result of the mentioned commit. So probably the proposed idea can be > discussed as a separate thread. Agreed. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Question regarding "Make archiver process an auxiliary process. commit"
At Tue, 6 Dec 2022 17:23:50 +0530, Bharath Rupireddy wrote in > Thanks. +1 for fixing this. > > I would like to quote recent discussions on reducing the useless > wakeups or increasing the sleep/hibernation times in various processes > to reduce the power savings [1] [2] [3] [4] [5]. With that in context, > does the archiver need to wake up every 60 sec at all when its latch > gets set (PgArchWakeup()) whenever the server switches to a new WAL > file? What happens if we get rid of PGARCH_AUTOWAKE_INTERVAL and rely > on its latch being set? If required, we can spread PgArchWakeup() to > more places, no? I thought so first, but archiving may be interrupted for various reasons (disk full I think is the most common one). So, only intentional wakeups aren't sufficient. > Before even answering the above questions, I think we need to see if > there're any cases where a process can miss SetLatch() calls (I don't > have an answer for that). I read a recent Thomas' mail that says something like "should we consider the case latch sets are missed?". It is triggered by SIGURG or SetEvent(). I'm not sure but I believe the former is now reliable and the latter was born reliable. > Or do we want to stick to what the below comment says? > > /* > * There shouldn't be anything for the archiver to do except to wait for a > * signal ... however, the archiver exists to protect our data, so she > * wakes up occasionally to allow herself to be proactive. > */ So I think this is still valid. If we want to eliminate useless wakeups, archiver needs to remember whether the last iteration was fully done or not. But it seems to be a race condition is involved. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Tue, 6 Dec 2022 11:08:43 -0800, Andres Freund wrote in > Hi, > > The tests fail on cfbot: > https://cirrus-ci.com/task/4533866329800704 > > They only seem to fail on 32bit linux. > > https://api.cirrus-ci.com/v1/artifact/task/4533866329800704/testrun/build-32/testrun/subscription/032_apply_delay/log/regress_log_032_apply_delay > [06:27:10.628](0.138s) ok 2 - check if the new rows were applied to subscriber > timed out waiting for match: (?^:logical replication apply delay) at > /tmp/cirrus-ci-build/src/test/subscription/t/032_apply_delay.pl line 124. It fails for me on 64bit Linux.. (Rocky 8.7) > t/032_apply_delay.pl ... Dubious, test returned 29 (wstat 7424, > 0x1d00) > No subtests run .. > t/032_apply_delay.pl (Wstat: 7424 Tests: 0 Failed: 0) > Non-zero exit status: 29 > Parse errors: No plan found in TAP output regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Transaction timeout
At Mon, 5 Dec 2022 17:10:50 -0800, Andres Freund wrote in > I'm most concerned about the overhead when the timeouts are *not* > enabled. And this adds a branch to start_xact_command() and a function > call for get_timeout_active(TRANSACTION_TIMEOUT) in that case. On its > own, that's not a whole lot, but it does add up. There's 10+ function > calls for timeout and ps_display purposes for every single statement. That path seems like existing just for robustness. I inserted "Assert(0)" just before the disable_timeout(), but make check-world didn't fail [1]. Couldn't we get rid of that path, adding an assertion instead? I'm not sure about other timeouts yet, though. About disabling side, we cannot rely on StatementTimeout. [1] # 032_apply_delay.pl fails for me so I don't know any of the later # tests fails. > But it's definitely also worth optimizing the timeout enabled paths. And > you're right, it looks like there's a fair bit of optimization > potential. Thanks. I'll work on that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center