Re: "out of relcache_callback_list slots" after multiple calls to pg_logical_slot_get_binary_changes

2023-02-20 Thread Kyotaro Horiguchi
Good catch!

At Sun, 19 Feb 2023 02:40:31 +, "shiy.f...@fujitsu.com" 
 wrote in 
> init_rel_sync_cache()), but they are not unregistered when it shutdowns. So,
> after multiple calls to the function, MAX_RELCACHE_CALLBACKS is exceeded. This
> is mentioned in the following comment.
> 
>   /*
>* We can get here if the plugin was used in SQL interface as the
>* RelSchemaSyncCache is destroyed when the decoding finishes, but there
>* is no way to unregister the relcache invalidation callback.
>*/
>   if (RelationSyncCache == NULL)
>   return;
> 
> Could we fix it by adding two new function to unregister relcache callback and
> syscache callback? I tried to do so in the attached patch.

I'm pretty sure that everytime an output plugin is initialized on a
process, it installs the same set of syscache/relcache callbacks each
time.  Do you think we could simply stop duplicate registration of
those callbacks by using a static boolean?  It would be far simpler.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add WAL read stats to pg_stat_wal

2023-02-19 Thread Kyotaro Horiguchi
At Thu, 16 Feb 2023 11:11:38 -0800, Andres Freund  wrote in 
> I wonder if we should keep the checkpointer around for longer. If we have
> checkpointer signal postmaster after it wrote the shutdown checkpoint,
> postmaster could signal walsenders to shut down, and checkpointer could do
> some final work, like writing out the stats.
> I suspect this could be useful for other things as well. It's awkward that we
> don't have a place to put "just before shutting down" type tasks. And
> checkpointer seems well suited for that.

I totally agree that it will be useful, but I'm not quite sure how
checkpointer would be able to let postmaster know about that state
without requiring access to shared memory.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Refactor calculations to use instr_time

2023-02-19 Thread Kyotaro Horiguchi
At Fri, 17 Feb 2023 13:53:36 +0300, Nazir Bilal Yavuz  
wrote in 
> Thanks for the review. I updated the patch.


WalUsageAccumDiff(, , );
-   PendingWalStats.wal_records = diff.wal_records;
-   PendingWalStats.wal_fpi = diff.wal_fpi;
-   PendingWalStats.wal_bytes = diff.wal_bytes;
...
+   WALSTAT_ACC(wal_records, diff);
+   WALSTAT_ACC(wal_fpi, diff);
+   WALSTAT_ACC(wal_bytes, diff);
+   WALSTAT_ACC(wal_buffers_full, PendingWalStats);


The lifetime of the variable "diff" seems to be longer now. Wouldn't
it be clearer if we renamed it to something more meaningful, like
wal_usage_diff, WalUsageDiff or PendingWalUsage?  Along those same
lines, it occurs to me that the new struct should be named
PgStat_PendingWalStats, instead of ..Usage. That change makes the name
of the type and the variable consistent.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Kyotaro Horiguchi
At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Horiguchi-san,
> 
> Thank you for responding! Before modifying patches, I want to confirm 
> something
> you said.
> 
> > As Amit-K mentioned, we may need to change the name of the option in
> > this version, since the delay mechanism in this version causes a delay
> > in sending from publisher than delaying apply on the subscriber side.
> 
> Right, will be changed.
> 
> > I'm not sure why output plugin is involved in the delay mechanism. It
> > appears to me that it would be simpler if the delay occurred in
> > reorder buffer or logical decoder instead.
> 
> I'm planning to change, but..

Yeah, I don't think we've made up our minds about which way to go yet,
so it's a bit too early to work on that.

> > Perhaps what I understand
> > correctly is that we could delay right before only sending commit
> > records in this case. If we delay at publisher end, all changes will
> > be sent at once if !streaming, and otherwise, all changes in a
> > transaction will be spooled at subscriber end. In any case, apply
> > worker won't be holding an active transaction unnecessarily.
> 
> What about parallel case? Latest patch does not reject the combination of 
> parallel
> streaming mode and delay. If delay is done at commit and subscriber uses an 
> parallel
> apply worker, it may acquire lock for a long time.

I didn't looked too closely, but my guess is that transactions are
conveyed in spool files in parallel mode, with each file storing a
complete transaction.

> > Of
> > course we need add the mechanism to process keep-alive and status
> > report messages.
> 
> Could you share the good way to handle keep-alive and status messages if you 
> have?
> If we changed to the decoding layer, it is strange to call walsender function
> directly.

I'm sorry, but I don't have a concrete idea at the moment. When I read
through the last patch, I missed that WalSndDelay is actually a subset
of WalSndLoop. Although it can handle keep-alives correctly, I'm not
sure we can accept that structure..

> > Those setups work fine when no
> > apply-delay involved, but they won't work with the patches we're
> > talking about because the subscriber won't respond to the keep-alive
> > packets from the peer.  So when I wrote "practically works" in the
> > last mail, this is what I meant.
> 
> I'm not sure around the part. I think in the latest patch, subscriber can 
> respond
> to the keepalive packets from the peer. Also, publisher can respond to the 
> peer.
> Could you please tell me if you know a case that publisher or subscriber 
> cannot
> respond to the opposite side? Note that if we apply the publisher-side patch, 
> we
> don't have to apply subscriber-side patch.

Sorry about that again, I missed that part in the last patch as
mentioned earlier..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-15 Thread Kyotaro Horiguchi
At Wed, 15 Feb 2023 11:29:18 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Andres and other hackers,
> 
> > OTOH, if we want to implement the functionality on publisher-side,
> > I think we need to first consider the interface.
> > We can think of two options (a) Have it as a subscription parameter as the 
> > patch
> > has now and
> > then pass it as an option to the publisher which it will use to delay;
> > (b) Have it defined on publisher-side, say via GUC or some other way.
> > The basic idea could be that while processing commit record (in 
> > DecodeCommit),
> > we can somehow check the value of delay and then use it there to delay 
> > sending
> > the xact.
> > Also, during delay, we need to somehow send the keepalive and process 
> > replies,
> > probably via a new callback or by some existing callback.
> > We also need to handle in-progress and 2PC xacts in a similar way.
> > For the former, probably we would need to apply the delay before sending 
> > the first
> > stream.
> > Could you please share what you feel on this direction as well ?
> 
> I implemented a patch that the delaying is done on the publisher side. In 
> this patch,
> approach (a) was chosen, in which min_apply_delay is specified as a 
> subscription
> parameter, and then apply worker passes it to the publisher as an output 
> plugin option.

As Amit-K mentioned, we may need to change the name of the option in
this version, since the delay mechanism in this version causes a delay
in sending from publisher than delaying apply on the subscriber side.

I'm not sure why output plugin is involved in the delay mechanism. It
appears to me that it would be simpler if the delay occurred in
reorder buffer or logical decoder instead. Perhaps what I understand
correctly is that we could delay right before only sending commit
records in this case. If we delay at publisher end, all changes will
be sent at once if !streaming, and otherwise, all changes in a
transaction will be spooled at subscriber end. In any case, apply
worker won't be holding an active transaction unnecessarily.  Of
course we need add the mechanism to process keep-alive and status
report messages.

> During the delay, the walsender periodically checks and processes replies 
> from the
> apply worker and sends keepalive messages if needed. Therefore, the ability 
> to handle
> keepalives is not loosed.

My understanding is that the keep-alives is a different mechanism with
a different objective from status reports. Even if subscriber doesn't
send a spontaneous or extra status reports at all, connection can be
checked and maintained by keep-alive packets. It is possible to setup
an asymmetric configuration where only walsender sends keep-alives,
but none are sent from the peer. Those setups work fine when no
apply-delay involved, but they won't work with the patches we're
talking about because the subscriber won't respond to the keep-alive
packets from the peer.  So when I wrote "practically works" in the
last mail, this is what I meant.

Thus if someone plans to enable apply_delay for logical replication,
that person should be aware of some additional subtle restrictions that
are required compared to a non-delayed setups.

> To delay the transaction in the output plugin layer, the new 
> LogicalOutputPlugin
> API was added. For now, I choose the output plugin layer but can consider to 
> do
> it from the core if there is a better way.
> 
> Could you please share your opinion?
> 
> Note: thanks for Osumi-san to help implementing.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-02-14 Thread Kyotaro Horiguchi
At Tue, 14 Feb 2023 22:35:01 -0800, Maciek Sakrejda  
wrote in 
> On Tue, Feb 14, 2023 at 11:08 AM Andres Freund  wrote:
> > One thing I started to wonder about since is whether we should remove the 
> > io_
> > prefix from io_object, io_context. The prefixes make sense on the C level, 
> > but
> > it's not clear to me that that's also the case on the table level.
> 
> Yeah, +1. It's hard to argue that there would be any confusion,
> considering `io_` is in the name of the view.

We usually add such prefixes to the columns of system views and
catalogs, but it seems that's not the case for the stats views. Thus
+1 from me, too.

> (Unless, I suppose, some other, non-I/O, "some_object" or
> "some_context" column were to be introduced to this view in the
> future. But that doesn't seem likely?)

I don't think that can happen. As for corss-views ambiguity, that is
already present. Many columns in stats views share the same names with
some other views.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-02-14 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 17:13:43 -0800, Andres Freund  wrote in 
> On 2023-02-14 10:05:40 +0900, Kyotaro Horiguchi wrote:
> > What do you think about the need for explicitly specifying the
> > default?  I'm fine with specifying the default using a single word,
> > such as WAIT_FOR_REMOTE_FLUSH.
> 
> We obviously shouldn't force the option to be present. Why would we want to
> break existing clients unnecessarily?  Without it the behaviour should be
> unchanged from today's.

I didn't suggest making the option mandatory. I just suggested
providing a way to specify the default value explicitly, like in the
recent commit 746915c686.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




KeepLogSeg needs some fixes on behavior

2023-02-14 Thread Kyotaro Horiguchi
This is a derived thread form [1], that discusses some subtle
behaviors of KeepLogSeg.

1: 
https://www.postgresql.org/message-id/20230213194131.hgzs6ropcvhda...@awork3.anarazel.de

At Mon, 13 Feb 2023 11:41:31 -0800, Andres Freund  wrote
> Hi,
> 
> On 2023-02-13 15:45:49 +0900, Kyotaro Horiguchi wrote:
> > This seems to have a thin connection to the issue, but.
> 
> I was worried that the changes could lead us to removing WAL without
> max_slot_wal_keep_size set.
> 
> 
> > > It seems decidedly not great to not log at least a debug1 (but probably it
> > > should be LOG) message when KeepLogSeg() decides to limit based on
> > > max_slot_wal_keep_size.
> > 
> > It's easy to do that, but that log is highly accompanied by a LOG line
> > "terminating process %d to release replication slot \"%s\"". I don't
> > mind adding it if it is a DEBUGx.
> 
> My problem with that is that we might *NOT* see those log messages for some
> reason, but that that's impossible to debug as-is. And even if we see them,
> it's not that easy to figure out by how much we were over
> max_slot_wal_keep_size, because we always report it in the context of a
> specific slot.

Since 551aa6b7b9, InvalidatePossiblyObsoleteSlot() emits the following
detail message in that case for both "terminating" and "invalidating"
messages.

errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
  LSN_FORMAT_ARGS(restart_lsn),
  (unsigned long long) (oldestLSN - restart_lsn))

Where oldestLSN is the cutoff LSN by KeepLogSeg().

> Removing WAL that's still needed is a *very* severe operation. Emitting an
> additional line in case it happens isn't a problem.

Totally agreed about the severity. The message above doesn't
explicitly say the source of the cutoff LSN but the only possible
source is max_slot_wal_keep_size. I think that DEBUG1 is appropriate
for the message from KeepLogSeg(), especially given how often we see
it.

> > > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's 
> > > the end
> > > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size 
> > > aside,
> > > only actually remove WAL if older than the segment that RedoRecPtr (the
> > > logical start of the checkpoint) is in. If the checkpoint is large we'll 
> > > end
> > > up removing replication slots even though they potentially would only have
> > > retained one additional WAL segment.
> > 
> > I think that it is a controversial part, but that variable is defined
> > the similar way to wal_keep_size.  And I think that all max_wal_size,
> > wal_keep_size and max_slot_wal_keep_size being defined with the same
> > base LSN makes things simpler for users (also for developers).
> > Regardless of checkpoint length, if slots get frequently invalidated,
> > the setting should be considered to be too small for the system
> > requirements.
> 
> I think it's bad that we define wal_keep_size, max_slot_wal_keep_size that
> way. I don't think bringing max_wal_size into this is useful, as it influences
> different things.

In my faint memory, when wal_keep_segments was switched to
wal_keep_size, in the first cut patch, I translated the latter to the
former by rounding up manner but it was rejected and ended up with the
formula we have now.

Speaking of max_slot_wal_keep_size, I think it depends on how we
interpret the variable. If we see it as the minimum amount to ensure,
then we should round it up. But if we see it as the maximum amount
that can't be exceeded, then we would round it down like we do
now. However, I also think that the "max" prefix could imply something
about the upper limit.

> > > Isn't it problematic to use ConvertToXSegs() to implement
> > > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a 
> > > large
> > > wal_segment_size that'd afaict lead to being much more aggressive 
> > > invalidating
> > > slots.
> > 
> > I think max_slot_wal_keep_size is, like max_wal_size for checkpoints,
> > a safeguard for slots not to fill-up WAL directory. Thus they both are
> > rounded down. If you have 1GB WAL directory and set wal_segment_size
> > to 4192MB, I don't see it a sane setup. But if segment size is smaller
> > than one hundredth of max_wal_size, that difference won't matter for
> > anyone.  But anyway, it's a pain in the a.. that the both of them (and
> > wal_keep_size) don't work in a immediate manner, though..
> 
> It doesn't matter a lot for 16MB segments, but with 1GB segments it's a
> different story.
> 
> To me the way it's done now is a bug, one that can in extr

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Kyotaro Horiguchi
At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund  wrote in 
> Hi,
> 
> On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> > Done, PSA v2 patch.
> 
> This feels way too complicated to me.  How about something more like the
> attached?

I like this one, but the parameters offset and size are in a different
order from pwrite(fd, buf, count, offset).  I perfer the arrangement
suggested by Bharath. And isn't it better to use Min(remaining_size,
BLCKSZ) instead of a bare if statement?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-14 Thread Kyotaro Horiguchi
At Tue, 14 Feb 2023 15:43:26 +0100, "Drouvot, Bertrand" 
 wrote in 
> Oh right, my bad (the issue has been introduced in V2).
> Fixed in V4.

Great!

> > I thought that we might be able to return entry_ref->pending since the
> > callers don't call pfree on the returned pointer, but it is not great
> > that we don't inform the callers if the returned memory can be safely
> > pfreed or not.
> > Thus what I have in mind is the following.
> > 
> >>if (!entry_ref)
> >> +  {
> >>entry_ref = 
> >> pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION,
> >>InvalidOid, rel_id);
> >> +  if (!entry_ref)
> >> + return NULL;
> >> +  }
> 
> LGTM, done that way in V4.

That part looks good to me, thanks!

I was going through v4 and it seems to me that the comment for
find_tabstat_entry may not be quite right.

> * find any existing PgStat_TableStatus entry for rel
> *
> * Find any existing PgStat_TableStatus entry for rel_id in the current
> * database. If not found, try finding from shared tables.
> *
> * If no entry found, return NULL, don't create a new one

The comment assumed that the function directly returns an entry from
shared memory, but now it copies the entry's contents into a palloc'ed
memory and stores the sums of some counters for the current
transaction in it. Do you think we should update the comment to
reflect this change?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-13 Thread Kyotaro Horiguchi
Thanks for the new version.

At Mon, 13 Feb 2023 09:58:52 +0100, "Drouvot, Bertrand" 
 wrote in 
> Hi,
> 
> On 2/13/23 8:40 AM, Kyotaro Horiguchi wrote:
> > At Mon, 13 Feb 2023 08:09:50 +0100, "Drouvot, Bertrand"
> >  wrote in
> >>> I think this is useful beyond being able to generate those functions
> >>> with
> >>> macros. The fact that we had to deal with transactional code in
> >>> pgstatfuncs.c
> >>> meant that a lot of the relevant itnernals had to be exposed "outside"
> >>> pgstat,
> >>> which seems architecturally not great.
> >>>
> >> Right, good point.
> > Agreed.
> > 
> >> Removing the pfrees in V2 attached.
> > Ah, that sound good.
> > if (!entry_ref)
> > +   {
> > entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION,
> > InvalidOid, rel_id);
> > +   return tablestatus;
> > +   }
> > We should return something if the call returns a non-null value?
> 
> What we do is: if entry_ref is NULL then we return NULL (so that the
> caller returns 0).
> 
> If entry_ref is not NULL then we return a copy of entry_ref->pending
> (with or without subtrans).

Isn't it ignoring the second call to pgstat_fetch_pending_entry?

What the code did is: if entry_ref is NULL for MyDatabaseId then we
*retry* fetching an global (not database-wise) entry. If any global
entry is found, return it (correctly entry_ref->pending) to the caller.

The current patch returns NULL when a glboal entry is found.

I thought that we might be able to return entry_ref->pending since the
callers don't call pfree on the returned pointer, but it is not great
that we don't inform the callers if the returned memory can be safely
pfreed or not.

Thus what I have in mind is the following.

>   if (!entry_ref)
> + {
>   entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION,
>   InvalidOid, rel_id);
> + if (!entry_ref)
> + return NULL;
> + }



> > So, since we want to hide the internal from pgstatfuncs, the
> > additional flag should be gone. 
> 
> I think there is pros and cons for both but I don't have a strong
> opinion about that.
> 
> So also proposing V3 attached without the flag in case this is the
> preferred approach.

That part looks good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-13 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 15:51:25 +0530, Amit Kapila  wrote 
in 
> I think we can introduce a new variable as last_feedback_time in the
> LogicalRepWorker structure and probably for the last_received, we can
> last_lsn in MyLogicalRepWorker as that seems to be updated correctly.
> I think it would be good to avoid global variables.

MyLogicalRepWorker is a global variable:p, but it is far better than a
bear one.

By the way, we are trying to send the status messages regularly, but
as Andres pointed out, worker does not read nor reply to keepalive
messages from publisher while delaying. It is not possible as far as
we choke the stream at the subscriber end. It doesn't seem to be a
practical problem, but IMHO I think he's right in terms of adherence
to the wire protocol, which was also one of my own initial concern.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-02-13 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 12:18:07 +0900, Michael Paquier  wrote 
in 
> On Mon, Feb 13, 2023 at 11:27:58AM +0900, Kyotaro Horiguchi wrote:
> > I think currently the output by --describe-config can be used only for
> > consulting while editing a (possiblly broken) config file.  Thus I
> > think it's no use showing GIC_DISALLOW_IN_FILE items there unless we
> > use help_config() for an on-session use.
> > 
> > On the other hand, don't we need to remove the condition
> > GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config()
> > should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if
> > it is GUC_NOT_IN_SAMPLE.  I'm not sure whether there's any variable
> > that are marked that way, though.
> 
> As in marked with GUC_NOT_IN_SAMPLE but not GUC_DISALLOW_IN_FILE?
> There are quite a lot, developer GUCs being one (say
> ignore_invalid_pages).  We don't want to list them in the sample file
> so as common users don't play with them, still they make sense if
> listed in a file.

Ah, right.  I think I faintly had them in my mind.

> If you add a check meaning that GUC_DISALLOW_IN_FILE implies
> GUC_NOT_IN_SAMPLE, where one change would need to be applied to
> config_file as all the other GUC_DISALLOW_IN_FILE GUCs already do
> that, you could remove GUC_DISALLOW_IN_FILE.  However,
> GUC_NOT_IN_SAMPLE should be around to not expose options, we don't
> want common users to know too much about.

Okay, I thought that "postgres --help-config" was a sort of developer
option, but your explanation above makes sense.

> The question about how much people rely on --describe-config these
> days is a good one, so perhaps there could be an argument in removing

Yeah, that the reason for my thought it was a developer option...

> GUC_NOT_IN_SAMPLE from the set.  TBH, I would be really surprised that
> anybody able to use a developer option writes an configuration file in
> an incorrect format and needs to use this option, though :)

Hmm.  I didn't directly link GUC_NOT_IN_SAMPLE to being a developer
option. But on second thought, it seems that it is. So, the current
code looks good for me now. Thanks for the explanation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-02-13 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 08:27:01 +0530, Amit Kapila  wrote 
in 
> On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi
>  wrote:
> >
> > IMHO I vaguely don't like that we lose a means to specify the default
> > behavior here. And I'm not sure we definitely don't need other than
> > flush and immedaite for both physical and logical replication.
> >
> 
> If we can think of any use case that requires its extension then it
> makes sense to make it a non-boolean option but otherwise, let's keep
> things simple by having a boolean option.

What do you think about the need for explicitly specifying the
default?  I'm fine with specifying the default using a single word,
such as WAIT_FOR_REMOTE_FLUSH.

> > If it's
> > not the case, I don't object to make it a Boolean.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-13 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy 
 wrote in 
> On Sun, Feb 12, 2023 at 11:01 PM Andres Freund  wrote:
> >
> > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > >   /* Prepare to write out a lot of copies of our zero buffer at once. 
> > > */
> > >   for (i = 0; i < lengthof(iov); ++i)
> > >   {
> > > - iov[i].iov_base = zbuffer.data;
> > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, 
> > > )->data);
> > >   iov[i].iov_len = zbuffer_sz;
> > >   }
> >
> > Another thing: I think we should either avoid iterating over all the IOVs if
> > we don't need them, or, even better, initialize the array as a constant, 
> > once.

FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension?
[*1]) for constant array initialization, but individual members don't
accept assigning a const value, thus I did deconstify as the follows.

>   static const struct iovec   iov[PG_IOV_MAX] =
>   {[0 ... PG_IOV_MAX - 1] =
>{
>.iov_base = (void *),
>.iov_len = BLCKSZ
>}
>   };

I didn't checked the actual mapping, but if I tried an assignment
"iov[0].iov_base = NULL", it failed as "assignment of member
‘iov_base’ in read-only object", so is it successfully placed in a
read-only segment?

Later code assigns iov[0].iov_len thus we need to provide a separate
iov non-const variable, or can we use pwrite instead there?  (I didn't
find pg_pwrite_with_retry(), though)

> How about like the attached patch? It makes the iovec static variable
> and points the zero buffer only once/for the first time to iovec. This
> avoids for-loop on every call.

As the patch itself, it seems forgetting to reset iov[0].iov_len after
writing a partial block.


retards.


*1: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Designated-Inits.html

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-12 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 08:09:50 +0100, "Drouvot, Bertrand" 
 wrote in 
>> I think this is useful beyond being able to generate those functions
>> with
>> macros. The fact that we had to deal with transactional code in
>> pgstatfuncs.c
>> meant that a lot of the relevant itnernals had to be exposed "outside"
>> pgstat,
>> which seems architecturally not great.
>> 
> Right, good point.

Agreed.

> Removing the pfrees in V2 attached.

Ah, that sound good.

if (!entry_ref)
+   {
entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, 
InvalidOid, rel_id);
+   return tablestatus;
+   }

We should return something if the call returns a non-null value?

So, since we want to hide the internal from pgstatfuncs, the
additional flag should be gone. If the additional cost doesn't bother
anyone, I don't mind to remove the flag.  The patch will get far
simpler by that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-02-12 Thread Kyotaro Horiguchi
At Fri, 10 Feb 2023 16:43:15 +0900, Michael Paquier  wrote 
in 
> On Wed, Feb 08, 2023 at 09:42:13PM -0500, Tom Lane wrote:
> > Hm.  On the one hand, if it is in fact not in postgresql.conf.sample,
> > then that flag should be set for sure.  OTOH I see that that flag
> > isn't purely documentation: help_config.c thinks it should hide
> > GUCs that are marked that way.  Do we really want that behavior?
> > Not sure.  I can see an argument that you might want --describe-config
> > to tell you that, but there are a lot of other GUC_NOT_IN_SAMPLE
> > GUCs that maybe do indeed deserve to be left out.
> 
> I am not sure to follow.  help_config() won't show something that's
> either marked NO_SHOW_ALL, NOT_IN_SAMPLE or DISALLOW_IN_FILE, hence
> config_file does not show up already in what postgres
> --describe-config prints, because it has DISALLOW_IN_FILE, so adding
> NOT_IN_SAMPLE changes nothing.

I think currently the output by --describe-config can be used only for
consulting while editing a (possiblly broken) config file.  Thus I
think it's no use showing GIC_DISALLOW_IN_FILE items there unless we
use help_config() for an on-session use.

On the other hand, don't we need to remove the condition
GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config()
should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if
it is GUC_NOT_IN_SAMPLE.  I'm not sure whether there's any variable
that are marked that way, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-02-12 Thread Kyotaro Horiguchi
At Fri, 10 Feb 2023 12:40:43 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Amit,
> 
> > Can't we have this option just as a bool (like shutdown_immediate)?
> > Why do we want to keep multiple modes?
> 
> Of course we can use boolean instead, but current style is motivated by the 
> post[1].
> This allows to add another option in future, whereas I do not have idea now.
> 
> I want to ask other reviewers which one is better...
> 
> [1]: 
> https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com

IMHO I vaguely don't like that we lose a means to specify the default
behavior here. And I'm not sure we definitely don't need other than
flush and immedaite for both physical and logical replication. If it's
not the case, I don't object to make it a Boolean.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-12 Thread Kyotaro Horiguchi
At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila  wrote 
in 
> On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila  wrote:
> >
> > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
> >  wrote:
> > > We have suffered this kind of feedback silence many times. Thus I
> > > don't want to rely on luck here. I had in mind of exposing last_send
> > > itself or providing interval-calclation function to the logic.
> >
> > I think we have last_send time in send_feedback(), so we can expose it
> > if we want but how would that solve the problem you are worried about?

Wal receiver can avoid a too-long sleep by knowing when to wake up for
the next feedback.

> I have an idea to use last_send time to avoid walsenders being
> timeout. Instead of directly using wal_receiver_status_interval as a
> minimum interval to send the feedback, we should check if it is
> greater than last_send time then we should send the feedback after
> (wal_receiver_status_interval - last_send). I think they can probably
> be different only on the very first time. Any better ideas?

If it means MyLogicalRepWorker->last_send_time, it is not the last
time when walreceiver sent a feedback but the last time when
wal*sender* sent a data. So I'm not sure that works.

We could use the variable that way, but AFAIU in turn when so many
changes have been spooled that the control doesn't return to
LogicalRepApplyLoop longer than wal_r_s_interval, maybe_apply_delay()
starts calling send_feedback() at every call after the first feedback
timing.  Even in that case, send_feedback() won't send one actually
until the next feedback timing, I don't think that behavior is great.

The only packets walreceiver sends back is the feedback packets and
currently only send_feedback knows the last feedback time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_usleep for multisecond delays

2023-02-09 Thread Kyotaro Horiguchi
At Thu, 9 Feb 2023 14:51:14 -0800, Nathan Bossart  
wrote in 
> On Thu, Feb 09, 2023 at 01:30:27PM -0800, Andres Freund wrote:
> > So I'm not sure it's the right direction to make pg_usleep() easier to 
> > use...
> Hm...  We might be able to use WaitLatch() for some of these.

And I think we are actully going to reduce or eliminate the use of
pg_sleep().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-09 Thread Kyotaro Horiguchi
At Thu, 9 Feb 2023 11:38:18 +0100, "Drouvot, Bertrand" 
 wrote in 
> Hi hackers,
> 
> Please find attached a patch proposal for $SUBJECT.
> 
> The idea has been raised in [1] by Andres: it would allow to simplify
> even more the work done to
> generate pg_stat_get_xact*() functions with Macros.
> 
> Indeed, with the reconciliation done in find_tabstat_entry() then all
> the pg_stat_get_xact*() functions
> (making use of find_tabstat_entry()) now "look the same" (should they
> take into account live subtransactions or not).
> 
> Looking forward to your feedback,

I like that direction.

Don't we rename PgStat_FunctionCounts to PgStat_FuncStatus, unifying
neighboring functions?

Why does find_tabstat_entry() copies the whole pending data and
performs subxaction summarization? The summarization is needed only by
few callers but now that cost is imposed to the all callers along with
additional palloc()/pfree() calls. That doesn't seem reasonable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Is psSocketPoll doing the right thing?

2023-02-09 Thread Kyotaro Horiguchi
> Subject: is p*s*Socket..

Oops...

At Thu, 9 Feb 2023 17:32:08 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2023/02/09 11:50, Kyotaro Horiguchi wrote:
> > Hello.
> > While looking a patch, I found that pqSocketPoll passes through the
> > result from poll(2) to the caller and throws away revents. If I
> > understand it correctly, poll() *doesn't* return -1 nor errno by the
> > reason it has set POLLERR, POLLHUP, POLLNVAL, and POLLRDHUP for some
> > of the target sockets, and returns 0 unless poll() itself failed to
> > work.
> 
> As far as I understand correctly, poll() returns >0 if "revents"
> has either of those bits, not 0 nor -1.

Right. as my understanding.

If any of the sockets is in any of the states, pqSocketPoll returns a
positive, which makes pqSocketCheck return 1. Finally
pqRead/WriteReady return "ready" even though the connection socket is
in an error state.  Actually that behavior doesn't harm since the
succeeding actual read/write will "properly" fail. However, once we
use this function to simply check the socket is sound without doing an
actual read/write, that behavior starts giving a harm by the false
answer.

> You're thinking that pqSocketPoll() should check "revents" and
> return -1 if either of those bits is set?

In short, yes.

pqSocketPoll() should somehow inform callers about that
state. Fortunately pqSocketPoll is a private function thus we can
refactor the function so that it can do that properly.

If no one object to that change, I'll work on that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Kyotaro Horiguchi
Mmm. A part of the previous mail have gone anywhere for a uncertain
reason and placed by a mysterious blank lines...

At Fri, 10 Feb 2023 09:57:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila  
> wrote in 
> > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
> > >  wrote in
> > > > Thank you for reviewing! PSA new version.
> > >
> > > +   if (statusinterval_ms > 0 && diffms > statusinterval_ms)
> > >
> > > The next expected feedback time is measured from the last status
> > > report.  Thus, it seems to me this may suppress feedbacks from being
> > > sent for an unexpectedly long time especially when min_apply_delay is
> > > shorter than wal_r_s_interval.
> > >
> > 
> > I think the minimum time before we send any feedback during the wait
> > is wal_r_s_interval. Now, I think if there is no transaction for a
> > long time before we get a new transaction, there should be keep-alive
> > messages in between which would allow us to send feedback at regular
> > intervals (wal_receiver_status_interval). So, I think we should be
> 
> Right.
> 
> > able to send feedback in less than 2 * wal_receiver_status_interval
> > unless wal_sender/receiver timeout is very large and there is a very
> > low volume of transactions. Now, we can try to send the feedback
> 
> We have suffered this kind of feedback silence many times. Thus I
> don't want to rely on luck here. I had in mind of exposing last_send
> itself or providing interval-calclation function to the logic.
> 
> > before we start waiting or maybe after every
> > wal_receiver_status_interval / 2 but I think that will lead to more
> > spurious feedback messages than we get the benefit from them.

Agreed. I think we dont want to do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Kyotaro Horiguchi
At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila  wrote 
in 
> On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
> >  wrote in
> > > Thank you for reviewing! PSA new version.
> >
> > +   if (statusinterval_ms > 0 && diffms > statusinterval_ms)
> >
> > The next expected feedback time is measured from the last status
> > report.  Thus, it seems to me this may suppress feedbacks from being
> > sent for an unexpectedly long time especially when min_apply_delay is
> > shorter than wal_r_s_interval.
> >
> 
> I think the minimum time before we send any feedback during the wait
> is wal_r_s_interval. Now, I think if there is no transaction for a
> long time before we get a new transaction, there should be keep-alive
> messages in between which would allow us to send feedback at regular
> intervals (wal_receiver_status_interval). So, I think we should be

Right.

> able to send feedback in less than 2 * wal_receiver_status_interval
> unless wal_sender/receiver timeout is very large and there is a very
> low volume of transactions. Now, we can try to send the feedback

We have suffered this kind of feedback silence many times. Thus I
don't want to rely on luck here. I had in mind of exposing last_send
itself or providing interval-calclation function to the logic.

> before we start waiting or maybe after every
> wal_receiver_status_interval / 2 but I think that will lead to more
> spurious feedback messages than we get the benefit from them.



regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Kyotaro Horiguchi
At Thu, 9 Feb 2023 13:26:19 +0530, Amit Kapila  wrote 
in 
amit.kapila16> On Thu, Feb 9, 2023 at 12:17 AM Peter Smith 
 wrote:
> > I understand in theory, your code is more efficient, but in practice,
> > I think the overhead of a single variable assignment every loop
> > iteration (which is doing WaitLatch anyway) is of insignificant
> > concern, whereas having one assignment is simpler than having two IMO.
> >
> 
> Yeah, that sounds better to me as well.

FWIW, I'm on board with this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-08 Thread Kyotaro Horiguchi
At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Thank you for reviewing! PSA new version.

+   if (statusinterval_ms > 0 && diffms > statusinterval_ms)

The next expected feedback time is measured from the last status
report.  Thus, it seems to me this may suppress feedbacks from being
sent for an unexpectedly long time especially when min_apply_delay is
shorter than wal_r_s_interval.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Is psSocketPoll doing the right thing?

2023-02-08 Thread Kyotaro Horiguchi
Hello.

While looking a patch, I found that pqSocketPoll passes through the
result from poll(2) to the caller and throws away revents. If I
understand it correctly, poll() *doesn't* return -1 nor errno by the
reason it has set POLLERR, POLLHUP, POLLNVAL, and POLLRDHUP for some
of the target sockets, and returns 0 unless poll() itself failed to
work.

It doesn't seem to be the intended behavior since the function sets
POLLERR to pollfd.events. (but the bit is ignored by poll(), though)

Is the above diagnosis correct?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Proposal] Add foreign-server health checks infrastructure

2023-02-08 Thread Kyotaro Horiguchi
At Fri, 27 Jan 2023 06:57:01 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> I found cfbot failure, PSA fixed version.

+   Unlike , this function checks socket
+   health. This check is performed by polling the socket. This function is
+   currently available only on systems that support the non-standard
+   POLLRDHUP extension to the poll system

I find it quite confusing that we have pqSocketCheck and PQconnCheck,
that does almost the same thing.. Since pqSocketCheck is a static
function, we can modify the function as we like.

I still don't understand why we need pqconnCheck_internal separate
from pqSocketPoll(), and PQconnCheck from pqSocketCheck.

https://www.postgresql.org/message-id/flat/TYAPR01MB58665BF23D38EDF10028DE2AF5299%40TYAPR01MB5866.jpnprd01.prod.outlook.com#47d21431bf9fa94f763c824f6e81fa54
> IIUC, pqSocketCheck () calls pqSocketPoll(),
> and in the pqSocketPoll() we poll()'d the POLLIN or POLLOUT event.
> But according to [1], we must wait POLLRDHUP event,
> so we cannot reuse it straightforward.

Yeah, I didn't suggest to use the function as-is. Couldn't we extend
the fucntion by letting it accept end_time = 0 && !forRead &&
!forWrite, not causing side effects?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-02-07 Thread Kyotaro Horiguchi
At Tue, 7 Feb 2023 22:38:14 -0800, Andres Freund  wrote in 
> Hi,
> 
> I did another read through the series. I do have some minor changes, but
> they're minor. I think this is ready for commit. I plan to start pushing
> tomorrow.
> 
> The changes I made are:
> - the tablespace test changes didn't quite work in isolation / needed a bit of
>   polishing
> - moved the tablespace changes to later in the series
> - split the tests out of the commit adding the view into its own commit
> - minor code formatting things (e.g. didn't like nested for()s without {})


> On 2023-01-25 16:56:17 +0900, Kyotaro Horiguchi wrote:
> > At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund  
> > wrote in
> > > > +   write_chunk_s(fpout, );
> > > > +   if (!read_chunk_s(fpin, >io.stats))
> > > >
> > > > The names of the functions hardly make sense alone to me. How about
> > > > write_struct()/read_struct()?  (I personally prefer to use
> > > > write_chunk() directly..)
> > >
> > > That's not related to this patch - there's several existing callers for
> > > it. And write_struct wouldn't be better imo, because it's not just for
> > > structs.
> >
> > Hmm.  Then what the "_s" stands for?
> 
> Size. It's a macro that just forwards to read_chunk()/write_chunk().

I know what the macros do.  But, I'm fine with the names as they are
there since before this patch. Sorry for the noise.

> > > > > +Number of read operations in units of 
> > > > > op_bytes.
> > > >
> > > > I may be the only one who see the name as umbiguous between "total
> > > > number of handled bytes" and "bytes hadled at an operation". Can't it
> > > > be op_blocksize or just block_size?
> > > >
> > > > +   b.io_object,
> > > > +   b.io_context,
> > >
> > > No, block wouldn't be helpful - we'd like to use this for something that 
> > > isn't
> > > uniform blocks.
> >
> > What does the field show in that case?  The mean of operation size? Or
> > one row per opration size?  If the former, the name looks somewhat
> > wrong. If the latter, block_size seems making sense.
> 
> 1, so that it's clear that the rest are in bytes.

Thanks. Okay, I guess the documentation will be changed as necessary.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-02-07 Thread Kyotaro Horiguchi
I agree to the direction and thanks for the patch.

At Tue, 7 Feb 2023 17:08:54 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> > I noticed that previous ones are rejected by cfbot, even if they passed on 
> > my
> > environment...
> > PSA fixed version.
> 
> While analyzing more, I found the further bug that forgets initialization.
> PSA new version that could be passed automated tests on my github repository.
> Sorry for noise.

0002:

This patch doesn't seem to offer a means to change the default
walsender behavior.  We need a subscription option named like
"walsender_exit_mode" to do that.


+ConsumeWalsenderOptions(List *options, WalSndData *data)

I wonder if it is the right design to put options for different things
into a single list. I rather choose to embed the walsender option in
the syntax than needing this function.

K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline opt_shutdown_mode

K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR opt_shutdown_mode 
plugin_options

where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".


==
If we go with the current design, I think it is better to share the
option list rule between the logical and physical START_REPLCIATION
commands.

I'm not sure I like the option syntax
"exit_before_confirming=". I imagin that other options may
come in future. Thus, how about "walsender_shutdown_mode=",
where the mode is one of "wait_flush"(default) and "immediate"?


+typedef struct
+{
+   boolexit_before_confirming;
+} WalSndData;

Data doesn't seem to represent the variable. Why not WalSndOptions?


-   !equal(newsub->publications, MySubscription->publications))
+   !equal(newsub->publications, MySubscription->publications) ||
+   (newsub->minapplydelay > 0 && MySubscription->minapplydelay == 
0) ||
+   (newsub->minapplydelay == 0 && MySubscription->minapplydelay > 
0))

 I slightly prefer the following expression (Others may disagree:p):

  ((newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))

 And I think we need a comment for the term. For example,

  /* minapplydelay affects START_REPLICATION option exit_before_confirming */


+ * Reads all entrly of the list and consume if needed.
s/entrly/entries/ ?
...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Make mesage at end-of-recovery less scary.

2023-02-06 Thread Kyotaro Horiguchi
Thanks!

At Fri, 3 Feb 2023 15:16:02 +0100, Alvaro Herrera  
wrote in 
> So this patch is now failing because it applies new tests to
> 011_crash_recovery.pl, which was removed recently.  Can you please move
> them elsewhere?

I don't find an appropriate file to move to. In the end I created a
new file with the name 034_recovery.pl.  I added a test for standbys,
too. (which is the first objective of this patch.)

> I think the comment for ValidXLogRecordLength should explain what the
> return value is.

Agreed.


/*
  * Validate record length of an XLOG record header.
  *
  * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
  * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
  */
 static bool
 ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,

I added a similar description to ValidXLogRecordHeader.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c58ca4d5db52c75dec9882d158d5724e12617005 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 30 Nov 2022 11:51:46 +0900
Subject: [PATCH v24] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 144 +-
 src/backend/access/transam/xlogrecovery.c |  94 ++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 +-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/034_recovery.pl   | 135 
 6 files changed, 335 insertions(+), 59 deletions(-)
 create mode 100644 src/test/recovery/t/034_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..8cb2d55333 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check 

Re: Where is the logig to create a table file?

2023-02-06 Thread Kyotaro Horiguchi
At Fri, 3 Feb 2023 13:44:46 +0400, Pavel Borisov  wrote 
in 
> Hi, Jack
> 
> On Fri, 3 Feb 2023 at 13:19, jack...@gmail.com  wrote:
> >
> > When I use 'create table t(a int);'; suppose that this table t's oid is 
> > 1200,
> > then postgres will create a file named 1200 in the $PGDATA/base, So where
> > is the logic code in the internal?
> >
> heapam_relation_set_new_filenode()->RelationCreateStorage()

Or if you are searching for the logic to determin the file name, see
GetNewRelFileNumber().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2023-02-06 Thread Kyotaro Horiguchi
Thank you for the comment!

At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas  wrote 
in 
> I want to call out this part of this patch:
> 
> > Also this allows for the cleanup of files left behind in the crash of
> > the transaction that created it.
> 
> This is interesting to a lot wider audience than ALTER TABLE SET
> LOGGED/UNLOGGED. It also adds most of the complexity, with the new
> marker files. Can you please split the first patch into two:
> 
> 1. Cleanup of newly created relations on crash
> 
> 2. ALTER TABLE SET LOGGED/UNLOGGED changes
> 
> Then we can review the first part independently.

Ah, indeed.  I'll do that.

> Regarding the first part, I'm not sure the marker files are the best
> approach to implement it. You need to create an extra file for every
> relation, just to delete it at commit. It feels a bit silly, but maybe

Agreed. (But I didn't come up with better idea..)

> it's OK in practice. The undo log patch set solved this problem with
> the undo log, but it looks like that patch set isn't going
> anywhere. Maybe invent a very lightweight version of the undo log for
> this?

I didn't thought on that line. Yes, indeed the marker files are a kind
of undo log.

Anyway, I'll split the current patch to two parts as suggested.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-06 Thread Kyotaro Horiguchi
Thanks!

At Mon, 6 Feb 2023 13:10:01 +, "Takamichi Osumi (Fujitsu)" 
 wrote in 
> The attached patch v29 has included your changes.

catalogs.sgml

+  
+   The minimum delay (ms) for applying changes.
+  

I think we don't use unit symbols that way. Namely I think we would
write it as "The minimum delay for applying changes in milliseconds"


alter_subscription.sgml

   are slot_name,
   synchronous_commit,
   binary, streaming,
-  disable_on_error, and
-  origin.
+  disable_on_error,
+  origin, and
+  min_apply_delay.
  

By the way, is there any rule for the order among the words? They
don't seem in alphabetical order nor in the same order to the
create_sbuscription page.  (I seems like in the order of SUBOPT_*
symbols, but I'm not sure it's a good idea..)


subscriptioncmds.c

+   if (opts.streaming == 
LOGICALREP_STREAM_PARALLEL &&
+   !IsSet(opts.specified_opts, 
SUBOPT_MIN_APPLY_DELAY) && sub->minapplydelay > 0)
..
+   if (opts.min_apply_delay > 0 &&
+   !IsSet(opts.specified_opts, 
SUBOPT_STREAMING) && sub->stream == LOGICALREP_STREAM_PARALLEL)

Don't we wrap the lines?


worker.c

+   if (wal_receiver_status_interval > 0 &&
+   diffms > wal_receiver_status_interval * 1000L)
+   {
+   WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH,
+ wal_receiver_status_interval * 1000L,
+ WAIT_EVENT_RECOVERY_APPLY_DELAY);
+   send_feedback(last_received, true, false, true);
+   }
+   else
+   WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH,
+ diffms,
+ WAIT_EVENT_RECOVERY_APPLY_DELAY);

send_feedback always handles the case where
wal_receiver_status_interval == 0. thus we can simply wait for
min(wal_receiver_status_interval, diffms) then call send_feedback()
unconditionally.


-start_apply(XLogRecPtr origin_startpos)
+start_apply(void)

-LogicalRepApplyLoop(XLogRecPtr last_received)
+LogicalRepApplyLoop(void)

Does this patch requires this change?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-06 Thread Kyotaro Horiguchi
At Tue, 7 Feb 2023 09:10:01 +0530, Amit Kapila  wrote 
in 
> On Tue, Feb 7, 2023 at 6:03 AM Peter Smith  wrote:
> > 5b.
> > Since there are no translator considerations here why not write the
> > second error like:
> >
> > errmsg("%d ms is outside the valid range for parameter
> > \"min_apply_delay\" (%d .. %d)",
> > result, 0, PG_INT32_MAX))
> >
> 
> I see that existing usage in the code matches what the patch had
> before this comment. See below and similar usages in the code.
> if (start <= 0)
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> errmsg("invalid value for parameter \"%s\": %d",
> "start", start)));

The same errmsg text occurs mamy times in the tree. On the other hand
the pointed message is the only one.  I suppose Peter considered this
aspect.

# "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)"
# also appears just once

As for me, it seems to me a good practice to do that regadless of the
number of duplicates to (semi)mechanically avoid duplicates.

(But I believe I would do as Peter suggests by myself for the first
cut, though:p)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-02-01 Thread Kyotaro Horiguchi
At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila  wrote 
in 
> On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada  wrote:
> >
> > Otherwise, we will end up terminating
> > the WAL stream without the done message. Which will lead to an error
> > message "ERROR:  could not receive data from WAL stream: server closed
> > the connection unexpectedly" on the subscriber even at a clean
> > shutdown.
> >
> 
> But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> disallows new connections, then waits for all existing clients to
> disconnect. If the server is in hot standby, recovery and streaming
> replication will be terminated once all clients have disconnected.),
> there is no such guarantee. I see that it is required for the
> switchover in physical replication to ensure that all the WAL is sent
> and replicated but we don't need that for logical replication.

+1

Since publisher is not aware of apply-delay (by this patch), as a
matter of fact publisher seems gone before sending EOS in that
case. The error message is correctly describing that situation.

> > In a case where pq_is_send_pending() doesn't become false
> > for a long time, (e.g., the network socket buffer got full due to the
> > apply worker waiting on a lock), I think users should unblock it by
> > themselves. Or it might be practically better to shutdown the
> > subscriber first in the logical replication case, unlike the physical
> > replication case.
> >
> 
> Yeah, will users like such a dependency? And what will they gain by doing so?

If PostgreSQL required such kind of special care about shutdown at
facing a trouble to keep replication consistency, that won't be
acceptable. The current time-delayed logical replication can be seen
as a kind of intentional continuous large network lag in this
aspect. And I think the consistency is guaranteed even in such cases.

On the other hand I don't think the almost all people care about the
exact progress when facing such troubles, as far as replication
consistently is maintained. IMHO that is also true for the
logical-delayed-replication case.

> [1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-01 Thread Kyotaro Horiguchi
At Wed, 1 Feb 2023 08:38:11 +0530, Amit Kapila  wrote 
in 
> On Wed, Feb 1, 2023 at 8:13 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila  
> > wrote in
> > > So, shall we check if the result of parse_int is in the range 0 and
> > > PG_INT32_MAX to ameliorate this concern?
> >
> > Yeah, it is exactly what I wanted to suggest.
> >
> > > If this works then we need to
> > > probably change the return value of defGetMinApplyDelay() to int32.
> >
> > I didn't thought doing that, int can store all values in the valid
> > range (I'm assuming we implicitly assume int >= int32 in bit width)
> > and it is the natural integer in C.  Either will do for me but I
> > slightly prefer to use int there.
> >
> 
> I think it would be clear to use int32 because the parameter where we
> store the return value is also int32.

I'm fine with that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-31 Thread Kyotaro Horiguchi
At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila  wrote 
in 
> On Tue, Jan 31, 2023 at 1:40 PM Kyotaro Horiguchi
>  wrote:
> >
> > Hi, Kuroda-san, Thanks for the detailed study.
> >
> > At Tue, 31 Jan 2023 07:06:40 +, "Hayato Kuroda (Fujitsu)" 
> >  wrote in
> > > Therefore, I think we can say that modern platforms that are supported by 
> > > PostgreSQL define int as 32-bit.
> > > It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep 
> > > to use INT_MAX.
> >
> > Yeah, I know that that's practically correct.  Just I wanted to make
> > clear is whether we (always) assume int == int32. I don't want to do
> > that just because that works. Even though we cannot be perfect, in
> > this particular case the destination space is explicitly made as
> > int32.
> >
> 
> So, shall we check if the result of parse_int is in the range 0 and
> PG_INT32_MAX to ameliorate this concern?

Yeah, it is exactly what I wanted to suggest.

> If this works then we need to
> probably change the return value of defGetMinApplyDelay() to int32.

I didn't thought doing that, int can store all values in the valid
range (I'm assuming we implicitly assume int >= int32 in bit width)
and it is the natural integer in C.  Either will do for me but I
slightly prefer to use int there.

As the result I'd like to propose the following change.


diff --git a/src/backend/commands/subscriptioncmds.c 
b/src/backend/commands/subscriptioncmds.c
index 489eae85ee..9de2745623 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -2293,16 +2293,16 @@ defGetMinApplyDelay(DefElem *def)
 hintmsg ? errhint("%s", _(hintmsg)) : 0));
 
/*
-* Check lower bound. parse_int() has already been confirmed that result
-* is less than or equal to INT_MAX.
+* Check the both boundary. Although parse_int() checked the result 
against
+* INT_MAX, this value is to be stored in a catalog column of int32.
 */
-   if (result < 0)
+   if (result < 0 || result > PG_INT32_MAX)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg("%d ms is outside the valid range for 
parameter \"%s\" (%d .. %d)",
result,
"min_apply_delay",
-   0, INT_MAX)));
+   0, PG_INT32_MAX)));
 
return result;
 }


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-31 Thread Kyotaro Horiguchi
Hi, Kuroda-san, Thanks for the detailed study.

At Tue, 31 Jan 2023 07:06:40 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Therefore, I think we can say that modern platforms that are supported by 
> PostgreSQL define int as 32-bit.
> It satisfies the condition sizeof(int) <= sizeof(int32), so we can keep to 
> use INT_MAX.

Yeah, I know that that's practically correct.  Just I wanted to make
clear is whether we (always) assume int == int32. I don't want to do
that just because that works. Even though we cannot be perfect, in
this particular case the destination space is explicitly made as
int32.

It's a similar discussion to the recent commit 3b4ac33254. We choosed
to use the "correct" symbols refusing to employ an implicit assumption
about the actual values.  (In that sense, it is a compromize to assume
int32 being narrower than int is a premise, but the code will get
uselessly complex without that assumption:p)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-30 Thread Kyotaro Horiguchi
At Mon, 30 Jan 2023 14:24:31 +0530, Amit Kapila  wrote 
in 
> On Mon, Jan 30, 2023 at 12:38 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila  
> > wrote in
> > > #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000))
> >
> > Which can lead to overflow, which is practically harmless.
> >
> 
> But here tz is always TimestampTz (which is int64), so do, we need to worry?

Sorry, I was putting an assuption that int were int64 here.

> > > the function defGetMinApplyDelay() sufficient to ensure that the
> > > 'delay' value stored in the catalog will always be lesser than
> > > INT_MAX?
> >
> > I'm concerned about cases where INT_MAX is wider than int32.  If we
> > don't assume such cases, I'm fine with INT_MAX there.
> >
> 
> I am not aware of such cases. Anyway, if any such case is discovered
> then we need to change the checks in defGetMinApplyDelay(), right? If
> so, then I think it is better to keep it as it is unless we know that
> this could be an issue on some platform.

I'm not sure. I think that int is generally thought that it is tied
with an integer type of any size.  min_apply_delay is tightly bond
with a catalog column of int32 thus I thought that (PG_)INT32_MAX is
the right limit.  So, as I expressed before, if we assume sizeof(int)
<= sizeof(int32), I' fine with using INT_MAX there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-29 Thread Kyotaro Horiguchi
At Mon, 30 Jan 2023 11:56:33 +0530, Amit Kapila  wrote 
in 
> On Mon, Jan 30, 2023 at 9:43 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 30 Jan 2023 08:51:05 +0530, Amit Kapila  
> > wrote in
> > > On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Sat, 28 Jan 2023 04:28:29 +, "Takamichi Osumi (Fujitsu)" 
> > > >  wrote in
> > > > > On Friday, January 27, 2023 8:00 PM Amit Kapila 
> > > > >  wrote:
> > > > > > So, you have changed min_apply_delay from int64 to int32, but you 
> > > > > > haven't
> > > > > > mentioned the reason for the same? We use 'int' for the similar 
> > > > > > parameter
> > > > > > recovery_min_apply_delay, so, ideally, it makes sense but still 
> > > > > > better to tell your
> > > > > > reason explicitly.
> > > > > Yes. It's because I thought I need to make this feature consistent 
> > > > > with the recovery_min_apply_delay.
> > > > > This feature handles the range same as the recovery_min_apply delay 
> > > > > from 0 to INT_MAX now
> > > > > so should be adjusted to match it.
> > > >
> > > > INT_MAX can stick out of int32 on some platforms. (I'm not sure where
> > > > that actually happens, though.) We can use PG_INT32_MAX instead.
> > > >
> > >
> > > But in other integer GUCs including recovery_min_apply_delay, we use
> > > INT_MAX, so not sure if it is a good idea to do something different
> > > here.
> >
> > The GUC is not stored in a catalog, but.. oh... it is multiplied by
> > 1000.
> 
> Which part of the patch you are referring to here? Isn't the check in

Where recovery_min_apply_delay is used. It is allowed to be set up to
INT_MAX but it is used as:

>   delayUntil = TimestampTzPlusMilliseconds(xtime, 
> recovery_min_apply_delay);

Where the macro is defined as:

> #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000))

Which can lead to overflow, which is practically harmless.

> the function defGetMinApplyDelay() sufficient to ensure that the
> 'delay' value stored in the catalog will always be lesser than
> INT_MAX?

I'm concerned about cases where INT_MAX is wider than int32.  If we
don't assume such cases, I'm fine with INT_MAX there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-29 Thread Kyotaro Horiguchi
At Mon, 30 Jan 2023 08:51:05 +0530, Amit Kapila  wrote 
in 
> On Mon, Jan 30, 2023 at 8:32 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Sat, 28 Jan 2023 04:28:29 +, "Takamichi Osumi (Fujitsu)" 
> >  wrote in
> > > On Friday, January 27, 2023 8:00 PM Amit Kapila  
> > > wrote:
> > > > So, you have changed min_apply_delay from int64 to int32, but you 
> > > > haven't
> > > > mentioned the reason for the same? We use 'int' for the similar 
> > > > parameter
> > > > recovery_min_apply_delay, so, ideally, it makes sense but still better 
> > > > to tell your
> > > > reason explicitly.
> > > Yes. It's because I thought I need to make this feature consistent with 
> > > the recovery_min_apply_delay.
> > > This feature handles the range same as the recovery_min_apply delay from 
> > > 0 to INT_MAX now
> > > so should be adjusted to match it.
> >
> > INT_MAX can stick out of int32 on some platforms. (I'm not sure where
> > that actually happens, though.) We can use PG_INT32_MAX instead.
> >
> 
> But in other integer GUCs including recovery_min_apply_delay, we use
> INT_MAX, so not sure if it is a good idea to do something different
> here.

The GUC is not stored in a catalog, but.. oh... it is multiplied by
1000. So if it is larger than (INT_MAX / 1000), it overflows...  If we
officially accept that (I don't think great) behavior (even only for
impractical values), I don't object further.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-29 Thread Kyotaro Horiguchi
At Sat, 28 Jan 2023 04:28:29 +, "Takamichi Osumi (Fujitsu)" 
 wrote in 
> On Friday, January 27, 2023 8:00 PM Amit Kapila  
> wrote:
> > So, you have changed min_apply_delay from int64 to int32, but you haven't
> > mentioned the reason for the same? We use 'int' for the similar parameter
> > recovery_min_apply_delay, so, ideally, it makes sense but still better to 
> > tell your
> > reason explicitly.
> Yes. It's because I thought I need to make this feature consistent with the 
> recovery_min_apply_delay.
> This feature handles the range same as the recovery_min_apply delay from 0 to 
> INT_MAX now
> so should be adjusted to match it.

INT_MAX can stick out of int32 on some platforms. (I'm not sure where
that actually happens, though.) We can use PG_INT32_MAX instead.

IMHO, I think we don't use int as a catalog column and I agree that
int32 is sufficient since I don't think more than 49 days delay is
practical. On the other hand, maybe I wouldn't want to use int32 for
intermediate calculations.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-25 Thread Kyotaro Horiguchi
At Wed, 25 Jan 2023 12:30:19 +0530, Amit Kapila  wrote 
in 
> On Wed, Jan 25, 2023 at 11:57 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 24 Jan 2023 12:19:04 +, "Takamichi Osumi (Fujitsu)" 
> >  wrote in
> > > Attached the patch v20 that has incorporated all comments so far.
> >
> ...
> >
> >
> > +  in which case no additional wait is necessary. If the system 
> > clocks
> > +  on publisher and subscriber are not synchronized, this may lead 
> > to
> > +  apply changes earlier than expected, but this is not a major 
> > issue
> > +  because this parameter is typically much larger than the time
> > +  deviations between servers. Note that if this parameter is set 
> > to a
> >
> > This doesn't seem to fit our documentation. It is not our business
> > whether a certain amount deviation is critical or not. How about
> > somethig like the following?
> >
> 
> But we have a similar description for 'recovery_min_apply_delay' [1].
> See "...If the system clocks on primary and standby are not
> synchronized, this may lead to recovery applying records earlier than
> expected; but that is not a major issue because useful settings of
> this parameter are much larger than typical time deviations between
> servers."

Mmmm. I thought that we might be able to gather the description
(including other common descriptions, if any), but I didn't find an
appropreate place..

Okay. I agree to the current description. Thanks for the kind
explanation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-24 Thread Kyotaro Horiguchi
At Tue, 24 Jan 2023 14:35:12 -0800, Andres Freund  wrote in 
> > 0002:
> > 
> > +   Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));
> > 
> > This is relatively complex checking. We already asserts-out increments
> > of invalid counters. Thus this is checking if some unrelated codes
> > clobbered them, which we do only when consistency is critical. Is
> > there any needs to do that here?  I saw another occurance of the same
> > assertion.
> 
> I found it useful to find problems.

Okay.

> > +   no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
> > +   bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
> > +   bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;
> > 
> > I'm not sure I like to omit parentheses for such a long Boolean
> > expression on the right side.
> 
> What parens would help?

I thought about the following.

no_temp_rel =
   (bktype == B_AUTOVAC_LAUNCHER ||
bktype == B_BG_WRITER ||
bktype == B_CHECKPOINTER ||
bktype == B_AUTOVAC_WORKER ||
bktype == B_STANDALONE_BACKEND ||
bktype == B_STARTUP);


> > +   write_chunk_s(fpout, );
> > +   if (!read_chunk_s(fpin, >io.stats))
> > 
> > The names of the functions hardly make sense alone to me. How about
> > write_struct()/read_struct()?  (I personally prefer to use
> > write_chunk() directly..)
> 
> That's not related to this patch - there's several existing callers for
> it. And write_struct wouldn't be better imo, because it's not just for
> structs.

Hmm.  Then what the "_s" stands for?

> > + PgStat_BktypeIO
> > 
> > This patch abbreviates "backend" as "bk" but "be" is used in many
> > places. I think that naming should follow the predecessors.
> 
> The precedence aren't consistent unfortunately :)

Uuuum.  Okay, just I like "be" there!  Anyway, I don't strongly
push that.

> > > +Number of read operations in units of 
> > > op_bytes.
> > 
> > I may be the only one who see the name as umbiguous between "total
> > number of handled bytes" and "bytes hadled at an operation". Can't it
> > be op_blocksize or just block_size?
> > 
> > +   b.io_object,
> > +   b.io_context,
> 
> No, block wouldn't be helpful - we'd like to use this for something that isn't
> uniform blocks.

What does the field show in that case?  The mean of operation size? Or
one row per opration size?  If the former, the name looks somewhat
wrong. If the latter, block_size seems making sense.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: wake up logical workers after ALTER SUBSCRIPTION

2023-01-24 Thread Kyotaro Horiguchi
At Tue, 24 Jan 2023 10:42:17 -0800, Nathan Bossart  
wrote in 
> On Tue, Jan 24, 2023 at 01:13:55PM -0500, Tom Lane wrote:
> > Either that comment needs to be rewritten or we need to invent some
> > more macros.
> 
> Here is a first attempt at a patch.  I scanned through all the existing
> uses of InvalidDsaPointer and DSM_HANDLE_INVALID and didn't notice anything
> else that needed adjusting.

There seems to be two cases for DSA_HANDLE_INVALID in dsa_get_handle
and dsa_attach_in_place, one of which is Assert(), though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-24 Thread Kyotaro Horiguchi
At Tue, 24 Jan 2023 12:19:04 +, "Takamichi Osumi (Fujitsu)" 
 wrote in 
> Attached the patch v20 that has incorporated all comments so far.

Thanks! I looked thourgh the documentation part.


+  
+   subminapplydelay int8
+  
+  
+   Total time spent delaying the application of changes, in milliseconds.
+  

I was confused becase it reads as this column shows the summarized
actual waiting time caused by min_apply_delay.  IIUC actually it shows
the min_apply_delay setting for the subscription. Thus shouldn't it be
something like this?

"The minimum amount of time to delay applying changes, in milliseconds"

And it might be better to mention the corresponding subscription paramter.


+   error. If wal_receiver_status_interval is set to
+   zero, the apply worker doesn't send any feedback messages during the
+   min_apply_delay period.

I took a bit longer time to understand what this sentence means.  I'd
like to suggest something like the follwoing.

"Since no status-update messages are sent while delaying, note that
wal_receiver_status_interval is the only source of keepalive messages
during that period."

+  
+   A logical replication subscription can delay the application of changes by
+   specifying the min_apply_delay subscription parameter.
+   See  for details.
+  

I'm not sure "logical replication subscription" is a common term.
Doesn't just "subscription" mean the same, especially in that context?
(Note that 31.2 starts with "A subscription is the downstream..").


+  Any delay occurs only on WAL records for transaction begins after all
+  initial table synchronization has finished. The delay is calculated

There is no "transaction begin" WAL records.  Maybe it is "logical
replication transaction begin message". The timestamp is of "commit
time".  (I took "transaction begins" as a noun, but that might be
wrong..)


+may reduce the actual wait time. It is also possible that the overhead
+already exceeds the requested min_apply_delay value,
+in which case no additional wait is necessary. If the system clocks

I'm not sure it is right to say "necessary" here.  IMHO it might be
better be "in which case no delay is applied".


+  in which case no additional wait is necessary. If the system clocks
+  on publisher and subscriber are not synchronized, this may lead to
+  apply changes earlier than expected, but this is not a major issue
+  because this parameter is typically much larger than the time
+  deviations between servers. Note that if this parameter is set to a

This doesn't seem to fit our documentation. It is not our business
whether a certain amount deviation is critical or not. How about
somethig like the following?

"Note that the delay is measured between the timestamp assigned by
publisher and the system clock on subscriber.  You need to manage the
system clocks to be in sync so that the delay works properly."

+Delaying the replication can mean there is a much longer time
+between making a change on the publisher, and that change being
+committed on the subscriber. This can impact the performance of
+synchronous replication. See 
+parameter.

Do we need the "can" in "Delaying the replication can mean"?  If we
want to say, it might be "Delaying the replication means there can be
a much longer..."?


+  
+   Create a subscription to a remote server that replicates tables in
+   the mypub publication and starts replicating immediately
+   on commit. Pre-existing data is not copied. The application of changes is
+   delayed by 4 hours.
+
+CREATE SUBSCRIPTION mysub
+ CONNECTION 'host=192.0.2.4 port=5432 user=foo dbname=foodb'
+PUBLICATION mypub
+   WITH (copy_data = false, min_apply_delay = '4h');
+

I'm not sure we need this additional example. We already have two
exmaples one of which differs from the above only by actual values for
PUBLICATION and WITH clauses.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-24 Thread Kyotaro Horiguchi
Sorry for making you bothered by this.

At Tue, 24 Jan 2023 10:12:40 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> > > Couldn't we maintain an additional static variable "last_applied"
> > > along with last_received?
> > >
> > 
> > It won't be easy to maintain the meaning of last_applied because there
> > are cases where we don't apply the change directly. For example, in
> > case of streaming xacts, we will just keep writing it to the file,
> > now, say, due to some reason, we have to send the feedback, then it
> > will not allow you to update the latest write locations. This would
> > then become different then what we are doing without the patch.
> > Another point to think about is that we also need to keep the variable
> > updated for keep-alive ('k') messages even though we don't apply
> > anything in that case. Still, other cases to consider are where we
> > have mix of streaming and non-streaming transactions.
> 
> I have tried to implement that, but it might be difficult because of a corner
> case related with the initial data sync.
> 
> First of all, I have made last_applied to update when
> 
> * transactions are committed, prepared, or aborted
> * apply worker receives keepalive message.

Yeah, I vagurly thought that it is enough that the update happens just
befor existing send_feecback() calls. But it turned out to introduce
another unprincipledness..

> I thought during the initial data sync, we must not update the last applied
> triggered by keepalive messages, so following lines were added just after
> updating last_received.
> 
> ```
> +   if (last_applied < end_lsn && 
> AllTablesyncsReady())
> +   last_applied = 
> end_lsn;
> ```

Maybe, the name "last_applied" made you confused. As I mentioned in
another message, the variable points to the remote LSN of last
"processed" 'w/k' messages.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-24 Thread Kyotaro Horiguchi
At Tue, 24 Jan 2023 14:22:19 +0530, Amit Kapila  wrote 
in 
> On Tue, Jan 24, 2023 at 12:44 PM Peter Smith  wrote:
> >
> > On Tue, Jan 24, 2023 at 5:58 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jan 24, 2023 at 8:15 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > > Attached the updated patch v19.
> > > >
> > > > + maybe_delay_apply(TransactionId xid, TimestampTz finish_ts)
> > > >
> > > > I look this spelling strange.  How about maybe_apply_delay()?
> > > >
> > >
> > > +1.
> >
> > It depends on how you read it. I read it like this:
> >
> > maybe_delay_apply === means "maybe delay [the] apply"
> > (which is exactly what the function does)
> >
> > versus
> >
> > maybe_apply_delay === means "maybe [the] apply [needs a] delay"
> > (which is also correct, but it seemed a more awkward way to say it IMO)
> >
> 
> This matches more with GUC and all other usages of variables in the
> patch. So, I still prefer the second one.

I read it as "maybe apply [the] delay [to something suggested by the
context]". If we go the first way, I will name it as
"maybe_delay_apply_change" or something that has an extra word.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-24 Thread Kyotaro Horiguchi
In short, I'd like to propose renaming the parameter in_delayed_apply
of send_feedback to "has_unprocessed_change".

At Tue, 24 Jan 2023 12:27:58 +0530, Amit Kapila  wrote 
in 
> > send_feedback():
> > +* If the subscriber side apply is delayed (because of time-delayed
> > +* replication) then do not tell the publisher that the received 
> > latest
> > +* LSN is already applied and flushed, otherwise, it leads to the
> > +* publisher side making a wrong assumption of logical replication
> > +* progress. Instead, we just send a feedback message to avoid a 
> > publisher
> > +* timeout during the delay.
> >  */
> > -   if (!have_pending_txes)
> > +   if (!have_pending_txes && !in_delayed_apply)
> > flushpos = writepos = recvpos;
> >
> > Honestly I don't like this wart. The reason for this is the function
> > assumes recvpos = applypos but we actually call it while holding
> > unapplied changes, that is, applypos < recvpos.
> >
> > Couldn't we maintain an additional static variable "last_applied"
> > along with last_received?
> >
> 
> It won't be easy to maintain the meaning of last_applied because there
> are cases where we don't apply the change directly. For example, in
> case of streaming xacts, we will just keep writing it to the file,
> now, say, due to some reason, we have to send the feedback, then it
> will not allow you to update the latest write locations. This would
> then become different then what we are doing without the patch.
> Another point to think about is that we also need to keep the variable
> updated for keep-alive ('k') messages even though we don't apply
> anything in that case. Still, other cases to consider are where we
> have mix of streaming and non-streaming transactions.

Yeah.  Even though I named it as "last_applied", its objective is to
have get_flush_position returning the correct have_pending_txes
without a hint from callers, that is, "let g_f_position know if
store_flush_position has been called with the last received data".

Anyway I tried that but didn't find a clean and simple way. However,
while on it, I realized what the code made me confused.

+static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply,
+ bool in_delayed_apply);

The name "in_delayed_apply" doesn't donsn't give me an idea of what
the function should do for it. If it is named "has_unprocessed_change",
I think it makes sense that send_feedback should think there may be an
outstanding transaction that is not known to the function.


So, my conclusion here is I'd like to propose changing the parameter
name to "has_unapplied_change".


> >  In this case the condition cited above
> > would be as follows and in_delayed_apply will become unnecessary.
> >
> > +   if (!have_pending_txes && last_received == last_applied)
> >
> > The function is a static function and always called with a variable
> > last_received that has the same scope with the function, as the first

Sorry for the noise, I misread it. Maybe I took the "function-scoped"
variable as file-scoped.. Thus the discussion is false.

> > parameter. Thus we can remove the first parameter then let the
> > function directly look at the both two varaibles instead.
> >
> 
> I think this is true without this patch, so why that has not been
> followed in the first place? One comment, I see in this regard is as
> below:
> 
> /* It's legal to not pass a recvpos */
> if (recvpos < last_recvpos)
> recvpos = last_recvpos;

Sorry. I don't understand this. It is just a part of the ratchet
mechanism for the last received lsn to report.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-24 Thread Kyotaro Horiguchi
At Tue, 24 Jan 2023 11:45:36 +0530, Amit Kapila  wrote 
in 
> Personally, I would prefer the above LOGs to be in reverse order as it
> doesn't make much sense to me to first say that we are skipping
> changes and then say the transaction is delayed. What do you think?

In the first place, I misunderstood maybe_start_skipping_changes(),
which doesn't actually skip changes. So... sorry for the noise.

For the record, I agree that the current order is right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-24 Thread Kyotaro Horiguchi
At Tue, 24 Jan 2023 11:28:58 +0530, Amit Kapila  wrote 
in 
> On Tue, Jan 24, 2023 at 6:17 AM Kyotaro Horiguchi
>  wrote:
> >
> > IMHO "foo > bar" is not an "option".  I think we say "foo and bar are
> > mutually exclusive options" but I think don't say "foo = x and bar = y
> > are.. options".  I wrote a comment as "this should be more like
> > human-speaking" and Euler seems having the same feeling for another
> > error message.
> >
> > Concretely I would spell this as "min_apply_delay cannot be enabled
> > when parallel streaming mode is enabled" or something.
> >
> 
> We can change it but the current message seems to be in line with some
> nearby messages like "slot_name = NONE and enabled = true are mutually
> exclusive options". So, isn't it better to keep this as one in sync
> with existing messages?

Ooo. subscriptioncmds.c is full of such messages. Okay I agree that it
is better to leave it as is..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-24 Thread Kyotaro Horiguchi
At Tue, 24 Jan 2023 17:22:03 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> +pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
> +{
> + Assert(io_object < IOOBJECT_NUM_TYPES);
> + Assert(io_context < IOCONTEXT_NUM_TYPES);
> + Assert(io_op < IOOP_NUM_TYPES);
> + Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, 
> io_op));
> 
> Is there any reason for not checking the value ranges at the
> bottom-most functions?  They can lead to out-of-bounds access so I

To make sure, the "They" means "out-of-range io_object/context/op
values"..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-24 Thread Kyotaro Horiguchi
Hello.

At Thu, 19 Jan 2023 21:15:34 -0500, Melanie Plageman 
 wrote in 
> Oh dear-- an extra FlushBuffer() snuck in there somehow.
> Removed it in attached v51.
> Also, I fixed an issue in my tablespace.sql updates

I only looked 0002 and 0004.
(Sorry for the random order of the comment..)

0002:

+   Assert(pgstat_bktype_io_stats_valid(bktype_shstats, MyBackendType));

This is relatively complex checking. We already asserts-out increments
of invalid counters. Thus this is checking if some unrelated codes
clobbered them, which we do only when consistency is critical. Is
there any needs to do that here?  I saw another occurance of the same
assertion.


-/* Reset some shared cluster-wide counters */
+/*
+ * Reset some shared cluster-wide counters
+ *
+ * When adding a new reset target, ideally the name should match that in
+ * pgstat_kind_infos, if relevant.
+ */

I'm not sure the addition is useful..


+pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+   Assert(io_object < IOOBJECT_NUM_TYPES);
+   Assert(io_context < IOCONTEXT_NUM_TYPES);
+   Assert(io_op < IOOP_NUM_TYPES);
+   Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, 
io_op));

Is there any reason for not checking the value ranges at the
bottom-most functions?  They can lead to out-of-bounds access so I
don't think we need to continue execution for such invalid values.

+   no_temp_rel = bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
+   bktype == B_CHECKPOINTER || bktype == B_AUTOVAC_WORKER ||
+   bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP;

I'm not sure I like to omit parentheses for such a long Boolean
expression on the right side.


+   write_chunk_s(fpout, );
+   if (!read_chunk_s(fpin, >io.stats))

The names of the functions hardly make sense alone to me. How about
write_struct()/read_struct()?  (I personally prefer to use
write_chunk() directly..)


+ PgStat_BktypeIO

This patch abbreviates "backend" as "bk" but "be" is used in many
places. I think that naming should follow the predecessors.


0004:

system_views.sql:

+FROM pg_stat_get_io() b;

What does the "b" stand for? (Backend? then "s" or "i" seems
straight-forward.)


+   nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, 
io_context, io_op);
+
+   if (nulls[col_idx])
+   continue;
+
+   values[col_idx] =
+   
Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);

This is a bit hard to read since it requires to follow the condition
flow. The following is simpler and I thhink close to our standard.

if (pgstat_tacks_io_op())
   values[col_idx] =

Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
else
   nulls[col_idx]  = true;


> +Number of read operations in units of op_bytes.

I may be the only one who see the name as umbiguous between "total
number of handled bytes" and "bytes hadled at an operation". Can't it
be op_blocksize or just block_size?

+   b.io_object,
+   b.io_context,

It's uncertain to me why only the two columns are prefixed by
"io". Don't "object_type" and just "context" work instead?


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-23 Thread Kyotaro Horiguchi
Sorry, I forgot to write one comment.

At Tue, 24 Jan 2023 11:45:35 +0900 (JST), Kyotaro Horiguchi 
 wrote in 

+   /* Should we delay the current transaction? */
+   if (finish_ts)
+   maybe_delay_apply(xid, finish_ts);
+
if (!am_parallel_apply_worker())
maybe_start_skipping_changes(lsn);

It may not give actual advantages, but isn't it better that delay
happens after skipping?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-23 Thread Kyotaro Horiguchi
> Attached the updated patch v19.

+ maybe_delay_apply(TransactionId xid, TimestampTz finish_ts)

I look this spelling strange.  How about maybe_apply_delay()?


send_feedback():
+* If the subscriber side apply is delayed (because of time-delayed
+* replication) then do not tell the publisher that the received latest
+* LSN is already applied and flushed, otherwise, it leads to the
+* publisher side making a wrong assumption of logical replication
+* progress. Instead, we just send a feedback message to avoid a 
publisher
+* timeout during the delay.
 */
-   if (!have_pending_txes)
+   if (!have_pending_txes && !in_delayed_apply)
flushpos = writepos = recvpos;

Honestly I don't like this wart. The reason for this is the function
assumes recvpos = applypos but we actually call it while holding
unapplied changes, that is, applypos < recvpos.

Couldn't we maintain an additional static variable "last_applied"
along with last_received?  In this case the condition cited above
would be as follows and in_delayed_apply will become unnecessary.

+   if (!have_pending_txes && last_received == last_applied)

The function is a static function and always called with a variable
last_received that has the same scope with the function, as the first
parameter. Thus we can remove the first parameter then let the
function directly look at the both two varaibles instead.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value

2023-01-23 Thread Kyotaro Horiguchi
At Thu, 19 Jan 2023 18:43:52 -0500, Tom Lane  wrote in 
> sirisha chamarthi  writes:
> > On Wed, Oct 19, 2022 at 7:59 PM Kyotaro Horiguchi 
> > wrote:
> >> In short, the proposed fix alone seems fine to me. If we want to show
> >> further details, I would add a bit as follows.
> >> 
> >> | * * WALAVAIL_REMOVED means it has been removed. A replication stream on
> >> | *   a slot with this LSN cannot continue.  Note that the affected
> >> | *   processes have been terminated by checkpointer, too.
> 
> > Thanks for your comments! Attached the patch with your suggestions.
> 
> Pushed with a bit of additional wordsmithing.  I thought "have been"

Thanks!

> was a bit too strong of an assertion considering that this function
> does not pay any attention to the actual state of any processes,
> so I made it say "should have been".

I think you're correct here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-23 Thread Kyotaro Horiguchi
At Mon, 23 Jan 2023 17:36:13 +0530, Amit Kapila  wrote 
in 
> On Sun, Jan 22, 2023 at 6:12 PM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> >
> > Attached the updated patch v19.
> Few comments:
> 2.
> + if (IsSet(supported_opts, SUBOPT_MIN_APPLY_DELAY) &&
> + opts->min_apply_delay > 0 && opts->streaming == LOGICALREP_STREAM_PARALLEL)
> + ereport(ERROR,
> + errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("%s and %s are mutually exclusive options",
> +"min_apply_delay > 0", "streaming = parallel"));
>  }
> 
> I think here we should add a comment for the translator as we are
> doing in some other nearby cases.

IMHO "foo > bar" is not an "option".  I think we say "foo and bar are
mutually exclusive options" but I think don't say "foo = x and bar = y
are.. options".  I wrote a comment as "this should be more like
human-speaking" and Euler seems having the same feeling for another
error message.

Concretely I would spell this as "min_apply_delay cannot be enabled
when parallel streaming mode is enabled" or something. And the
opposite-direction message nearby would be "parallel streaming mode
cannot be enabled when min_apply_delay is enabled."

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-15 Thread Kyotaro Horiguchi
At Tue, 10 Jan 2023 12:01:43 +0530, Amit Kapila  wrote 
in 
> On Tue, Jan 10, 2023 at 11:16 AM Kyotaro Horiguchi
>  wrote:
> > Although I don't see a technical difference between the two, all the
> > other occurances including the just above (except test_shm_mq) use
> > "could not". A faint memory in my non-durable memory tells me that we
> > have a policy that we use "can/could not" than "unable".
> >
> 
> Right, it is mentioned in docs [1] (see section "Tricky Words to Avoid").

Thanks for confirmation.

> Can you please start a new thread and post these changes as we are
> proposing to change existing message as well?

All right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Lazy allocation of pages required for verifying FPI consistency

2023-01-15 Thread Kyotaro Horiguchi
At Thu, 12 Jan 2023 15:02:25 +0530, Bharath Rupireddy 
 wrote in 
> On the contrary, PGAlignedBlock is being used elsewhere in the code;

I noticed it and had the same feeling, and thought that they don't
justify to do the same at other places.

> some of them are hot paths. verifyBackupPageConsistency() is not
> something that gets called always i.e. WAL consistency checks are done
> conditionally - when either one enables wal_consistency_checking for
> the rmgr or the WAL record is flagged with
> XLR_CHECK_CONSISTENCY (core doesn't do, it's an external module, if
> any, do that).

Right. So we could allocate them at the first use as below, but...

> I really don't see much of a problem in allocating them statically and
> pushing closer to where they're being used. If this really concerns,
> at the least, the dynamic allocation needs to be pushed to
> verifyBackupPageConsistency() IMO with if (first_time) { allocate two
> blocks with palloc} and use them. This at least saves some memory on
> the heap for most of the servers out there.

Yeah, we could do that. But as I mentioned before, that happens only
on startup thus it can be said that that's not worth bothering.  On
the other hand I don't think it's great to waste 16kB * max_backends
memory especially when it is clearly recognized and easily avoidable.

I guess the reason for the code is more or less that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-01-15 Thread Kyotaro Horiguchi
At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila  wrote 
in 
> Okay, but what happens in the case of physical replication when
> synchronous_commit = remote_apply? In that case, won't it ensure that
> apply has also happened? If so, then shouldn't the time delay feature
> also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply.  The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2023-01-15 Thread Kyotaro Horiguchi
At Wed, 28 Dec 2022 09:15:41 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> > Another thing we can investigate here why do we need to ensure that
> > there is no pending send before shutdown.
> 
> I have not done yet about it, will continue next year.
> It seems that walsenders have been sending all data before shutting down 
> since ea5516,
> e0b581 and 754baa. 
> There were many threads related with streaming replication, so I could not pin
> the specific message that written in the commit message of ea5516.
> 
> I have also checked some wiki pages [1][2], but I could not find any design 
> about it.
> 
> [1]: https://wiki.postgresql.org/wiki/Streaming_Replication
> [2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit. Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Lazy allocation of pages required for verifying FPI consistency

2023-01-12 Thread Kyotaro Horiguchi
At Mon, 9 Jan 2023 20:00:00 +0530, Bharath Rupireddy 
 wrote in 
> I propose to statically allocate these two pages using PGAlignedBlock
> structure lazily in verifyBackupPageConsistency() to not waste dynamic
> memory worth 2*BLCKSZ bytes. I'm attaching a small patch herewith.
> 
> Thoughts?

IMHO, it's a bit scaring to me to push down the execution stack by
that large size. I tend to choose the (current) possible memory
wasting only on startup process than digging stack deeply.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-11 Thread Kyotaro Horiguchi
At Wed, 11 Jan 2023 12:46:24 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> them. Which version is better?


Some comments by a quick loock, different from the above.


+ CONNECTION 'host=192.168.1.50 port=5432 user=foo dbname=foodb'

I understand that we (not PG people, but IT people) are supposed to
use in documents a certain set of special addresses that is guaranteed
not to be routed in the field.

> TEST-NET-1 : 192.0.2.0/24
> TEST-NET-2 : 198.51.100.0/24
> TEST-NET-3 : 203.0.113.0/24

(I found 192.83.123.89 in the postgres_fdw doc, but it'd be another issue..)


+   if (strspn(tmp, "-0123456789 ") == strlen(tmp))

Do we need to bother spending another memory block for apparent
non-digits here?


+   errmsg(INT64_FORMAT " ms is 
outside the valid range for parameter \"%s\"",

We don't use INT64_FORMAT in translatable message strings. Cast then
use %lld instead.

This message looks unfriendly as it doesn't suggest the valid range,
and it shows the input value in a different unit from what was in the
input. A I think we can spell it as "\"%s\" is outside the valid range
for subsciription parameter \"%s\" (0 ..  in millisecond)"

+   int64   min_apply_delay;
..
+   if (ms < 0 || ms > INT_MAX)

Why is the variable wider than required?


+   errmsg("%s and %s are mutually 
exclusive options",
+  "min_apply_delay > 0", 
"streaming = parallel"));

Mmm. Couldn't we refuse 0 as min_apply_delay?


+   sub->minapplydelay > 0)
...
+   if (opts.min_apply_delay > 0 &&

Is there any reason for the differenciation?


+   errmsg("cannot 
set %s for subscription with %s",
+  
"streaming = parallel", "min_apply_delay > 0"));

I think that this shoud be more like human-speking. Say, "cannot
enable min_apply_delay for subscription in parallel streaming mode" or
something..  The same is applicable to the nearby message.



+static void maybe_delay_apply(TimestampTz ts);

  apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
-  XLogRecPtr lsn)
+  XLogRecPtr lsn, TimestampTz ts)

"ts" looks too generic. Couldn't it be more specific?
We need a explanation for the parameter in the function comment.


+   if (!am_parallel_apply_worker())
+   {
+   Assert(ts > 0);
+   maybe_delay_apply(ts);

It seems to me better that the if condition and assertion are checked
inside maybe_delay_apply().


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: mprove tab completion for ALTER EXTENSION ADD/DROP

2023-01-10 Thread Kyotaro Horiguchi
At Mon, 2 Jan 2023 13:19:50 +0530, vignesh C  wrote in 
> On Mon, 5 Dec 2022 at 06:53, Michael Paquier  wrote:
> >
> > The DROP could be matched with the objects that are actually part of
> > the so-said extension?
> 
> The modified v2 version has the changes to handle the same. Sorry for
> the delay as I was working on another project.

It suggests the *kinds* of objects that are part of the extension, but
lists the objects of that kind regardless of dependency.  I read
Michael suggested (and I agree) to restrict the objects (not kinds) to
actually be a part of the extension. (And not for object kinds.)

However I'm not sure it is useful to restrict object kinds since the
operator already knows what to drop, if you still want to do that, the
use of completion_dont_quote looks ugly since the function
(requote_identifier) is assuming an identifier as input.  I didn't
looked closer but maybe we need another way to do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Add overlaps geometric operators that ignore point overlaps

2023-01-10 Thread Kyotaro Horiguchi
Hello.

At Sun, 1 Jan 2023 01:13:24 +0530, Ankit Kumar Pandey  
wrote in 
> This is patch for todo item: Add overlaps geometric operators that
> ignore point overlaps
> 
> Issue:
> 
> SELECT circle '((0,0), 1)' && circle '((2,0),1) returns True
> 
> Expectation: In above case, both figures touch other but do not
> overlap (i.e. touching != overlap). Hence, it should return false.

This may be slightly off from the common definition in other geometric
processing systems, it is the established behavior of PostgreSQL that
should already have users.

About the behavior itself, since it seems to me that the words "touch"
and "overlap" have no rigorous mathematical definitions, that depends
on definition. The following discussion would be mere a word play..

If circle ((0,0),1) means a circumference, i.e. a set of points
described as "x^2 + y^2 = 1" (or it may be a disc containing the area
inside (<=) here) and "overlap" means "share at least a point", the
two circles are overlapping. This seems to be our current stand point
and what is expressed in the doc.

If it meant the area exclusively inside the outline (i.e. x^2 + y^2 <
1), the two circles could be said touching but not overlapping.  Or,
if circle is defined as "(<)= 1" but "overlap" meant "share at least
an area", they could be said not overlapping but touching? (I'm not
sure about the border between a point and an area here and the
distinction would be connected with the annoying EPSILON..)  The same
discussion holds for boxes or other shapes.

> Now, as per as discussion
> (https://www.postgresql.org/message-id/20100322175532.GG26428%40fetter.org)
> and corresponding change in docs,
> https://www.postgresql.org/docs/15/functions-geometry.html, it
> mentions
> 
> `Do these objects overlap? (One point in common makes this true.)
> `. Does this means current behavior is correct? Or do we still need
> the proposed change (if so, with proper updates in docs)?
> 
> If current behavior is correct, this todo item might need some update
> (unless I missed anything) otherwise any suggestion is welcomed.

I read the todo description as we may want *another set* of operators
to do that, not to change the current behavior of the existing
operators.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-09 Thread Kyotaro Horiguchi
Hello.

At Mon, 9 Jan 2023 14:21:03 +0530, Amit Kapila  wrote 
in 
> Pushed the first (0001) patch.

It added the following error message.

+   seg = dsm_attach(handle);
+   if (!seg)
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("unable to map dynamic shared memory 
segment")));

On the other hand we already have the following one in parallel.c
(another in pg_prewarm)

seg = dsm_attach(DatumGetUInt32(main_arg));
if (seg == NULL)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("could not map dynamic shared memory 
segment")));

Although I don't see a technical difference between the two, all the
other occurances including the just above (except test_shm_mq) use
"could not". A faint memory in my non-durable memory tells me that we
have a policy that we use "can/could not" than "unable".

(Mmm. I find ones in StartBackgroundWorker and sepgsql_client_auth.)

Shouldn't we use the latter than the former?  If that's true, it seems
to me that test_shm_mq also needs the same amendment to avoid the same
mistake in future.

=
index 2e5914d5d9..a2d7474ed4 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -891,7 +891,7 @@ ParallelApplyWorkerMain(Datum main_arg)
if (!seg)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-errmsg("unable to map dynamic shared memory 
segment")));
+errmsg("could not map dynamic shared memory 
segment")));
 
toc = shm_toc_attach(PG_LOGICAL_APPLY_SHM_MAGIC, 
dsm_segment_address(seg));
if (!toc)
diff --git a/src/test/modules/test_shm_mq/worker.c 
b/src/test/modules/test_shm_mq/worker.c
index 8807727337..005b56023b 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -81,7 +81,7 @@ test_shm_mq_main(Datum main_arg)
if (seg == NULL)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-errmsg("unable to map dynamic shared memory 
segment")));
+errmsg("could not map dynamic shared memory 
segment")));
toc = shm_toc_attach(PG_TEST_SHM_MQ_MAGIC, dsm_segment_address(seg));
    if (toc == NULL)
ereport(ERROR,
=

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: ARRNELEMS Out-of-bounds possible errors

2022-12-23 Thread Kyotaro Horiguchi
At Fri, 23 Dec 2022 17:37:55 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 22 Dec 2022 12:35:58 -0300, Ranier Vilela  wrote 
> in 
> > Hi.
> > 
> > Per Coverity.
> > 
> > The commit ccff2d2
> > <https://github.com/postgres/postgres/commit/ccff2d20ed9622815df2a7deffce8a7b14830965>,
> > changed the behavior function ArrayGetNItems,
> > with the introduction of the function ArrayGetNItemsSafe.
> > 
> > Now ArrayGetNItems may return -1, according to the comment.
> > " instead of throwing an exception. -1 is returned after an error."
> 
> If I'm reading the code correctly, it's the definition of
> ArrayGetNItems*Safe*. ArrayGetNItems() calls that function with a NULL
> escontext and the NULL turns ereturn() into ereport().

> That doesn't seem to be changed by the commit.

No.. It seems to me that the commit didn't change its behavior in that
regard.

> Of course teaching Coverity not to issue the false warnings would be
> another actual issue that we should do, maybe.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: ARRNELEMS Out-of-bounds possible errors

2022-12-23 Thread Kyotaro Horiguchi
At Thu, 22 Dec 2022 12:35:58 -0300, Ranier Vilela  wrote 
in 
> Hi.
> 
> Per Coverity.
> 
> The commit ccff2d2
> <https://github.com/postgres/postgres/commit/ccff2d20ed9622815df2a7deffce8a7b14830965>,
> changed the behavior function ArrayGetNItems,
> with the introduction of the function ArrayGetNItemsSafe.
> 
> Now ArrayGetNItems may return -1, according to the comment.
> " instead of throwing an exception. -1 is returned after an error."

If I'm reading the code correctly, it's the definition of
ArrayGetNItems*Safe*. ArrayGetNItems() calls that function with a NULL
escontext and the NULL turns ereturn() into ereport(). That doesn't
seem to be changed by the commit.

Of course teaching Coverity not to issue the false warnings would be
another actual issue that we should do, maybe.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Force streaming every change in logical decoding

2022-12-22 Thread Kyotaro Horiguchi
At Thu, 22 Dec 2022 14:53:34 +0530, Amit Kapila  wrote 
in 
> For this, I like your proposal for "buffered" as an explicit default value.

Good to hear.

> Okay, how about modifying it to: "When set to
> immediate, stream each change if
> streaming option (see optional parameters set by
> CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change.

Looks fine. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Exit walsender before confirming remote flush in logical replication

2022-12-22 Thread Kyotaro Horiguchi
At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat 
 wrote in 
> On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
>  wrote:
> > In case of logical replication, however, we cannot support the use-case that
> > switches the role publisher <-> subscriber. Suppose same case as above, 
> > additional
..
> > Therefore, I think that we can ignore the condition for shutting down the
> > walsender in logical replication.
...
> > This change may be useful for time-delayed logical replication. The 
> > walsender
> > waits the shutdown until all changes are applied on subscriber, even if it 
> > is
> > delayed. This causes that publisher cannot be stopped if large delay-time is
> > specified.
> 
> I think the current behaviour is an artifact of using the same WAL
> sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone. And I'm afraind that many people already
relies on that behavior.

> I agree with you that the logical WAL sender need not wait for all the
> WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it.  I'm
fine doing that only in the apply_delay case, though.  However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos?  A little tweak is needed in send_feedback() but seems to
work..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Force streaming every change in logical decoding

2022-12-22 Thread Kyotaro Horiguchi
At Thu, 22 Dec 2022 16:59:30 +0900, Masahiko Sawada  
wrote in 
> On Thu, Dec 22, 2022 at 4:18 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Amit,
> >
> > Thank you for updating the patch. I have also checked the patch
> > and basically it has worked well. Almost all things I found were modified
> > by v4.
> >
> > One comment: while setting logical_decoding_mode to wrong value,
> > I got unfriendly ERROR message.
> >
> > ```
> > postgres=# SET logical_decoding_mode = 1;
> > ERROR:  invalid value for parameter "logical_decoding_mode": "1"
> > HINT:  Available values: , immediate
> > ```
> >
> > Here all acceptable enum should be output as HINT, but we could not see the 
> > empty string.
> > Should we modify config_enum_get_options() for treating empty string, maybe
> > like (empty)?
> 
> Good point. I think the hint message can say "The only allowed value
> is \"immediate\" as recovery_target does. Or considering the name of
> logical_decoding_mode, we might want to have a non-empty string, say
> 'normal' as Kuroda-san proposed, as the default value.

Oh. I missed this, and agree to have the explicit default option.
I slightly prefer "buffered" but "normal" also works fine for me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-12-22 Thread Kyotaro Horiguchi
At Thu, 22 Dec 2022 10:09:23 +0530, Bharath Rupireddy 
 wrote in 
> On Thu, Dec 22, 2022 at 7:57 AM Michael Paquier  wrote:
> >
> > On Wed, Dec 21, 2022 at 10:22:02PM +0100, Magnus Hagander wrote:
> > > Basically, we take one thing and turn it into 3. That very naturally rings
> > > with "split" to me.
> > >
> > > Parse might work as well, certainly better than dissect. I'd still prefer
> > > split though.
> >
> > Honestly, I don't have any counter-arguments, so I am fine to switch
> > the name as you are suggesting.  And pg_split_walfile_name() it is?
> 
> +1. FWIW, a simple patch is here
> https://www.postgresql.org/message-id/CALj2ACXdZ7WGRD-_jPPeZugvWLN%2Bgxo3QtV-eZPRicUwjesM%3Dg%40mail.gmail.com.

By the way the function is documented as the follows.

>  Extracts the file sequence number and timeline ID from a WAL file name.

I didn't find the definition for the workd "file sequence number" in
the doc. Instead I find "segment number" (a bit doubtful, though..).

In the first place "file sequence number" and "segno" can hardly be
associated by appearance by readers, I think.  (Yeah, we can identify
that since the another parameter is identifiable alone.) Why don't we
spell out the parameter simply as "segment number"?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Force streaming every change in logical decoding

2022-12-21 Thread Kyotaro Horiguchi
At Thu, 22 Dec 2022 12:35:46 +0530, Amit Kapila  wrote 
in 
> I have addressed these comments in the attached. Additionally, I have
> modified the docs and commit messages to make those clear. I think
> instead of adding new tests with this patch, it may be better to
> change some of the existing tests related to streaming to use this
> parameter as that will clearly show one of the purposes of this patch.

Being late but I'm warried that we might sometime regret that the lack
of the explicit default. Don't we really need it?

+Allows streaming or serializing changes immediately in logical 
decoding.
+The allowed values of logical_decoding_mode are the
+empty string and immediate. When set to
+immediate, stream each change if
+streaming option is enabled, otherwise, serialize
+each change.  When set to an empty string, which is the default,
+decoding will stream or serialize changes when
+logical_decoding_work_mem is reached.

With (really) fresh eyes, I took a bit long time to understand what
the "streaming" option is. Couldn't we augment the description by a
bit?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Inconsistency in reporting checkpointer stats

2022-12-21 Thread Kyotaro Horiguchi
At Wed, 21 Dec 2022 17:14:12 +0530, Nitin Jadhav 
 wrote in 
> [1]:
> 2022-12-21 10:52:25.931 UTC [63530] LOG:  checkpoint complete: wrote
> 4670 buffers (28.5%), wrote 3 slru buffers (0.0%); 0 WAL file(s)
> added, 0 removed, 4 recycled; write=0.045 s, sync=0.161 s, total=0.244
> s; sync files=25, longest=0.146 s, average=0.007 s; distance=66130 kB,
> estimate=66130 kB; lsn=0/5557C78, redo lsn=0/5557C40
> 
> Thanks & Regards,
> Nitin Jadhav
> 
> On Tue, Dec 20, 2022 at 11:08 PM Andres Freund  wrote:
> >
> > On 2022-12-20 08:18:36 -0500, Robert Haas wrote:
> > > I think that the SLRU information is potentially useful, but mixing it
> > > with the information about regular buffers just seems confusing.
> >
> > +1
> >
> > At least for now, it'd be different if/when we manage to move SLRUs to
> > the main buffer pool.

It sounds reasonable to exclude SRLU write from buffer writes. But I'm
not sure its useful to count SLRU writes separately since it is under
the noise level of buffer writes in reglular cases and the value
doesn't lead to tuning. However I'm not strongly opposed to adding it
either.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Inconsistency in reporting checkpointer stats

2022-12-21 Thread Kyotaro Horiguchi
At Mon, 19 Dec 2022 18:05:38 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Dec 16, 2022 at 2:14 PM Kyotaro Horiguchi
>  wrote:
> >
> > In the first place I don't like that we count the same things twice.
> > Couldn't we count the number only by any one of them?
> >
> > If we remove CheckPointerStats.ckpt_bufs_written, CreateCheckPoint can
> > get the final number as the difference between the start-end values of
> > *the shared stats*. As long as a checkpoint runs on a single process,
> > trace info in BufferSync will work fine.  Assuming single process
> > checkpointing there must be no problem to do that. (Anyway the current
> > shared stats update for checkpointer is assuming single-process).
> 
> What if someone resets checkpointer shared stats with
> pg_stat_reset_shared()? In such a case, the checkpoint complete
> message will not have the stats, no?

I don't know. I don't believe the stats system doesn't follow such a
strict resetting policy.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: (non) translatable string splicing

2022-12-20 Thread Kyotaro Horiguchi
At Mon, 19 Dec 2022 16:14:04 -0500, Robert Haas  wrote 
in 
> On Fri, Dec 16, 2022 at 8:25 AM Justin Pryzby  wrote:
> > Due to incomplete translation, that allows some pretty fancy output,
> > like:
> > | You must identify the directory where the residen los binarios del 
> > clúster antiguo.
> 
> I can't see how that could be mejor. :-)

It was quite annoying but not untranslatable. But the "the" before
"residen" looks like badly misplaced:p It should be a part of the
inner text ("residen los..").

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: (non) translatable string splicing

2022-12-20 Thread Kyotaro Horiguchi
At Mon, 19 Dec 2022 13:20:55 +0900, Michael Paquier  wrote 
in 
> On Fri, Dec 16, 2022 at 07:24:52AM -0600, Justin Pryzby wrote:
> > Due to incomplete translation, that allows some pretty fancy output,
> > like:
> > | You must identify the directory where the residen los binarios del 
> > clúster antiguo.
> > 
> > That commit also does this a couple times:
> > 
> > +_(" which is an index on \"%s.%s\""),

For this specific case I didn't feel a difficulty since it is
compatible with "This is blah" in that context.

> Ugh.  Perhaps we could just simplify the wordings as of "index on
> blah", "index on OID %u", "TOAST table for blah" and "TOAST table for
> OID %u" with newlines after each item?

I'm fine with just removing " which ". but I don't understand about
the extra newlines.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Inconsistency in reporting checkpointer stats

2022-12-16 Thread Kyotaro Horiguchi
At Wed, 14 Dec 2022 16:54:53 +0530, Bharath Rupireddy 
 wrote in 
> Indeed PendingCheckpointerStats.buf_written_checkpoints needs to count
> buffer writes in SlruInternalWritePage(). However, does it need to be
> done immediately there? The stats will not be visible to the users
> until the next pgstat_report_checkpointer(). Incrementing
> buf_written_checkpoints in BufferSync() makes sense as the
> pgstat_report_checkpointer() gets called in there via
> CheckpointWriteDelay() and it becomes visible to the user immediately.
> Have you checked if pgstat_report_checkpointer() gets called while the
> checkpoint calls SlruInternalWritePage()? If not, then you can just
> assign ckpt_bufs_written to buf_written_checkpoints in
> LogCheckpointEnd() like its other friends
> checkpoint_write_time and checkpoint_sync_time.

If I'm getting Bharath correctly, it results in double counting of
BufferSync. If we want to keep the realtime-reporting nature of
BufferSync, BufferSync should give up to increment CheckPointerStats'
counter.  Such separation seems to be a kind of stupid and quite
bug-prone.

In the first place I don't like that we count the same things twice.
Couldn't we count the number only by any one of them?

If we remove CheckPointerStats.ckpt_bufs_written, CreateCheckPoint can
get the final number as the difference between the start-end values of
*the shared stats*. As long as a checkpoint runs on a single process,
trace info in BufferSync will work fine.  Assuming single process
checkpointing there must be no problem to do that. (Anyway the current
shared stats update for checkpointer is assuming single-process).

Otherwise, in exchange with giving up the realtime nature, we can
count the number only by CheckPointerStats.ckpt_bufs_written.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 09:03:12 +0100, Pavel Stehule  
wrote in 
> I found some solution based by using fmgr hook
> 
> https://github.com/okbob/plpgsql_check/commit/9a17e97354a48913de5219048ee3be6f8460bae9

Oh! Thanks for the pointer, will look into that.

By the way, It seems to me that the tool is using
RegisterResourceReleaseCallback to reset the function nest level. But
since there's a case where the mechanism doesn't work, I suspect that
the callback can be missed in some cases of error return, which seems
to be a bug if it is true..

# I haven't confirmed that behavior by myself, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 08:41:21 +0100, Pavel Stehule  
wrote in 
> čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada 
> napsal:
> > Is this a bug in plpgsql?
> >
> 
> I think it is by design.  There is not any callback that is called after an
> exception.
> 
> It is true, so some callbacks on statement error and function's error can
> be nice. It can help me to implement profilers, or tracers more simply and
> more robustly.
> 
> But I am not sure about performance impacts. This is on a critical path.

I didn't searched for, but I guess all of the end-side callback of all
begin-end type callbacks are not called on exception. Additional
PG_TRY level wouldn't be acceptable for performance reasons.

What we (pg_hint_plan people) want is any means to know that the
top-level function is exited, to reset function nest level. It would
be simpler than calling end callback at every nest level.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 10:29:17 +0530, Amit Kapila  wrote 
in 
> On Thu, Dec 15, 2022 at 10:11 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila  
> > wrote in
> > > On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi
> > >  wrote:
> > > subscriber was busy enough that it doesn't need to add an additional
> > > delay before applying a particular transaction(s) but adding a delay
> > > to such a transaction on the publisher will actually make it take much
> > > longer to reflect than expected. We probably need to name this
> >
> > Isn't the name min_apply_delay implying the same behavior? Even though
> > the delay time will be a bit prolonged.
> >
> 
> Sorry, I don't understand what you intend to say in this point. In
> above, I mean that the currently proposed patch won't have such a
> problem but if we apply delay on publisher the problem can happen.

Are you saing about the sender-side delay lets the whole transaction
(if it have not streamed out) stay on the sender side?  If so... yeah,
I agree that it is undesirable.

> > > parameter as min_send_delay if we want to do what you are saying and
> > > then I don't know if it serves the actual need and also it will be
> > > different from what we do in physical standby.
> >
> > In the first place phisical and logical replication works differently
> > and the mechanism to delaying "apply" differs even in the current
> > state in terms of logrep delay choking stream.
> >
> 
> I think the first preference is to make it work in a similar way (as
> much as possible) to how this parameter works in physical standby and
> if that is not at all possible then we may consider other approaches.

I uderstood that. However, still I think choking the stream on the
receiver-side alone is kind of ugly since it is breaking the protocol
assumption, that is, the in-band maintenance packets are processed in
a on-time manner on the peer under normal operation (even though
involving some delays for some natural reasons).  In this regard, I
inclined to be in favor of Kuroda-san'sproposal..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 09:18:55 +0530, Amit Kapila  wrote 
in 
> On Thu, Dec 15, 2022 at 7:22 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila  
> > wrote in
> > > On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu)
> > >  wrote:
> > > > One idea to avoid that is to send the min_apply_delay subscriber option 
> > > > to publisher
> > > > and compare them, but it may be not sufficient. Because XXX_timout GUC 
> > > > parameters
> > > > could be modified later.
> > > >
> > >
> > > How about restarting the apply worker if min_apply_delay changes? Will
> > > that be sufficient?
> >
> > Mmm. If publisher knows that value, isn't it albe to delay *sending*
> > data in the first place? This will resolve many known issues including
> > walsender's un-terminatability, possible buffer-full and status packet
> > exchanging.
> >
> 
> Yeah, but won't it change the meaning of this parameter? Say the

Internally changes, but does not change on its face.  The difference is
only in where the choking point exists. If ".._apply_delay" should
work literally, we should go the way Kuroda-san proposed. Namely,
"apply worker has received the data, but will deilay applying it".  If
we technically name it correctly for the current behavior, it would be
"min_receive_delay" or "min_choking_interval".

> subscriber was busy enough that it doesn't need to add an additional
> delay before applying a particular transaction(s) but adding a delay
> to such a transaction on the publisher will actually make it take much
> longer to reflect than expected. We probably need to name this

Isn't the name min_apply_delay implying the same behavior? Even though
the delay time will be a bit prolonged.

> parameter as min_send_delay if we want to do what you are saying and
> then I don't know if it serves the actual need and also it will be
> different from what we do in physical standby.

In the first place phisical and logical replication works differently
and the mechanism to delaying "apply" differs even in the current
state in terms of logrep delay choking stream.

I guess they cannot be different in terms of normal operation. But I'm
not sure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Thu, 15 Dec 2022 09:23:12 +0530, Amit Kapila  wrote 
in 
> On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi
>  wrote:
> > Allowing walsender to finish ignoring replication status
> > wouldn't be great.
> >
> 
> Yes, that would be ideal. But do you know why that is a must?

I believe a graceful shutdown (fast and smart) of a replication set is expected 
to be in sync.  Of course we can change the policy to allow walsnder to stop 
before confirming all WAL have been applied. However walsender doesn't have an 
idea of  wheter the peer is intentionally delaying or not.

> >  One idea is to let logical workers send delaying
> > status.
> >
> 
> How can that help?

If logical worker notifies "I'm intentionally pausing replication for
now, so if you wan to shutting down, plese go ahead ignoring me",
publisher can legally run a (kind of) dirty shut down.

# It looks a bit too much, though..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Wed, 14 Dec 2022 16:30:28 +0530, Amit Kapila  wrote 
in 
> On Wed, Dec 14, 2022 at 4:16 PM Hayato Kuroda (Fujitsu)
>  wrote:
> > One idea to avoid that is to send the min_apply_delay subscriber option to 
> > publisher
> > and compare them, but it may be not sufficient. Because XXX_timout GUC 
> > parameters
> > could be modified later.
> >
> 
> How about restarting the apply worker if min_apply_delay changes? Will
> that be sufficient?

Mmm. If publisher knows that value, isn't it albe to delay *sending*
data in the first place? This will resolve many known issues including
walsender's un-terminatability, possible buffer-full and status packet
exchanging.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-14 Thread Kyotaro Horiguchi
At Wed, 14 Dec 2022 10:46:17 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> I have implemented and tested that workers wake up per wal_receiver_timeout/2
> and send keepalive. Basically it works well, but I found two problems.
> Do you have any good suggestions about them?
> 
> 1)
> 
> With this PoC at present, workers calculate sending intervals based on its
> wal_receiver_timeout, and it is suppressed when the parameter is set to zero.
> 
> This means that there is a possibility that walsender is timeout when 
> wal_sender_timeout
> in publisher and wal_receiver_timeout in subscriber is different.
> Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min,

It seems to me wal_receiver_status_interval is better for this use.
It's enough for us to docuemnt that "wal_r_s_interval should be
shorter than wal_sener_timeout/2 especially when logical replication
connection is using min_apply_delay. Otherwise you will suffer
repeated termination of walsender".

> and min_apply_delay is 10min. The worker on subscriber will wake up per 
> 2.5min and
> send keepalives, but walsender exits before the message arrives to publisher.
> 
> One idea to avoid that is to send the min_apply_delay subscriber option to 
> publisher
> and compare them, but it may be not sufficient. Because XXX_timout GUC 
> parameters
> could be modified later.

# Anyway, I don't think such asymmetric setup is preferable.

> 2)
> 
> The issue reported by Vignesh-san[1] has still remained. I have already 
> analyzed that [2],
> the root cause is that flushed WAL is not updated and sent to the publisher. 
> Even
> if workers send keepalive messages to pub during the delay, the flushed 
> position
> cannot be modified.

I didn't look closer but the cause I guess is walsender doesn't die
until all WAL has been sent, while logical delay chokes replication
stream. Allowing walsender to finish ignoring replication status
wouldn't be great.  One idea is to let logical workers send delaying
status.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_upgrade: Make testing different transfer modes easier

2022-12-14 Thread Kyotaro Horiguchi
At Wed, 14 Dec 2022 10:40:45 +0100, Daniel Gustafsson  wrote 
in 
> > On 14 Dec 2022, at 08:04, Peter Eisentraut 
> >  wrote:
> > 
> > On 07.12.22 17:33, Peter Eisentraut wrote:
> >> I think if we want to make this configurable on the fly, and environment 
> >> variable would be much easier, like
> >> my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
> > 
> > Here is an updated patch set that incorporates this idea.

We have already PG_TEST_EXTRA. Shouldn't we use it here as well?

> I would prefer a small note about it in src/bin/pg_upgrade/TESTING to document
> it outside of the code, but otherwise LGTM.
> 
> + $mode,
>   '--check'
>   ],
> 
> ...
> 
> - '-p', $oldnode->port, '-P', $newnode->port
> + '-p', $oldnode->port, '-P', $newnode->port,
> + $mode,
>   ],
> 
> Minor nitpick, but while in there should we take the opportunity to add a
> trailing comma on the other two array declarations which now ends with 
> --check?
> It's good Perl practice and will make the code consistent.

Otherwise look god to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Shortening the Scope of Critical Section in Creation of New MultiXacts

2022-12-13 Thread Kyotaro Horiguchi
Hello.

In short, the code as-is looks correct.

At Wed, 14 Dec 2022 00:14:34 +, "Bagga, Rishu"  wrote 
in 
>* Critical section from here until caller has written the data into the
>* just-reserved SLRU space; we don't want to error out with a partly

"the data" here means the whole this multi transaction, which includes
members. We shouldn't exit the critical section at least until the
very end of RecordNewMultiXact().

> This makes sense, as we need the multixact state and multixact offset 
> data to be consistent, but once we write data into the offsets, we can 
> end the critical section. Currently we wait until the members data is 
> also written before we end the critical section. 

Why do you think that the members are not a part of a
multitransaction?  A multitransaction is not complete without them.

Addition to the above, we cannot simply move the END_CRIT_SECTION() to
there since RecordNewMultiXact() has another caller that doesn't start
a critical section.

By the way, I didn't tested this for myself but..

> This passes regression tests

Maybe if you did the same with an assertion-build, you will get an
assertion-failure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-13 Thread Kyotaro Horiguchi
At Tue, 13 Dec 2022 17:05:35 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila  
> > wrote in
> Yeah, I think ideally it will timeout but if we have a solution like
> during delay, we keep sending ping messages time-to-time, it should
> work fine. However, that needs to be verified. Do you see any reasons
> why that won't work?

Ah. I meant that "I have no clear idea of whether" by "I'm not sure".

I looked there a bit further. Finally ProcessPendingWrites() waits for
streaming socket to be writable thus no critical problem found here.
That being said, it might be better ProcessPendingWrites() refrain
from sending consecutive keepalives while waiting, 30s ping timeout
and 1h delay may result in 120 successive pings. It might not be a big
deal but..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Kyotaro Horiguchi
At Tue, 13 Dec 2022 02:28:49 +, "Takamichi Osumi (Fujitsu)" 
 wrote in 
> On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi 
>  wrote:
>
> We couldn't reproduce this failure and
> find the same type of failure on the cfbot from the past failures.
> It seems no subtests run in your environment.

Very sorry for that. The test script is found to be a left-over file
in a git-reset'ed working tree. Please forget about it.

FWIW, the latest patch passed make-world for me on Rocky8/x86_64.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Kyotaro Horiguchi
At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila  wrote 
in 
> On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu)
>  wrote:
> > once and apply later. Our basic design is as follows:
> >
> > * All transactions areserialized to files if min_apply_delay is set to 
> > non-zero.
> > * After receiving the commit message and spending time, workers reads and
> >   applies spooled messages
> >
> 
> I think this may be more work than required because in some cases
> doing I/O just to delay xacts will later lead to more work. Can't we
> send some ping to walsender to communicate that walreceiver is alive?
> We already seem to be sending a ping in LogicalRepApplyLoop if we
> haven't heard anything from the server for more than
> wal_receiver_timeout / 2. Now, it is possible that the walsender is
> terminated due to some other reason and we need to see if we can
> detect that or if it will only be detected once the walreceiver's
> delay time is over.

FWIW, I thought the same thing with Amit.

What we should do here is logrep workers notifying to walsender that
it's living and the communication in-between is fine, and maybe the
worker's status. Spontaneous send_feedback() calls while delaying will
be sufficient for this purpose. We might need to supress extra forced
feedbacks instead. In contrast the worker doesn't need to bother to
know whether the peer is living until it receives the next data. But
we might need to adjust the wait_time in LogicalRepApplyLoop().

But, I'm not sure what will happen when walsender is blocked by
buffer-full for a long time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Kyotaro Horiguchi
Hello.

At Mon, 12 Dec 2022 07:42:30 +, "Takamichi Osumi (Fujitsu)" 
 wrote in 
> On Monday, December 12, 2022 2:54 PM Kyotaro Horiguchi 
>  wrote:
> > I asked about unexpected walsender termination caused by this feature but I
> > think I didn't received an answer for it and the behavior is still exists.
..
> Thank you so much for your report!
> Yes. Currently, how to deal with the timeout issue is under discussion.
> Some analysis about the root cause are also there.
> 
> Kindly have a look at [1].
> 
> 
> [1] - 
> https://www.postgresql.org/message-id/TYAPR01MB58669394A67F2340B82E42D1F5E29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Oops. Thank you for the pointer. Will visit there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-11 Thread Kyotaro Horiguchi
Hello.

I asked about unexpected walsender termination caused by this feature
but I think I didn't received an answer for it and the behavior is
still exists.

Specifically, if servers have the following settings, walsender
terminates for replication timeout. After that, connection is restored
after the LR delay elapses. Although it can be said to be working in
that sense, the error happens repeatedly every about min_apply_delay
internvals but is hard to distinguish from network troubles.  I'm not
sure you're deliberately okay with it but, I don't think the behavior
causing replication timeouts is acceptable.

> wal_sender_timeout = 10s;
> wal_receiver_temeout = 10s;
> 
> create subscription ... with (min_apply_delay='60s');

This is a kind of artificial but timeout=60s and delay=5m is not an
uncommon setup and that also causes this "issue".

subscriber:
> 2022-12-12 14:17:18.139 JST LOG:  terminating walsender process due to 
> replication timeout
> 2022-12-12 14:18:11.076 JST LOG:  starting logical decoding for slot "s1"
...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-11 Thread Kyotaro Horiguchi
Sorry for the confusion.

At Mon, 12 Dec 2022 12:06:36 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > This patch copies the bleeding edge WAL page without recording the
> > (next) insertion point nor checking whether all in-progress insertion
> > behind the target LSN have finished. Thus the copied page may have
> > holes.  That being said, the sequential-reading nature and the fact
> > that WAL buffers are zero-initialized may make it work for recovery,
> > but I don't think this also works for replication.
> 
> Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
> records. Please forget about recovery.

NO... Logical walsenders do that. So, please forget about this...

> > I remember that the one of the advantage of reading the on-memory WAL
> > records is that that allows walsender to presend the unwritten
> > records. So perhaps we should manage how far the buffer is filled with
> > valid content (or how far we can presend) in this feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-11 Thread Kyotaro Horiguchi
At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> This patch copies the bleeding edge WAL page without recording the
> (next) insertion point nor checking whether all in-progress insertion
> behind the target LSN have finished. Thus the copied page may have
> holes.  That being said, the sequential-reading nature and the fact
> that WAL buffers are zero-initialized may make it work for recovery,
> but I don't think this also works for replication.

Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.

> I remember that the one of the advantage of reading the on-memory WAL
> records is that that allows walsender to presend the unwritten
> records. So perhaps we should manage how far the buffer is filled with
> valid content (or how far we can presend) in this feature.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-11 Thread Kyotaro Horiguchi
At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy 
 wrote in 
> The patch introduces concurrent readers for the WAL buffers, so far
> only there are concurrent writers. In the patch, WALRead() takes just
> one lock (WALBufMappingLock) in shared mode to enable concurrent
> readers and does minimal things - checks if the requested WAL page is
> present in WAL buffers, if so, copies the page and releases the lock.
> I think taking just WALBufMappingLock is enough here as the concurrent
> writers depend on it to initialize and replace a page in WAL buffers.
> 
> I'll add this to the next commitfest.
> 
> Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read, fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement. It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read. Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes.  That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Question regarding "Make archiver process an auxiliary process. commit"

2022-12-06 Thread Kyotaro Horiguchi
At Wed, 7 Dec 2022 11:28:23 +0530, Sravan Kumar  wrote 
in 
> On Tue, Dec 6, 2022 at 5:24 PM Bharath Rupireddy <
> bharath.rupireddyforpostg...@gmail.com> wrote:
> 
> > On Tue, Dec 6, 2022 at 4:57 PM Sravan Kumar 
> > wrote:
> > >
> > > Thank you for the feedback.
> > >
> > > I'll be glad to help with the fix. Here's the patch for review.
> >
> > Thanks. +1 for fixing this.
> >> I would like to quote recent discussions on reducing the useless
> >> wakeups or increasing the sleep/hibernation times in various processes
> >> to reduce the power savings [1] [2] [3] [4] [5]. With that in context,
> >> does the archiver need to wake up every 60 sec at all when its latch
> >> gets set (PgArchWakeup()) whenever the server switches to a new WAL
> >> file? What happens if we get rid of PGARCH_AUTOWAKE_INTERVAL and rely
> >> on its latch being set? If required, we can spread PgArchWakeup() to
> >> more places, no?
> >
> >
> I like the idea of not having to wake up intermittently and probably we
> should start a discussion about it.
> 
> I see the following comment in PgArchWakeup().
> 
>  /*
> * We don't acquire ProcArrayLock here.  It's actually fine because
> * procLatch isn't ever freed, so we just can potentially set the wrong
> * process' (or no process') latch.  Even in that case the archiver will
> * be relaunched shortly and will start archiving.
> */
> 
> While I do not fully understand the comment, it gives me an impression that
> the SetLatch() done here is counting on the timeout to wake up the archiver
> in some cases where the latch is wrongly set.

It is telling about the first iteration of archive process, not
periodical wakeups.  So I think it is doable by considering how to
handle incomplete archiving iterations.

> The proposed idea is a behaviour change while this thread intends to clean
> up some code that's
> a result of the mentioned commit. So probably the proposed idea can be
> discussed as a separate thread.

Agreed.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Question regarding "Make archiver process an auxiliary process. commit"

2022-12-06 Thread Kyotaro Horiguchi
At Tue, 6 Dec 2022 17:23:50 +0530, Bharath Rupireddy 
 wrote in 
> Thanks. +1 for fixing this.
> 
> I would like to quote recent discussions on reducing the useless
> wakeups or increasing the sleep/hibernation times in various processes
> to reduce the power savings [1] [2] [3] [4] [5]. With that in context,
> does the archiver need to wake up every 60 sec at all when its latch
> gets set (PgArchWakeup()) whenever the server switches to a new WAL
> file? What happens if we get rid of PGARCH_AUTOWAKE_INTERVAL and rely
> on its latch being set? If required, we can spread PgArchWakeup() to
> more places, no?

I thought so first, but archiving may be interrupted for various
reasons (disk full I think is the most common one). So, only
intentional wakeups aren't sufficient.

> Before even answering the above questions, I think we need to see if
> there're any cases where a process can miss SetLatch() calls (I don't
> have an answer for that).

I read a recent Thomas' mail that says something like "should we
consider the case latch sets are missed?".  It is triggered by SIGURG
or SetEvent().  I'm not sure but I believe the former is now reliable
and the latter was born reliable.

> Or do we want to stick to what the below comment says?
> 
> /*
>  * There shouldn't be anything for the archiver to do except to wait for a
>  * signal ... however, the archiver exists to protect our data, so she
>  * wakes up occasionally to allow herself to be proactive.
>  */

So I think this is still valid.  If we want to eliminate useless
wakeups, archiver needs to remember whether the last iteration was
fully done or not. But it seems to be a race condition is involved.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-06 Thread Kyotaro Horiguchi
At Tue, 6 Dec 2022 11:08:43 -0800, Andres Freund  wrote in 
> Hi,
> 
> The tests fail on cfbot:
> https://cirrus-ci.com/task/4533866329800704
> 
> They only seem to fail on 32bit linux.
> 
> https://api.cirrus-ci.com/v1/artifact/task/4533866329800704/testrun/build-32/testrun/subscription/032_apply_delay/log/regress_log_032_apply_delay
> [06:27:10.628](0.138s) ok 2 - check if the new rows were applied to subscriber
> timed out waiting for match: (?^:logical replication apply delay) at 
> /tmp/cirrus-ci-build/src/test/subscription/t/032_apply_delay.pl line 124.

It fails for me on 64bit Linux.. (Rocky 8.7)

> t/032_apply_delay.pl ... Dubious, test returned 29 (wstat 7424, 
> 0x1d00)
> No subtests run
..
> t/032_apply_delay.pl (Wstat: 7424 Tests: 0 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: No plan found in TAP output

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Transaction timeout

2022-12-06 Thread Kyotaro Horiguchi
At Mon, 5 Dec 2022 17:10:50 -0800, Andres Freund  wrote in 
> I'm most concerned about the overhead when the timeouts are *not*
> enabled. And this adds a branch to start_xact_command() and a function
> call for get_timeout_active(TRANSACTION_TIMEOUT) in that case. On its
> own, that's not a whole lot, but it does add up. There's 10+ function
> calls for timeout and ps_display purposes for every single statement.

That path seems like existing just for robustness. I inserted
"Assert(0)" just before the disable_timeout(), but make check-world
didn't fail [1]. Couldn't we get rid of that path, adding an assertion
instead?  I'm not sure about other timeouts yet, though.

About disabling side, we cannot rely on StatementTimeout.

[1]
# 032_apply_delay.pl fails for me so I don't know any of the later
# tests fails.

> But it's definitely also worth optimizing the timeout enabled paths. And
> you're right, it looks like there's a fair bit of optimization
> potential.

Thanks. I'll work on that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




<    1   2   3   4   5   6   7   8   9   10   >