RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-06-12 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> At PGcon and other places we have discussed the time-delayed logical
> replication,
> but now we have understood that there are no easy ways. Followings are our
> analysis.

At this point, I have not planned to develop the PoC anymore, unless better idea
or infrastructure will come.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-06-12 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

At PGcon and other places we have discussed the time-delayed logical 
replication,
but now we have understood that there are no easy ways. Followings are our 
analysis.

# Abstract

To implement the time-dealyed logical replication for more proper approach,
the worker must serialize all the received messages into permanent files.
But PostgreSQL does not have good infrastructures for the purpose so huge 
engineering is needed.

## Review: problem of without-file approach

In the without-file approach, the  apply worker process sleeps while delaying 
the application.
This approach is chosen in earlier versions like [1], but it contains problems 
which was
shared by Sawada-san [2]. They lead the PANIC error due to the disk full.

 A) WALs cannot be recycled on publisher because they are not flushed on 
subscriber.
 B) Moreover, vacuuming cannot remove dead tuples on publisher.

## Alternative approach: serializing messages to files

To prevent any potential issues, the worker should serialize all incoming 
messages
to a permanent file, like what the physical walreceiver does.
Here, messages are first written into files at the beginning of transactions 
and then flushed at the end.
This approach could slove problem a), b), but it still has many considerations 
and difficulties.

### How to separate messages into files?

There are two possibilities for dividing messages into files, but neither of 
them is ideal.

1. Create a file per received transaction. 
 
In this case files will be removed after the delay-period is exceeded and it is 
applied.
This is the simplest approach, but the number of files is bloat.

2. Use one large file or segmented file (like WAL). 

This can reduce the number of files, but we must consider further things:

 A) Purge – We must purge the applied transaction, but we do not have a good way
 to remove one transaction from the large file.

 B) 2PC – It is more likely to occur that the segment which contains the actual
 transaction differs from the segment where COMMIT PREPARED.
 Hence the worker must check all the segments to find the actual messages from 
them.

 C) Streamed in-progress transactions - chunks of transactions are separated
 into several segments. Hence the worker must check all the segments to find
 chunks messages from them, same as above.

### Handle the case when the file exceeds the limitation 

Regardless of the option chosen from the ones mentioned above, there is a 
possibility
that the file size could exceed the file system's limit. This can occur as the
publisher can send transactions of any length.
PostgreSQL provides a mechanism for working with such large files - BufFile 
data structure,
but it could not be used as-is for several reasons:

 A) It only supports the buffered-I/O. A read or write of the low-level File
 occurs only when the buffer is filled or emptied. So, we cannot control when 
it is persisted.

 B) It can be used only for temporary purpose. Internally the BufFile creates
 some physical files into $PGDATA/base/pgsql_tmp directories, and files in the
 subdirectory will be removed when postmaster restarts.

 C) It does not have mechanisms for restoring information after the restart.
 BufFile contains virtual positions such as file index and offset, but these
 fields are stored in a memory structure, so the BufFile will forget the 
ordering
 of files and its initial/final position after restarts.

 D) It cannot remove a part of virtual file. Even if a large file is separated
 into multiple physical files and all transactions in a physical file are 
already
 applied, BufFile cannot remove only one part.

[1]: 
https://www.postgresql.org/message-id/f026292b-c9ee-472e-beaa-d32c5c3a2ced%40www.fastmail.com
[2]: 
https://www.postgresql.org/message-id/CAD21AoAeG2+RsUYD9+mEwr8-rrt8R1bqpe56T2D=euo-qs-...@mail.gmail.com

Acknowledgement:

Amit, Peter, Sawada-san
Thank you for discussing with me off-list.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-22 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thank you for giving suggestions.

> > Dear hackers,
> >
> > I rebased and refined my PoC. Followings are the changes:
> >
> 
> 1. Is my understanding correct that this patch creates the delay files
> for each transaction? If so, did you consider other approaches such as
> using one file to avoid creating many files?

I have been analyzing the approach which uses only one file per subscription, 
per
your suggestion. Currently I'm not sure whether it is good approach or not, so 
could
you please give me any feedbacks?

TL;DR: rotating segment files like WALs may be used, but there are several 
issues.

# Assumption 

* Streamed txns are also serialized to the same permanent file, in the received 
order.
* No additional sorting is considered.

# Considerations

As a premise, applied txns must be removed from files, otherwise the disk 
becomes
full in some day and it leads PANIC.

## Naive approach - serialize all the changes to one large file

If workers continue to write received changes from the head naively, it may be
difficult to purge applied txns because there seems not to have a good way to
truncate the first part of the file. I could not find related functions in fd.h.

## Alternative approach - separate the file into segments

Alternative approach I came up with is that the file is divided into some 
segments
- like WAL - and remove it if all written txns are applied. It may work well in
non-streaming, 1pc case, but may not in other cases.

### Regarding the PREPARE transactions

At that time it is more likely to occur that the segment which contains the
actual txn is differ from the segment where COMMIT PREPARED. Hence the worker
must check all the remained segments to find the actual messages from them. 
Isn't
it inefficient? There is another approach that workers apply the PREPARE
immediately and spill to file only COMMIT PREPARED, but in this case the worker
have been acquiring the lock and never released it till delay is done.

### Regarding the streamed transactions

As for streaming case, chunks of txns are separated into several segments.
Hence the worker must check all the remained segments to find chunks messages
from them, same as above. Isn't it inefficient too?

Additionally, segments which have prepared or streamed transactions cannot be
removed, so even if the case many files may be generated and remained.

Anyway, it may be difficult to accept to stream in-progress transactions while
delaying the application. IIUC the motivation of steaming is to reduce the lag
between nodes, and it is opposite of this feature. So it might be okay, not 
sure.

### Regarding the publisher - timing to send schema may be fuzzy

Another issue is that the timing when publisher sends the schema information
cannot be determined on publisher itself. As discussed on hackers, publisher
must send schema information once per segment file, but it is controlled on
subscriber side.
I'm thinking that the walsender cannot recognize the changing of segments and
understand the timing to send them.

That's it. I'm very happy to get idea. 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-11 Thread Amit Kapila
On Fri, May 12, 2023 at 10:33 AM Amit Kapila  wrote:
>
> On Fri, May 12, 2023 at 7:38 AM Masahiko Sawada  wrote:
> >
> > On Thu, May 11, 2023 at 2:04 PM Amit Kapila  wrote:
> > >
> > > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
> > >  wrote:
> > > >
> > > > Dear hackers,
> > > >
> > > > I rebased and refined my PoC. Followings are the changes:
> > > >
> > >
> > > 1. Is my understanding correct that this patch creates the delay files
> > > for each transaction? If so, did you consider other approaches such as
> > > using one file to avoid creating many files?
> > > 2. For streaming transactions, first the changes are written in the
> > > temp file and then moved to the delay file. It seems like there is a
> > > double work. Is it possible to unify it such that when min_apply_delay
> > > is specified, we just use the delay file without sacrificing the
> > > advantages like stream sub-abort can truncate the changes?
> > > 3. Ideally, there shouldn't be a performance impact of this feature on
> > > regular transactions because the delay file is created only when
> > > min_apply_delay is active but better to do some testing of the same.
> > >
> >
> > In addition to the points Amit raised, if the 'required_schema' option
> > is specified in START_REPLICATION, the publisher sends schema
> > information for every change. I think it leads to significant
> > overhead. Did you consider alternative approaches such as sending the
> > schema information for every transaction or the subscriber requests
> > the publisher to send it?
> >
>
> Why do we need this new flag? I can't see any comments in the related
> code which explain its need.
>

So as per the email [1], this would be required after the subscriber
restart. I guess we ideally need it once per delay file (considering
that we have one file for all delayed xacts). In the worst case, we
can have it per transaction as suggested by Sawada-San.

[1] - 
https://www.postgresql.org/message-id/TYAPR01MB5866568A5C1E71338328B20CF5629%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-11 Thread Amit Kapila
On Fri, May 12, 2023 at 7:38 AM Masahiko Sawada  wrote:
>
> On Thu, May 11, 2023 at 2:04 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Dear hackers,
> > >
> > > I rebased and refined my PoC. Followings are the changes:
> > >
> >
> > 1. Is my understanding correct that this patch creates the delay files
> > for each transaction? If so, did you consider other approaches such as
> > using one file to avoid creating many files?
> > 2. For streaming transactions, first the changes are written in the
> > temp file and then moved to the delay file. It seems like there is a
> > double work. Is it possible to unify it such that when min_apply_delay
> > is specified, we just use the delay file without sacrificing the
> > advantages like stream sub-abort can truncate the changes?
> > 3. Ideally, there shouldn't be a performance impact of this feature on
> > regular transactions because the delay file is created only when
> > min_apply_delay is active but better to do some testing of the same.
> >
>
> In addition to the points Amit raised, if the 'required_schema' option
> is specified in START_REPLICATION, the publisher sends schema
> information for every change. I think it leads to significant
> overhead. Did you consider alternative approaches such as sending the
> schema information for every transaction or the subscriber requests
> the publisher to send it?
>

Why do we need this new flag? I can't see any comments in the related
code which explain its need.

> > Overall, I think such an approach can address comments by Sawada-San
> > [1] but not sure if Sawada-San or others have any better ideas to
> > achieve this feature. It would be good to see what others think of
> > this approach.
> >
>
> I agree with this approach.
>
> When it comes to the idea of writing logical changes to permanent
> files, I think it would also be a good idea (and perhaps could be a
> building block of this feature) that we write streamed changes to a
> permanent file so that the apply worker can retry to apply them
> without retrieving the same changes again from the publisher.
>

I think we anyway won't be able to send confirmation till we write or
process the commit. If it gets interrupted anytime in between we need
to get all the changes again. I think using Fileset with temp files
has quite a few advantages for streaming as are noted in the header
comments of worker.c. We can investigate to replace that with
permanent files but I don't see that the advantages outweigh the
change. Also, after parallel apply, I am expecting, most users would
prefer that mode for large transactions, so making changes in the
serialized path doesn't seem like a good idea to me.

Having said that, I also thought that it would be a good idea if both
streaming and time-delayed can use the same code path in some way
w.r.t writing to files but couldn't come up with any good idea without
more downsides. I see that Kuroda-San has tried to keep the code path
isolated for this feature but still see that one can question the
implementation approach.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-11 Thread Masahiko Sawada
On Fri, May 12, 2023 at 12:48 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > Overall, I think such an approach can address comments by Sawada-San
> > > [1] but not sure if Sawada-San or others have any better ideas to
> > > achieve this feature. It would be good to see what others think of
> > > this approach.
> > >
> >
> > I agree with this approach.
> >
> > When it comes to the idea of writing logical changes to permanent
> > files, I think it would also be a good idea (and perhaps could be a
> > building block of this feature) that we write streamed changes to a
> > permanent file so that the apply worker can retry to apply them
> > without retrieving the same changes again from the publisher.
>
> I'm very relieved to hear that.
> One question: did you mean to say that serializing changes into the permanent 
> files
> can be extend to the non-delay case, right? I think once I will treat for 
> delayed
> replication, and then we can consider later.

What I was thinking of is that we implement non-delay cases (only for
streamed transactions) and then extend it to delay cases (i.e. adding
non-streamed transaction support and the delay mechanism). It might be
helpful if this patch becomes large and this approach can enable us to
reduce the complexity or divide the patch. That being said, I've not
considered this approach enough yet and it's just an idea. Extending
this feature to non-delay cases later also makes sense to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-11 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Sawada-san,

Thank you for replying!

> On Thu, May 11, 2023 at 2:04 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Dear hackers,
> > >
> > > I rebased and refined my PoC. Followings are the changes:
> > >
> >
> > 1. Is my understanding correct that this patch creates the delay files
> > for each transaction? If so, did you consider other approaches such as
> > using one file to avoid creating many files?
> > 2. For streaming transactions, first the changes are written in the
> > temp file and then moved to the delay file. It seems like there is a
> > double work. Is it possible to unify it such that when min_apply_delay
> > is specified, we just use the delay file without sacrificing the
> > advantages like stream sub-abort can truncate the changes?
> > 3. Ideally, there shouldn't be a performance impact of this feature on
> > regular transactions because the delay file is created only when
> > min_apply_delay is active but better to do some testing of the same.
> >
> 
> In addition to the points Amit raised, if the 'required_schema' option
> is specified in START_REPLICATION, the publisher sends schema
> information for every change. I think it leads to significant
> overhead. Did you consider alternative approaches such as sending the
> schema information for every transaction or the subscriber requests
> the publisher to send it?

Thanks for giving your opinions. Except for suggestion 2, I have never 
considered.
I will analyze them and share my opinion later.
About 2, I chose the style in order to simplify the source code, but I'm now 
planning
to follow suggestions.

> > Overall, I think such an approach can address comments by Sawada-San
> > [1] but not sure if Sawada-San or others have any better ideas to
> > achieve this feature. It would be good to see what others think of
> > this approach.
> >
> 
> I agree with this approach.
> 
> When it comes to the idea of writing logical changes to permanent
> files, I think it would also be a good idea (and perhaps could be a
> building block of this feature) that we write streamed changes to a
> permanent file so that the apply worker can retry to apply them
> without retrieving the same changes again from the publisher.

I'm very relieved to hear that.
One question: did you mean to say that serializing changes into the permanent 
files
can be extend to the non-delay case, right? I think once I will treat for 
delayed
replication, and then we can consider later.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-11 Thread Masahiko Sawada
On Thu, May 11, 2023 at 2:04 PM Amit Kapila  wrote:
>
> On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear hackers,
> >
> > I rebased and refined my PoC. Followings are the changes:
> >
>
> 1. Is my understanding correct that this patch creates the delay files
> for each transaction? If so, did you consider other approaches such as
> using one file to avoid creating many files?
> 2. For streaming transactions, first the changes are written in the
> temp file and then moved to the delay file. It seems like there is a
> double work. Is it possible to unify it such that when min_apply_delay
> is specified, we just use the delay file without sacrificing the
> advantages like stream sub-abort can truncate the changes?
> 3. Ideally, there shouldn't be a performance impact of this feature on
> regular transactions because the delay file is created only when
> min_apply_delay is active but better to do some testing of the same.
>

In addition to the points Amit raised, if the 'required_schema' option
is specified in START_REPLICATION, the publisher sends schema
information for every change. I think it leads to significant
overhead. Did you consider alternative approaches such as sending the
schema information for every transaction or the subscriber requests
the publisher to send it?

> Overall, I think such an approach can address comments by Sawada-San
> [1] but not sure if Sawada-San or others have any better ideas to
> achieve this feature. It would be good to see what others think of
> this approach.
>

I agree with this approach.

When it comes to the idea of writing logical changes to permanent
files, I think it would also be a good idea (and perhaps could be a
building block of this feature) that we write streamed changes to a
permanent file so that the apply worker can retry to apply them
without retrieving the same changes again from the publisher.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Amit Kapila
On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> I rebased and refined my PoC. Followings are the changes:
>

1. Is my understanding correct that this patch creates the delay files
for each transaction? If so, did you consider other approaches such as
using one file to avoid creating many files?
2. For streaming transactions, first the changes are written in the
temp file and then moved to the delay file. It seems like there is a
double work. Is it possible to unify it such that when min_apply_delay
is specified, we just use the delay file without sacrificing the
advantages like stream sub-abort can truncate the changes?
3. Ideally, there shouldn't be a performance impact of this feature on
regular transactions because the delay file is created only when
min_apply_delay is active but better to do some testing of the same.

Overall, I think such an approach can address comments by Sawada-San
[1] but not sure if Sawada-San or others have any better ideas to
achieve this feature. It would be good to see what others think of
this approach.

[1] - 
https://www.postgresql.org/message-id/CAD21AoAeG2%2BRsUYD9%2BmEwr8-rrt8R1bqpe56T2D%3DeuO-Qs-GAg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Hayato Kuroda (Fujitsu)
Dear Amit-san, Bharath,

Thank you for giving your opinion!

> Some of the other solutions like MySQL also have this feature. See
> [2], you can also read the other use cases in that article. It seems
> pglogical has this feature and there is a customer demand for the same
> [3]

Additionally, the Db2[1] seems to have similar feature. If we extend to DBaaSes,
RDS for MySQL [2] and TencentDB [3] have that. These may indicate the needs
of the delayed replication. 

[1]: 
https://www.ibm.com/docs/en/db2/11.5?topic=parameters-hadr-replay-delay-hadr-replay-delay
[2]: 
https://aws.amazon.com/jp/blogs/database/recover-from-a-disaster-with-delayed-replication-in-amazon-rds-for-mysql/
[3]: https://www.tencentcloud.com/document/product/236/41085

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Amit Kapila
On Wed, May 10, 2023 at 5:35 PM Bharath Rupireddy
 wrote:
>
> On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear hackers,
> >
> > I rebased and refined my PoC. Followings are the changes:
>
> Thanks.
>
> Apologies for being late here. Please bear with me if I'm repeating
> any of the discussed points.
>
> I'm mainly trying to understand the production level use-case behind
> this feature, and for that matter, recovery_min_apply_delay. AFAIK,
> people try to keep the replication lag as minimum as possible i.e.
> near zero to avoid the extreme problems on production servers - wal
> file growth, blocked vacuum, crash and downtime.
>
> The proposed feature commit message and existing docs about
> recovery_min_apply_delay justify the reason as 'offering opportunities
> to correct data loss errors'. If someone wants to enable
> recovery_min_apply_delay/min_apply_delay on production servers, I'm
> guessing their values will be in hours, not in minutes; for the simple
> reason that when a data loss occurs, people/infrastructure monitoring
> postgres need to know it first and need time to respond with
> corrective actions to recover data loss. When these parameters are
> set, the primary server mustn't be generating too much WAL to avoid
> eventual crash/downtime. Who would really want to be so defensive
> against somebody who may or may not accidentally cause data loss and
> enable these features on production servers (especially when these can
> take down the primary server) and live happily with the induced
> replication lag?
>
> AFAIK, PITR is what people use for recovering from data loss errors in
> production.
>

I think PITR is not a preferred way to achieve this because it can be
quite time-consuming. See how Gitlab[1] uses delayed replication in
PostgreSQL. This is one of the use cases I came across but I am sure
there will be others as well, otherwise, we would not have introduced
this feature in the first place.

Some of the other solutions like MySQL also have this feature. See
[2], you can also read the other use cases in that article. It seems
pglogical has this feature and there is a customer demand for the same
[3]

> IMO, before we even go implement the apply delay feature for logical
> replication, it's worth to understand if induced replication lags have
> any production level significance.
>

I think the main thing here is to come up with the right design to
implement this feature. In the last release, we found some blocking
problems with the proposed patch at that time but Kuroda-San came up
with a new patch with a different design based on the discussion here.
I haven't looked at it yet though.


[1] - 
https://about.gitlab.com/blog/2019/02/13/delayed-replication-for-disaster-recovery-with-postgresql/
[2] - https://dev.mysql.com/doc/refman/8.0/en/replication-delayed.html
[3] - 
https://www.postgresql.org/message-id/73b06a32-56ab-4056-86ff-e307f3c316f1%40www.fastmail.com

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Bharath Rupireddy
On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> I rebased and refined my PoC. Followings are the changes:

Thanks.

Apologies for being late here. Please bear with me if I'm repeating
any of the discussed points.

I'm mainly trying to understand the production level use-case behind
this feature, and for that matter, recovery_min_apply_delay. AFAIK,
people try to keep the replication lag as minimum as possible i.e.
near zero to avoid the extreme problems on production servers - wal
file growth, blocked vacuum, crash and downtime.

The proposed feature commit message and existing docs about
recovery_min_apply_delay justify the reason as 'offering opportunities
to correct data loss errors'. If someone wants to enable
recovery_min_apply_delay/min_apply_delay on production servers, I'm
guessing their values will be in hours, not in minutes; for the simple
reason that when a data loss occurs, people/infrastructure monitoring
postgres need to know it first and need time to respond with
corrective actions to recover data loss. When these parameters are
set, the primary server mustn't be generating too much WAL to avoid
eventual crash/downtime. Who would really want to be so defensive
against somebody who may or may not accidentally cause data loss and
enable these features on production servers (especially when these can
take down the primary server) and live happily with the induced
replication lag?

AFAIK, PITR is what people use for recovering from data loss errors in
production.

IMO, before we even go implement the apply delay feature for logical
replication, it's worth to understand if induced replication lags have
any production level significance. We can also debate if providing
apply delay hooks is any better with simple out-of-the-box extensions
as opposed to the core providing these features.

Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-10 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

Based on the discussion Sawada-san pointed out[1] that the current approach of
logical time-delayed avoids recycling WALs, I'm planning to close the CF entry 
once.
This or the forked thread will be registered again after deciding on the 
alternative
approach. Thank you very much for the time to join our discussions earlier.

I think to solve the issue, logical changes must be flushed on subscribers once
and workers apply changes after spending a specified time. The straightforward
approach for it is following physical replication - introduce the walreceiver 
process
on the subscriber. We must research more, but at least there are some benefits:

* Publisher can be shutted down even if the apply worker stuck. The stuck is 
more
  likely happen than physical replication, so this may improve the robustness.
  More detail, please see another thread[2].
* In case of synchronous_commit = 'remote_write', publisher can COMMIT faster.
  This is because walreceiver will flush changes immediately and reply soon.
  Even if time-delayed is enabled, the wait-time will not be increased.
* May be used as an infrastructure of parallel apply for non-streaming 
transaction.
  The basic design of them are the similar - one process receive changes and 
others apply.

I searched old discussions [3] and wiki pages, and I found that the initial 
prototype
had a logical walreceiver but in a later version [4] apply worker directly 
received
changes. I could not find the reason for the decision, but I suspect there were 
the
following reasons. Could you please tell me the correct background about that?

* Performance bottlenecks. If the walreceiver flush changes and the worker 
applies
  them, fsync() is called for every reception.
* Complexity. In this design walreceiver and apply worker must share the 
progress
  of flush/apply. For crash recovery, more consideration is needed. The related 
discussion
  can be found in [5].
* Extendibility. In-core logical replication should be a sample of an external
  project. Apply worker is just a background worker that can be launched from 
an extension,
  so it can be easily understood. If it deeply depends on the walreceiver, 
other projects cannot follow.

[1]: 
https://www.postgresql.org/message-id/CAD21AoAeG2%2BRsUYD9%2BmEwr8-rrt8R1bqpe56T2D%3DeuO-Qs-GAg%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/flat/TYAPR01MB586668E50FC2447AD7F92491F5E89%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: 
https://www.postgresql.org/message-id/201206131327.24092.andres%402ndquadrant.com
[4]: 
https://www.postgresql.org/message-id/37e19ad5-f667-2fe2-b95b-bba69c5b6...@2ndquadrant.com
[5]: 
https://www.postgresql.org/message-id/1339586927-13156-12-git-send-email-andres%402ndquadrant.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-09 Thread Amit Kapila
On Thu, Mar 9, 2023 at 2:56 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 9 Mar 2023 11:00:46 +0530, Amit Kapila  
> wrote in
> > On Wed, Mar 8, 2023 at 9:20 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart  
> > > wrote:
> > > >
> > >
> > > >  IMO the current set of
> > > > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this
> > > > feature virtually unusable for a lot of workloads, so it's probably 
> > > > worth
> > > > exploring an alternative approach.
> > >
> > > It might require more engineering effort for alternative approaches
> > > such as one I proposed but the feature could become better from the
> > > user perspective. I also think it would be worth exploring it if we've
> > > not yet.
> > >
> >
> > Fair enough. I think as of now most people think that we should
> > consider alternative approaches for this feature. The two ideas at a
>
> If we can notify subscriber of the transaction start time, will that
> solve the current problem?
>

I don't think that will solve the current problem because the problem
is related to confirming back the flush LSN (commit LSN) to the
publisher which we do only after we commit the delayed transaction.
Due to this, we are not able to advance WAL(restart_lsn)/XMIN on the
publisher which causes an accumulation of WAL and does not allow the
vacuum to remove deleted rows. Do you have something else in mind
which makes you think that it can solve the problem?

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-09 Thread Kyotaro Horiguchi
At Thu, 9 Mar 2023 11:00:46 +0530, Amit Kapila  wrote 
in 
> On Wed, Mar 8, 2023 at 9:20 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart  
> > wrote:
> > >
> >
> > >  IMO the current set of
> > > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this
> > > feature virtually unusable for a lot of workloads, so it's probably worth
> > > exploring an alternative approach.
> >
> > It might require more engineering effort for alternative approaches
> > such as one I proposed but the feature could become better from the
> > user perspective. I also think it would be worth exploring it if we've
> > not yet.
> >
> 
> Fair enough. I think as of now most people think that we should
> consider alternative approaches for this feature. The two ideas at a

If we can notify subscriber of the transaction start time, will that
solve the current problem?  If not, or if it is not possible, +1 to
look for other solutions.

> high level are that the apply worker itself first flushes the decoded
> WAL (maybe only when time-delay is configured) or have a separate
> walreceiver process as we have for standby. I think we need to analyze
> the pros and cons of each of those approaches and see if they would be
> useful even for other things on the apply side.

My understanding of the requirements here is that the publisher should
not hold changes, the subscriber should not hold data reads, and all
transactions including two-phase ones should be applied at once upon
committing.  Both sides need to respond to the requests from the other
side.  We expect apply-delay of several hours or more. My thoughts
considering the requirements are as follows:

If we expect delays of several hours or more, I don't think it's
feasible to stack received changes in the process memory. So, if
apply-delay is in effect, I think it would be better to process
transactions through files regardless of process configuration.

I'm not sure whether we should have a separate process for protocol
processing. On one hand, it would simplify the protocol processing
part, but on the other hand, changes would always have to be applied
through files.  If we plan to integrate the paths with and without
apply-delay by the file-passing method, this might work. If we want to
maintain the current approach when not applying apply-delay, I think
we would have to implement it in a single process, but I feel the
protocol processing part could become complicated.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-08 Thread Amit Kapila
On Wed, Mar 8, 2023 at 9:20 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart  
> wrote:
> >
>
> >  IMO the current set of
> > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this
> > feature virtually unusable for a lot of workloads, so it's probably worth
> > exploring an alternative approach.
>
> It might require more engineering effort for alternative approaches
> such as one I proposed but the feature could become better from the
> user perspective. I also think it would be worth exploring it if we've
> not yet.
>

Fair enough. I think as of now most people think that we should
consider alternative approaches for this feature. The two ideas at a
high level are that the apply worker itself first flushes the decoded
WAL (maybe only when time-delay is configured) or have a separate
walreceiver process as we have for standby. I think we need to analyze
the pros and cons of each of those approaches and see if they would be
useful even for other things on the apply side.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-07 Thread Masahiko Sawada
On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart  wrote:
>
> On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote:
> > On the other hand, we already have a similar problem with
> > recovery_min_apply_delay combined with hot_standby_feedback [1].
> > So, that probably is an acceptable trade-off for the pgsql-hackers.
> > If you use this feature, you should be even more careful.
>
> Yes, but it's possible to turn off hot_standby_feedback so that you don't
> incur bloat on the primary.  And you don't need to store hours or days of
> WAL on the primary.

Right. This side effect belongs to the combination of
recovery_min_apply_delay and hot_standby_feedback/replication slot.
recovery_min_apply_delay itself can be used even without this side
effect if we accept other trade-offs. When it comes to this
time-delayed logical replication feature, there is no choice to avoid
the side effect for users who want to use this feature.

> I'm very late to this thread, but IIUC you cannot
> avoid blocking VACUUM with the proposed feature.

Right.

>  IMO the current set of
> trade-offs (e.g., unavoidable bloat and WAL buildup) would make this
> feature virtually unusable for a lot of workloads, so it's probably worth
> exploring an alternative approach.

It might require more engineering effort for alternative approaches
such as one I proposed but the feature could become better from the
user perspective. I also think it would be worth exploring it if we've
not yet.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-07 Thread Nathan Bossart
On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote:
> On the other hand, we already have a similar problem with
> recovery_min_apply_delay combined with hot_standby_feedback [1].
> So, that probably is an acceptable trade-off for the pgsql-hackers.
> If you use this feature, you should be even more careful.

Yes, but it's possible to turn off hot_standby_feedback so that you don't
incur bloat on the primary.  And you don't need to store hours or days of
WAL on the primary.  I'm very late to this thread, but IIUC you cannot
avoid blocking VACUUM with the proposed feature.  IMO the current set of
trade-offs (e.g., unavoidable bloat and WAL buildup) would make this
feature virtually unusable for a lot of workloads, so it's probably worth
exploring an alternative approach.  In any case, we probably shouldn't rush
this into v16 in its current form.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-06 Thread Önder Kalacı
Hi all,

Thanks for working on this.


> I imagine that a typical use case would be to set min_send_delay to
> several hours to days. I'm concerned that it could not be an
> acceptable trade-off for many users that the system cannot collect any
> garbage during that.
>

I'm not too worried about the WAL recycling, that mostly looks like
a documentation issue to me. It is not a problem that many PG users
are unfamiliar. Also, even though one day creating - altering subscription
is relaxed to be done by a regular user, one option could be to require
this setting to be changed by a superuser? That would alleviate my concern
regarding WAL recycling. A superuser should be able to monitor the system
and adjust the settings/hardware accordingly.

However, VACUUM being blocked by replication with a configuration
change on the subscription sounds more concerning to me. Blocking
VACUUM for hours could quickly escalate to performance problems.

On the other hand, we already have a similar problem with
recovery_min_apply_delay combined with hot_standby_feedback [1].
So, that probably is an acceptable trade-off for the pgsql-hackers.
If you use this feature, you should be even more careful.


> I think we can have the apply process write the decoded changes
> somewhere on the disk (as not temporary files) and return the flush
> LSN so that the apply worker can apply them later and the publisher
> can advance slot's LSN. The feature would be more complex but from the
> user perspective it would be better.
>

Yes, this might probably be one of the ideal solutions to the problem at
hand. But,
my current guess is that it'd be a non-trivial change with different
concurrency/failure
scenarios. So, I'm not sure if that is going to be a realistic patch to
pursue.


Thanks,
Onder KALACI



[1] PostgreSQL: Documentation: 15: 20.6. Replication



Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-03 Thread Masahiko Sawada
On Thu, Mar 2, 2023 at 1:07 PM Amit Kapila  wrote:
>
> On Thu, Mar 2, 2023 at 7:38 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 1, 2023 at 6:21 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > >
> > > > Apart from a bad-use case example I mentioned, in general, piling up
> > > > WAL files due to the replication slot has many bad effects on the
> > > > system. I'm concerned that the side effect of this feature (at least
> > > > of the current design) is too huge compared to the benefit, and afraid
> > > > that users might end up using this feature without understanding the
> > > > side effect well. It might be okay if we thoroughly document it but
> > > > I'm not sure.
> > >
> > > One approach is that change max_slot_wal_keep_size forcibly when 
> > > min_send_delay
> > > is set. But it may lead to disable the slot because WALs needed by the 
> > > time-delayed
> > > replication may be also removed. Just the right value cannot be set by us 
> > > because
> > > it is quite depends on the min_send_delay and workload.
> > >
> > > How about throwing the WARNING when min_send_delay > 0 but
> > > max_slot_wal_keep_size < 0? Differ from previous, version the subscription
> > > parameter min_send_delay will be sent to publisher. Therefore, we can 
> > > compare
> > > min_send_delay and max_slot_wal_keep_size when publisher receives the 
> > > parameter.
> >
> > Since max_slot_wal_keep_size can be changed by reloading the config
> > file, each walsender warns it also at that time?
> >
>
> I think Kuroda-San wants to emit a WARNING at the time of CREATE
> SUBSCRIPTION. But it won't be possible to emit a WARNING at the time
> of ALTER SUBSCRIPTION. Also, as you say if the user later changes the
> value of max_slot_wal_keep_size, then even if we issue LOG/WARNING in
> walsender, it may go unnoticed. If we really want to give WARNING for
> this then we can probably give it as soon as user has set non-default
> value of min_send_delay to indicate that this can lead to retaining
> WAL on the publisher and they should consider setting
> max_slot_wal_keep_size.
>
> Having said that, I think users can always tune max_slot_wal_keep_size
> and min_send_delay (as none of these requires restart) if they see any
> indication of unexpected WAL size growth. There could be multiple ways
> to check it but I think one can refer wal_status in
> pg_replication_slots, the extended value can be an indicator of this.
>
> > Not sure it's
> > helpful. I think it's a legitimate use case to set min_send_delay > 0
> > and max_slot_wal_keep_size = -1, and users might not even notice the
> > WARNING message.
> >
>
> I think it would be better to tell about this in the docs along with
> the 'min_send_delay' description. The key point is whether this would
> be an acceptable trade-off for users who want to use this feature. I
> think it can harm only if users use this without understanding the
> corresponding trade-off. As we kept the default to no delay, it is
> expected from users using this have an understanding of the trade-off.

I imagine that a typical use case would be to set min_send_delay to
several hours to days. I'm concerned that it could not be an
acceptable trade-off for many users that the system cannot collect any
garbage during that.

I think we can have the apply process write the decoded changes
somewhere on the disk (as not temporary files) and return the flush
LSN so that the apply worker can apply them later and the publisher
can advance slot's LSN. The feature would be more complex but from the
user perspective it would be better.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
> Yeah, min_send_delay and max_slots_wal_keep_size should be easily tunable
> because
> the appropriate value depends on the enviroment and workload.
> However, pg_replication_slots.pg_replication_slots cannot show the exact amout
> of WALs,
> so it may not suitable for tuning. I think user can compare the value
> pg_replication_slots.restart_lsn (or pg_stat_replication.sent_lsn) and
> pg_current_wal_lsn() to calclate number of WALs to be delayed, like
> 
> ```
> postgres=# select pg_current_wal_lsn() - pg_replication_slots.restart_lsn as
> delayed from pg_replication_slots;
>   delayed
> 
>  1689153760
> (1 row)
> ```
> 
> > I think it would be better to tell about this in the docs along with
> > the 'min_send_delay' description. The key point is whether this would
> > be an acceptable trade-off for users who want to use this feature. I
> > think it can harm only if users use this without understanding the
> > corresponding trade-off. As we kept the default to no delay, it is
> > expected from users using this have an understanding of the trade-off.
> 
> Yes, the trade-off should be emphasized.

Based on the understanding, I added them to the doc in new version patch.
Please see [1].

[1]: 
https://www.postgresql.org/message-id/flat/TYAPR01MB586606CF3B585B6F8BE13A9CF5B29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Fair point but I think the current comment should explain why we are
> doing something different here. How about extending the existing
> comments to something like: "If we've requested to shut down, exit the
> process. This is unlike handling at other places where we allow
> complete WAL to be sent before shutdown because we don't want the
> delayed transactions to be applied downstream. This will allow one to
> use the data from downstream in case of some unwanted operations on
> the current node."

Thank you for suggestion. I think it is better, so changed.
Please see new patch at [1]

[1]: 
https://www.postgresql.org/message-id/flat/TYAPR01MB586606CF3B585B6F8BE13A9CF5B29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thank you for reviewing! New version can be available at [1].

> 1) Currently we have added the delay during the decode of commit,
> while decoding the commit walsender process will stop decoding any
> further transaction until delay is completed. There might be a
> possibility that a lot of transactions will happen in parallel and
> there will be a lot of transactions to be decoded after the delay is
> completed.
> Will it be possible to decode the WAL if any WAL is generated instead
> of staying idle in the meantime, I'm not sure if this is feasible just
> throwing my thought to see if it might be possible.
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -676,6 +676,15 @@ DecodeCommit(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
> 
> buf->origptr, buf->endptr);
> }
> 
> +   /*
> +* Delay sending the changes if required. For streaming transactions,
> +* this means a delay in sending the last stream but that is OK
> +* because on the downstream the changes will be applied only after
> +* receiving the last stream.
> +*/
> +   if (ctx->min_send_delay > 0 && ctx->delay_send)
> +   ctx->delay_send(ctx, xid, commit_time);
> +

I see your point, but I think that extension can be done in future version if 
needed.
This is because we must change some parts and introduce some complexities.

If we have decoded but have not wanted to send changes yet, we must store them 
in
the memory one and skip sending. In order to do that we must add new data 
structure,
and we must add another path in DecodeCommit, DecodePrepare not to send changes
and in WalSndLoop() and other functions to send pending changes. These may not 
be sufficient. 

I'm now thinking aboves are not needed, we can modify later if the overhead of
decoding is quite large and we must do them very efficiently.

> 2) Generally single line comments are not terminated by ".", The
> comment "/* Sleep until appropriate time. */" can be changed
> appropriately:
> +
> +   /* Sleep until appropriate time. */
> +   timeout_sleeptime_ms = WalSndComputeSleeptime(now);
> +
> +   elog(DEBUG2, "time-delayed replication for txid %u,
> delay_time = %d ms, remaining wait time: %ld ms",
> +xid, (int) ctx->min_send_delay,
> remaining_wait_time_ms);
> +
> +   /* Sleep until we get reply from worker or we time out */
> +   WalSndWait(WL_SOCKET_READABLE,

Right, removed.

> 3) In some places we mention as min_send_delay and in some places we
> mention it as time-delayed replication, we can keep the comment
> consistent by using the similar wordings.
> +-- fail - specifying streaming = parallel with time-delayed replication is 
> not
> +-- supported
> +CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
> false, streaming = parallel, min_send_delay = 123);
> 
> +-- fail - alter subscription with streaming = parallel should fail when
> +-- time-delayed replication is set
> +ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);
> 
> +-- fail - alter subscription with min_send_delay should fail when
> +-- streaming = parallel is set

"time-delayed replication" was removed.

> 4) Since the value is stored in ms, we need not add ms again as the
> default value is in ms:
> @@ -4686,6 +4694,9 @@ dumpSubscription(Archive *fout, const
> SubscriptionInfo *subinfo)
> if (strcmp(subinfo->subsynccommit, "off") != 0)
> appendPQExpBuffer(query, ", synchronous_commit = %s",
> fmtId(subinfo->subsynccommit));
> 
> +   if (subinfo->subminsenddelay > 0)
> +   appendPQExpBuffer(query, ", min_send_delay = '%d ms'",
> subinfo->subminsenddelay);

Right, fixed.

> 5) we can use the new error reporting style:
> 5.a) brackets around errcode can be removed
> +   ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("invalid value for parameter
> \"%s\": \"%s\"",
> +   "min_send_delay",
> input_string),
> +hintmsg ? errhint("%s", _(hintmsg)) : 0));
> 
> 5.b) Similarly here too;
> +   if (result < 0 || result > PG_INT32_MAX)
> +   ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("%d ms is outside the valid
> range for parameter \"%s\" (%d .. %d)",
> +   result,
> +   "min_send_delay",
> +   0, PG_INT32_MAX)));
> 
> 5.c) Similarly here too;
> +   delay_val = strtoul(strVal(defel->arg), , 10);
> +   if (errno != 0 || *endptr != '\0')
> +   ereport(ERROR,
> +
> 

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-02 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing! PSA new version.

> 1.
> Nitpick. The new text is jagged-looking. It should wrap at ~80 chars.

Addressed.

> 
> 2.
> 2. Another reason is for that parallel streaming, the transaction will be 
> opened
> immediately by the parallel apply worker. Therefore, if the walsender
> is delayed in sending the final record of the transaction, the
> parallel apply worker must wait to receive it with an open
> transaction. This would result in the locks acquired during the
> transaction not being released until the min_send_delay has elapsed.
> 
> ~
> 
> The text already said there are "two reasons", and already this is
> numbered as reason 2. So it doesn't need to keep saying "Another
> reason" here.
> 
> "Another reason is for that parallel streaming" --> "For parallel 
> streaming..."

Changed.

> ==
> src/backend/replication/walsender.c
> 
> 3. WalSndDelay
> 
> + /* die if timeout was reached */
> + WalSndCheckTimeOut();
> 
> Other nearby comments start uppercase, so this should too.

I just picked from other part and they have lowercase, but fixed.

> ==
> src/include/replication/walreceiver.h
> 
> 4. WalRcvStreamOptions
> 
> @@ -187,6 +187,7 @@ typedef struct
>   * prepare time */
>   char*origin; /* Only publish data originating from the
>   * specified origin */
> + int32 min_send_delay; /* The minimum send delay */
>   } logical;
>   } proto;
>  } WalRcvStreamOptions;
> 
> ~
> 
> Should that comment mention the units are "(ms)"

Added.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v10-0001-Time-delayed-logical-replication-on-publisher-si.patch
Description:  v10-0001-Time-delayed-logical-replication-on-publisher-si.patch


RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-01 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Sawada-san,

> I think Kuroda-San wants to emit a WARNING at the time of CREATE
> SUBSCRIPTION. But it won't be possible to emit a WARNING at the time
> of ALTER SUBSCRIPTION. Also, as you say if the user later changes the
> value of max_slot_wal_keep_size, then even if we issue LOG/WARNING in
> walsender, it may go unnoticed. If we really want to give WARNING for
> this then we can probably give it as soon as user has set non-default
> value of min_send_delay to indicate that this can lead to retaining
> WAL on the publisher and they should consider setting
> max_slot_wal_keep_size.

Yeah, my motivation is to emit WARNING at CREATE SUBSCRIPTION, but I have not 
noticed
that the approach has not covered ALTER SUBSCRIPTION.

> Having said that, I think users can always tune max_slot_wal_keep_size
> and min_send_delay (as none of these requires restart) if they see any
> indication of unexpected WAL size growth. There could be multiple ways
> to check it but I think one can refer wal_status in
> pg_replication_slots, the extended value can be an indicator of this.

Yeah, min_send_delay and max_slots_wal_keep_size should be easily tunable 
because
the appropriate value depends on the enviroment and workload.
However, pg_replication_slots.pg_replication_slots cannot show the exact amout 
of WALs,
so it may not suitable for tuning. I think user can compare the value
pg_replication_slots.restart_lsn (or pg_stat_replication.sent_lsn) and
pg_current_wal_lsn() to calclate number of WALs to be delayed, like

```
postgres=# select pg_current_wal_lsn() - pg_replication_slots.restart_lsn as 
delayed from pg_replication_slots;
  delayed   

 1689153760
(1 row)
```

> I think it would be better to tell about this in the docs along with
> the 'min_send_delay' description. The key point is whether this would
> be an acceptable trade-off for users who want to use this feature. I
> think it can harm only if users use this without understanding the
> corresponding trade-off. As we kept the default to no delay, it is
> expected from users using this have an understanding of the trade-off.

Yes, the trade-off should be emphasized.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-01 Thread Amit Kapila
On Thu, Mar 2, 2023 at 7:38 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 1, 2023 at 6:21 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > >
> > > Apart from a bad-use case example I mentioned, in general, piling up
> > > WAL files due to the replication slot has many bad effects on the
> > > system. I'm concerned that the side effect of this feature (at least
> > > of the current design) is too huge compared to the benefit, and afraid
> > > that users might end up using this feature without understanding the
> > > side effect well. It might be okay if we thoroughly document it but
> > > I'm not sure.
> >
> > One approach is that change max_slot_wal_keep_size forcibly when 
> > min_send_delay
> > is set. But it may lead to disable the slot because WALs needed by the 
> > time-delayed
> > replication may be also removed. Just the right value cannot be set by us 
> > because
> > it is quite depends on the min_send_delay and workload.
> >
> > How about throwing the WARNING when min_send_delay > 0 but
> > max_slot_wal_keep_size < 0? Differ from previous, version the subscription
> > parameter min_send_delay will be sent to publisher. Therefore, we can 
> > compare
> > min_send_delay and max_slot_wal_keep_size when publisher receives the 
> > parameter.
>
> Since max_slot_wal_keep_size can be changed by reloading the config
> file, each walsender warns it also at that time?
>

I think Kuroda-San wants to emit a WARNING at the time of CREATE
SUBSCRIPTION. But it won't be possible to emit a WARNING at the time
of ALTER SUBSCRIPTION. Also, as you say if the user later changes the
value of max_slot_wal_keep_size, then even if we issue LOG/WARNING in
walsender, it may go unnoticed. If we really want to give WARNING for
this then we can probably give it as soon as user has set non-default
value of min_send_delay to indicate that this can lead to retaining
WAL on the publisher and they should consider setting
max_slot_wal_keep_size.

Having said that, I think users can always tune max_slot_wal_keep_size
and min_send_delay (as none of these requires restart) if they see any
indication of unexpected WAL size growth. There could be multiple ways
to check it but I think one can refer wal_status in
pg_replication_slots, the extended value can be an indicator of this.

> Not sure it's
> helpful. I think it's a legitimate use case to set min_send_delay > 0
> and max_slot_wal_keep_size = -1, and users might not even notice the
> WARNING message.
>

I think it would be better to tell about this in the docs along with
the 'min_send_delay' description. The key point is whether this would
be an acceptable trade-off for users who want to use this feature. I
think it can harm only if users use this without understanding the
corresponding trade-off. As we kept the default to no delay, it is
expected from users using this have an understanding of the trade-off.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-01 Thread Masahiko Sawada
On Wed, Mar 1, 2023 at 6:21 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> Thank you for giving your consideration!
>
> > >  We have documented at least one such case
> > > already where during Drop Subscription, if the network is not
> > > reachable then also, a similar problem can happen and users need to be
> > > careful about it [1].
> >
> > Apart from a bad-use case example I mentioned, in general, piling up
> > WAL files due to the replication slot has many bad effects on the
> > system. I'm concerned that the side effect of this feature (at least
> > of the current design) is too huge compared to the benefit, and afraid
> > that users might end up using this feature without understanding the
> > side effect well. It might be okay if we thoroughly document it but
> > I'm not sure.
>
> One approach is that change max_slot_wal_keep_size forcibly when 
> min_send_delay
> is set. But it may lead to disable the slot because WALs needed by the 
> time-delayed
> replication may be also removed. Just the right value cannot be set by us 
> because
> it is quite depends on the min_send_delay and workload.
>
> How about throwing the WARNING when min_send_delay > 0 but
> max_slot_wal_keep_size < 0? Differ from previous, version the subscription
> parameter min_send_delay will be sent to publisher. Therefore, we can compare
> min_send_delay and max_slot_wal_keep_size when publisher receives the 
> parameter.

Since max_slot_wal_keep_size can be changed by reloading the config
file, each walsender warns it also at that time? Not sure it's
helpful. I think it's a legitimate use case to set min_send_delay > 0
and max_slot_wal_keep_size = -1, and users might not even notice the
WARNING message.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-01 Thread vignesh C
On Tue, 28 Feb 2023 at 21:21, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > Few comments:
>
> Thank you for reviewing! PSA new version.

Thanks for the updated patch, few comments:
1) Currently we have added the delay during the decode of commit,
while decoding the commit walsender process will stop decoding any
further transaction until delay is completed. There might be a
possibility that a lot of transactions will happen in parallel and
there will be a lot of transactions to be decoded after the delay is
completed.
Will it be possible to decode the WAL if any WAL is generated instead
of staying idle in the meantime, I'm not sure if this is feasible just
throwing my thought to see if it might be possible.
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -676,6 +676,15 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,

buf->origptr, buf->endptr);
}

+   /*
+* Delay sending the changes if required. For streaming transactions,
+* this means a delay in sending the last stream but that is OK
+* because on the downstream the changes will be applied only after
+* receiving the last stream.
+*/
+   if (ctx->min_send_delay > 0 && ctx->delay_send)
+   ctx->delay_send(ctx, xid, commit_time);
+

2) Generally single line comments are not terminated by ".", The
comment "/* Sleep until appropriate time. */" can be changed
appropriately:
+
+   /* Sleep until appropriate time. */
+   timeout_sleeptime_ms = WalSndComputeSleeptime(now);
+
+   elog(DEBUG2, "time-delayed replication for txid %u,
delay_time = %d ms, remaining wait time: %ld ms",
+xid, (int) ctx->min_send_delay,
remaining_wait_time_ms);
+
+   /* Sleep until we get reply from worker or we time out */
+   WalSndWait(WL_SOCKET_READABLE,

3) In some places we mention as min_send_delay and in some places we
mention it as time-delayed replication, we can keep the comment
consistent by using the similar wordings.
+-- fail - specifying streaming = parallel with time-delayed replication is not
+-- supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, streaming = parallel, min_send_delay = 123);

+-- fail - alter subscription with streaming = parallel should fail when
+-- time-delayed replication is set
+ALTER SUBSCRIPTION regress_testsub SET (streaming = parallel);

+-- fail - alter subscription with min_send_delay should fail when
+-- streaming = parallel is set

4) Since the value is stored in ms, we need not add ms again as the
default value is in ms:
@@ -4686,6 +4694,9 @@ dumpSubscription(Archive *fout, const
SubscriptionInfo *subinfo)
if (strcmp(subinfo->subsynccommit, "off") != 0)
appendPQExpBuffer(query, ", synchronous_commit = %s",
fmtId(subinfo->subsynccommit));

+   if (subinfo->subminsenddelay > 0)
+   appendPQExpBuffer(query, ", min_send_delay = '%d ms'",
subinfo->subminsenddelay);
+

5) we can use the new error reporting style:
5.a) brackets around errcode can be removed
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid value for parameter
\"%s\": \"%s\"",
+   "min_send_delay", input_string),
+hintmsg ? errhint("%s", _(hintmsg)) : 0));

5.b) Similarly here too;
+   if (result < 0 || result > PG_INT32_MAX)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("%d ms is outside the valid
range for parameter \"%s\" (%d .. %d)",
+   result,
+   "min_send_delay",
+   0, PG_INT32_MAX)));

5.c) Similarly here too;
+   delay_val = strtoul(strVal(defel->arg), , 10);
+   if (errno != 0 || *endptr != '\0')
+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid
min_send_delay")));


5.d) Similarly here too;
+   if (delay_val > PG_INT32_MAX)
+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+
errmsg("min_send_delay \"%s\" out of range",
+
strVal(defel->arg;


6) This can be changed to a single line comment:
+   /*
+* Parse given string as parameter which has millisecond unit
+*/
+   if (!parse_int(input_string, , GUC_UNIT_MS, ))
+   ereport(ERROR,

7) In the expect we have specifically mention "for non-streaming
transaction", is the behavior different for streaming transaction, if
not we can 

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-01 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

Thank you for giving your consideration!

> >  We have documented at least one such case
> > already where during Drop Subscription, if the network is not
> > reachable then also, a similar problem can happen and users need to be
> > careful about it [1].
> 
> Apart from a bad-use case example I mentioned, in general, piling up
> WAL files due to the replication slot has many bad effects on the
> system. I'm concerned that the side effect of this feature (at least
> of the current design) is too huge compared to the benefit, and afraid
> that users might end up using this feature without understanding the
> side effect well. It might be okay if we thoroughly document it but
> I'm not sure.

One approach is that change max_slot_wal_keep_size forcibly when min_send_delay
is set. But it may lead to disable the slot because WALs needed by the 
time-delayed
replication may be also removed. Just the right value cannot be set by us 
because
it is quite depends on the min_send_delay and workload.

How about throwing the WARNING when min_send_delay > 0 but
max_slot_wal_keep_size < 0? Differ from previous, version the subscription
parameter min_send_delay will be sent to publisher. Therefore, we can compare
min_send_delay and max_slot_wal_keep_size when publisher receives the parameter.

Of course we can reject such a setup by using ereport(ERROR), but it may 
generate
abandoned replication slot. It is because we send the parameter at 
START_REPLICATION
and the slot has been already created.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Wed, Mar 1, 2023 at 10:57 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila  wrote:
> >
>
> > Won't a malicious user can block the
> > replication in other ways as well and let the publisher stall (or
> > crash the publisher) even without setting min_send_delay? Basically,
> > one needs to either disable the subscription or create a
> > constraint-violating row in the table to make that happen. If the
> > system is exposed for arbitrarily allowing the creation of a
> > subscription then a malicious user can create a subscription similar
> > to one existing subscription and block the replication due to
> > constraint violations. I don't think it would be so easy to bypass the
> > current system that a malicious user will be allowed to create/alter
> > subscriptions arbitrarily.
>
> Right. But a difference is that with min_send_delay, it's just to
> create a subscription.
>

But, currently, only superusers would be allowed to create
subscriptions. Even, if we change it and allow it based on some
pre-defined role, it won't be allowed to create a subscription
arbitrarily. So, not sure, if any malicious user can easily bypass it
as you are envisioning it.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Masahiko Sawada
On Wed, Mar 1, 2023 at 1:55 PM Amit Kapila  wrote:
>
> On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu)
> >  wrote:
> >
> > Thinking of side effects of this feature (no matter where we delay
> > applying the changes), on the publisher, vacuum cannot collect garbage
> > and WAL cannot be recycled. Is that okay in the first place? The point
> > is that the subscription setting affects the publisher. That is,
> > min_send_delay is specified on the subscriber but the symptoms that
> > could ultimately lead to a server crash appear on the publisher, which
> > sounds dangerous to me.
> >
> > Imagine a service or system like where there is a publication server
> > and it's somewhat exposed so that a user (or a subsystem) arbitrarily
> > can create a subscriber to replicate a subset of the data. A malicious
> > user can have the publisher crash by creating a subscription with,
> > say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation
> > but it's -1 by default.
> >
>
> By publisher crash, do you mean due to the disk full situation, it can
> lead the publisher to stop/panic?

Exactly.

> Won't a malicious user can block the
> replication in other ways as well and let the publisher stall (or
> crash the publisher) even without setting min_send_delay? Basically,
> one needs to either disable the subscription or create a
> constraint-violating row in the table to make that happen. If the
> system is exposed for arbitrarily allowing the creation of a
> subscription then a malicious user can create a subscription similar
> to one existing subscription and block the replication due to
> constraint violations. I don't think it would be so easy to bypass the
> current system that a malicious user will be allowed to create/alter
> subscriptions arbitrarily.

Right. But a difference is that with min_send_delay, it's just to
create a subscription.

> Similarly, if there is a network issue
> (unreachable or slow), one will see similar symptoms. I think
> retention of data and WAL on publisher do rely on acknowledgment from
> subscribers and delay in that due to any reason can lead to the
> symptoms you describe above.

I think that piling up WAL files due to a slow network is a different
story since it's a problem not only on the subscriber side.

>  We have documented at least one such case
> already where during Drop Subscription, if the network is not
> reachable then also, a similar problem can happen and users need to be
> careful about it [1].

Apart from a bad-use case example I mentioned, in general, piling up
WAL files due to the replication slot has many bad effects on the
system. I'm concerned that the side effect of this feature (at least
of the current design) is too huge compared to the benefit, and afraid
that users might end up using this feature without understanding the
side effect well. It might be okay if we thoroughly document it but
I'm not sure.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Tue, Feb 28, 2023 at 9:21 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > 1.
> > + /*
> > + * If we've requested to shut down, exit the process.
> > + *
> > + * Note that WalSndDone() cannot be used here because the delaying
> > + * changes will be sent in the function.
> > + */
> > + if (got_STOPPING)
> > + {
> > + QueryCompletion qc;
> > +
> > + /* Inform the standby that XLOG streaming is done */
> > + SetQueryCompletion(, CMDTAG_COPY, 0);
> > + EndCommand(, DestRemote, false);
> > + pq_flush();
> >
> > Do we really need to do anything except for breaking the loop and let
> > the exit handling happen in the main loop when 'got_STOPPING' is set?
> > AFAICS, this is what we are doing in some other palces (See
> > WalSndWaitForWal). Won't that work? It seems that will help us sending
> > all the pending WAL.
>
> If we exit the loop after got_STOPPING is set, as you said, the walsender will
> send delaying changes and then exit. The behavior is same as the case that 
> WalSndDone()
> is called. But I think it is not suitable for the motivation of the feature.
> If users notice the miss operation like TRUNCATE, they must shut down the 
> publisher
> once and then recovery from back up or old subscriber. If the walsender sends 
> all
> pending changes, miss operations will be also propagated to subscriber and 
> data
> cannot be protected. So currently I want to keep the style.
> FYI - In case of physical replication, received WALs are not applied when the
> secondary is shutted down.
>

Fair point but I think the current comment should explain why we are
doing something different here. How about extending the existing
comments to something like: "If we've requested to shut down, exit the
process. This is unlike handling at other places where we allow
complete WAL to be sent before shutdown because we don't want the
delayed transactions to be applied downstream. This will allow one to
use the data from downstream in case of some unwanted operations on
the current node."

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Wed, Mar 1, 2023 at 8:18 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu)
>  wrote:
>
> Thinking of side effects of this feature (no matter where we delay
> applying the changes), on the publisher, vacuum cannot collect garbage
> and WAL cannot be recycled. Is that okay in the first place? The point
> is that the subscription setting affects the publisher. That is,
> min_send_delay is specified on the subscriber but the symptoms that
> could ultimately lead to a server crash appear on the publisher, which
> sounds dangerous to me.
>
> Imagine a service or system like where there is a publication server
> and it's somewhat exposed so that a user (or a subsystem) arbitrarily
> can create a subscriber to replicate a subset of the data. A malicious
> user can have the publisher crash by creating a subscription with,
> say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation
> but it's -1 by default.
>

By publisher crash, do you mean due to the disk full situation, it can
lead the publisher to stop/panic? Won't a malicious user can block the
replication in other ways as well and let the publisher stall (or
crash the publisher) even without setting min_send_delay? Basically,
one needs to either disable the subscription or create a
constraint-violating row in the table to make that happen. If the
system is exposed for arbitrarily allowing the creation of a
subscription then a malicious user can create a subscription similar
to one existing subscription and block the replication due to
constraint violations. I don't think it would be so easy to bypass the
current system that a malicious user will be allowed to create/alter
subscriptions arbitrarily. Similarly, if there is a network issue
(unreachable or slow), one will see similar symptoms. I think
retention of data and WAL on publisher do rely on acknowledgment from
subscribers and delay in that due to any reason can lead to the
symptoms you describe above. We have documented at least one such case
already where during Drop Subscription, if the network is not
reachable then also, a similar problem can happen and users need to be
careful about it [1].

[1] - 
https://www.postgresql.org/docs/devel/logical-replication-subscription.html

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Wed, Mar 1, 2023 at 8:06 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila  
> wrote in
> > On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila  
> > > wrote in
> > > > The one difference w.r.t recovery_min_apply_delay is that here we will
> > > > hold locks for the duration of the delay which didn't seem to be a
> > > > good idea. This will also probably lead to more bloat as we will keep
> > > > transactions open for a long time. Doing it before DecodePrepare won't
> > >
> > > I don't have a concrete picture but could we tell reorder buffer to
> > > retain a PREPAREd transaction until a COMMIT PREPARED comes?
> > >
> >
> > Yeah, we could do that and that is what is the behavior unless the
> > user enables 2PC via 'two_phase' subscription option. But, I don't see
> > the need to unnecessarily delay the prepare till the commit if a user
> > has specified 'two_phase' option. It is quite possible that COMMIT
> > PREPARED happens at a much later time frame than the amount of delay
> > the user is expecting.
>
> It looks like the user should decide between potential long locks or
> extra delays, and this choice ought to be documented.
>

Sure, we can do that. However, I think the way this feature works is
that we keep standby/subscriber behind the primary/publisher by a
certain time period and if there is any unwanted transaction (say
Delete * .. without where clause), we can recover it from the receiver
side. So, it may not matter much even if we wait at PREPARE to avoid
long locks instead of documenting it.

> > >  If
> > > delaying non-prepared transactions until COMMIT is adequate, then the
> > > same thing seems to work for prepared transactions.
> > >
> > > > have such problems. This is the reason that we decide to perform a
> > > > delay at the start of the transaction instead at commit/prepare in the
> > > > subscriber-side approach.
> > >
> > > It seems that there are no technical obstacles to do that on the
> > > publisher side. The only observable difference would be that
> > > relatively large prepared transactions may experience noticeable
> > > additional delays.  IMHO I don't think it's a good practice
> > > protocol-wise to intentionally choke a stream at the receiving end
> > > when it has not been flow-controlled on the transmitting end.
> > >
> >
> > But in this proposal, we are not choking/delaying anything on the receiving 
> > end.
>
> I didn't say that to the latest patch.  I interpreted the quote of
> your description as saying that the subscriber-side solution is
> effective in solving the long-lock problems, so I replied that that
> can be solved with the publisher-side solution and the subscriber-side
> solution could cause some unwanted behavior.
>
> Do you think we have decided to go with the publisher-side solution?
> I'm fine if so.
>

I am fine too unless we discover any major challenges with
publisher-side implementation.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Masahiko Sawada
On Wed, Mar 1, 2023 at 12:51 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > Few comments:
>
> Thank you for reviewing! PSA new version.
> Note that the starting point of delay for 2PC was not changed,
> I think it has been under discussion.
>
> > 1.
> > + /*
> > + * If we've requested to shut down, exit the process.
> > + *
> > + * Note that WalSndDone() cannot be used here because the delaying
> > + * changes will be sent in the function.
> > + */
> > + if (got_STOPPING)
> > + {
> > + QueryCompletion qc;
> > +
> > + /* Inform the standby that XLOG streaming is done */
> > + SetQueryCompletion(, CMDTAG_COPY, 0);
> > + EndCommand(, DestRemote, false);
> > + pq_flush();
> >
> > Do we really need to do anything except for breaking the loop and let
> > the exit handling happen in the main loop when 'got_STOPPING' is set?
> > AFAICS, this is what we are doing in some other palces (See
> > WalSndWaitForWal). Won't that work? It seems that will help us sending
> > all the pending WAL.
>
> If we exit the loop after got_STOPPING is set, as you said, the walsender will
> send delaying changes and then exit. The behavior is same as the case that 
> WalSndDone()
> is called. But I think it is not suitable for the motivation of the feature.
> If users notice the miss operation like TRUNCATE, they must shut down the 
> publisher
> once and then recovery from back up or old subscriber. If the walsender sends 
> all
> pending changes, miss operations will be also propagated to subscriber and 
> data
> cannot be protected. So currently I want to keep the style.
> FYI - In case of physical replication, received WALs are not applied when the
> secondary is shutted down.
>
> > 2.
> > + /* Try to flush pending output to the client */
> > + if (pq_flush_if_writable() != 0)
> > + WalSndShutdown();
> >
> > Is there a reason to try flushing here?
>
> IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is 
> a
> trouble and walsender fails to send messages to subscriber.
>
> In Linux, the stuck trace from pq_flush_if_writable() will finally reach the 
> send() system call.
> And according to man page[1], it will be triggered by some unexpected state 
> or the connection is closed.
>
> Based on above, I think the returned value should be confirmed.
>
> > Apart from the above, I have made a few changes in the comments in the
> > attached diff patch. If you agree with those then please include them
> > in the next version.
>
> Thanks! I checked and I thought all of them should be included.
>
> Moreover, I used grammar checker and slightly reworded the commit message.

Thinking of side effects of this feature (no matter where we delay
applying the changes), on the publisher, vacuum cannot collect garbage
and WAL cannot be recycled. Is that okay in the first place? The point
is that the subscription setting affects the publisher. That is,
min_send_delay is specified on the subscriber but the symptoms that
could ultimately lead to a server crash appear on the publisher, which
sounds dangerous to me.

Imagine a service or system like where there is a publication server
and it's somewhat exposed so that a user (or a subsystem) arbitrarily
can create a subscriber to replicate a subset of the data. A malicious
user can have the publisher crash by creating a subscription with,
say, min_send_delay = 20d. max_slot_wal_keep_size helps this situation
but it's -1 by default.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Kyotaro Horiguchi
At Tue, 28 Feb 2023 08:35:11 +0530, Amit Kapila  wrote 
in 
> On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila  
> > wrote in
> > > The one difference w.r.t recovery_min_apply_delay is that here we will
> > > hold locks for the duration of the delay which didn't seem to be a
> > > good idea. This will also probably lead to more bloat as we will keep
> > > transactions open for a long time. Doing it before DecodePrepare won't
> >
> > I don't have a concrete picture but could we tell reorder buffer to
> > retain a PREPAREd transaction until a COMMIT PREPARED comes?
> >
> 
> Yeah, we could do that and that is what is the behavior unless the
> user enables 2PC via 'two_phase' subscription option. But, I don't see
> the need to unnecessarily delay the prepare till the commit if a user
> has specified 'two_phase' option. It is quite possible that COMMIT
> PREPARED happens at a much later time frame than the amount of delay
> the user is expecting.

It looks like the user should decide between potential long locks or
extra delays, and this choice ought to be documented.

> >  If
> > delaying non-prepared transactions until COMMIT is adequate, then the
> > same thing seems to work for prepared transactions.
> >
> > > have such problems. This is the reason that we decide to perform a
> > > delay at the start of the transaction instead at commit/prepare in the
> > > subscriber-side approach.
> >
> > It seems that there are no technical obstacles to do that on the
> > publisher side. The only observable difference would be that
> > relatively large prepared transactions may experience noticeable
> > additional delays.  IMHO I don't think it's a good practice
> > protocol-wise to intentionally choke a stream at the receiving end
> > when it has not been flow-controlled on the transmitting end.
> >
> 
> But in this proposal, we are not choking/delaying anything on the receiving 
> end.

I didn't say that to the latest patch.  I interpreted the quote of
your description as saying that the subscriber-side solution is
effective in solving the long-lock problems, so I replied that that
can be solved with the publisher-side solution and the subscriber-side
solution could cause some unwanted behavior.

Do you think we have decided to go with the publisher-side solution?
I'm fine if so.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Peter Smith
Here are some review comments for v9-0001, but these are only very trivial.

==
Commit Message

1.
Nitpick. The new text is jagged-looking. It should wrap at ~80 chars.

~~~

2.
2. Another reason is for that parallel streaming, the transaction will be opened
immediately by the parallel apply worker. Therefore, if the walsender
is delayed in sending the final record of the transaction, the
parallel apply worker must wait to receive it with an open
transaction. This would result in the locks acquired during the
transaction not being released until the min_send_delay has elapsed.

~

The text already said there are "two reasons", and already this is
numbered as reason 2. So it doesn't need to keep saying "Another
reason" here.

"Another reason is for that parallel streaming" --> "For parallel streaming..."

==
src/backend/replication/walsender.c

3. WalSndDelay

+ /* die if timeout was reached */
+ WalSndCheckTimeOut();

Other nearby comments start uppercase, so this should too.

==
src/include/replication/walreceiver.h

4. WalRcvStreamOptions

@@ -187,6 +187,7 @@ typedef struct
  * prepare time */
  char*origin; /* Only publish data originating from the
  * specified origin */
+ int32 min_send_delay; /* The minimum send delay */
  } logical;
  } proto;
 } WalRcvStreamOptions;

~

Should that comment mention the units are "(ms)"

--
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Few comments:

Thank you for reviewing! PSA new version.
Note that the starting point of delay for 2PC was not changed,
I think it has been under discussion.

> 1.
> + /*
> + * If we've requested to shut down, exit the process.
> + *
> + * Note that WalSndDone() cannot be used here because the delaying
> + * changes will be sent in the function.
> + */
> + if (got_STOPPING)
> + {
> + QueryCompletion qc;
> +
> + /* Inform the standby that XLOG streaming is done */
> + SetQueryCompletion(, CMDTAG_COPY, 0);
> + EndCommand(, DestRemote, false);
> + pq_flush();
> 
> Do we really need to do anything except for breaking the loop and let
> the exit handling happen in the main loop when 'got_STOPPING' is set?
> AFAICS, this is what we are doing in some other palces (See
> WalSndWaitForWal). Won't that work? It seems that will help us sending
> all the pending WAL.

If we exit the loop after got_STOPPING is set, as you said, the walsender will
send delaying changes and then exit. The behavior is same as the case that 
WalSndDone()
is called. But I think it is not suitable for the motivation of the feature.
If users notice the miss operation like TRUNCATE, they must shut down the 
publisher
once and then recovery from back up or old subscriber. If the walsender sends 
all
pending changes, miss operations will be also propagated to subscriber and data
cannot be protected. So currently I want to keep the style.
FYI - In case of physical replication, received WALs are not applied when the
secondary is shutted down.

> 2.
> + /* Try to flush pending output to the client */
> + if (pq_flush_if_writable() != 0)
> + WalSndShutdown();
> 
> Is there a reason to try flushing here?

IIUC if pq_flush_if_writable() returns non-zero (EOF), it means that there is a
trouble and walsender fails to send messages to subscriber.

In Linux, the stuck trace from pq_flush_if_writable() will finally reach the 
send() system call.
And according to man page[1], it will be triggered by some unexpected state or 
the connection is closed.

Based on above, I think the returned value should be confirmed.

> Apart from the above, I have made a few changes in the comments in the
> attached diff patch. If you agree with those then please include them
> in the next version.

Thanks! I checked and I thought all of them should be included.

Moreover, I used grammar checker and slightly reworded the commit message.

[1]: https://man7.org/linux/man-pages/man3/send.3p.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v9-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v9-0001-Time-delayed-logical-replication-on-publisher-sid.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-28 Thread Amit Kapila
On Mon, Feb 27, 2023 at 2:21 PM Hayato Kuroda (Fujitsu)
 wrote:
>

Few comments:
1.
+ /*
+ * If we've requested to shut down, exit the process.
+ *
+ * Note that WalSndDone() cannot be used here because the delaying
+ * changes will be sent in the function.
+ */
+ if (got_STOPPING)
+ {
+ QueryCompletion qc;
+
+ /* Inform the standby that XLOG streaming is done */
+ SetQueryCompletion(, CMDTAG_COPY, 0);
+ EndCommand(, DestRemote, false);
+ pq_flush();

Do we really need to do anything except for breaking the loop and let
the exit handling happen in the main loop when 'got_STOPPING' is set?
AFAICS, this is what we are doing in some other palces (See
WalSndWaitForWal). Won't that work? It seems that will help us sending
all the pending WAL.

2.
+ /* Try to flush pending output to the client */
+ if (pq_flush_if_writable() != 0)
+ WalSndShutdown();

Is there a reason to try flushing here?

Apart from the above, I have made a few changes in the comments in the
attached diff patch. If you agree with those then please include them
in the next version.

-- 
With Regards,
Amit Kapila.


changes_amit_1.patch
Description: Binary data


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-27 Thread Amit Kapila
On Tue, Feb 28, 2023 at 8:14 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila  
> wrote in
> > The one difference w.r.t recovery_min_apply_delay is that here we will
> > hold locks for the duration of the delay which didn't seem to be a
> > good idea. This will also probably lead to more bloat as we will keep
> > transactions open for a long time. Doing it before DecodePrepare won't
>
> I don't have a concrete picture but could we tell reorder buffer to
> retain a PREPAREd transaction until a COMMIT PREPARED comes?
>

Yeah, we could do that and that is what is the behavior unless the
user enables 2PC via 'two_phase' subscription option. But, I don't see
the need to unnecessarily delay the prepare till the commit if a user
has specified 'two_phase' option. It is quite possible that COMMIT
PREPARED happens at a much later time frame than the amount of delay
the user is expecting.

>  If
> delaying non-prepared transactions until COMMIT is adequate, then the
> same thing seems to work for prepared transactions.
>
> > have such problems. This is the reason that we decide to perform a
> > delay at the start of the transaction instead at commit/prepare in the
> > subscriber-side approach.
>
> It seems that there are no technical obstacles to do that on the
> publisher side. The only observable difference would be that
> relatively large prepared transactions may experience noticeable
> additional delays.  IMHO I don't think it's a good practice
> protocol-wise to intentionally choke a stream at the receiving end
> when it has not been flow-controlled on the transmitting end.
>

But in this proposal, we are not choking/delaying anything on the receiving end.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-27 Thread Kyotaro Horiguchi
At Mon, 27 Feb 2023 14:56:19 +0530, Amit Kapila  wrote 
in 
> The one difference w.r.t recovery_min_apply_delay is that here we will
> hold locks for the duration of the delay which didn't seem to be a
> good idea. This will also probably lead to more bloat as we will keep
> transactions open for a long time. Doing it before DecodePrepare won't

I don't have a concrete picture but could we tell reorder buffer to
retain a PREPAREd transaction until a COMMIT PREPARED comes?  If
delaying non-prepared transactions until COMMIT is adequate, then the
same thing seems to work for prepared transactions.

> have such problems. This is the reason that we decide to perform a
> delay at the start of the transaction instead at commit/prepare in the
> subscriber-side approach.

It seems that there are no technical obstacles to do that on the
publisher side. The only observable difference would be that
relatively large prepared transactions may experience noticeable
additional delays.  IMHO I don't think it's a good practice
protocol-wise to intentionally choke a stream at the receiving end
when it has not been flow-controlled on the transmitting end.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-27 Thread Amit Kapila
On Mon, Feb 27, 2023 at 1:50 PM Masahiko Sawada  wrote:
>
> On Mon, Feb 27, 2023 at 3:34 PM Amit Kapila  wrote:
> >
> > > ---
> > > Why do we not to delay sending COMMIT PREPARED messages?
> > >
> >
> > I think we need to either add delay for prepare or commit prepared as
> > otherwise, it will lead to delaying the xact more than required.
>
> Agreed.
>
> > The
> > patch seems to add a delay before sending a PREPARE as that is the
> > time when the subscriber will apply the changes.
>
> Considering the purpose of this feature mentioned in the commit
> message "particularly to fix errors that might cause data loss",
> delaying sending PREPARE would really help that situation? For
> example, even after (mistakenly) executing PREPARE for a transaction
> executing DELETE without WHERE clause on the publisher the user still
> can rollback the transaction. They don't lose data on both nodes yet.
> After executing (and replicating) COMMIT PREPARED for that
> transaction, they lose the data on both nodes. IIUC the time-delayed
> logical replication should help this situation by delaying sending
> COMMIT PREPARED so that, for example, the user can stop logical
> replication before COMMIT PREPARED message arrives to the subscriber.
> So I think we should delay sending COMMIT PREPARED (and COMMIT)
> instead of PREPARE. This would help users to correct data loss errors,
> and would be more consistent with what recovery_min_apply_delay does.
>

The one difference w.r.t recovery_min_apply_delay is that here we will
hold locks for the duration of the delay which didn't seem to be a
good idea. This will also probably lead to more bloat as we will keep
transactions open for a long time. Doing it before DecodePrepare won't
have such problems. This is the reason that we decide to perform a
delay at the start of the transaction instead at commit/prepare in the
subscriber-side approach.

-- 
With Regards,
Amit Kapila.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-27 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san, Amit,

Thank you for reviewing!

> + *
> + * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum
> protocol version
> + * with support for delaying to send transactions. Introduced in PG16.
>   */
>  #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
>  #define LOGICALREP_PROTO_VERSION_NUM 1
>  #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
>  #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
>  #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4
> -#define LOGICALREP_PROTO_MAX_VERSION_NUM
> LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM
> +#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4
> +#define LOGICALREP_PROTO_MAX_VERSION_NUM
> LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM
> 
> What is the usecase of the old macro,
> LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding
> LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this
> way, we will end up adding macros every time when adding a new option,
> which seems not a good idea. I'm really not sure we need to change the
> protocol version or the macro. Commit
> 366283961ac0ed6d8901c6090f3fd02fce0a introduced the 'origin'
> subscription parameter that is also sent to the publisher, but we
> didn't touch the protocol version at all.

I removed the protocol number.

I checked the previous discussion[1]. According to it, the protocol version must
be modified when new message is added or exiting messages are changed.
This patch intentionally make walsenders delay sending data, and at that time no
extra information is added. Therefore I think it is not needed.

> ---
> Why do we not to delay sending COMMIT PREPARED messages?

This is motivated by the comment[2] but I preferred your opinion[3].
Now COMMIT PREPARED is delayed instead of PREPARE message.

> ---
> +/*
> + * If we've requested to shut down, exit the process.
> + *
> + * Note that WalSndDone() cannot be used here because
> the delaying
> + * changes will be sent in the function.
> + */
> +if (got_STOPPING)
> +WalSndShutdown();
> 
> Since the walsender exits without sending the done message at a server
> shutdown, we get the following log message on the subscriber:
> 
> ERROR:  could not receive data from WAL stream: server closed the
> connection unexpectedly
> 
> I think that since the walsender is just waiting for sending data, it
> can send the done message if the socket is writable.

You are right. I was confused with the previous implementation that workers 
cannot
accept any messages. I make walsenders send the end-command message directly.
Is it what you expeced?

> ---
> +delayUntil = TimestampTzPlusMilliseconds(delay_start,
> ctx->min_send_delay);
> +remaining_wait_time_ms =
> TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
> +
> (snip)
> +
> +/* Sleep until appropriate time. */
> +timeout_sleeptime_ms =
> WalSndComputeSleeptime(GetCurrentTimestamp());
> 
> I think it's better to call GetCurrentTimestamp() only once.

Right, fixed.

> ---
> +# This test is successful only if at least the configured delay has elapsed.
> +ok( time() - $publisher_insert_time >= $delay,
> +"subscriber applies WAL only after replication delay for
> non-streaming transaction"
> +);
> 
> The subscriber doesn't actually apply WAL records, but logically
> replicated changes. How about "subscriber applies changes only
> after..."?

I grepped other tests, and I could not find the same usage of the word "WAL".
So fixed as you said.

In next version I will use grammar checker like Chat-GPT to modify commit 
messages...

[1]: 
https://www.postgresql.org/message-id/CAA4eK1LjOm6-OHggYVH35dQ_v40jOXrJW0GFy3kuwTd2J48%3DUg%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAA4eK1K4uPbudrNdH%2B%3D_vN-Hpe9wYh%3D3vBS5Ww9dHn-LOWMV0g%40mail.gmail.com
[3]: 
https://www.postgresql.org/message-id/CAD21AoA0mPq_m6USfAC8DAkvFfwjqGvGq++Uv=avryyotvq...@mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v8-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v8-0001-Time-delayed-logical-replication-on-publisher-sid.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-27 Thread Masahiko Sawada
On Mon, Feb 27, 2023 at 3:34 PM Amit Kapila  wrote:
>
> On Mon, Feb 27, 2023 at 11:11 AM Masahiko Sawada  
> wrote:
> >
> > On Thu, Feb 23, 2023 at 9:10 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Thank you for reviewing! PSA new version.
> > >
> > >
> >
> > Thank you for updating the patch. Here are some comments on v7 patch:
> >
> > + *
> > + * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum protocol 
> > version
> > + * with support for delaying to send transactions. Introduced in PG16.
> >   */
> >  #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> >  #define LOGICALREP_PROTO_VERSION_NUM 1
> >  #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> >  #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
> >  #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4
> > -#define LOGICALREP_PROTO_MAX_VERSION_NUM
> > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM
> > +#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4
> > +#define LOGICALREP_PROTO_MAX_VERSION_NUM
> > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM
> >
> > What is the usecase of the old macro,
> > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding
> > LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this
> > way, we will end up adding macros every time when adding a new option,
> > which seems not a good idea. I'm really not sure we need to change the
> > protocol version or the macro. Commit
> > 366283961ac0ed6d8901c6090f3fd02fce0a introduced the 'origin'
> > subscription parameter that is also sent to the publisher, but we
> > didn't touch the protocol version at all.
> >
>
> Right, I also don't see a reason to do anything for this. We have
> previously bumped the protocol version when we send extra/additional
> information from walsender but here that is not the requirement, so
> this change doesn't seem to be required.
>
> > ---
> > Why do we not to delay sending COMMIT PREPARED messages?
> >
>
> I think we need to either add delay for prepare or commit prepared as
> otherwise, it will lead to delaying the xact more than required.

Agreed.

> The
> patch seems to add a delay before sending a PREPARE as that is the
> time when the subscriber will apply the changes.

Considering the purpose of this feature mentioned in the commit
message "particularly to fix errors that might cause data loss",
delaying sending PREPARE would really help that situation? For
example, even after (mistakenly) executing PREPARE for a transaction
executing DELETE without WHERE clause on the publisher the user still
can rollback the transaction. They don't lose data on both nodes yet.
After executing (and replicating) COMMIT PREPARED for that
transaction, they lose the data on both nodes. IIUC the time-delayed
logical replication should help this situation by delaying sending
COMMIT PREPARED so that, for example, the user can stop logical
replication before COMMIT PREPARED message arrives to the subscriber.
So I think we should delay sending COMMIT PREPARED (and COMMIT)
instead of PREPARE. This would help users to correct data loss errors,
and would be more consistent with what recovery_min_apply_delay does.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-26 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> I was trying to think if there is any better way to implement the
> newly added callback (WalSndDelay()) but couldn't find any. For
> example, one idea I tried to evaluate is whether we can merge it with
> the existing callback WalSndUpdateProgress() or maybe extract the part
> other than progress tracking from that function into a new callback
> and then try to reuse it here as well. Though there is some common
> functionality like checking for timeout and processing replies still
> they are different enough that they seem to need separate callbacks.
> The prime purpose of a callback for the patch being discussed here is
> to delay the xact before sending the commit/prepare whereas the
> existing callback (WalSndUpdateProgress()) or what we are discussing
> at [1] allows sending the keepalive message in some special cases
> where there is no communication between walsender and walreceiver.
> Now, the WalSndDelay() also tries to check for timeout and send
> keepalive if necessary but there is also dependency on the delay
> parameter, so don't think it is a good idea of trying to combine those
> functionalities into one API.
> 
> Thoughts?
> 
> [1] -
> https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40
> awork3.anarazel.de

Thank you for confirming. My understanding was that we should keep the current 
design.
I agree with your posting.

In the current callback and modified version in [1], sending keepalives is done
via ProcessPendingWrites(). It is called by many functions and should not be 
changed,
like adding end_time only for us. Moreover, the name is not suitable because
time-delayed logical replication does not wait until the send buffer becomes 
empty.

If we reconstruct WalSndUpdateProgress() and change mechanisms around that,
codes will become dirty. As Amit said, in one path, the lag will be tracked and
the walsender will wait until the buffer is empty.
In another path, the lag calculation will be ignored, and the walsender will 
wait
until the process spends time till a given period. Such a function is painful 
to read later.

I think callbacks that have different purposes should not be mixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-26 Thread Amit Kapila
On Mon, Feb 27, 2023 at 11:11 AM Masahiko Sawada  wrote:
>
> On Thu, Feb 23, 2023 at 9:10 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Thank you for reviewing! PSA new version.
> >
> >
>
> Thank you for updating the patch. Here are some comments on v7 patch:
>
> + *
> + * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum protocol 
> version
> + * with support for delaying to send transactions. Introduced in PG16.
>   */
>  #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
>  #define LOGICALREP_PROTO_VERSION_NUM 1
>  #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
>  #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
>  #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4
> -#define LOGICALREP_PROTO_MAX_VERSION_NUM
> LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM
> +#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4
> +#define LOGICALREP_PROTO_MAX_VERSION_NUM
> LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM
>
> What is the usecase of the old macro,
> LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding
> LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this
> way, we will end up adding macros every time when adding a new option,
> which seems not a good idea. I'm really not sure we need to change the
> protocol version or the macro. Commit
> 366283961ac0ed6d8901c6090f3fd02fce0a introduced the 'origin'
> subscription parameter that is also sent to the publisher, but we
> didn't touch the protocol version at all.
>

Right, I also don't see a reason to do anything for this. We have
previously bumped the protocol version when we send extra/additional
information from walsender but here that is not the requirement, so
this change doesn't seem to be required.

> ---
> Why do we not to delay sending COMMIT PREPARED messages?
>

I think we need to either add delay for prepare or commit prepared as
otherwise, it will lead to delaying the xact more than required. The
patch seems to add a delay before sending a PREPARE as that is the
time when the subscriber will apply the changes.
-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-26 Thread Amit Kapila
On Thu, Feb 23, 2023 at 5:40 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thank you for reviewing! PSA new version.
>

I was trying to think if there is any better way to implement the
newly added callback (WalSndDelay()) but couldn't find any. For
example, one idea I tried to evaluate is whether we can merge it with
the existing callback WalSndUpdateProgress() or maybe extract the part
other than progress tracking from that function into a new callback
and then try to reuse it here as well. Though there is some common
functionality like checking for timeout and processing replies still
they are different enough that they seem to need separate callbacks.
The prime purpose of a callback for the patch being discussed here is
to delay the xact before sending the commit/prepare whereas the
existing callback (WalSndUpdateProgress()) or what we are discussing
at [1] allows sending the keepalive message in some special cases
where there is no communication between walsender and walreceiver.
Now, the WalSndDelay() also tries to check for timeout and send
keepalive if necessary but there is also dependency on the delay
parameter, so don't think it is a good idea of trying to combine those
functionalities into one API.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-26 Thread Masahiko Sawada
On Thu, Feb 23, 2023 at 9:10 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shi,
>
> Thank you for reviewing! PSA new version.
>
> > + elog(DEBUG2, "time-delayed replication for txid %u, delay_time
> > = %d ms, remaining wait time: %ld ms",
> > +  ctx->write_xid, (int) ctx->min_send_delay,
> > +  remaining_wait_time_ms);
> >
> > I tried and saw that the xid here looks wrong, what it got is the xid of the
> > previous transaction. It seems `ctx->write_xid` has not been updated and we
> > can't use it.
> >
>
> Good catch. There are several approaches to fix that, I choose the simplest 
> way.
> TransactionId was added as an argument of functions.
>

Thank you for updating the patch. Here are some comments on v7 patch:

+ *
+ * LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM is the minimum protocol version
+ * with support for delaying to send transactions. Introduced in PG16.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
 #define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
 #define LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM 4
-#define LOGICALREP_PROTO_MAX_VERSION_NUM
LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM
+#define LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM 4
+#define LOGICALREP_PROTO_MAX_VERSION_NUM
LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM

What is the usecase of the old macro,
LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, after adding
LOGICALREP_PROTO_MIN_SEND_DELAY_VERSION_NUM ? I think if we go this
way, we will end up adding macros every time when adding a new option,
which seems not a good idea. I'm really not sure we need to change the
protocol version or the macro. Commit
366283961ac0ed6d8901c6090f3fd02fce0a introduced the 'origin'
subscription parameter that is also sent to the publisher, but we
didn't touch the protocol version at all.

---
Why do we not to delay sending COMMIT PREPARED messages?

---
+/*
+ * If we've requested to shut down, exit the process.
+ *
+ * Note that WalSndDone() cannot be used here because
the delaying
+ * changes will be sent in the function.
+ */
+if (got_STOPPING)
+WalSndShutdown();

Since the walsender exits without sending the done message at a server
shutdown, we get the following log message on the subscriber:

ERROR:  could not receive data from WAL stream: server closed the
connection unexpectedly

I think that since the walsender is just waiting for sending data, it
can send the done message if the socket is writable.

---
+delayUntil = TimestampTzPlusMilliseconds(delay_start,
ctx->min_send_delay);
+remaining_wait_time_ms =
TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
(snip)
+
+/* Sleep until appropriate time. */
+timeout_sleeptime_ms =
WalSndComputeSleeptime(GetCurrentTimestamp());

I think it's better to call GetCurrentTimestamp() only once.

---
+# This test is successful only if at least the configured delay has elapsed.
+ok( time() - $publisher_insert_time >= $delay,
+"subscriber applies WAL only after replication delay for
non-streaming transaction"
+);

The subscriber doesn't actually apply WAL records, but logically
replicated changes. How about "subscriber applies changes only
after..."?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-23 Thread Hayato Kuroda (Fujitsu)
Dear Shi,

Thank you for reviewing! PSA new version.

> + elog(DEBUG2, "time-delayed replication for txid %u, delay_time
> = %d ms, remaining wait time: %ld ms",
> +  ctx->write_xid, (int) ctx->min_send_delay,
> +  remaining_wait_time_ms);
> 
> I tried and saw that the xid here looks wrong, what it got is the xid of the
> previous transaction. It seems `ctx->write_xid` has not been updated and we
> can't use it.
>

Good catch. There are several approaches to fix that, I choose the simplest way.
TransactionId was added as an argument of functions.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v7-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v7-0001-Time-delayed-logical-replication-on-publisher-sid.patch


RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-23 Thread shiy.f...@fujitsu.com
On Wed, Feb 22, 2023 9:48 PM Kuroda, Hayato/黒田 隼人  
wrote:
> 
> Thank you for reviewing! PSA new version.
> 

Thanks for your patch. Here is a comment.

+   elog(DEBUG2, "time-delayed replication for txid %u, delay_time 
= %d ms, remaining wait time: %ld ms",
+ctx->write_xid, (int) ctx->min_send_delay,
+remaining_wait_time_ms);

I tried and saw that the xid here looks wrong, what it got is the xid of the
previous transaction. It seems `ctx->write_xid` has not been updated and we
can't use it.

Regards,
Shi Yu


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-22 Thread Peter Smith
Patch v6 LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thank you for reviewing! PSA new version.

> I think it would be better to say: "The minimum delay, in
> milliseconds, by the publisher before sending all the changes". If you
> agree then similar change is required in below comment as well:
> + /*
> + * The minimum delay, in milliseconds, by the publisher before sending
> + * COMMIT/PREPARE record.
> + */
> + int32 min_send_delay;

OK, both of them were fixed.

> > > Should the validation be also checking/asserting no negative numbers,
> > > or actually should the min_send_delay be defined as a uint32 in the
> > > first place?
> >
> > I think you are right because min_apply_delay does not have related code.
> > we must consider additional possibility that user may send
> START_REPLICATION
> > by hand and it has minus value.
> > Fixed.
> >
> 
> Your reasoning for adding the additional check seems good to me but I
> don't see it in the patch. The check as I see is as below:
> + if (delay_val > PG_INT32_MAX)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("min_send_delay \"%s\" out of range",
> + strVal(defel->arg;
> 
> Am, I missing something, and the new check is at some other place?

For extracting value from the string, strtoul() is used.
This is an important point.

```
delay_val = strtoul(strVal(defel->arg), , 10);
```

If user specifies min_send_delay as '-1', the value is read as a bit string
'0x', and it is interpreted as PG_UINT64_MAX. After that such a
strange value is rejected by the part you copied. I have tested the case and it 
has
correctly rejected.

```
postgres=#  START_REPLICATION SLOT "sub" LOGICAL 0/0 (min_send_delay '-1');
ERROR:  min_send_delay "-1" out of range
CONTEXT:  slot "sub", output plugin "pgoutput", in the startup callback
```

> +  has been finished. However, there is a possibility that the table
> +  status written in  linkend="catalog-pg-subscription-rel">pg_subscription_rel ctname>
> +  will be delayed in getting to "ready" state, and also two-phase
> +  (if specified) will be delayed in getting to "enabled".
> + 
> 
> There appears to be a special value <0x00> after "ready". I think that
> is added by mistake or probably you have used some editor which has
> added this value. Can we slightly reword this to: "However, there is a
> possibility that the table status updated in  linkend="catalog-pg-subscription-rel">pg_subscription_rel ctname>
> could be delayed in getting to the "ready" state, and also two-phase
> (if specified) could be delayed in getting to "enabled"."?

Oh, my Visual Studio Code did not detect the strange character.
And reworded accordingly.

Additionally, I modified the commit message to describe more clearly the reason
why the do not allow combination of min_send_delay and streaming = parallel.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v6-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v6-0001-Time-delayed-logical-replication-on-publisher-sid.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-22 Thread Amit Kapila
On Tue, Feb 21, 2023 at 1:28 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > doc/src/sgml/catalogs.sgml
> >
> > 4.
> > + 
> > +  
> > +   subminsenddelay int4
> > +  
> > +  
> > +   The minimum delay, in milliseconds, by the publisher to send changes
> > +  
> > + 
> >
> > "by the publisher to send changes" --> "by the publisher before sending 
> > changes"
>
> As Amit said[1], there is a possibility to delay after sending delay. So I 
> changed to
> "before sending COMMIT record". How do you think?
>

I think it would be better to say: "The minimum delay, in
milliseconds, by the publisher before sending all the changes". If you
agree then similar change is required in below comment as well:
+ /*
+ * The minimum delay, in milliseconds, by the publisher before sending
+ * COMMIT/PREPARE record.
+ */
+ int32 min_send_delay;
+

>
> > src/backend/replication/pgoutput/pgoutput.c
> >
> > 11.
> > + errno = 0;
> > + parsed = strtoul(strVal(defel->arg), , 10);
> > + if (errno != 0 || *endptr != '\0')
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("invalid min_send_delay")));
> > +
> > + if (parsed > PG_INT32_MAX)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("min_send_delay \"%s\" out of range",
> > + strVal(defel->arg;
> >
> > Should the validation be also checking/asserting no negative numbers,
> > or actually should the min_send_delay be defined as a uint32 in the
> > first place?
>
> I think you are right because min_apply_delay does not have related code.
> we must consider additional possibility that user may send START_REPLICATION
> by hand and it has minus value.
> Fixed.
>

Your reasoning for adding the additional check seems good to me but I
don't see it in the patch. The check as I see is as below:
+ if (delay_val > PG_INT32_MAX)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("min_send_delay \"%s\" out of range",
+ strVal(defel->arg;

Am, I missing something, and the new check is at some other place?

+  has been finished. However, there is a possibility that the table
+  status written in pg_subscription_rel
+  will be delayed in getting to "ready" state, and also two-phase
+  (if specified) will be delayed in getting to "enabled".
+ 

There appears to be a special value <0x00> after "ready". I think that
is added by mistake or probably you have used some editor which has
added this value. Can we slightly reword this to: "However, there is a
possibility that the table status updated in pg_subscription_rel
could be delayed in getting to the "ready" state, and also two-phase
(if specified) could be delayed in getting to "enabled"."?

-- 
With Regards,
Amit Kapila.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-21 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing! PSA new version.

> 1.
> The other possibility was to apply the delay at the end of the parallel apply
> transaction but that would cause issues related to resource bloat and
> locks being
> held for a long time.
> 
> ~
> 
> The reply [1] for review comment #2 says that this was "slightly
> reworded", but AFAICT nothing is changed here.

Oh, my git operation might be wrong and it was disappeared.
Sorry for inconvenience, reworded again.

> 2.
> Eariler versions were written by Euler Taveira, Takamichi Osumi, and
> Kuroda Hayato
> 
> Typo: "Eariler"

Fixed.

> ==
> doc/src/sgml/ref/create_subscription.sgml
> 
> 3.
> + 
> +  By default, the publisher sends changes as soon as possible. This
> +  parameter allows the user to delay changes by given time period. If
> +  the value is specified without units, it is taken as milliseconds.
> +  The default is zero (no delay). See  linkend="config-setting-names-values"/>
> +  for details on the available valid time units.
> + 
> 
> "by given time period" --> "by the given time period"

Fixed.

> src/backend/replication/pgoutput/pgoutput.c
> 
> 4. parse_output_parameters
> 
> + else if (strcmp(defel->defname, "min_send_delay") == 0)
> + {
> + unsigned long parsed;
> + char*endptr;
> 
> I think 'parsed' is a fairly meaningless variable name. How about
> calling this variable something useful like 'delay_val' or
> 'min_send_delay_value', or something like those? Yes, I recognize that
> you copied this from some existing code fragment, but IMO that doesn't
> make it good.

OK, changed to 'delay_val'.

> 
> ==
> src/backend/replication/walsender.c
> 
> 5.
> + /* Sleep until we get reply from worker or we time out */
> + WalSndWait(WL_SOCKET_READABLE,
> +Min(timeout_sleeptime_ms, remaining_wait_time_ms),
> +WAIT_EVENT_WALSENDER_SEND_DELAY);
> 
> In my previous review [2] comment #14, I questioned if this comment
> was correct. It looks like that was accidentally missed.

Sorry, I missed that. But I think this does not have to be changed.

Important point here is that WalSndWait() is used, not WaitLatch().
According to comment atop WalSndWait(), the function waits till following 
events:

- the socket becomes readable or writable
- a timeout occurs

Logical walsender process is always connected to worker, so the socket becomes 
readable
when apply worker sends feedback message.
That's why I wrote "Sleep until we get reply from worker or we time out".


> src/include/replication/logical.h
> 
> 6.
> + /*
> + * The minimum delay, in milliseconds, by the publisher before sending
> + * COMMIT/PREPARE record
> + */
> + int32 min_send_delay;
> 
> The comment is missing a period.

Right, added.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v5-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v5-0001-Time-delayed-logical-replication-on-publisher-sid.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-21 Thread Peter Smith
Here are some very minor review comments for the patch v4-0001

==
Commit Message

1.
The other possibility was to apply the delay at the end of the parallel apply
transaction but that would cause issues related to resource bloat and
locks being
held for a long time.

~

The reply [1] for review comment #2 says that this was "slightly
reworded", but AFAICT nothing is changed here.

~~~

2.
Eariler versions were written by Euler Taveira, Takamichi Osumi, and
Kuroda Hayato

Typo: "Eariler"

==
doc/src/sgml/ref/create_subscription.sgml

3.
+ 
+  By default, the publisher sends changes as soon as possible. This
+  parameter allows the user to delay changes by given time period. If
+  the value is specified without units, it is taken as milliseconds.
+  The default is zero (no delay). See 
+  for details on the available valid time units.
+ 

"by given time period" --> "by the given time period"

==
src/backend/replication/pgoutput/pgoutput.c

4. parse_output_parameters

+ else if (strcmp(defel->defname, "min_send_delay") == 0)
+ {
+ unsigned long parsed;
+ char*endptr;

I think 'parsed' is a fairly meaningless variable name. How about
calling this variable something useful like 'delay_val' or
'min_send_delay_value', or something like those? Yes, I recognize that
you copied this from some existing code fragment, but IMO that doesn't
make it good.

==
src/backend/replication/walsender.c

5.
+ /* Sleep until we get reply from worker or we time out */
+ WalSndWait(WL_SOCKET_READABLE,
+Min(timeout_sleeptime_ms, remaining_wait_time_ms),
+WAIT_EVENT_WALSENDER_SEND_DELAY);

In my previous review [2] comment #14, I questioned if this comment
was correct. It looks like that was accidentally missed.

==
src/include/replication/logical.h

6.
+ /*
+ * The minimum delay, in milliseconds, by the publisher before sending
+ * COMMIT/PREPARE record
+ */
+ int32 min_send_delay;

The comment is missing a period.


--
[1] Kuroda-san replied to my review v3-0001.
https://www.postgresql.org/message-id/TYAPR01MB5866C6BCA4D9386D9C486033F5A59%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] My previous review v3-0001.
https://www.postgresql.org/message-id/CAHut%2BPu6Y%2BBkYKg6MYGi2zGnx6imeK4QzxBVhpQoPMeDr1npnQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-21 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thank you for commenting!

> > 8.
> > + 
> > +  The delay is effective only when the initial table 
> > synchronization
> > +  has been finished and the publisher decides to send a particular
> > +  transaction downstream. The delay does not take into account the
> > +  overhead of time spent in transferring the transaction, which
> means
> > +  that the arrival time at the subscriber may be delayed more than
> the
> > +  given time.
> > + 
> >
> > I'm not sure about this mention about only "effective only when the
> > initial table synchronization has been finished"... Now that the delay
> > is pub-side I don't know that it is true anymore.
> >
> 
> This will still be true because we don't wait during the initial copy
> (sync). The delay happens only when the replication starts.

Maybe this depends on the definition of initial copy and sync.
I checked and added descriptions in [1].


> > 11.
> > + errno = 0;
> > + parsed = strtoul(strVal(defel->arg), , 10);
> > + if (errno != 0 || *endptr != '\0')
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("invalid min_send_delay")));
> > +
> > + if (parsed > PG_INT32_MAX)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("min_send_delay \"%s\" out of range",
> > + strVal(defel->arg;
> >
> > Should the validation be also checking/asserting no negative numbers,
> > or actually should the min_send_delay be defined as a uint32 in the
> > first place?
> >
> 
> I don't see the need to change the datatype of min_send_delay as
> compared to what we have min_apply_delay.

I think it is OK to change "long" to "unsinged long", because
We use strtoul() for reading and should reject the minus value.
Of course we can modify them, but I want to keep the consistency with 
proto_version part.

[1]: 
https://www.postgresql.org/message-id/tyapr01mb5866c6bca4d9386d9c486033f5...@tyapr01mb5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> 1.
> +-- fail - utilizing streaming = parallel with time-delayed
> replication is not supported
> +CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
> false, streaming = parallel, min_send_delay = 123);
> 
> "utilizing" --> "specifying"

Fixed.

> 2.
> +-- success -- min_send_delay value without unit is take as milliseconds
> +CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexit' PUBLICATION testpub WITH (connect =
> false, min_send_delay = 123);
> +\dRs+
> 
> "without unit is take as" --> "without units is taken as"

Fixed.

> 3.
> +-- success -- min_send_delay value with unit is converted into ms and
> stored as an integer
> +ALTER SUBSCRIPTION regress_testsub SET (min_send_delay = '1 d');
> +\dRs+
> 
> 
> "with unit is converted into ms" --> "with units other than ms is
> converted to ms"

Fixed.

> 4. Missing tests?
> 
> Why have the previous ALTER SUBSCRIPTION tests been removed? AFAIK,
> currently, there are no regression tests for error messages like:
> 
> test_sub=# ALTER SUBSCRIPTION sub1 SET (min_send_delay = 123);
> ERROR:  cannot set min_send_delay for subscription in parallel streaming mode

These tests were missed while changing the basic design.
Added.

> src/test/subscription/t/001_rep_changes.pl
> 
> 5.
> +# This test is successful if and only if the LSN has been applied with at 
> least
> +# the configured apply delay.
> +ok( time() - $publisher_insert_time >= $delay,
> + "subscriber applies WAL only after replication delay for
> non-streaming transaction"
> +);
> 
> It's not strictly an "apply delay". Maybe this comment only needs to
> say like below:
> 
> SUGGESTION
> # This test is successful only if at least the configured delay has elapsed.

Changed.

New patch is available on [1].

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5866C6BCA4D9386D9C486033F5A59%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing! PSA new version. 

> 1.
> If the subscription sets min_send_delay parameter, an apply worker passes the
> value to the publisher as an output plugin option. And then, the walsender 
> will
> delay the transaction sending for given milliseconds.
> 
> ~
> 
> 1a.
> "an apply worker" --> "the apply worker (via walrcv_startstreaming)".
> 
> ~
> 
> 1b.
> "And then, the walsender" --> "The walsender"

Fixed.

> 2.
> The combination of parallel streaming mode and min_send_delay is not allowed.
> This is because in parallel streaming mode, we start applying the transaction
> stream as soon as the first change arrives without knowing the transaction's
> prepare/commit time. Always waiting for the full 'min_send_delay' period might
> include unnecessary delay.
> 
> ~
> 
> Is there another reason not to support this?
> 
> Even if streaming + min_send_delay incurs some extra delay, is that a
> reason to reject outright the combination? What difference will the
> potential of a few extra seconds overhead make when min_send_delay is
> more likely to be far greater (e.g. minutes or hours)?

Another case I came up with is that streaming transactions are come 
continuously.
If there are many transactions to be streamed, the walsender must delay to send 
for
every transactions, for the given period. It means that arrival of transactions 
at
the subscriber may delay for approximately min_send_delay x # of transactions.

> 3.
> The other possibility was to apply the delay at the end of the parallel apply
> transaction but that would cause issues related to resource bloat and
> locks being
> held for a long time.
> 
> ~
> 
> Is this explanation still relevant now you are doing pub-side delays?

Slightly reworded. I think the problem may be occurred if we delay sending 
COMMIT
record for parallel applied transactions.

> doc/src/sgml/catalogs.sgml
> 
> 4.
> + 
> +  
> +   subminsenddelay int4
> +  
> +  
> +   The minimum delay, in milliseconds, by the publisher to send changes
> +  
> + 
> 
> "by the publisher to send changes" --> "by the publisher before sending 
> changes"

As Amit said[1], there is a possibility to delay after sending delay. So I 
changed to
"before sending COMMIT record". How do you think?

> doc/src/sgml/logical-replication.sgml
> 
> 5.
> +  
> +   A publication can delay sending changes to the subscription by specifying
> +   the min_send_delay subscription parameter. See
> +for details.
> +  
> 
> ~
> 
> This description seemed backwards because IIUC the PUBLICATION has
> nothing to do with the delay really, the walsender is told what to do
> by the SUBSCRIPTION. Anyway, this paragraph is in the "Subscriber"
> section, so mentioning publications was a bit confusing.
> 
> SUGGESTION
> A subscription can delay the receipt of changes by specifying the
> min_send_delay subscription parameter. See ...

Changed.

> doc/src/sgml/monitoring.sgml
> 
> 6.
> + 
> +  WalSenderSendDelay
> +  Waiting while sending changes for time-delayed logical
> replication
> +  in the WAL sender process.
> + 
> 
> Should this say "Waiting before sending changes", instead of "Waiting
> while sending changes"?

Per discussion[1], I did not fix.

> doc/src/sgml/ref/create_subscription.sgml
> 
> 7.
> + 
> +  By default, the publisher sends changes as soon as possible. This
> +  parameter allows the user to delay the publisher to send changes by
> +  given time period. If the value is specified without units, it is
> +  taken as milliseconds. The default is zero (no delay). See
> +   for details on the
> +  available valid time units.
> + 
> 
> "to delay the publisher to send changes" --> "to delay changes"

Fixed.

> 8.
> + 
> +  The delay is effective only when the initial table synchronization
> +  has been finished and the publisher decides to send a particular
> +  transaction downstream. The delay does not take into account the
> +  overhead of time spent in transferring the transaction, which means
> +  that the arrival time at the subscriber may be delayed more than 
> the
> +  given time.
> + 
> 
> I'm not sure about this mention about only "effective only when the
> initial table synchronization has been finished"... Now that the delay
> is pub-side I don't know that it is true anymore. The tablesync worker
> will try to synchronize with the apply worker. IIUC during this
> "synchronization" phase the apply worker might be getting delayed by
> its own walsender, so therefore the tablesync might also be delayed
> (due to syncing with the apply worker) won't it?

I tested and checked codes. First of all, the tablesync worker request to send 
WALs
without min_send_delay, so changes will be sent and applied with no delays. In 
this meaning,
the table synchronization has not been affected by the feature. While 

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-20 Thread Amit Kapila
On Tue, Feb 21, 2023 at 3:31 AM Peter Smith  wrote:
>
>
> 2.
> The combination of parallel streaming mode and min_send_delay is not allowed.
> This is because in parallel streaming mode, we start applying the transaction
> stream as soon as the first change arrives without knowing the transaction's
> prepare/commit time. Always waiting for the full 'min_send_delay' period might
> include unnecessary delay.
>
> ~
>
> Is there another reason not to support this?
>
> Even if streaming + min_send_delay incurs some extra delay, is that a
> reason to reject outright the combination? What difference will the
> potential of a few extra seconds overhead make when min_send_delay is
> more likely to be far greater (e.g. minutes or hours)?
>

I think the point is that we don't know the commit time at the start
of streaming and even the transaction can be quite long in which case
adding the delay is not expected.

>
> ==
> doc/src/sgml/catalogs.sgml
>
> 4.
> + 
> +  
> +   subminsenddelay int4
> +  
> +  
> +   The minimum delay, in milliseconds, by the publisher to send changes
> +  
> + 
>
> "by the publisher to send changes" --> "by the publisher before sending 
> changes"
>

For the streaming (=on) case, we may end up sending changes before we
start to apply delay.

> ==
> doc/src/sgml/monitoring.sgml
>
> 6.
> + 
> +  WalSenderSendDelay
> +  Waiting while sending changes for time-delayed logical 
> replication
> +  in the WAL sender process.
> + 
>
> Should this say "Waiting before sending changes", instead of "Waiting
> while sending changes"?
>

In the streaming (non-parallel) case, we may have sent some changes
before wait as we wait only at commit/prepare time. The downstream
won't apply such changes till commit. So, this description makes sense
and this matches similar nearby descriptions.

>
> 8.
> + 
> +  The delay is effective only when the initial table synchronization
> +  has been finished and the publisher decides to send a particular
> +  transaction downstream. The delay does not take into account the
> +  overhead of time spent in transferring the transaction, which means
> +  that the arrival time at the subscriber may be delayed more than 
> the
> +  given time.
> + 
>
> I'm not sure about this mention about only "effective only when the
> initial table synchronization has been finished"... Now that the delay
> is pub-side I don't know that it is true anymore.
>

This will still be true because we don't wait during the initial copy
(sync). The delay happens only when the replication starts.

> ==
> src/backend/commands/subscriptioncmds.c
> ==
> src/backend/replication/pgoutput/pgoutput.c
>
> 11.
> + errno = 0;
> + parsed = strtoul(strVal(defel->arg), , 10);
> + if (errno != 0 || *endptr != '\0')
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid min_send_delay")));
> +
> + if (parsed > PG_INT32_MAX)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("min_send_delay \"%s\" out of range",
> + strVal(defel->arg;
>
> Should the validation be also checking/asserting no negative numbers,
> or actually should the min_send_delay be defined as a uint32 in the
> first place?
>

I don't see the need to change the datatype of min_send_delay as
compared to what we have min_apply_delay.

> ==
> src/include/replication/logical.h
>
> 15.
> @@ -64,6 +68,7 @@ typedef struct LogicalDecodingContext
>   LogicalOutputPluginWriterPrepareWrite prepare_write;
>   LogicalOutputPluginWriterWrite write;
>   LogicalOutputPluginWriterUpdateProgress update_progress;
> + LogicalOutputPluginWriterDelay delay;
>
> ~
>
> 15a.
> Question: Is there some advantage to introducing another callback,
> instead of just doing the delay inline?
>

This is required because we need to check walsender's timeout and or
process replies during the delay.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-20 Thread Peter Smith
Here are some review comments for the v3-0001 test code.

==
src/test/regress/sql/subscription.sql

1.
+-- fail - utilizing streaming = parallel with time-delayed
replication is not supported
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, streaming = parallel, min_send_delay = 123);

"utilizing" --> "specifying"

~~~

2.
+-- success -- min_send_delay value without unit is take as milliseconds
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexit' PUBLICATION testpub WITH (connect =
false, min_send_delay = 123);
+\dRs+

"without unit is take as" --> "without units is taken as"

~~~

3.
+-- success -- min_send_delay value with unit is converted into ms and
stored as an integer
+ALTER SUBSCRIPTION regress_testsub SET (min_send_delay = '1 d');
+\dRs+


"with unit is converted into ms" --> "with units other than ms is
converted to ms"

~~~

4. Missing tests?

Why have the previous ALTER SUBSCRIPTION tests been removed? AFAIK,
currently, there are no regression tests for error messages like:

test_sub=# ALTER SUBSCRIPTION sub1 SET (min_send_delay = 123);
ERROR:  cannot set min_send_delay for subscription in parallel streaming mode

==
src/test/subscription/t/001_rep_changes.pl

5.
+# This test is successful if and only if the LSN has been applied with at least
+# the configured apply delay.
+ok( time() - $publisher_insert_time >= $delay,
+ "subscriber applies WAL only after replication delay for
non-streaming transaction"
+);

It's not strictly an "apply delay". Maybe this comment only needs to
say like below:

SUGGESTION
# This test is successful only if at least the configured delay has elapsed.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-20 Thread Peter Smith
Here are some review comments for patch v3-0001.

(I haven't looked at the test code yet)

==
Commit Message

1.
If the subscription sets min_send_delay parameter, an apply worker passes the
value to the publisher as an output plugin option. And then, the walsender will
delay the transaction sending for given milliseconds.

~

1a.
"an apply worker" --> "the apply worker (via walrcv_startstreaming)".

~

1b.
"And then, the walsender" --> "The walsender"

~~~

2.
The combination of parallel streaming mode and min_send_delay is not allowed.
This is because in parallel streaming mode, we start applying the transaction
stream as soon as the first change arrives without knowing the transaction's
prepare/commit time. Always waiting for the full 'min_send_delay' period might
include unnecessary delay.

~

Is there another reason not to support this?

Even if streaming + min_send_delay incurs some extra delay, is that a
reason to reject outright the combination? What difference will the
potential of a few extra seconds overhead make when min_send_delay is
more likely to be far greater (e.g. minutes or hours)?

~~~

3.
The other possibility was to apply the delay at the end of the parallel apply
transaction but that would cause issues related to resource bloat and
locks being
held for a long time.

~

Is this explanation still relevant now you are doing pub-side delays?

==
doc/src/sgml/catalogs.sgml

4.
+ 
+  
+   subminsenddelay int4
+  
+  
+   The minimum delay, in milliseconds, by the publisher to send changes
+  
+ 

"by the publisher to send changes" --> "by the publisher before sending changes"

==
doc/src/sgml/logical-replication.sgml

5.
+  
+   A publication can delay sending changes to the subscription by specifying
+   the min_send_delay subscription parameter. See
+for details.
+  

~

This description seemed backwards because IIUC the PUBLICATION has
nothing to do with the delay really, the walsender is told what to do
by the SUBSCRIPTION. Anyway, this paragraph is in the "Subscriber"
section, so mentioning publications was a bit confusing.

SUGGESTION
A subscription can delay the receipt of changes by specifying the
min_send_delay subscription parameter. See ...

==
doc/src/sgml/monitoring.sgml

6.
+ 
+  WalSenderSendDelay
+  Waiting while sending changes for time-delayed logical replication
+  in the WAL sender process.
+ 

Should this say "Waiting before sending changes", instead of "Waiting
while sending changes"?

==
doc/src/sgml/ref/create_subscription.sgml

7.
+ 
+  By default, the publisher sends changes as soon as possible. This
+  parameter allows the user to delay the publisher to send changes by
+  given time period. If the value is specified without units, it is
+  taken as milliseconds. The default is zero (no delay). See
+   for details on the
+  available valid time units.
+ 

"to delay the publisher to send changes" --> "to delay changes"

~~~

8.
+ 
+  The delay is effective only when the initial table synchronization
+  has been finished and the publisher decides to send a particular
+  transaction downstream. The delay does not take into account the
+  overhead of time spent in transferring the transaction, which means
+  that the arrival time at the subscriber may be delayed more than the
+  given time.
+ 

I'm not sure about this mention about only "effective only when the
initial table synchronization has been finished"... Now that the delay
is pub-side I don't know that it is true anymore. The tablesync worker
will try to synchronize with the apply worker. IIUC during this
"synchronization" phase the apply worker might be getting delayed by
its own walsender, so therefore the tablesync might also be delayed
(due to syncing with the apply worker) won't it?

==
src/backend/commands/subscriptioncmds.c

9.
+ /*
+ * translator: the first %s is a string of the form "parameter > 0"
+ * and the second one is "option = value".
+ */
+ errmsg("%s and %s are mutually exclusive options",
+"min_send_delay > 0", "streaming = parallel"));
+
+
 }

Excessive whitespace.

==
src/backend/replication/logical/worker.c

10. ApplyWorkerMain

+ /*
+ * Time-delayed logical replication does not support tablesync
+ * workers, so only the leader apply worker can request walsenders to
+ * apply delay on the publisher side.
+ */
+ if (server_version >= 16 && MySubscription->minsenddelay > 0)
+ options.proto.logical.min_send_delay = MySubscription->minsenddelay;

"apply delay" --> "delay"

==
src/backend/replication/pgoutput/pgoutput.c

11.
+ errno = 0;
+ parsed = strtoul(strVal(defel->arg), , 10);
+ if (errno != 0 || *endptr != '\0')
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid min_send_delay")));
+
+ if (parsed > PG_INT32_MAX)
+ ereport(ERROR,
+ 

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-19 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thank you for reviewing! PSA new version.

> 1.
> +  
> +   The minimum delay for publisher sends data, in milliseconds
> +  
> + 
> 
> It would probably be better to write it as "The minimum delay, in
> milliseconds, by the publisher to send changes"

Fixed.

> 2. The subminsenddelay is placed inconsistently in the patch. In the
> docs (catalogs.sgml), system_views.sql, and in some places in the
> code, it is after subskiplsn, but in the catalog table and
> corresponding structure, it is placed after subowner. It should be
> consistently placed after the subscription owner.

Basically moved. Note that some parts were not changed like
maybe_reread_subscription() because the ordering had been already broken.

> 3.
> + 
> +  WalSenderSendDelay
> +  Waiting for sending changes to subscriber in WAL sender
> +  process.
> 
> How about writing it as follows: "Waiting while sending changes for
> time-delayed logical replication in the WAL sender process."?

Fixed.

> 4.
> + 
> +  Any delay becomes effective only after all initial table
> +  synchronization has finished and occurs before each transaction
> +  starts to get applied on the subscriber. The delay does not take 
> into
> +  account the overhead of time spent in transferring the transaction,
> +  which means that the arrival time at the subscriber may be delayed
> +  more than the given time.
> + 
> 
> This needs to change based on a new approach. It should be something
> like: "The delay is effective only when the publisher decides to send
> a particular transaction downstream."

Right, the first sentence is partially changed as you said.

> 5.
> + * allowed. This is because in parallel streaming mode, we start applying
> + * the transaction stream as soon as the first change arrives without
> + * knowing the transaction's prepare/commit time. Always waiting for the
> + * full 'min_send_delay' period might include unnecessary delay.
> + *
> + * The other possibility was to apply the delay at the end of the parallel
> + * apply transaction but that would cause issues related to resource bloat
> + * and locks being held for a long time.
> + */
> 
> This part of the comments seems to imply more of a subscriber-side
> delay approach. I think we should try to adjust these as per the
> changed approach.

Adjusted.

> 6.
> @@ -666,6 +666,10 @@ DecodeCommit(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
>   buf->origptr, buf->endptr);
>   }
> 
> + /* Delay given time if the context has 'delay' callback */
> + if (ctx->delay)
> + ctx->delay(ctx, commit_time);
> +
> 
> I think we should invoke delay functionality only when
> ctx->min_send_delay > 0. Otherwise, there will be some unnecessary
> overhead. We can change the comment along the lines of: "Delay sending
> the changes if required. For streaming transactions, this means a
> delay in sending the last stream but that is okay because on the
> downstream the changes will be applied only after receiving the last
> stream."

Changed accordingly.

> 7. For 2PC transactions, I think we should add the delay in
> DecodePrerpare. Because after receiving the PREPARE, the downstream
> will apply the xact. In this case, we shouldn't add a delay for the
> commit_prepared.

Right, the transaction will be end when it receive PREPARE. Fixed. 
I've tested locally and the delay seemed to be occurred at PREPARE phase.

> 8.
> +#
> +# If the subscription sets min_send_delay parameter, the logical replication
> +# worker will delay the transaction apply for min_send_delay milliseconds.
> 
> I think here also comments should be updated as per the changed
> approach for applying the delay on the publisher side.

Fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v3-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v3-0001-Time-delayed-logical-replication-on-publisher-sid.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-17 Thread Amit Kapila
On Fri, Feb 17, 2023 at 12:14 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thank you for replying! This direction seems OK, so I started to revise the 
> patch.
> PSA new version.
>

Few comments:
=
1.
+  
+   The minimum delay for publisher sends data, in milliseconds
+  
+ 

It would probably be better to write it as "The minimum delay, in
milliseconds, by the publisher to send changes"

2. The subminsenddelay is placed inconsistently in the patch. In the
docs (catalogs.sgml), system_views.sql, and in some places in the
code, it is after subskiplsn, but in the catalog table and
corresponding structure, it is placed after subowner. It should be
consistently placed after the subscription owner.

3.
+ 
+  WalSenderSendDelay
+  Waiting for sending changes to subscriber in WAL sender
+  process.

How about writing it as follows: "Waiting while sending changes for
time-delayed logical replication in the WAL sender process."?

4.
+ 
+  Any delay becomes effective only after all initial table
+  synchronization has finished and occurs before each transaction
+  starts to get applied on the subscriber. The delay does not take into
+  account the overhead of time spent in transferring the transaction,
+  which means that the arrival time at the subscriber may be delayed
+  more than the given time.
+ 

This needs to change based on a new approach. It should be something
like: "The delay is effective only when the publisher decides to send
a particular transaction downstream."

5.
+ * allowed. This is because in parallel streaming mode, we start applying
+ * the transaction stream as soon as the first change arrives without
+ * knowing the transaction's prepare/commit time. Always waiting for the
+ * full 'min_send_delay' period might include unnecessary delay.
+ *
+ * The other possibility was to apply the delay at the end of the parallel
+ * apply transaction but that would cause issues related to resource bloat
+ * and locks being held for a long time.
+ */

This part of the comments seems to imply more of a subscriber-side
delay approach. I think we should try to adjust these as per the
changed approach.

6.
@@ -666,6 +666,10 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
  buf->origptr, buf->endptr);
  }

+ /* Delay given time if the context has 'delay' callback */
+ if (ctx->delay)
+ ctx->delay(ctx, commit_time);
+

I think we should invoke delay functionality only when
ctx->min_send_delay > 0. Otherwise, there will be some unnecessary
overhead. We can change the comment along the lines of: "Delay sending
the changes if required. For streaming transactions, this means a
delay in sending the last stream but that is okay because on the
downstream the changes will be applied only after receiving the last
stream."

7. For 2PC transactions, I think we should add the delay in
DecodePrerpare. Because after receiving the PREPARE, the downstream
will apply the xact. In this case, we shouldn't add a delay for the
commit_prepared.

8.
+#
+# If the subscription sets min_send_delay parameter, the logical replication
+# worker will delay the transaction apply for min_send_delay milliseconds.

I think here also comments should be updated as per the changed
approach for applying the delay on the publisher side.

-- 
With Regards,
Amit Kapila.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Hayato Kuroda (Fujitsu)
Dear Andres,

Thank you for giving comments! I understood that you have agreed the approach
that publisher delays to send data.

> > I'm not sure why output plugin is involved in the delay mechanism.
> 
> +many
> 
> The output plugin absolutely never should be involved in something like
> this. It was a grave mistake that OutputPluginUpdateProgress() calls were
> added to the commit callback and then proliferated.
> 
> 
> > It appears to me that it would be simpler if the delay occurred in reorder
> > buffer or logical decoder instead.
> 
> This is a feature specific to walsender. So the riggering logic should either
> directly live in the walsender, or in a callback set in
> LogicalDecodingContext. That could be called from decode.c or such.

OK, I can follow the opinion.
I think the walsender function should not be called directly from decode.c.
So I implemented as callback in LogicalDecodingContext and it is called
from decode.c if set.

New patch was posted in [1].

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5866F00191375D0193320A4DF5A19%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> > > > Perhaps what I understand
> > > > correctly is that we could delay right before only sending commit
> > > > records in this case. If we delay at publisher end, all changes will
> > > > be sent at once if !streaming, and otherwise, all changes in a
> > > > transaction will be spooled at subscriber end. In any case, apply
> > > > worker won't be holding an active transaction unnecessarily.
> > >
> > > What about parallel case? Latest patch does not reject the combination of
> parallel
> > > streaming mode and delay. If delay is done at commit and subscriber uses 
> > > an
> parallel
> > > apply worker, it may acquire lock for a long time.
> >
> > I didn't looked too closely, but my guess is that transactions are
> > conveyed in spool files in parallel mode, with each file storing a
> > complete transaction.
> >
> 
> No, we don't try to collect all the data in files for parallel mode.
> Having said that, it doesn't matter because we won't know the time of
> the commit (which is used to compute delay) before we encounter the
> commit record in WAL. So, I feel for this approach, we can follow what
> you said.

Right. And new patch follows the opinion. 

> > > > Of
> > > > course we need add the mechanism to process keep-alive and status
> > > > report messages.
> > >
> > > Could you share the good way to handle keep-alive and status messages if
> you have?
> > > If we changed to the decoding layer, it is strange to call walsender 
> > > function
> > > directly.
> >
> > I'm sorry, but I don't have a concrete idea at the moment. When I read
> > through the last patch, I missed that WalSndDelay is actually a subset
> > of WalSndLoop. Although it can handle keep-alives correctly, I'm not
> > sure we can accept that structure..
> >
> 
> I think we can use update_progress_txn() for this purpose but note
> that we are discussing to change the same in thread [1].
> 
> [1] -
> https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40
> awork3.anarazel.de

I did not reuse update_progress_txn() because we cannot use it straightforward,
But I can change if we have better idea than present.

New patch was posted in [1].

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5866F00191375D0193320A4DF5A19%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san,

Thank you for replying! This direction seems OK, so I started to revise the 
patch.
PSA new version.

> > > As Amit-K mentioned, we may need to change the name of the option in
> > > this version, since the delay mechanism in this version causes a delay
> > > in sending from publisher than delaying apply on the subscriber side.
> >
> > Right, will be changed.
> >
> > > I'm not sure why output plugin is involved in the delay mechanism. It
> > > appears to me that it would be simpler if the delay occurred in
> > > reorder buffer or logical decoder instead.
> >
> > I'm planning to change, but..
> 
> Yeah, I don't think we've made up our minds about which way to go yet,
> so it's a bit too early to work on that.

The parameter name is changed to min_send_delay.
And the delaying spot is changed to logical decoder.

> > > Perhaps what I understand
> > > correctly is that we could delay right before only sending commit
> > > records in this case. If we delay at publisher end, all changes will
> > > be sent at once if !streaming, and otherwise, all changes in a
> > > transaction will be spooled at subscriber end. In any case, apply
> > > worker won't be holding an active transaction unnecessarily.
> >
> > What about parallel case? Latest patch does not reject the combination of
> parallel
> > streaming mode and delay. If delay is done at commit and subscriber uses an
> parallel
> > apply worker, it may acquire lock for a long time.
> 
> I didn't looked too closely, but my guess is that transactions are
> conveyed in spool files in parallel mode, with each file storing a
> complete transaction.

Based on the advice, I moved the delaying to DecodeCommit().
And the combination of parallel streaming mode and min_send_delay is
rejected again.

> > > Of
> > > course we need add the mechanism to process keep-alive and status
> > > report messages.
> >
> > Could you share the good way to handle keep-alive and status messages if you
> have?
> > If we changed to the decoding layer, it is strange to call walsender 
> > function
> > directly.
> 
> I'm sorry, but I don't have a concrete idea at the moment. When I read
> through the last patch, I missed that WalSndDelay is actually a subset
> of WalSndLoop. Although it can handle keep-alives correctly, I'm not
> sure we can accept that structure..

No issues. I have kept the current implementation.

Some bugs I found are also fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v2-0001-Time-delayed-logical-replication-on-publisher-sid.patch
Description:  v2-0001-Time-delayed-logical-replication-on-publisher-sid.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Andres Freund
Hi,

On 2023-02-16 14:21:01 +0900, Kyotaro Horiguchi wrote:
> I'm not sure why output plugin is involved in the delay mechanism.

+many

The output plugin absolutely never should be involved in something like
this. It was a grave mistake that OutputPluginUpdateProgress() calls were
added to the commit callback and then proliferated.


> It appears to me that it would be simpler if the delay occurred in reorder
> buffer or logical decoder instead.

This is a feature specific to walsender. So the riggering logic should either
directly live in the walsender, or in a callback set in
LogicalDecodingContext. That could be called from decode.c or such.

Greetings,

Andres Freund




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Amit Kapila
On Thu, Feb 16, 2023 at 2:25 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" 
>  wrote in
> > Dear Horiguchi-san,
> >
> > Thank you for responding! Before modifying patches, I want to confirm 
> > something
> > you said.
> >
> > > As Amit-K mentioned, we may need to change the name of the option in
> > > this version, since the delay mechanism in this version causes a delay
> > > in sending from publisher than delaying apply on the subscriber side.
> >
> > Right, will be changed.
> >
> > > I'm not sure why output plugin is involved in the delay mechanism. It
> > > appears to me that it would be simpler if the delay occurred in
> > > reorder buffer or logical decoder instead.
> >
> > I'm planning to change, but..
>
> Yeah, I don't think we've made up our minds about which way to go yet,
> so it's a bit too early to work on that.
>
> > > Perhaps what I understand
> > > correctly is that we could delay right before only sending commit
> > > records in this case. If we delay at publisher end, all changes will
> > > be sent at once if !streaming, and otherwise, all changes in a
> > > transaction will be spooled at subscriber end. In any case, apply
> > > worker won't be holding an active transaction unnecessarily.
> >
> > What about parallel case? Latest patch does not reject the combination of 
> > parallel
> > streaming mode and delay. If delay is done at commit and subscriber uses an 
> > parallel
> > apply worker, it may acquire lock for a long time.
>
> I didn't looked too closely, but my guess is that transactions are
> conveyed in spool files in parallel mode, with each file storing a
> complete transaction.
>

No, we don't try to collect all the data in files for parallel mode.
Having said that, it doesn't matter because we won't know the time of
the commit (which is used to compute delay) before we encounter the
commit record in WAL. So, I feel for this approach, we can follow what
you said.

> > > Of
> > > course we need add the mechanism to process keep-alive and status
> > > report messages.
> >
> > Could you share the good way to handle keep-alive and status messages if 
> > you have?
> > If we changed to the decoding layer, it is strange to call walsender 
> > function
> > directly.
>
> I'm sorry, but I don't have a concrete idea at the moment. When I read
> through the last patch, I missed that WalSndDelay is actually a subset
> of WalSndLoop. Although it can handle keep-alives correctly, I'm not
> sure we can accept that structure..
>

I think we can use update_progress_txn() for this purpose but note
that we are discussing to change the same in thread [1].

[1] - 
https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Kyotaro Horiguchi
At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Horiguchi-san,
> 
> Thank you for responding! Before modifying patches, I want to confirm 
> something
> you said.
> 
> > As Amit-K mentioned, we may need to change the name of the option in
> > this version, since the delay mechanism in this version causes a delay
> > in sending from publisher than delaying apply on the subscriber side.
> 
> Right, will be changed.
> 
> > I'm not sure why output plugin is involved in the delay mechanism. It
> > appears to me that it would be simpler if the delay occurred in
> > reorder buffer or logical decoder instead.
> 
> I'm planning to change, but..

Yeah, I don't think we've made up our minds about which way to go yet,
so it's a bit too early to work on that.

> > Perhaps what I understand
> > correctly is that we could delay right before only sending commit
> > records in this case. If we delay at publisher end, all changes will
> > be sent at once if !streaming, and otherwise, all changes in a
> > transaction will be spooled at subscriber end. In any case, apply
> > worker won't be holding an active transaction unnecessarily.
> 
> What about parallel case? Latest patch does not reject the combination of 
> parallel
> streaming mode and delay. If delay is done at commit and subscriber uses an 
> parallel
> apply worker, it may acquire lock for a long time.

I didn't looked too closely, but my guess is that transactions are
conveyed in spool files in parallel mode, with each file storing a
complete transaction.

> > Of
> > course we need add the mechanism to process keep-alive and status
> > report messages.
> 
> Could you share the good way to handle keep-alive and status messages if you 
> have?
> If we changed to the decoding layer, it is strange to call walsender function
> directly.

I'm sorry, but I don't have a concrete idea at the moment. When I read
through the last patch, I missed that WalSndDelay is actually a subset
of WalSndLoop. Although it can handle keep-alives correctly, I'm not
sure we can accept that structure..

> > Those setups work fine when no
> > apply-delay involved, but they won't work with the patches we're
> > talking about because the subscriber won't respond to the keep-alive
> > packets from the peer.  So when I wrote "practically works" in the
> > last mail, this is what I meant.
> 
> I'm not sure around the part. I think in the latest patch, subscriber can 
> respond
> to the keepalive packets from the peer. Also, publisher can respond to the 
> peer.
> Could you please tell me if you know a case that publisher or subscriber 
> cannot
> respond to the opposite side? Note that if we apply the publisher-side patch, 
> we
> don't have to apply subscriber-side patch.

Sorry about that again, I missed that part in the last patch as
mentioned earlier..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-15 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san,

Thank you for responding! Before modifying patches, I want to confirm something
you said.

> As Amit-K mentioned, we may need to change the name of the option in
> this version, since the delay mechanism in this version causes a delay
> in sending from publisher than delaying apply on the subscriber side.

Right, will be changed.

> I'm not sure why output plugin is involved in the delay mechanism. It
> appears to me that it would be simpler if the delay occurred in
> reorder buffer or logical decoder instead.

I'm planning to change, but..

> Perhaps what I understand
> correctly is that we could delay right before only sending commit
> records in this case. If we delay at publisher end, all changes will
> be sent at once if !streaming, and otherwise, all changes in a
> transaction will be spooled at subscriber end. In any case, apply
> worker won't be holding an active transaction unnecessarily.

What about parallel case? Latest patch does not reject the combination of 
parallel
streaming mode and delay. If delay is done at commit and subscriber uses an 
parallel
apply worker, it may acquire lock for a long time.

> Of
> course we need add the mechanism to process keep-alive and status
> report messages.

Could you share the good way to handle keep-alive and status messages if you 
have?
If we changed to the decoding layer, it is strange to call walsender function
directly.

> Those setups work fine when no
> apply-delay involved, but they won't work with the patches we're
> talking about because the subscriber won't respond to the keep-alive
> packets from the peer.  So when I wrote "practically works" in the
> last mail, this is what I meant.

I'm not sure around the part. I think in the latest patch, subscriber can 
respond
to the keepalive packets from the peer. Also, publisher can respond to the peer.
Could you please tell me if you know a case that publisher or subscriber cannot
respond to the opposite side? Note that if we apply the publisher-side patch, we
don't have to apply subscriber-side patch.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-15 Thread Kyotaro Horiguchi
At Wed, 15 Feb 2023 11:29:18 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Andres and other hackers,
> 
> > OTOH, if we want to implement the functionality on publisher-side,
> > I think we need to first consider the interface.
> > We can think of two options (a) Have it as a subscription parameter as the 
> > patch
> > has now and
> > then pass it as an option to the publisher which it will use to delay;
> > (b) Have it defined on publisher-side, say via GUC or some other way.
> > The basic idea could be that while processing commit record (in 
> > DecodeCommit),
> > we can somehow check the value of delay and then use it there to delay 
> > sending
> > the xact.
> > Also, during delay, we need to somehow send the keepalive and process 
> > replies,
> > probably via a new callback or by some existing callback.
> > We also need to handle in-progress and 2PC xacts in a similar way.
> > For the former, probably we would need to apply the delay before sending 
> > the first
> > stream.
> > Could you please share what you feel on this direction as well ?
> 
> I implemented a patch that the delaying is done on the publisher side. In 
> this patch,
> approach (a) was chosen, in which min_apply_delay is specified as a 
> subscription
> parameter, and then apply worker passes it to the publisher as an output 
> plugin option.

As Amit-K mentioned, we may need to change the name of the option in
this version, since the delay mechanism in this version causes a delay
in sending from publisher than delaying apply on the subscriber side.

I'm not sure why output plugin is involved in the delay mechanism. It
appears to me that it would be simpler if the delay occurred in
reorder buffer or logical decoder instead. Perhaps what I understand
correctly is that we could delay right before only sending commit
records in this case. If we delay at publisher end, all changes will
be sent at once if !streaming, and otherwise, all changes in a
transaction will be spooled at subscriber end. In any case, apply
worker won't be holding an active transaction unnecessarily.  Of
course we need add the mechanism to process keep-alive and status
report messages.

> During the delay, the walsender periodically checks and processes replies 
> from the
> apply worker and sends keepalive messages if needed. Therefore, the ability 
> to handle
> keepalives is not loosed.

My understanding is that the keep-alives is a different mechanism with
a different objective from status reports. Even if subscriber doesn't
send a spontaneous or extra status reports at all, connection can be
checked and maintained by keep-alive packets. It is possible to setup
an asymmetric configuration where only walsender sends keep-alives,
but none are sent from the peer. Those setups work fine when no
apply-delay involved, but they won't work with the patches we're
talking about because the subscriber won't respond to the keep-alive
packets from the peer.  So when I wrote "practically works" in the
last mail, this is what I meant.

Thus if someone plans to enable apply_delay for logical replication,
that person should be aware of some additional subtle restrictions that
are required compared to a non-delayed setups.

> To delay the transaction in the output plugin layer, the new 
> LogicalOutputPlugin
> API was added. For now, I choose the output plugin layer but can consider to 
> do
> it from the core if there is a better way.
> 
> Could you please share your opinion?
> 
> Note: thanks for Osumi-san to help implementing.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-15 Thread Hayato Kuroda (Fujitsu)
Dear Andres and other hackers,

> OTOH, if we want to implement the functionality on publisher-side,
> I think we need to first consider the interface.
> We can think of two options (a) Have it as a subscription parameter as the 
> patch
> has now and
> then pass it as an option to the publisher which it will use to delay;
> (b) Have it defined on publisher-side, say via GUC or some other way.
> The basic idea could be that while processing commit record (in DecodeCommit),
> we can somehow check the value of delay and then use it there to delay sending
> the xact.
> Also, during delay, we need to somehow send the keepalive and process replies,
> probably via a new callback or by some existing callback.
> We also need to handle in-progress and 2PC xacts in a similar way.
> For the former, probably we would need to apply the delay before sending the 
> first
> stream.
> Could you please share what you feel on this direction as well ?

I implemented a patch that the delaying is done on the publisher side. In this 
patch,
approach (a) was chosen, in which min_apply_delay is specified as a subscription
parameter, and then apply worker passes it to the publisher as an output plugin 
option.
During the delay, the walsender periodically checks and processes replies from 
the
apply worker and sends keepalive messages if needed. Therefore, the ability to 
handle
keepalives is not loosed.
To delay the transaction in the output plugin layer, the new LogicalOutputPlugin
API was added. For now, I choose the output plugin layer but can consider to do
it from the core if there is a better way.

Could you please share your opinion?

Note: thanks for Osumi-san to help implementing.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Time-delayed-logical-replication-on-publisher-side.patch
Description:  0001-Time-delayed-logical-replication-on-publisher-side.patch


RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-13 Thread Takamichi Osumi (Fujitsu)
Hi, Andres-san


On Tuesday, February 14, 2023 1:47 AM Andres Freund  wrote:
> On 2023-02-11 05:44:47 +, Takamichi Osumi (Fujitsu) wrote:
> > On Saturday, February 11, 2023 11:10 AM Andres Freund
>  wrote:
> > > Has there been any discussion about whether this is actually best
> > > implemented on the client side? You could alternatively implement it
> > > on the sender.
> > >
> > > That'd have quite a few advantages, I think - you e.g. wouldn't
> > > remove the ability to *receive* and send feedback messages.  We'd
> > > not end up filling up the network buffer with data that we'll not process
> anytime soon.
> > Thanks for your comments !
> >
> > We have discussed about the publisher side idea around here [1] but,
> > we chose the current direction. Kindly have a look at the discussion.
> >
> > If we apply the delay on the publisher, then it can lead to extra
> > delay where we don't need to apply.
> > The current proposed approach can take other loads or factors
> > (network, busyness of the publisher, etc) into account because it
> > calculates the required delay on the subscriber.
> 
> I don't think it's OK to just loose the ability to read / reply to keepalive
> messages.
> 
> I think as-is we seriously consider to just reject the feature, adding too 
> much
> complexity, without corresponding gain.
Thanks for your comments !

Could you please tell us about your concern a bit more?

The keepalive/reply messages are currently used for two purposes,
(a) send the updated wrte/flush/apply locations; (b) avoid timeouts incase of 
idle times.
Both the cases shouldn't be impacted with this time-delayed LR patch because 
during the delay there won't
be any progress and to avoid timeouts, we allow to send the alive message 
during the delay.
This is just we would like to clarify the issue you have in mind.

OTOH, if we want to implement the functionality on publisher-side,
I think we need to first consider the interface.
We can think of two options (a) Have it as a subscription parameter as the 
patch has now and
then pass it as an option to the publisher which it will use to delay;
(b) Have it defined on publisher-side, say via GUC or some other way.
The basic idea could be that while processing commit record (in DecodeCommit),
we can somehow check the value of delay and then use it there to delay sending 
the xact.
Also, during delay, we need to somehow send the keepalive and process replies,
probably via a new callback or by some existing callback.
We also need to handle in-progress and 2PC xacts in a similar way.
For the former, probably we would need to apply the delay before sending the 
first stream.
Could you please share what you feel on this direction as well ?


Best Regards,
Takamichi Osumi





Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-13 Thread Kyotaro Horiguchi
At Mon, 13 Feb 2023 15:51:25 +0530, Amit Kapila  wrote 
in 
> I think we can introduce a new variable as last_feedback_time in the
> LogicalRepWorker structure and probably for the last_received, we can
> last_lsn in MyLogicalRepWorker as that seems to be updated correctly.
> I think it would be good to avoid global variables.

MyLogicalRepWorker is a global variable:p, but it is far better than a
bear one.

By the way, we are trying to send the status messages regularly, but
as Andres pointed out, worker does not read nor reply to keepalive
messages from publisher while delaying. It is not possible as far as
we choke the stream at the subscriber end. It doesn't seem to be a
practical problem, but IMHO I think he's right in terms of adherence
to the wire protocol, which was also one of my own initial concern.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-13 Thread Andres Freund
Hi,

On 2023-02-11 05:44:47 +, Takamichi Osumi (Fujitsu) wrote:
> On Saturday, February 11, 2023 11:10 AM Andres Freund  
> wrote:
> > Has there been any discussion about whether this is actually best
> > implemented on the client side? You could alternatively implement it on the
> > sender.
> > 
> > That'd have quite a few advantages, I think - you e.g. wouldn't remove the
> > ability to *receive* and send feedback messages.  We'd not end up filling up
> > the network buffer with data that we'll not process anytime soon.
> Thanks for your comments !
> 
> We have discussed about the publisher side idea around here [1]
> but, we chose the current direction. Kindly have a look at the discussion.
> 
> If we apply the delay on the publisher, then
> it can lead to extra delay where we don't need to apply.
> The current proposed approach can take other loads or factors
> (network, busyness of the publisher, etc) into account
> because it calculates the required delay on the subscriber.

I don't think it's OK to just loose the ability to read / reply to keepalive
messages.

I think as-is we seriously consider to just reject the feature, adding too
much complexity, without corresponding gain.

Greetings,

Andres Freund




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-13 Thread Amit Kapila
On Fri, Feb 10, 2023 at 4:56 PM Takamichi Osumi (Fujitsu)
 wrote:
>
> On Friday, February 10, 2023 2:05 PM Friday, February 10, 2023 2:05 PM wrote:
> > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila 
> > wrote:
>
> In the previous patch, we couldn't solve the
> timeout of the publisher, when we conduct a scenario suggested by 
> Horiguchi-san
> and reproduced in the scenario attached test file 'test.sh'.
> But now we handle it by adjusting the timing of the first wait time.
>
> FYI, we thought to implement the new variable 'send_time'
> in the LogicalRepWorker structure at first. But, this structure
> is used when launcher controls workers or reports statistics
> and it stores TimestampTz recorded in the received WAL,
> so not sure if the struct is the right place to implement the variable.
> Moreover, there are other similar variables such as last_recv_time
> or reply_time. So, those will be confusing when we decide to have
> new variable together. Then, it's declared separately.
>

I think we can introduce a new variable as last_feedback_time in the
LogicalRepWorker structure and probably for the last_received, we can
last_lsn in MyLogicalRepWorker as that seems to be updated correctly.
I think it would be good to avoid global variables.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-13 Thread Peter Smith
Here are my review comments for the v34 patch.

==
src/backend/replication/logical/worker.c

+/* The last time we send a feedback message */
+static TimestampTz send_time = 0;
+

IMO this is a bad variable name. When this variable was changed to be
global it ought to have been renamed.

The name "send_time" is almost meaningless without any contextual information.

But also it's bad because this global name is "shadowed" by several
other parameters and other local variables using that same name  (e.g.
see UpdateWorkerStats, LogicalRepApplyLoop, etc). It is too confusing.

How about using a unique/meaningful name with a comment to match to
improve readability and remove unwanted shadowing?

SUGGESTION
/* Timestamp of when the last feedback message was sent. */
static TimestampTz last_sent_feedback_ts = 0;

~~~

2. maybe_apply_delay

+ /* Apply the delay by the latch mechanism */
+ do
+ {
+ TimestampTz delayUntil;
+ long diffms;
+
+ ResetLatch(MyLatch);
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* This might change wal_receiver_status_interval */
+ if (ConfigReloadPending)
+ {
+ ConfigReloadPending = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+
+ /*
+ * Before calculating the time duration, reload the catalog if needed.
+ */
+ if (!in_remote_transaction && !in_streamed_transaction)
+ {
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }
+
+ delayUntil = TimestampTzPlusMilliseconds(finish_ts,
MySubscription->minapplydelay);
+ diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
+
+ /*
+ * Exit without arming the latch if it's already past time to apply
+ * this transaction.
+ */
+ if (diffms <= 0)
+ break;
+
+ elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay
= %d ms, remaining wait time: %ld ms",
+ xid, MySubscription->minapplydelay, diffms);
+
+ /*
+ * Call send_feedback() to prevent the publisher from exiting by
+ * timeout during the delay, when the status interval is greater than
+ * zero.
+ */
+ if (!status_interval_ms)
+ {
+ TimestampTz nextFeedback;
+
+ /*
+ * Based on the last time when we send a feedback message, adjust
+ * the first delay time for this transaction. This ensures that
+ * the first feedback message follows wal_receiver_status_interval
+ * interval.
+ */
+ nextFeedback = TimestampTzPlusMilliseconds(send_time,
+wal_receiver_status_interval * 1000L);
+ status_interval_ms =
TimestampDifferenceMilliseconds(GetCurrentTimestamp(), nextFeedback);
+ }
+ else
+ status_interval_ms = wal_receiver_status_interval * 1000L;
+
+ if (status_interval_ms > 0 && diffms > status_interval_ms)
+ {
+ WaitLatch(MyLatch,
+   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+   status_interval_ms,
+   WAIT_EVENT_LOGICAL_APPLY_DELAY);
+ send_feedback(last_received, true, false, true);
+ }
+ else
+ WaitLatch(MyLatch,
+   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+   diffms,
+   WAIT_EVENT_LOGICAL_APPLY_DELAY);
+
+ } while (true);

~

IMO this logic has been tweaked too many times without revisiting the
variable names and logic from scratch, so it has become over-complex
- some variable names are assuming multiple meanings
- multiple * 1000L have crept back in again
- the 'diffms' is too generic now with so many vars so it has lost its meaning
- GetCurrentTimestamp call in multiple places

SUGGESTIONS
- rename some variables and simplify the logic.
- reduce all the if/else
- don't be sneaky with the meaning of status_interval_ms
- 'diffms' --> 'remaining_delay_ms'
- 'DelayUntil' --> 'delay_until_ts'
- introduce 'now' variable
- simplify the check of (next_feedback_due_ms < remaining_delay_ms)

SUGGESTION (WFM)

/* Apply the delay by the latch mechanism */
while (true)
{
TimestampTz now;
TimestampTz delay_until_ts;
long remaining_delay_ms;
long status_interval_ms;

ResetLatch(MyLatch);

CHECK_FOR_INTERRUPTS();

/* This might change wal_receiver_status_interval */
if (ConfigReloadPending)
{
ConfigReloadPending = false;
ProcessConfigFile(PGC_SIGHUP);
}

/*
* Before calculating the time duration, reload the catalog if needed.
*/
if (!in_remote_transaction && !in_streamed_transaction)
{
AcceptInvalidationMessages();
maybe_reread_subscription();
}

now = GetCurrentTimestamp();
delay_until_ts = TimestampTzPlusMilliseconds(finish_ts,
MySubscription->minapplydelay);
remaining_delay_ms = TimestampDifferenceMilliseconds(now, delay_until_ts);

/*
* Exit without arming the latch if it's already past time to apply
* this transaction.
*/
if (remaining_delay_ms <= 0)
break;

elog(DEBUG2, "time-delayed replication for txid %u, min_apply_delay =
%d ms, remaining wait time: %ld ms",
xid, MySubscription->minapplydelay, remaining_delay_ms);
/*
* If a status interval is defined then we may need to call send_feedback()
* early to prevent the publisher from exiting during a long apply delay.
*/
status_interval_ms = wal_receiver_status_interval * 1000L;
if (status_interval_ms > 0)
{
TimestampTz next_feedback_due_ts;
long next_feedback_due_ms;

/*
* Find if the next feedback 

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-12 Thread Takamichi Osumi (Fujitsu)
Hi, Horiguchi-san


On Monday, February 13, 2023 10:26 AM Kyotaro Horiguchi 
 wrote:
> At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila 
> wrote in
> > On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila 
> wrote:
> > >
> > > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
> > >  wrote:
> > > > We have suffered this kind of feedback silence many times. Thus I
> > > > don't want to rely on luck here. I had in mind of exposing
> > > > last_send itself or providing interval-calclation function to the logic.
> > >
> > > I think we have last_send time in send_feedback(), so we can expose
> > > it if we want but how would that solve the problem you are worried
> about?
> 
> Wal receiver can avoid a too-long sleep by knowing when to wake up for the
> next feedback.
> 
> > I have an idea to use last_send time to avoid walsenders being
> > timeout. Instead of directly using wal_receiver_status_interval as a
> > minimum interval to send the feedback, we should check if it is
> > greater than last_send time then we should send the feedback after
> > (wal_receiver_status_interval - last_send). I think they can probably
> > be different only on the very first time. Any better ideas?
> 
> If it means MyLogicalRepWorker->last_send_time, it is not the last time when
> walreceiver sent a feedback but the last time when
> wal*sender* sent a data. So I'm not sure that works.
> 
> We could use the variable that way, but AFAIU in turn when so many changes
> have been spooled that the control doesn't return to LogicalRepApplyLoop
> longer than wal_r_s_interval, maybe_apply_delay() starts calling
> send_feedback() at every call after the first feedback timing.  Even in that
> case, send_feedback() won't send one actually until the next feedback timing,
> I don't think that behavior is great.
> 
> The only packets walreceiver sends back is the feedback packets and
> currently only send_feedback knows the last feedback time.
Thanks for your comments !

As described in your last sentence, in the latest patch v34 [1],
we use the last time set in send_feedback() and
based on it, we calculate and adjust the first timing of feedback message
in maybe_apply_delay() so that we can send the feedback message following
the interval of wal_receiver_status_interval. I wasn't sure if
the above concern is still valid for this implementation.

Could you please have a look at the latest patch and share your opinion ?


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83736C50C98CB2153728A7A8EDDE9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi





Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-12 Thread Kyotaro Horiguchi
At Fri, 10 Feb 2023 10:34:49 +0530, Amit Kapila  wrote 
in 
> On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila  wrote:
> >
> > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
> >  wrote:
> > > We have suffered this kind of feedback silence many times. Thus I
> > > don't want to rely on luck here. I had in mind of exposing last_send
> > > itself or providing interval-calclation function to the logic.
> >
> > I think we have last_send time in send_feedback(), so we can expose it
> > if we want but how would that solve the problem you are worried about?

Wal receiver can avoid a too-long sleep by knowing when to wake up for
the next feedback.

> I have an idea to use last_send time to avoid walsenders being
> timeout. Instead of directly using wal_receiver_status_interval as a
> minimum interval to send the feedback, we should check if it is
> greater than last_send time then we should send the feedback after
> (wal_receiver_status_interval - last_send). I think they can probably
> be different only on the very first time. Any better ideas?

If it means MyLogicalRepWorker->last_send_time, it is not the last
time when walreceiver sent a feedback but the last time when
wal*sender* sent a data. So I'm not sure that works.

We could use the variable that way, but AFAIU in turn when so many
changes have been spooled that the control doesn't return to
LogicalRepApplyLoop longer than wal_r_s_interval, maybe_apply_delay()
starts calling send_feedback() at every call after the first feedback
timing.  Even in that case, send_feedback() won't send one actually
until the next feedback timing, I don't think that behavior is great.

The only packets walreceiver sends back is the feedback packets and
currently only send_feedback knows the last feedback time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-10 Thread Takamichi Osumi (Fujitsu)
Hi


On Saturday, February 11, 2023 11:10 AM Andres Freund  
wrote:
> On 2023-02-10 11:26:21 +, Takamichi Osumi (Fujitsu) wrote:
> > Subject: [PATCH v34] Time-delayed logical replication subscriber
> >
> > Similar to physical replication, a time-delayed copy of the data for
> > logical replication is useful for some scenarios (particularly to fix
> > errors that might cause data loss).
> >
> > This patch implements a new subscription parameter called
> 'min_apply_delay'.
> Has there been any discussion about whether this is actually best
> implemented on the client side? You could alternatively implement it on the
> sender.
> 
> That'd have quite a few advantages, I think - you e.g. wouldn't remove the
> ability to *receive* and send feedback messages.  We'd not end up filling up
> the network buffer with data that we'll not process anytime soon.
Thanks for your comments !

We have discussed about the publisher side idea around here [1]
but, we chose the current direction. Kindly have a look at the discussion.

If we apply the delay on the publisher, then
it can lead to extra delay where we don't need to apply.
The current proposed approach can take other loads or factors
(network, busyness of the publisher, etc) into account
because it calculates the required delay on the subscriber.


[1] - 
https://www.postgresql.org/message-id/20221215.105200.268327207020006785.horikyota.ntt%40gmail.com


Best Regards,
Takamichi Osumi





Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-10 Thread Andres Freund
Hi,

On 2023-02-10 11:26:21 +, Takamichi Osumi (Fujitsu) wrote:
> Subject: [PATCH v34] Time-delayed logical replication subscriber
> 
> Similar to physical replication, a time-delayed copy of the data for
> logical replication is useful for some scenarios (particularly to fix
> errors that might cause data loss).
> 
> This patch implements a new subscription parameter called 'min_apply_delay'.

Sorry for not reading through the thread, but it's very long.


Has there been any discussion about whether this is actually best implemented
on the client side? You could alternatively implement it on the sender.

That'd have quite a few advantages, I think - you e.g. wouldn't remove the
ability to *receive* and send feedback messages.  We'd not end up filling up
the network buffer with data that we'll not process anytime soon.

Greetings,

Andres Freund




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-10 Thread Takamichi Osumi (Fujitsu)
Hi,

On Friday, February 10, 2023 2:05 PM Friday, February 10, 2023 2:05 PM wrote:
> On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila 
> wrote:
> >
> > On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila
> > >  wrote in
> > > > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
> > > >  wrote:
> > > > >
> > > > > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)"
> > > > >  wrote in
> > > > > > Thank you for reviewing! PSA new version.
> > > > >
> > > > > +   if (statusinterval_ms > 0 && diffms >
> > > > > + statusinterval_ms)
> > > > >
> > > > > The next expected feedback time is measured from the last status
> > > > > report.  Thus, it seems to me this may suppress feedbacks from
> > > > > being sent for an unexpectedly long time especially when
> > > > > min_apply_delay is shorter than wal_r_s_interval.
> > > > >
> > > >
> > > > I think the minimum time before we send any feedback during the
> > > > wait is wal_r_s_interval. Now, I think if there is no transaction
> > > > for a long time before we get a new transaction, there should be
> > > > keep-alive messages in between which would allow us to send
> > > > feedback at regular intervals (wal_receiver_status_interval). So,
> > > > I think we should be
> > >
> > > Right.
> > >
> > > > able to send feedback in less than 2 *
> > > > wal_receiver_status_interval unless wal_sender/receiver timeout is
> > > > very large and there is a very low volume of transactions. Now, we
> > > > can try to send the feedback
> > >
> > > We have suffered this kind of feedback silence many times. Thus I
> > > don't want to rely on luck here. I had in mind of exposing last_send
> > > itself or providing interval-calclation function to the logic.
> > >
> >
> > I think we have last_send time in send_feedback(), so we can expose it
> > if we want but how would that solve the problem you are worried about?
> >
> 
> I have an idea to use last_send time to avoid walsenders being timeout.
> Instead of directly using wal_receiver_status_interval as a minimum interval
> to send the feedback, we should check if it is greater than last_send time
> then we should send the feedback after (wal_receiver_status_interval -
> last_send). I think they can probably be different only on the very first 
> time.
> Any better ideas?
This idea sounds good to me and
implemented this idea in an attached patch v34.

In the previous patch, we couldn't solve the
timeout of the publisher, when we conduct a scenario suggested by Horiguchi-san
and reproduced in the scenario attached test file 'test.sh'.
But now we handle it by adjusting the timing of the first wait time.

FYI, we thought to implement the new variable 'send_time'
in the LogicalRepWorker structure at first. But, this structure
is used when launcher controls workers or reports statistics
and it stores TimestampTz recorded in the received WAL,
so not sure if the struct is the right place to implement the variable.
Moreover, there are other similar variables such as last_recv_time
or reply_time. So, those will be confusing when we decide to have
new variable together. Then, it's declared separately.

The new patch also includes some changes for wait event.
Kindly have a look at the v34 patch.


Best Regards,
Takamichi Osumi



v34-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v34-0001-Time-delayed-logical-replication-subscriber.patch


test.sh
Description: test.sh


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Amit Kapila
On Fri, Feb 10, 2023 at 10:11 AM Amit Kapila  wrote:
>
> On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila  
> > wrote in
> > > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
> > > >  wrote in
> > > > > Thank you for reviewing! PSA new version.
> > > >
> > > > +   if (statusinterval_ms > 0 && diffms > statusinterval_ms)
> > > >
> > > > The next expected feedback time is measured from the last status
> > > > report.  Thus, it seems to me this may suppress feedbacks from being
> > > > sent for an unexpectedly long time especially when min_apply_delay is
> > > > shorter than wal_r_s_interval.
> > > >
> > >
> > > I think the minimum time before we send any feedback during the wait
> > > is wal_r_s_interval. Now, I think if there is no transaction for a
> > > long time before we get a new transaction, there should be keep-alive
> > > messages in between which would allow us to send feedback at regular
> > > intervals (wal_receiver_status_interval). So, I think we should be
> >
> > Right.
> >
> > > able to send feedback in less than 2 * wal_receiver_status_interval
> > > unless wal_sender/receiver timeout is very large and there is a very
> > > low volume of transactions. Now, we can try to send the feedback
> >
> > We have suffered this kind of feedback silence many times. Thus I
> > don't want to rely on luck here. I had in mind of exposing last_send
> > itself or providing interval-calclation function to the logic.
> >
>
> I think we have last_send time in send_feedback(), so we can expose it
> if we want but how would that solve the problem you are worried about?
>

I have an idea to use last_send time to avoid walsenders being
timeout. Instead of directly using wal_receiver_status_interval as a
minimum interval to send the feedback, we should check if it is
greater than last_send time then we should send the feedback after
(wal_receiver_status_interval - last_send). I think they can probably
be different only on the very first time. Any better ideas?

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Amit Kapila
On Fri, Feb 10, 2023 at 6:27 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila  
> wrote in
> > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
> > >  wrote in
> > > > Thank you for reviewing! PSA new version.
> > >
> > > +   if (statusinterval_ms > 0 && diffms > statusinterval_ms)
> > >
> > > The next expected feedback time is measured from the last status
> > > report.  Thus, it seems to me this may suppress feedbacks from being
> > > sent for an unexpectedly long time especially when min_apply_delay is
> > > shorter than wal_r_s_interval.
> > >
> >
> > I think the minimum time before we send any feedback during the wait
> > is wal_r_s_interval. Now, I think if there is no transaction for a
> > long time before we get a new transaction, there should be keep-alive
> > messages in between which would allow us to send feedback at regular
> > intervals (wal_receiver_status_interval). So, I think we should be
>
> Right.
>
> > able to send feedback in less than 2 * wal_receiver_status_interval
> > unless wal_sender/receiver timeout is very large and there is a very
> > low volume of transactions. Now, we can try to send the feedback
>
> We have suffered this kind of feedback silence many times. Thus I
> don't want to rely on luck here. I had in mind of exposing last_send
> itself or providing interval-calclation function to the logic.
>

I think we have last_send time in send_feedback(), so we can expose it
if we want but how would that solve the problem you are worried about?
The one simple idea as I shared in my last email was to send feedback
every wal_receiver_status_interval / 2. I think this should avoid any
timeout problem because we already recommend setting it to lesser than
wal_sender_timeout.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Kyotaro Horiguchi
Mmm. A part of the previous mail have gone anywhere for a uncertain
reason and placed by a mysterious blank lines...

At Fri, 10 Feb 2023 09:57:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila  
> wrote in 
> > On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
> > >  wrote in
> > > > Thank you for reviewing! PSA new version.
> > >
> > > +   if (statusinterval_ms > 0 && diffms > statusinterval_ms)
> > >
> > > The next expected feedback time is measured from the last status
> > > report.  Thus, it seems to me this may suppress feedbacks from being
> > > sent for an unexpectedly long time especially when min_apply_delay is
> > > shorter than wal_r_s_interval.
> > >
> > 
> > I think the minimum time before we send any feedback during the wait
> > is wal_r_s_interval. Now, I think if there is no transaction for a
> > long time before we get a new transaction, there should be keep-alive
> > messages in between which would allow us to send feedback at regular
> > intervals (wal_receiver_status_interval). So, I think we should be
> 
> Right.
> 
> > able to send feedback in less than 2 * wal_receiver_status_interval
> > unless wal_sender/receiver timeout is very large and there is a very
> > low volume of transactions. Now, we can try to send the feedback
> 
> We have suffered this kind of feedback silence many times. Thus I
> don't want to rely on luck here. I had in mind of exposing last_send
> itself or providing interval-calclation function to the logic.
> 
> > before we start waiting or maybe after every
> > wal_receiver_status_interval / 2 but I think that will lead to more
> > spurious feedback messages than we get the benefit from them.

Agreed. I think we dont want to do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Kyotaro Horiguchi
At Thu, 9 Feb 2023 13:48:52 +0530, Amit Kapila  wrote 
in 
> On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
> >  wrote in
> > > Thank you for reviewing! PSA new version.
> >
> > +   if (statusinterval_ms > 0 && diffms > statusinterval_ms)
> >
> > The next expected feedback time is measured from the last status
> > report.  Thus, it seems to me this may suppress feedbacks from being
> > sent for an unexpectedly long time especially when min_apply_delay is
> > shorter than wal_r_s_interval.
> >
> 
> I think the minimum time before we send any feedback during the wait
> is wal_r_s_interval. Now, I think if there is no transaction for a
> long time before we get a new transaction, there should be keep-alive
> messages in between which would allow us to send feedback at regular
> intervals (wal_receiver_status_interval). So, I think we should be

Right.

> able to send feedback in less than 2 * wal_receiver_status_interval
> unless wal_sender/receiver timeout is very large and there is a very
> low volume of transactions. Now, we can try to send the feedback

We have suffered this kind of feedback silence many times. Thus I
don't want to rely on luck here. I had in mind of exposing last_send
itself or providing interval-calclation function to the logic.

> before we start waiting or maybe after every
> wal_receiver_status_interval / 2 but I think that will lead to more
> spurious feedback messages than we get the benefit from them.



regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Kyotaro Horiguchi
At Thu, 9 Feb 2023 13:26:19 +0530, Amit Kapila  wrote 
in 
amit.kapila16> On Thu, Feb 9, 2023 at 12:17 AM Peter Smith 
 wrote:
> > I understand in theory, your code is more efficient, but in practice,
> > I think the overhead of a single variable assignment every loop
> > iteration (which is doing WaitLatch anyway) is of insignificant
> > concern, whereas having one assignment is simpler than having two IMO.
> >
> 
> Yeah, that sounds better to me as well.

FWIW, I'm on board with this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Peter Smith
> The comment adjustment suggested by Peter-san above
> was also included in this v33.
> Please have a look at the attached patch.

Patch v33 LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Takamichi Osumi (Fujitsu)
Hi,


On Thursday, February 9, 2023 4:56 PM Amit Kapila  
wrote:
> On Thu, Feb 9, 2023 at 12:17 AM Peter Smith 
> wrote:
> >
> > On Wed, Feb 8, 2023 at 8:03 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > ...
> > > > ==
> > > >
> > > > src/backend/replication/logical/worker.c
> > > >
> > > > 2. maybe_apply_delay
> > > >
> > > > + if (wal_receiver_status_interval > 0 && diffms >
> > > > + wal_receiver_status_interval * 1000L) { WaitLatch(MyLatch,
> > > > +   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > > > +   wal_receiver_status_interval * 1000L,
> > > > +   WAIT_EVENT_RECOVERY_APPLY_DELAY);
> send_feedback(last_received,
> > > > + true, false, true); }
> > > >
> > > > I felt that introducing another variable like:
> > > >
> > > > long statusinterval_ms = wal_receiver_status_interval * 1000L;
> > > >
> > > > would help here by doing 2 things:
> > > > 1) The condition would be easier to read because the ms units
> > > > would be the same
> > > > 2) Won't need * 1000L repeated in two places.
> > > >
> > > > Only, do take care to assign this variable in the right place in
> > > > this loop in case the configuration is changed.
> > >
> > > Fixed. Calculations are done on two lines - first one is the
> > > entrance of the loop, and second one is the after SIGHUP is detected.
> > >
> >
> > TBH, I expected you would write this as just a *single* variable
> > assignment before the condition like below:
> >
> > SUGGESTION (tweaked comment and put single assignment before
> > condition)
> > /*
> >  * Call send_feedback() to prevent the publisher from exiting by
> >  * timeout during the delay, when the status interval is greater than
> >  * zero.
> >  */
> > status_interval_ms = wal_receiver_status_interval * 1000L; if
> > (status_interval_ms > 0 && diffms > status_interval_ms) { ...
> >
> > ~
> >
> > I understand in theory, your code is more efficient, but in practice,
> > I think the overhead of a single variable assignment every loop
> > iteration (which is doing WaitLatch anyway) is of insignificant
> > concern, whereas having one assignment is simpler than having two IMO.
> >
> 
> Yeah, that sounds better to me as well.
OK, fixed.

The comment adjustment suggested by Peter-san above
was also included in this v33.
Please have a look at the attached patch.



Best Regards,
Takamichi Osumi



v33-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v33-0001-Time-delayed-logical-replication-subscriber.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-09 Thread Amit Kapila
On Thu, Feb 9, 2023 at 10:45 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
>  wrote in
> > Thank you for reviewing! PSA new version.
>
> +   if (statusinterval_ms > 0 && diffms > statusinterval_ms)
>
> The next expected feedback time is measured from the last status
> report.  Thus, it seems to me this may suppress feedbacks from being
> sent for an unexpectedly long time especially when min_apply_delay is
> shorter than wal_r_s_interval.
>

I think the minimum time before we send any feedback during the wait
is wal_r_s_interval. Now, I think if there is no transaction for a
long time before we get a new transaction, there should be keep-alive
messages in between which would allow us to send feedback at regular
intervals (wal_receiver_status_interval). So, I think we should be
able to send feedback in less than 2 * wal_receiver_status_interval
unless wal_sender/receiver timeout is very large and there is a very
low volume of transactions. Now, we can try to send the feedback
before we start waiting or maybe after every
wal_receiver_status_interval / 2 but I think that will lead to more
spurious feedback messages than we get the benefit from them.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-08 Thread Amit Kapila
On Thu, Feb 9, 2023 at 12:17 AM Peter Smith  wrote:
>
> On Wed, Feb 8, 2023 at 8:03 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> ...
> > > ==
> > >
> > > src/backend/replication/logical/worker.c
> > >
> > > 2. maybe_apply_delay
> > >
> > > + if (wal_receiver_status_interval > 0 &&
> > > + diffms > wal_receiver_status_interval * 1000L)
> > > + {
> > > + WaitLatch(MyLatch,
> > > +   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > > +   wal_receiver_status_interval * 1000L,
> > > +   WAIT_EVENT_RECOVERY_APPLY_DELAY);
> > > + send_feedback(last_received, true, false, true);
> > > + }
> > >
> > > I felt that introducing another variable like:
> > >
> > > long statusinterval_ms = wal_receiver_status_interval * 1000L;
> > >
> > > would help here by doing 2 things:
> > > 1) The condition would be easier to read because the ms units would be 
> > > the same
> > > 2) Won't need * 1000L repeated in two places.
> > >
> > > Only, do take care to assign this variable in the right place in this
> > > loop in case the configuration is changed.
> >
> > Fixed. Calculations are done on two lines - first one is the entrance of 
> > the loop,
> > and second one is the after SIGHUP is detected.
> >
>
> TBH, I expected you would write this as just a *single* variable
> assignment before the condition like below:
>
> SUGGESTION (tweaked comment and put single assignment before condition)
> /*
>  * Call send_feedback() to prevent the publisher from exiting by
>  * timeout during the delay, when the status interval is greater than
>  * zero.
>  */
> status_interval_ms = wal_receiver_status_interval * 1000L;
> if (status_interval_ms > 0 && diffms > status_interval_ms)
> {
> ...
>
> ~
>
> I understand in theory, your code is more efficient, but in practice,
> I think the overhead of a single variable assignment every loop
> iteration (which is doing WaitLatch anyway) is of insignificant
> concern, whereas having one assignment is simpler than having two IMO.
>

Yeah, that sounds better to me as well.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-08 Thread Kyotaro Horiguchi
At Wed, 8 Feb 2023 09:03:03 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Thank you for reviewing! PSA new version.

+   if (statusinterval_ms > 0 && diffms > statusinterval_ms)

The next expected feedback time is measured from the last status
report.  Thus, it seems to me this may suppress feedbacks from being
sent for an unexpectedly long time especially when min_apply_delay is
shorter than wal_r_s_interval.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-08 Thread Peter Smith
On Wed, Feb 8, 2023 at 8:03 PM Hayato Kuroda (Fujitsu)
 wrote:
>
...
> > ==
> >
> > src/backend/replication/logical/worker.c
> >
> > 2. maybe_apply_delay
> >
> > + if (wal_receiver_status_interval > 0 &&
> > + diffms > wal_receiver_status_interval * 1000L)
> > + {
> > + WaitLatch(MyLatch,
> > +   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > +   wal_receiver_status_interval * 1000L,
> > +   WAIT_EVENT_RECOVERY_APPLY_DELAY);
> > + send_feedback(last_received, true, false, true);
> > + }
> >
> > I felt that introducing another variable like:
> >
> > long statusinterval_ms = wal_receiver_status_interval * 1000L;
> >
> > would help here by doing 2 things:
> > 1) The condition would be easier to read because the ms units would be the 
> > same
> > 2) Won't need * 1000L repeated in two places.
> >
> > Only, do take care to assign this variable in the right place in this
> > loop in case the configuration is changed.
>
> Fixed. Calculations are done on two lines - first one is the entrance of the 
> loop,
> and second one is the after SIGHUP is detected.
>

TBH, I expected you would write this as just a *single* variable
assignment before the condition like below:

SUGGESTION (tweaked comment and put single assignment before condition)
/*
 * Call send_feedback() to prevent the publisher from exiting by
 * timeout during the delay, when the status interval is greater than
 * zero.
 */
status_interval_ms = wal_receiver_status_interval * 1000L;
if (status_interval_ms > 0 && diffms > status_interval_ms)
{
...

~

I understand in theory, your code is more efficient, but in practice,
I think the overhead of a single variable assignment every loop
iteration (which is doing WaitLatch anyway) is of insignificant
concern, whereas having one assignment is simpler than having two IMO.

But, if you want to keep it the way you have then that is OK.

Otherwise, this patch v32 LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-08 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing! PSA new version.

> ==
> doc/src/sgml/glossary.sgml
> 
> 1.
> + 
> +  Replication setup that applies time-delayed copy of the data.
> +
> 
> That sentence seemed a bit strange to me.
> 
> SUGGESTION
> Replication setup that delays the application of changes by a
> specified minimum time-delay period.

Fixed.

> ==
> 
> src/backend/replication/logical/worker.c
> 
> 2. maybe_apply_delay
> 
> + if (wal_receiver_status_interval > 0 &&
> + diffms > wal_receiver_status_interval * 1000L)
> + {
> + WaitLatch(MyLatch,
> +   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> +   wal_receiver_status_interval * 1000L,
> +   WAIT_EVENT_RECOVERY_APPLY_DELAY);
> + send_feedback(last_received, true, false, true);
> + }
> 
> I felt that introducing another variable like:
> 
> long statusinterval_ms = wal_receiver_status_interval * 1000L;
> 
> would help here by doing 2 things:
> 1) The condition would be easier to read because the ms units would be the 
> same
> 2) Won't need * 1000L repeated in two places.
> 
> Only, do take care to assign this variable in the right place in this
> loop in case the configuration is changed.

Fixed. Calculations are done on two lines - first one is the entrance of the 
loop,
and second one is the after SIGHUP is detected.

> ==
> src/test/subscription/t/001_rep_changes.pl
> 
> 3.
> +# Test time-delayed logical replication
> +#
> +# If the subscription sets min_apply_delay parameter, the logical replication
> +# worker will delay the transaction apply for min_apply_delay milliseconds. 
> We
> +# look the time duration between tuples are inserted on publisher and then
> +# changes are replicated on subscriber.
> 
> This comment and the other one appearing later in this test are both
> explaining the same test strategy. I think both comments should be
> combined into one big one up-front, like this:
> 
> SUGGESTION
> If the subscription sets min_apply_delay parameter, the logical
> replication worker will delay the transaction apply for
> min_apply_delay milliseconds. We verify this by looking at the time
> difference between a) when tuples are inserted on the publisher, and
> b) when those changes are replicated on the subscriber. Even on slow
> machines, this strategy will give predictable behavior.

Changed.

> 4.
> +my $delay = 3;
> +
> +# Set min_apply_delay parameter to 3 seconds
> +$node_subscriber->safe_psql('postgres',
> + "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay =
> '${delay}s')");
> 
> IMO that "my $delay = 3;" assignment should be *after* the comment:
> 
> e.g.
> +
> +# Set min_apply_delay parameter to 3 seconds
> +my $delay = 3;
> +$node_subscriber->safe_psql('postgres',
> + "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay =
> '${delay}s')");

Right, changed.

> 5.
> +# Make new content on publisher and check its presence in subscriber
> depending
> +# on the delay applied above. Before doing the insertion, get the
> +# current timestamp that will be used as a comparison base. Even on slow
> +# machines, this allows to have a predictable behavior when comparing the
> +# delay between data insertion moment on publisher and replay time on
> subscriber.
> 
> Most of this comment is now redundant because this was already
> explained in the big comment up-front (see #3). Only one useful
> sentence is left.
> 
> SUGGESTION
> Before doing the insertion, get the current timestamp that will be
> used as a comparison base.

Removed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v32-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v32-0001-Time-delayed-logical-replication-subscriber.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-07 Thread Peter Smith
Here are my review comments for v31-0001

==
doc/src/sgml/glossary.sgml

1.
+ 
+  Replication setup that applies time-delayed copy of the data.
+

That sentence seemed a bit strange to me.

SUGGESTION
Replication setup that delays the application of changes by a
specified minimum time-delay period.

==

src/backend/replication/logical/worker.c

2. maybe_apply_delay

+ if (wal_receiver_status_interval > 0 &&
+ diffms > wal_receiver_status_interval * 1000L)
+ {
+ WaitLatch(MyLatch,
+   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+   wal_receiver_status_interval * 1000L,
+   WAIT_EVENT_RECOVERY_APPLY_DELAY);
+ send_feedback(last_received, true, false, true);
+ }

I felt that introducing another variable like:

long statusinterval_ms = wal_receiver_status_interval * 1000L;

would help here by doing 2 things:
1) The condition would be easier to read because the ms units would be the same
2) Won't need * 1000L repeated in two places.

Only, do take care to assign this variable in the right place in this
loop in case the configuration is changed.

==
src/test/subscription/t/001_rep_changes.pl

3.
+# Test time-delayed logical replication
+#
+# If the subscription sets min_apply_delay parameter, the logical replication
+# worker will delay the transaction apply for min_apply_delay milliseconds. We
+# look the time duration between tuples are inserted on publisher and then
+# changes are replicated on subscriber.

This comment and the other one appearing later in this test are both
explaining the same test strategy. I think both comments should be
combined into one big one up-front, like this:

SUGGESTION
If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction apply for
min_apply_delay milliseconds. We verify this by looking at the time
difference between a) when tuples are inserted on the publisher, and
b) when those changes are replicated on the subscriber. Even on slow
machines, this strategy will give predictable behavior.

~~

4.
+my $delay = 3;
+
+# Set min_apply_delay parameter to 3 seconds
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");

IMO that "my $delay = 3;" assignment should be *after* the comment:

e.g.
+
+# Set min_apply_delay parameter to 3 seconds
+my $delay = 3;
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");

~~~

5.
+# Make new content on publisher and check its presence in subscriber depending
+# on the delay applied above. Before doing the insertion, get the
+# current timestamp that will be used as a comparison base. Even on slow
+# machines, this allows to have a predictable behavior when comparing the
+# delay between data insertion moment on publisher and replay time on
subscriber.

Most of this comment is now redundant because this was already
explained in the big comment up-front (see #3). Only one useful
sentence is left.

SUGGESTION
Before doing the insertion, get the current timestamp that will be
used as a comparison base.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-07 Thread Takamichi Osumi (Fujitsu)
Hi, Horiguchi-san


Thanks for your review !
On Tuesday, February 7, 2023 1:43 PM From: Kyotaro Horiguchi 
 wrote:
> At Mon, 6 Feb 2023 13:10:01 +, "Takamichi Osumi (Fujitsu)"
>  wrote in
> subscriptioncmds.c
> 
> + if (opts.streaming ==
> LOGICALREP_STREAM_PARALLEL &&
> + !IsSet(opts.specified_opts,
> SUBOPT_MIN_APPLY_DELAY) &&
> +sub->minapplydelay > 0)
> ..
> + if (opts.min_apply_delay > 0 &&
> + !IsSet(opts.specified_opts,
> SUBOPT_STREAMING) && sub->stream ==
> +LOGICALREP_STREAM_PARALLEL)
> 
> Don't we wrap the lines?
Yes, those lines should have looked nicer.
Updated. Kindly have a look at the latest patch v31 in [1].
There are also other some changes in the patch.

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373BA483A6D2C924C600968EDDB9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi





RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-07 Thread Takamichi Osumi (Fujitsu)
Hi,


On Tuesday, February 7, 2023 2:26 PM Amit Kapila  
wrote:
> On Tue, Feb 7, 2023 at 10:13 AM Kyotaro Horiguchi 
> wrote:
> >
> > At Mon, 6 Feb 2023 13:10:01 +, "Takamichi Osumi (Fujitsu)"
> >  wrote in
> > > The attached patch v29 has included your changes.
> >
> > catalogs.sgml
> >
> > +  
> > +   The minimum delay (ms) for applying changes.
> > +  
> >
> > I think we don't use unit symbols that way. Namely I think we would
> > write it as "The minimum delay for applying changes in milliseconds"
> >
> 
> Okay, if we prefer to use milliseconds, then how about: "The minimum delay, in
> milliseconds, for applying changes"?
This looks good to me. Adopted.

> 
> >
> > alter_subscription.sgml
> >
> >are slot_name,
> >synchronous_commit,
> >binary, streaming,
> > -  disable_on_error, and
> > -  origin.
> > +  disable_on_error,
> > +  origin, and
> > +  min_apply_delay.
> >   
> >
> > By the way, is there any rule for the order among the words?
> >
> 
> Currently, it is in the order in which the corresponding features are added.
Yes. So, I keep it as it is.

> 
> > They
> > don't seem in alphabetical order nor in the same order to the
> > create_sbuscription page.
> >
> 
> In create_subscription page also, it appears to be in the order in which those
> are added with a difference that they are divided into two categories
> (parameters that control what happens during subscription creation and
> parameters that control the subscription's replication behavior after it has 
> been
> created)
Same as here. The current order should be fine.

> 
> >  (I seems like in the order of SUBOPT_* symbols, but I'm not sure it's
> > a good idea..)
> >
> >
> > subscriptioncmds.c
> >
> > +   if (opts.streaming ==
> LOGICALREP_STREAM_PARALLEL &&
> > +
> > + !IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY) &&
> > + sub->minapplydelay > 0)
> > ..
> > +   if (opts.min_apply_delay > 0 &&
> > +
> > + !IsSet(opts.specified_opts, SUBOPT_STREAMING) && sub->stream ==
> > + LOGICALREP_STREAM_PARALLEL)
> >
> > Don't we wrap the lines?
> >
> >
> > worker.c
> >
> > +   if (wal_receiver_status_interval > 0 &&
> > +   diffms > wal_receiver_status_interval * 1000L)
> > +   {
> > +   WaitLatch(MyLatch,
> > + WL_LATCH_SET |
> WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > + wal_receiver_status_interval *
> 1000L,
> > +
> WAIT_EVENT_RECOVERY_APPLY_DELAY);
> > +   send_feedback(last_received, true, false, true);
> > +   }
> > +   else
> > +   WaitLatch(MyLatch,
> > + WL_LATCH_SET |
> WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > + diffms,
> > +
> > + WAIT_EVENT_RECOVERY_APPLY_DELAY);
> >
> > send_feedback always handles the case where
> > wal_receiver_status_interval == 0.
> >
> 
> It only handles when force is false but here we are using that as true. So, 
> not
> sure, if what you said would be an improvement.
Agreed. So, I keep it as it is.

> 
> >  thus we can simply wait for
> > min(wal_receiver_status_interval, diffms) then call send_feedback()
> > unconditionally.
> >
> >
> > -start_apply(XLogRecPtr origin_startpos)
> > +start_apply(void)
> >
> > -LogicalRepApplyLoop(XLogRecPtr last_received)
> > +LogicalRepApplyLoop(void)
> >
> > Does this patch requires this change?
> >
> 
> I think this is because the scope of last_received has been changed so that it
> can be used to pass in send_feedback() during the delay.
Yes, that's our intention.


Kindly have a look at the latest patch v31 shared in [1].

[1] - 
https://www.postgresql.org/message-id/TYCPR01MB8373BA483A6D2C924C600968EDDB9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi





RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-07 Thread Takamichi Osumi (Fujitsu)
Hi,


On Tuesday, February 7, 2023 6:56 PM Amit Kapila  
wrote:
> On Tue, Feb 7, 2023 at 8:22 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Thank you for reviewing! PSA new version.
> >
> 
> Few comments:
> =
Thanks for your comments !

> 1.
> @@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
> BKI_SHARED_RELATION BKI_ROW
> 
>   Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
> 
> + int32 subminapplydelay; /* Replication apply delay (ms) */
> +
>   bool subenabled; /* True if the subscription is enabled (the
>   * worker should be running) */
> 
> @@ -120,6 +122,7 @@ typedef struct Subscription
>   * in */
>   XLogRecPtr skiplsn; /* All changes finished at this LSN are
>   * skipped */
> + int32 minapplydelay; /* Replication apply delay (ms) */
>   char*name; /* Name of the subscription */
>   Oid owner; /* Oid of the subscription owner */
> 
> Why the new parameter is placed at different locations in above two
> strcutures? I think it should be after owner in both cases and accordingly its
> order should be changed in GetSubscription() or any other place it is used.
Fixed.


> 
> 2. A minor comment change suggestion:
>  /*
>   * Common spoolfile processing.
>   *
> - * The commit/prepare time (finish_ts) for streamed transactions is required
> - * for time-delayed logical replication.
> + * The commit/prepare time (finish_ts) is required for time-delayed
> + logical
> + * replication.
>   */
Fixed.

 
> 3. I find the newly added tests take about 8s on my machine which is close
> highest in the subscription folder. I understand that it can't be less than 3s
> because of the delay but checking multiple cases makes it take that long. I
> think we can avoid the tests for streaming and disable the subscription. Also,
> after removing those, I think it would be better to add the remaining test in
> 001_rep_changes to save set-up and tear-down costs as well.
Sounds good to me. Moved the test to 001_rep_changes.pl.


> 4.
> +$node_publisher->append_conf('postgresql.conf',
> + 'logical_decoding_work_mem = 64kB');
> 
> I think this setting is also not required.
Yes. And, in the process to move the test, removed.

Attached the v31 patch.

Note that regarding the translator style,
I chose to export the parameters from the errmsg to outside
at this stage. If there is a need to change it, then I'll follow it.

Other changes are minor alignments to make 'if' conditions
that exceeded 80 characters folded and look nicer.

Also conducted pgindent and pgperltidy.


Best Regards,
Takamichi Osumi



v31-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v31-0001-Time-delayed-logical-replication-subscriber.patch


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-07 Thread Amit Kapila
On Tue, Feb 7, 2023 at 8:22 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thank you for reviewing! PSA new version.
>

Few comments:
=
1.
@@ -74,6 +74,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
BKI_SHARED_RELATION BKI_ROW

  Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */

+ int32 subminapplydelay; /* Replication apply delay (ms) */
+
  bool subenabled; /* True if the subscription is enabled (the
  * worker should be running) */

@@ -120,6 +122,7 @@ typedef struct Subscription
  * in */
  XLogRecPtr skiplsn; /* All changes finished at this LSN are
  * skipped */
+ int32 minapplydelay; /* Replication apply delay (ms) */
  char*name; /* Name of the subscription */
  Oid owner; /* Oid of the subscription owner */

Why the new parameter is placed at different locations in above two
strcutures? I think it should be after owner in both cases and
accordingly its order should be changed in GetSubscription() or any
other place it is used.

2. A minor comment change suggestion:
 /*
  * Common spoolfile processing.
  *
- * The commit/prepare time (finish_ts) for streamed transactions is required
- * for time-delayed logical replication.
+ * The commit/prepare time (finish_ts) is required for time-delayed logical
+ * replication.
  */

3. I find the newly added tests take about 8s on my machine which is
close highest in the subscription folder. I understand that it can't
be less than 3s because of the delay but checking multiple cases makes
it take that long. I think we can avoid the tests for streaming and
disable the subscription. Also, after removing those, I think it would
be better to add the remaining test in 001_rep_changes to save set-up
and tear-down costs as well.

4.
+$node_publisher->append_conf('postgresql.conf',
+ 'logical_decoding_work_mem = 64kB');

I think this setting is also not required.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-06 Thread Amit Kapila
On Tue, Feb 7, 2023 at 10:42 AM Peter Smith  wrote:
>
> On Tue, Feb 7, 2023 at 4:02 PM Amit Kapila  wrote:
> >
> > On Tue, Feb 7, 2023 at 10:07 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Tue, 7 Feb 2023 09:10:01 +0530, Amit Kapila  
> > > wrote in
> > > > On Tue, Feb 7, 2023 at 6:03 AM Peter Smith  
> > > > wrote:
> > > > > 5b.
> > > > > Since there are no translator considerations here why not write the
> > > > > second error like:
> > > > >
> > > > > errmsg("%d ms is outside the valid range for parameter
> > > > > \"min_apply_delay\" (%d .. %d)",
> > > > > result, 0, PG_INT32_MAX))
> > > > >
> > > >
> > > > I see that existing usage in the code matches what the patch had
> > > > before this comment. See below and similar usages in the code.
> > > > if (start <= 0)
> > > > ereport(ERROR,
> > > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > errmsg("invalid value for parameter \"%s\": %d",
> > > > "start", start)));
> > >
> > > The same errmsg text occurs mamy times in the tree. On the other hand
> > > the pointed message is the only one.  I suppose Peter considered this
> > > aspect.
> > >
> > > # "%d%s%s is outside the valid range for parameter \"%s\" (%d .. %d)"
> > > # also appears just once
> > >
> > > As for me, it seems to me a good practice to do that regadless of the
> > > number of duplicates to (semi)mechanically avoid duplicates.
> > >
> > > (But I believe I would do as Peter suggests by myself for the first
> > > cut, though:p)
> > >
> >
> > Personally, I would prefer consistency. I think we can later start a
> > new thread to change the existing message and if there is a consensus
> > and value in the same then we could use the same style here as well.
> >
>
> Of course, if there is a convention then we should stick to it.
>
> My understanding was that (string literal) message parameters are
> specified separately from the message format string primarily as an
> aid to translators. That makes good sense for parameters with names
> that are also English words (like "start" etc), but for non-word
> parameters like "min_apply_delay" there is no such ambiguity in the
> first place.
>

TBH, I am not an expert in this matter. So, to avoid, making any
mistakes I thought of keeping it close to the existing style.

-- 
With Regards,
Amit Kapila.




  1   2   3   4   >