Re: hot updates and fillfactor

2024-03-22 Thread Fabrice Chapuis
Thanks for your explanation and for the links


On Tue, Mar 19, 2024 at 11:17 AM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi Fabrice,
>
> > I do not understand why hot_updates value is not 0 for pg_database?
> Given that reloptions is empty for this table that means it has a default
> value of 100%
>
> Maybe I didn't entirely understand your question, but why would you
> assume they are somehow related?
>
> According to the documentation [1][2]:
>
> pg_class.reloptions:
>   Access-method-specific options, as “keyword=value” strings
>
> pg_stat_all_tables.n_tup_hot_upd:
>   Number of rows HOT updated. These are updates where no successor
> versions are required in indexes.
>
> The value of n_tup_hot_upd is not zero because there are tuples that
> were HOT-updated. That's it. You can read more about HOT here [3].
>
> [1]: https://www.postgresql.org/docs/current/catalog-pg-class.html
> [2]: https://www.postgresql.org/docs/current/monitoring-stats.html
> [3]: https://www.postgresql.org/docs/current/storage-hot.html
>
> --
> Best regards,
> Aleksander Alekseev
>


hot updates and fillfactor

2024-03-19 Thread Fabrice Chapuis
Hi,

I do not understand why hot_updates value is not 0 for pg_database? Given
that reloptions is empty for this table that means it has a default value
of 100%

Regards,

Fabrice

 SELECT
relname AS table_name,
seq_scan AS sequential_scans,
idx_scan AS index_scans,
n_tup_ins AS inserts,
n_tup_upd AS updates,
n_tup_hot_upd AS hot_updates
FROM
pg_stat_all_tables
ORDER BY
hot_updates DESC;

   table_name| sequential_scans | index_scans | inserts |
updates | hot_updates
-+--+-+-+-+-
 pg_database |  5038104 |   187427486 |   3 |
16 |*  16*


postgres [1728332]=# select t.relname as table_name,
   t.reloptions
from pg_class t
  join pg_namespace n on n.oid = t.relnamespace
where t.relname in ('pg_database')
  and n.nspname = 'pg_catalog';

 table_name  | reloptions
-+
 pg_database |
(1 row)


double precisoin type

2024-03-01 Thread Fabrice Chapuis
Hello,

postgres [1264904]=# select 123456789.123456789123456::double precision;
┌┐
│   float8   │
├┤
│ 123456789.12345679 │
└┘
(1 row)

I do not understand why this number is truncated at 123456789.12345679 that
is 17 digits and not 15 digits

Any idea

Fabrice

Documentation says:
double precision 8 bytes variable-precision, inexact 15 decimal digits
precision


pg_restore problem to load constraints with tables

2024-02-20 Thread Fabrice Chapuis
Hi,
When a table is reloaded wit pg_restore, it is recreated without indexes or
constraints. There are automatically skipped. Is there a reason for this?

g_restore -j 8 -v -d zof /shared/pgdump/aq/backup/dbtest/shtest --no-owner
--role=test -t  mytable 2>&1 | tee -a dbest.log

pg_restore: skipping item 7727 SEQUENCE SET xxx_seq
pg_restore: skipping item 5110 INDEX x_cons
pg_restore: skipping item 5143 CONSTRAINT xxx
pg_restore: skipping item 5670 FK CONSTRAINT xxx

Thanks for your feedback

Fabrice


pg_restore option --clean

2024-02-19 Thread Fabrice Chapuis
Hi,
The --clean option of pg_restore allows you to replace an object before
being imported. However, dependencies such as foreign keys or views prevent
the deletion of the object. Is there a way to add the cascade option to
force the deletion?
Thanks for helping
Fabrice


Re: pg_waldump

2023-12-19 Thread Fabrice Chapuis
Ok thanks for all these precisions
Regards
Fabrice

On Tue, Dec 19, 2023 at 2:00 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Tue, 19 Dec 2023, 12:27 Fabrice Chapuis, 
> wrote:
> >
> > Hi,
> > Is it possible to visualize the DDL with the pg_waldump tool. I created
> a postgres user but I cannot find the creation command in the wals
>
> Not really, no. PostgreSQL does not log DDL or DML as such in WAL.
> Essentially all catalog updates are logged only as changes on a
> certain page in some file: a new user getting inserted would be
> approximately "Insert tuple [user's pg_role row data] on page X in
> file [the file corresponding to the pg_role table]".
>
> You could likely derive most DDL commands from Heap/Insert,
> Heap/Delete, and Heap/Update records (after cross-referencing the
> database's relfilemap), as most DDL is "just" a lot of in-memory
> operations plus some record insertions/updates/deletes in catalog
> tables. You'd also need to keep track of any relfilemap changes while
> processing the WAL, as VACUUM FULL on the catalog tables would change
> the file numbering of catalog tables...
>
> Kind regards,
>
> Matthias van de Meent
> Neon (https://neon.tech)
>


pg_waldump

2023-12-19 Thread Fabrice Chapuis
Hi,
Is it possible to visualize the DDL with the pg_waldump tool. I created a
postgres user but I cannot find the creation command in the wals

Thanks for help

Fabrice


Re: wal recycling problem

2023-10-17 Thread Fabrice Chapuis
Thanks for your feedback
> How would you know which part of WAL is needed for any specific
replication slot?
change are captured for each published table and written twice,  once in
the current wal and once in the slot-specific wal
> How would you handle multiple replications
for the same table
added information about from which publication a table belongs is entered
in the wal slot
> be it logical or physical replication, retains WAL up to
max_slot_wal_keep_size
ok but if max_slot_wal_keep_size is exceeded the changes are lost and all
of the replicated tables must be resynchronized

Regards

Fabrice

On Sun, Oct 8, 2023 at 3:57 PM Christoph Moench-Tegeder 
wrote:

> ## Fabrice Chapuis (fabrice636...@gmail.com):
>
> > From a conceptual point of view I think that specific wals per
> subscription
> > should be used and stored in the pg_replslot folder in order to avoid
> > working directly on the wals of the instance.
> > What do you think about this proposal?
>
> I think that would open a wholly new can of worms.
> The most obvious point here is: that WAL is primarily generated for
> the operation of the database itself - it's our kind of transaction
> log, or "Redo Log" in other systems' lingo. Replication (be it physical
> or logical) is a secondary purpose (an obvious and important one, but
> still secondary).
> How would you know which part of WAL is needed for any specific
> replication slot? You'd have to decode and filter it, and already
> you're back at square one. How would you handle multiple replications
> for the same table (in the same publication, or even over multiple
> (overlapping) publications) - do you multiply the WAL?
>
> For now, we have "any replication using replication slots, be it logical
> or physical replication, retains WAL up to max_slot_wal_keep_size
> (or "unlimited" if not set - and on PostgreSQL 12 and before); and you
> need to monitor the state of your replication slots", which is a
> totally usabe rule, I think.
>
> Regards,
> Christoph
>
> --
> Spare Space
>


Re: wal recycling problem

2023-10-06 Thread Fabrice Chapuis
Thanks Christoph for your message.
Now I understand why the wals are preserved if logical replication is
configured and enabled. The problem is that when a large volume of data is
loaded into a database, for example during a pg_restore, the wal sender
process associated with the logical replication slot will have to decrypt
all of the wals generated during this operation which will take a long time
and the restart_lsn will not be modified.
>From a conceptual point of view I think that specific wals per subscription
should be used and stored in the pg_replslot folder in order to avoid
working directly on the wals of the instance.

What do you think about this proposal?

Regards

Fabrice


On Mon, Oct 2, 2023 at 12:06 PM Christoph Moench-Tegeder 
wrote:

> Hi,
>
> ## Fabrice Chapuis (fabrice636...@gmail.com):
>
> > on the other hand there are 2 slots for logical replication which display
> > status extended. I don't understand why given that the
> confirmed_flush_lsn
> > field that is up to date. The restart_lsn remains frozen, for what
> reason?
>
> There you have it - "extended" means "holding wal". And as long as the
> restart_lsn does not advance, checkpointer cannot free any wal beyond
> that lsn. My first idea would be some long-running (or huge) transaction
> which is in process (active or still being streamed). I'd recommend
> looking into what the clients on these slots are doing.
>
> Regards,
> Christoph
>
> --
> Spare Space
>


Re: wal recycling problem

2023-09-29 Thread Fabrice Chapuis
Yes, barman replication can keep up with primary, wals segments size are
under max_wal_size (24Gb in our configuration)

Here is  pg_replication_slots view:

barman_ge  physical  f  t39409 1EE2/4900
reservedf
barman_be  physical  f  t39434 1EE2/3D00
reservedf

on the other hand there are 2 slots for logical replication which display
status extended. I don't understand why given that the confirmed_flush_lsn
field that is up to date. The restart_lsn remains frozen, for what reason?

pgoutput │ logical   │ 2667915 │ db019a00 │ f │ t  │1880162
│  │ 68512101 │ 1ECA/37C3F1B8 │ 1EE2/4D6BDCF8   │ extended   │
 │ f │
pgoutput │ logical   │ 2668584 │ db038a00 │ f │ t  │
 363230  │  │ 68512101 │ 1ECA/37C3F1B8 │ 1EE2/4D6BDCF8   │
extended   │   │ f │

Regards
Fabrice

On Thu, Sep 28, 2023 at 7:59 PM Christoph Moench-Tegeder 
wrote:

> ## Fabrice Chapuis (fabrice636...@gmail.com):
>
> > We have a cluster of 2 members (1 primary and 1 standby) with Postgres
> > version 14.9 and 2 barman server, slots are only configured for barman,
> > barman is version 3.7.
>
> The obvious question here is: can both of those barmans keep up with
> your database, or are you seeing WAL retention due to exactly these
> replication slots? (Check pg_replication_slots).
>
> Regards,
> Christoph
>
> --
> Spare Space
>


wal recycling problem

2023-09-28 Thread Fabrice Chapuis
Hello,

I have a question about the automatic removal of unused WAL files. When
loading data with pg_restore (200Gb) we noticed that a lot of WALs files
are generated and they are not purged automatically nor recycled despite
frequent checkpoints, then pg_wal folder (150Gb) fill and become out of
space.
We have a cluster of 2 members (1 primary and 1 standby) with Postgres
version 14.9 and 2 barman server, slots are only configured for barman,
barman is version 3.7.
The archive command is desactivated (archive_command=':')
I use pg_archivecleanup (with the wal file generated from the last
checkpoint in parameter) to remove files manually before the limit of 150Gb
so that the restore can terminate.

Why does postgres do not this cleanup automatically, which part of the code
is responsible for removing or recycling the wals?

Thanks for your help

Fabrice


drop table in transaction

2023-05-09 Thread Fabrice Chapuis
Where in the code is written the mechanism used for isolation when drop
table is executed in a transaction

Thanks for your help

Fabrice


Re: Logical replication timeout problem

2022-11-04 Thread Fabrice Chapuis
Hello Wang,
I tested the draft patch in my lab for Postgres 14.4, the refresh of the
materialized view ran without generating the timeout on the worker.
Do you plan to propose this patch at the next commit fest.

Regards,
Fabrice

On Wed, Oct 19, 2022 at 10:15 AM wangw.f...@fujitsu.com <
wangw.f...@fujitsu.com> wrote:

> On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis 
> wrote:
> > Hello Amit,
> >
> > In version 14.4 the timeout problem for logical replication happens
> again despite
> > the patch provided for this issue in this version. When bulky
> materialized views
> > are reloaded it broke logical replication. It is possible to solve this
> problem by
> > using your new "streaming" option.
> > Have you ever had this issue reported to you?
> >
> > Regards
> >
> > Fabrice
> >
> > 2022-10-10 17:19:02 CEST [538424]: [17-1]
> > user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
> > MATERIALIZED VIEW sxxxa00.table_base"
> > PL/pgSQL function refresh_materialized_view(text) line 5 at
> EXECUTE
> > 2022-10-10 17:19:02 CEST [538424]: [18-1]
> > user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
> > refresh_materialized_view('sxxxa00.table_base');
> > 2022-10-10 17:19:02 CEST [538424]: [19-1]
> > user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652
> > ms  statement: select refresh_materialized_view('sxxxa00.table_base');
> > 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:
> automatic
> > vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
> > pages: 0 removed, 296589 remain, 0 skipped due to pins, 0
> skipped frozen
> > tuples: 0 removed, 48472622 remain, 0 are dead but not yet
> removable,
> > oldest xmin: 1501528
> > index scan not needed: 0 pages from table (0.00% of total) had 0
> dead item
> > identifiers removed
> > I/O timings: read: 1.494 ms, write: 0.000 ms
> > avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
> > buffer usage: 593301 hits, 77 misses, 294605 dirtied
> > WAL usage: 296644 records, 46119 full page images, 173652718
> bytes
> > system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32
> s
> > 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:
> automatic
> > analyze of table "dbxxxa00.sxxxa00.table_base"
> > I/O timings: read: 0.043 ms, write: 0.000 ms
> > avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
> > buffer usage: 30308 hits, 2 misses, 2 dirtied
> > system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> > 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:
> checkpoint
> > complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
> recycled;
> > write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251,
> longest=0.00
> > 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> > 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:
> terminating
> > logical replication worker due to timeout
> > 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:
> background
> > worker "logical replication worker" (PID 716163) exited with exit code 1
> > 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
> > replication apply worker for subscription "subxxx_sxxxa00" has started
>
> Thanks for reporting!
>
> There is one thing I want to confirm:
> Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
> executed on the publisher-side?
>
> If so, I think the reason for this timeout problem could be that during DDL
> (`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
> rewrite. Since these temporary data will not be processed by the pgoutput
> plugin, our previous fix for DML had no impact on this case.
> I think setting "streaming" option to "on" could work around this problem.
>
> I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
> I tried it locally and it seems to work.
> Could you please confirm whether this problem is fixed after applying this
> draft patch?
>
> If this draft patch works, I will improve it and try to fix this problem.
>
> Regards,
> Wang wei
>


Re: Logical replication timeout problem

2022-10-19 Thread Fabrice Chapuis
Yes the refresh of MV is on the Publisher Side.
Thanks for your draft patch, I'll try it
I'll back to you as soonas possible

One question: why the refresh of the MV is a DDL not a DML?

Regards

Fabrice

On Wed, 19 Oct 2022, 10:15 wangw.f...@fujitsu.com 
wrote:

> On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis 
> wrote:
> > Hello Amit,
> >
> > In version 14.4 the timeout problem for logical replication happens
> again despite
> > the patch provided for this issue in this version. When bulky
> materialized views
> > are reloaded it broke logical replication. It is possible to solve this
> problem by
> > using your new "streaming" option.
> > Have you ever had this issue reported to you?
> >
> > Regards
> >
> > Fabrice
> >
> > 2022-10-10 17:19:02 CEST [538424]: [17-1]
> > user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
> > MATERIALIZED VIEW sxxxa00.table_base"
> > PL/pgSQL function refresh_materialized_view(text) line 5 at
> EXECUTE
> > 2022-10-10 17:19:02 CEST [538424]: [18-1]
> > user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
> > refresh_materialized_view('sxxxa00.table_base');
> > 2022-10-10 17:19:02 CEST [538424]: [19-1]
> > user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652
> > ms  statement: select refresh_materialized_view('sxxxa00.table_base');
> > 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:
> automatic
> > vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
> > pages: 0 removed, 296589 remain, 0 skipped due to pins, 0
> skipped frozen
> > tuples: 0 removed, 48472622 remain, 0 are dead but not yet
> removable,
> > oldest xmin: 1501528
> > index scan not needed: 0 pages from table (0.00% of total) had 0
> dead item
> > identifiers removed
> > I/O timings: read: 1.494 ms, write: 0.000 ms
> > avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
> > buffer usage: 593301 hits, 77 misses, 294605 dirtied
> > WAL usage: 296644 records, 46119 full page images, 173652718
> bytes
> > system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32
> s
> > 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:
> automatic
> > analyze of table "dbxxxa00.sxxxa00.table_base"
> > I/O timings: read: 0.043 ms, write: 0.000 ms
> > avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
> > buffer usage: 30308 hits, 2 misses, 2 dirtied
> > system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> > 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:
> checkpoint
> > complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
> recycled;
> > write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251,
> longest=0.00
> > 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> > 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:
> terminating
> > logical replication worker due to timeout
> > 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:
> background
> > worker "logical replication worker" (PID 716163) exited with exit code 1
> > 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
> > replication apply worker for subscription "subxxx_sxxxa00" has started
>
> Thanks for reporting!
>
> There is one thing I want to confirm:
> Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
> executed on the publisher-side?
>
> If so, I think the reason for this timeout problem could be that during DDL
> (`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
> rewrite. Since these temporary data will not be processed by the pgoutput
> plugin, our previous fix for DML had no impact on this case.
> I think setting "streaming" option to "on" could work around this problem.
>
> I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
> I tried it locally and it seems to work.
> Could you please confirm whether this problem is fixed after applying this
> draft patch?
>
> If this draft patch works, I will improve it and try to fix this problem.
>
> Regards,
> Wang wei
>


Re: Logical replication timeout problem

2022-10-18 Thread Fabrice Chapuis
Hello Amit,

In version 14.4 the timeout problem for logical replication happens again
despite the patch provided for this issue in this version. When bulky
materialized views are reloaded it broke logical replication. It is
possible to solve this problem by using your new "streaming" option.
Have you ever had this issue reported to you?

Regards

Fabrice

2022-10-10 17:19:02 CEST [538424]: [17-1]
user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
MATERIALIZED VIEW sxxxa00.table_base"
PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
2022-10-10 17:19:02 CEST [538424]: [18-1]
user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:02 CEST [538424]: [19-1]
user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652 ms
statement: select refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:  automatic
vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped
frozen
tuples: 0 removed, 48472622 remain, 0 are dead but not yet
removable, oldest xmin: 1501528
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
I/O timings: read: 1.494 ms, write: 0.000 ms
avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
buffer usage: 593301 hits, 77 misses, 294605 dirtied
WAL usage: 296644 records, 46119 full page images, 173652718 bytes
system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:  automatic
analyze of table "dbxxxa00.sxxxa00.table_base"
I/O timings: read: 0.043 ms, write: 0.000 ms
avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
buffer usage: 30308 hits, 2 misses, 2 dirtied
system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:
checkpoint complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0
removed, 0 recycled; write=269.551 s, sync=0.002 s, total=269.560 s; sync
files=251, longest=0.00
1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:
background worker "logical replication worker" (PID 716163) exited with
exit code 1
2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
replication apply worker for subscription "subxxx_sxxxa00" has started

On Fri, Apr 1, 2022 at 6:09 AM Amit Kapila  wrote:

> On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira  wrote:
> >
> > On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
> >
> > This is exactly our initial analysis and we have tried a patch on
> > these lines and it has a noticeable overhead. See [1]. Calling this
> > for each change or each skipped change can bring noticeable overhead
> > that is why we decided to call it after a certain threshold (100) of
> > skipped changes. Now, surely as mentioned in my previous reply we can
> > make it generic such that instead of calling this (update_progress
> > function as in the patch) for skipped cases, we call it always. Will
> > that make it better?
> >
> > That's what I have in mind but using a different approach.
> >
> > > The functions CreateInitDecodingContext and CreateDecodingContext
> receives the
> > > update_progress function as a parameter. These functions are called in
> 2
> > > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT)
> and (b)
> > > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > > WalSndUpdateProgress as a progress function. Case (b) does not have
> one because
> > > it is not required -- local decoding/communication. There is no custom
> update
> > > progress routine for each output plugin which leads me to the question:
> > > couldn't we encapsulate the update progress call into the callback
> functions?
> > >
> >
> > Sorry, I don't get your point. What exactly do you mean by this?
> > AFAIS, currently we call this output plugin API in pgoutput functions
> > only, do you intend to get it invoked from a different place?
> >
> > It seems I didn't make myself clear. The callbacks I'm referring to the
> > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> > *_cb_wrapper() function, we have something like:
> >
> > if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> > NewUpdateProgress(ctx, false);
> >
> > The NewUpdateProgress function would contain a logic similar to the
> > update_progress() from the proposed patch. (A different function name
> here just
> > to avoid confusion.)
> >
> > The output plugin is responsible to set ctx->progress with the callback
> > variables (for example, 

Re: Logical replication timeout problem

2022-02-09 Thread Fabrice Chapuis
Thanks for your patch, it works well in my test lab.
I added the definition *extern in wal_sender_timeout;* in the
*output_plugin.h* file for compilation works.
I tested the patch for version 10 which is currently in production on our
systems.
The functions below are only in master branch:
pgoutput_prepare_txn functions,
pgoutput_commit_prepared_txn,
pgoutput_rollback_prepared_txn,
pgoutput_stream_commit,
pgoutput_stream_prepare_txn

Will the patch be proposed retroactively to version 13-12-11-10.

Best regards,

Fabrice

On Tue, Feb 8, 2022 at 3:59 AM wangw.f...@fujitsu.com <
wangw.f...@fujitsu.com> wrote:

> On Wed, Jan 26, 2022 at 11:37 AM I wrote:
> > On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila 
> wrote:
> > > Now, one idea to solve this problem could be that whenever we skip
> > > sending any change we do try to update the plugin progress via
> > > OutputPluginUpdateProgress(for walsender, it will invoke
> > > WalSndUpdateProgress), and there it tries to process replies and send
> > > keep_alive if necessary as we do when we send some data via
> > > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > > don't know whether it is a good idea to invoke such a mechanism for
> > > every change we skip to send or we should do it after we skip sending
> > > some threshold of continuous changes. I think later would be
> > > preferred. Also, we might want to introduce a new parameter
> > > send_keep_alive to this API so that there is flexibility to invoke
> > > this mechanism as we don't need to invoke it while we are actually
> > > sending data and before that, we just update the progress via this
> > > API.
> > ..
> > Based on above, I think the second idea that sending some threshold of
> > continuous changes might be better, I will do some research about this
> > approach.
> Based on the second idea, I wrote a new patch(see attachment).
>
> Regards,
> Wang wei
>


Re: Logical replication timeout problem

2022-01-28 Thread Fabrice Chapuis
Thanks for your new fix Wang.

TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime,
wal_sender_timeout / 2);

shouldn't we use receiver_timeout in place of wal_sender_timeout because de
problem comes from the consummer.

On Wed, Jan 26, 2022 at 4:37 AM wangw.f...@fujitsu.com <
wangw.f...@fujitsu.com> wrote:

> On Thu, Jan 22, 2022 at 7:12 PM Amit Kapila 
> wrote:
> > Now, one idea to solve this problem could be that whenever we skip
> > sending any change we do try to update the plugin progress via
> > OutputPluginUpdateProgress(for walsender, it will invoke
> > WalSndUpdateProgress), and there it tries to process replies and send
> > keep_alive if necessary as we do when we send some data via
> > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > don't know whether it is a good idea to invoke such a mechanism for
> > every change we skip to send or we should do it after we skip sending
> > some threshold of continuous changes. I think later would be
> > preferred. Also, we might want to introduce a new parameter
> > send_keep_alive to this API so that there is flexibility to invoke
> > this mechanism as we don't need to invoke it while we are actually
> > sending data and before that, we just update the progress via this
> > API.
>
> I tried out the patch according to your advice.
> I found if I invoke ProcessRepliesIfAny and WalSndKeepaliveIfNecessary in
> function OutputPluginUpdateProgress, the running time of the newly added
> function OutputPluginUpdateProgress invoked in pgoutput_change brings
> notable
> overhead:
> --11.34%--pgoutput_change
>   |
>   |--8.94%--OutputPluginUpdateProgress
>   |  |
>   |   --8.70%--WalSndUpdateProgress
>   | |
>   | |--7.44%--ProcessRepliesIfAny
>
> So I tried another way of sending keepalive message to the standby machine
> based on the timeout without asking for a reply(see attachment), the
> running
> time of the newly added function OutputPluginUpdateProgress invoked in
> pgoutput_change also brings slight overhead:
> --3.63%--pgoutput_change
>   |
>   |--1.40%--get_rel_sync_entry
>   |  |
>   |   --1.14%--hash_search
>   |
>--1.08%--OutputPluginUpdateProgress
>  |
>   --0.85%--WalSndUpdateProgress
>
> Based on above, I think the second idea that sending some threshold of
> continuous changes might be better, I will do some research about this
> approach.
>
> Regards,
> Wang wei
>


Re: Logical replication timeout problem

2022-01-21 Thread Fabrice Chapuis
I keep your patch 0001 and I add these two calls in function
WalSndUpdateProgress without modifying WalSndKeepaliveIfNecessary, it works
too.
What do your think of this patch?

static void
WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn,
TransactionId xid)
{
static TimestampTz sendTime = 0;
TimestampTz now = GetCurrentTimestamp();

ProcessRepliesIfAny();
WalSndKeepaliveIfNecessary();



/*
 * Track lag no more than once per
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS to
 * avoid flooding the lag tracker when we commit frequently.
 */
...
Regards

Fabrice

On Fri, Jan 21, 2022 at 2:17 PM Fabrice Chapuis 
wrote:

> Thanks for your patch, it also works well when executing our use case, the
> timeout no longer appears in the logs. Is it necessary now to refine this
> patch and make as few changes as possible in order for it to be released?
>
> On Fri, Jan 21, 2022 at 10:51 AM wangw.f...@fujitsu.com <
> wangw.f...@fujitsu.com> wrote:
>
>> On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila 
>> wrote:
>> > It might be not reaching the actual send_keep_alive logic in
>> > WalSndKeepaliveIfNecessary because of below code:
>> > {
>> > ...
>> > /*
>> > * Don't send keepalive messages if timeouts are globally disabled or
>> > * we're doing something not partaking in timeouts.
>> > */
>> > if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
>> > }
>> >
>> > I think you can add elog before the above return and before updating
>> progress
>> > in the below code:
>> > case REORDER_BUFFER_CHANGE_INSERT:
>> >   if (!relentry->pubactions.pubinsert)
>> > + {
>> > + OutputPluginUpdateProgress(ctx);
>> >   return;
>> >
>> > This will help us to rule out one possibility.
>>
>> Thanks for your advices!
>>
>> According to your advices, I applied 0001,0002 and 0003 to run the test
>> script.
>> When subscriber timeout, I filter publisher-side log:
>> $ grep "before invoking update progress" pub.log | wc -l
>> 60373557
>> $ grep "return because wal_sender_timeout or last_reply_timestamp"
>> pub.log | wc -l
>> 0
>> $ grep "return because waiting_for_ping_response" pub.log | wc -l
>> 0
>>
>> Based on this result, I think function WalSndKeepaliveIfNecessary was
>> invoked,
>> but function WalSndKeepalive was not invoked because (last_processing >=
>> ping_time) is false.
>> So I tried to see changes about last_processing and last_reply_timestamp
>> (because ping_time is based on last_reply_timestamp).
>>
>> I found last_processing and last_reply_timestamp is set in function
>> ProcessRepliesIfAny.
>> last_processing is set to the time when function ProcessRepliesIfAny is
>> invoked.
>> Only when publisher receive a response from subscriber,
>> last_reply_timestamp is
>> set to last_processing and the flag waiting_for_ping_response is reset to
>> false.
>>
>> When we are during the loop to skip all the changes of transaction, IIUC,
>> we do
>> not invoke function ProcessRepliesIfAny. So I think last_processing and
>> last_reply_timestamp will not be changed in this loop.
>> Therefore I think about our use case, we should modify the condition of
>> invoking WalSndKeepalive.(please refer to
>> 0004-Simple-modification-of-timing.patch, and note that this is only a
>> patch
>> for testing).
>> At the same time I modify the input of WalSndKeepalive from true to
>> false. This
>> is because when input is true, waiting_for_ping_response is set to true in
>> WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked
>> in the
>> loop, so I think waiting_for_ping_response will not be reset to false and
>> keepalive messages will not be sent.
>>
>> I tested after applying patches(0001 and 0004), I found the timeout was
>> not
>> printed in subscriber-side log. And the added messages "begin load
>> changes" and
>> "commit the log" were printed in publisher-side log:
>> $ grep -ir "begin load changes" pub.log
>> 2022-01-21 11:17:06.934 CST [2577699] LOG:  begin load changes
>> $ grep -ir "commit the log" pub.log
>> 2022-01-21 11:21:15.564 CST [2577699] LOG:  commit the log
>>
>> Attach the patches and test script mentioned above, in case someone wants
>> to
>> try.
>>
>> Regards,
>> Wang wei
>>
>


Re: Logical replication timeout problem

2022-01-21 Thread Fabrice Chapuis
Thanks for your patch, it also works well when executing our use case, the
timeout no longer appears in the logs. Is it necessary now to refine this
patch and make as few changes as possible in order for it to be released?

On Fri, Jan 21, 2022 at 10:51 AM wangw.f...@fujitsu.com <
wangw.f...@fujitsu.com> wrote:

> On Thu, Jan 20, 2022 at 9:18 PM Amit Kapila 
> wrote:
> > It might be not reaching the actual send_keep_alive logic in
> > WalSndKeepaliveIfNecessary because of below code:
> > {
> > ...
> > /*
> > * Don't send keepalive messages if timeouts are globally disabled or
> > * we're doing something not partaking in timeouts.
> > */
> > if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; ..
> > }
> >
> > I think you can add elog before the above return and before updating
> progress
> > in the below code:
> > case REORDER_BUFFER_CHANGE_INSERT:
> >   if (!relentry->pubactions.pubinsert)
> > + {
> > + OutputPluginUpdateProgress(ctx);
> >   return;
> >
> > This will help us to rule out one possibility.
>
> Thanks for your advices!
>
> According to your advices, I applied 0001,0002 and 0003 to run the test
> script.
> When subscriber timeout, I filter publisher-side log:
> $ grep "before invoking update progress" pub.log | wc -l
> 60373557
> $ grep "return because wal_sender_timeout or last_reply_timestamp" pub.log
> | wc -l
> 0
> $ grep "return because waiting_for_ping_response" pub.log | wc -l
> 0
>
> Based on this result, I think function WalSndKeepaliveIfNecessary was
> invoked,
> but function WalSndKeepalive was not invoked because (last_processing >=
> ping_time) is false.
> So I tried to see changes about last_processing and last_reply_timestamp
> (because ping_time is based on last_reply_timestamp).
>
> I found last_processing and last_reply_timestamp is set in function
> ProcessRepliesIfAny.
> last_processing is set to the time when function ProcessRepliesIfAny is
> invoked.
> Only when publisher receive a response from subscriber,
> last_reply_timestamp is
> set to last_processing and the flag waiting_for_ping_response is reset to
> false.
>
> When we are during the loop to skip all the changes of transaction, IIUC,
> we do
> not invoke function ProcessRepliesIfAny. So I think last_processing and
> last_reply_timestamp will not be changed in this loop.
> Therefore I think about our use case, we should modify the condition of
> invoking WalSndKeepalive.(please refer to
> 0004-Simple-modification-of-timing.patch, and note that this is only a
> patch
> for testing).
> At the same time I modify the input of WalSndKeepalive from true to false.
> This
> is because when input is true, waiting_for_ping_response is set to true in
> WalSndKeepalive. As mentioned above, ProcessRepliesIfAny is not invoked in
> the
> loop, so I think waiting_for_ping_response will not be reset to false and
> keepalive messages will not be sent.
>
> I tested after applying patches(0001 and 0004), I found the timeout was not
> printed in subscriber-side log. And the added messages "begin load
> changes" and
> "commit the log" were printed in publisher-side log:
> $ grep -ir "begin load changes" pub.log
> 2022-01-21 11:17:06.934 CST [2577699] LOG:  begin load changes
> $ grep -ir "commit the log" pub.log
> 2022-01-21 11:21:15.564 CST [2577699] LOG:  commit the log
>
> Attach the patches and test script mentioned above, in case someone wants
> to
> try.
>
> Regards,
> Wang wei
>


Re: Logical replication timeout problem

2022-01-19 Thread Fabrice Chapuis
Hello Amit,







If it takes little work for you, can you please send me a piece of code
with the change needed to do the test

Thanks

Regards,

Fabrice


On Fri, Jan 14, 2022 at 1:03 PM Amit Kapila  wrote:

> On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis 
> wrote:
> >
> > If I can follow you, I have to make the following changes:
> >
>
> No, not like that but we can try that way as well to see if that helps
> to avoid your problem. Am, I understanding correctly even after
> modification, you are seeing the problem. Can you try by calling
> WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2022-01-14 Thread Fabrice Chapuis
if it takes little work for you, can you please send me a piece of code
with the change needed to do the test

Thanks

Regards,

Fabrice

On Fri, Jan 14, 2022 at 1:03 PM Amit Kapila  wrote:

> On Fri, Jan 14, 2022 at 3:47 PM Fabrice Chapuis 
> wrote:
> >
> > If I can follow you, I have to make the following changes:
> >
>
> No, not like that but we can try that way as well to see if that helps
> to avoid your problem. Am, I understanding correctly even after
> modification, you are seeing the problem. Can you try by calling
> WalSndKeepaliveIfNecessary() instead of WalSndKeepalive()?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2022-01-14 Thread Fabrice Chapuis
If I can follow you, I have to make the following changes:

1. In walsender.c:

static void
WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn,
TransactionId xid)
{
static TimestampTz sendTime = 0;
TimestampTz now = GetCurrentTimestamp();

/* Keep the worker process alive */
WalSndKeepalive(true);
/*
* Track lag no more than once per WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS to
* avoid flooding the lag tracker when we commit frequently.
*/
#define WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS 1000
if (!TimestampDifferenceExceeds(sendTime, now,
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS))
return;

LagTrackerWrite(lsn, now);
sendTime = now;
}

I put *requestReply *parameter to true, is that correct?

2. In pgoutput.c

/*
 * Sends the decoded DML over wire.
 *
 * This is called both in streaming and non-streaming modes.
 */
static void
pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
MemoryContext old;
RelationSyncEntry *relentry;
TransactionId xid = InvalidTransactionId;
Relation ancestor = NULL;

WalSndUpdateProgress(ctx, txn->origin_lsn,  change->txn->xid);

if (!is_publishable_relation(relation))
return;
...

Make a call to *WalSndUpdateProgress* in function *pgoutput_change.*

For info: the information in the log after reproducing the problem.

2022-01-13 11:19:46.340 CET [82233] LOCATION:  WalSndKeepaliveIfNecessary,
walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT:  START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2022-01-13 11:19:46.340 CET [82233] LOG:  0: attempt to send keep alive
message
2022-01-13 11:19:46.340 CET [82233] LOCATION:  WalSndKeepaliveIfNecessary,
walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT:  START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2022-01-13 11:19:46.340 CET [82233] LOG:  0: attempt to send keep alive
message
2022-01-13 11:19:46.340 CET [82233] LOCATION:  WalSndKeepaliveIfNecessary,
walsender.c:3389
2022-01-13 11:19:46.340 CET [82233] STATEMENT:  START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2022-01-13 11:20:46.418 CET [82232] ERROR:  XX000: terminating logical
replication worker due to timeout
2022-01-13 11:20:46.418 CET [82232] LOCATION:  LogicalRepApplyLoop,
worker.c:1267
2022-01-13 11:20:46.421 CET [82224] LOG:  0: worker process: logical
replication worker for subscription 26994 (PID 82232) exited with exit code
1
2022-01-13 11:20:46.421 CET [82224] LOCATION:  LogChildExit,
postmaster.c:3625

Thanks a lot for your help.

Fabrice

On Thu, Jan 13, 2022 at 2:59 PM Amit Kapila  wrote:

> On Thu, Jan 13, 2022 at 3:43 PM Fabrice Chapuis 
> wrote:
> >
> > first phase: postgres read WAL files and generate 1420 snap files.
> > second phase: I guess, but on this point maybe you can clarify, postgres
> has to decode the snap files and remove them if no statement must be
> applied on a replicated table.
> > It is from this point that the worker process exit after 1 minute
> timeout.
> >
>
> Okay, I think the problem could be that because we are skipping all
> the changes of transaction there is no communication sent to the
> subscriber and it eventually timed out. Actually, we try to send
> keep-alive at transaction boundaries like when we call
> pgoutput_commit_txn. The pgoutput_commit_txn will call
> OutputPluginWrite->WalSndWriteData. I think to tackle the problem we
> need to try to send such keepalives via WalSndUpdateProgress and
> invoke that in pgoutput_change when we skip sending the change.
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2022-01-13 Thread Fabrice Chapuis
first phase: postgres read WAL files and generate 1420 snap files.
second phase: I guess, but on this point maybe you can clarify, postgres
has to decode the snap files and remove them if no statement must be
applied on a replicated table.
It is from this point that the worker process exit after 1 minute timeout.

On Wed, Jan 12, 2022 at 11:54 AM Amit Kapila 
wrote:

> On Tue, Jan 11, 2022 at 8:13 PM Fabrice Chapuis 
> wrote:
>
>> Can you explain why you think this will help in solving your current
>> problem?
>>
>> Indeed your are right this function won't help, we have to look elsewhere.
>>
>> It is still not clear to me why the problem happened? IIUC, after
>> restoring 4096 changes from snap files, we send them to the subscriber, and
>> then apply worker should apply those one by one. Now, is it taking one
>> minute to restore 4096 changes due to which apply worker is timed out?
>>
>> Now I can easily reproduce the problem.
>> In a first phase, snap files are generated and stored in pg_replslot.
>> This process end when1420 files are present in pg_replslots (this is in
>> relation with statements that must be replayed from WAL). In the
>> pg_stat_replication view, the state field is set to *catchup*.
>> In a 2nd phase, the snap files must be decoded. However after one minute
>> (wal_receiver_timeout parameter set to 1 minute) the worker process stop
>> with a timeout.
>>
>>
> What exactly do you mean by the first and second phase in the above
> description?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2022-01-11 Thread Fabrice Chapuis
Can you explain why you think this will help in solving your current
problem?

Indeed your are right this function won't help, we have to look elsewhere.

It is still not clear to me why the problem happened? IIUC, after restoring
4096 changes from snap files, we send them to the subscriber, and then
apply worker should apply those one by one. Now, is it taking one minute to
restore 4096 changes due to which apply worker is timed out?

Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This
process end when1420 files are present in pg_replslots (this is in relation
with statements that must be replayed from WAL). In the pg_stat_replication
view, the state field is set to *catchup*.
In a 2nd phase, the snap files must be decoded. However after one minute
(wal_receiver_timeout parameter set to 1 minute) the worker process stop
with a timeout.

I can put a debug point to check if a timeout is sent to the worker
process. Do you have any other clue?

Thank you for your help

Fabrice




On Fri, Jan 7, 2022 at 11:26 AM Amit Kapila  wrote:

> On Wed, Dec 29, 2021 at 5:02 PM Fabrice Chapuis 
> wrote:
>
>> I put the instance with high level debug mode.
>> I try to do some log interpretation: After having finished writing the
>> modifications generated by the insert in the snap files,
>> then these files are read (restored). One minute after this work starts,
>> the worker process exit with an error code = 1.
>> I see that keepalive messages were sent before the work process work
>> leave.
>>
>> 2021-12-28 10:50:01.894 CET [55792] LOCATION:  WalSndKeepalive,
>> walsender.c:3365
>> ...
>> 2021-12-28 10:50:31.854 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:31.907 CET [55792] DEBUG:  0: StartTransaction(1)
>> name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
>> 2021-12-28 10:50:31.907 CET [55792] LOCATION:  ShowTransactionStateRec,
>> xact.c:5075
>> 2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:31.907 CET [55792] DEBUG:  0: spill 2271 changes in
>> XID 14312 to disk
>> 2021-12-28 10:50:31.907 CET [55792] LOCATION:  ReorderBufferSerializeTXN,
>> reorderbuffer.c:2245
>> 2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> *2021-12-28 10:50:32.110 CET [55792] DEBUG:  0: restored
>> 4096/22603999 changes from disk*
>> 2021-12-28 10:50:32.110 CET [55792] LOCATION:  ReorderBufferIterTXNNext,
>> reorderbuffer.c:1156
>> 2021-12-28 10:50:32.110 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:32.138 CET [55792] DEBUG:  0: restored 4096/22603999
>> changes from disk
>> ...
>>
>> *2021-12-28 10:50:35.341 CET [55794] DEBUG:  0: sending replication
>> keepalive2021-12-28 10:50:35.341 CET [55794] LOCATION:  WalSndKeepalive,
>> walsender.c:3365*
>> ...
>> *2021-12-28 10:51:31.995 CET [55791] ERROR:  XX000: terminating logical
>> replication worker due to timeout*
>>
>> *2021-12-28 10:51:31.995 CET [55791] LOCATION:  LogicalRepApplyLoop,
>> worker.c:1267*
>>
>>
> It is still not clear to me why the problem happened? IIUC, after
> restoring 4096 changes from snap files, we send them to the subscriber, and
> then apply worker should apply those one by one. Now, is it taking one
> minute to restore 4096 changes due to which apply worker is timed out?
>
> Could this function in* Apply main loop* in worker.c help to find a
>> solution?
>>
>> rc = WaitLatchOrSocket(MyLatch,
>> WL_SOCKET_READABLE | WL_LATCH_SET |
>> WL_TIMEOUT | WL_POSTMASTER_DEATH,
>> fd, wait_time,
>> WAIT_EVENT_LOGICAL_APPLY_MAIN);
>>
>
> Can you explain why you think this will help in solving your current
> problem?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-12-29 Thread Fabrice Chapuis
I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the
modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts,
the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.

2021-12-28 10:50:01.894 CET [55792] LOCATION:  WalSndKeepalive,
walsender.c:3365
...
2021-12-28 10:50:31.854 CET [55792] STATEMENT:  START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  0: StartTransaction(1)
name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ShowTransactionStateRec,
xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG:  0: spill 2271 changes in
XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION:  ReorderBufferSerializeTXN,
reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
*2021-12-28 10:50:32.110 CET [55792] DEBUG:  0: restored 4096/22603999
changes from disk*
2021-12-28 10:50:32.110 CET [55792] LOCATION:  ReorderBufferIterTXNNext,
reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT:  START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG:  0: restored 4096/22603999
changes from disk
...

*2021-12-28 10:50:35.341 CET [55794] DEBUG:  0: sending replication
keepalive2021-12-28 10:50:35.341 CET [55794] LOCATION:  WalSndKeepalive,
walsender.c:3365*
...
*2021-12-28 10:51:31.995 CET [55791] ERROR:  XX000: terminating logical
replication worker due to timeout*

*2021-12-28 10:51:31.995 CET [55791] LOCATION:  LogicalRepApplyLoop,
worker.c:1267*

Could this function in* Apply main loop* in worker.c help to find a
solution?

rc = WaitLatchOrSocket(MyLatch,
WL_SOCKET_READABLE | WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
fd, wait_time,
WAIT_EVENT_LOGICAL_APPLY_MAIN);
Thanks for your help

Fabrice

On Thu, Dec 23, 2021 at 11:52 AM Amit Kapila 
wrote:

> On Wed, Dec 22, 2021 at 8:50 PM Fabrice Chapuis 
> wrote:
> >
> > Hello Amit,
> >
> > I was able to reproduce the timeout problem in the lab.
> > After loading more than 20 millions of rows in a table which is not
> replicated (insert command ends without error), errors related to logical
> replication processes appear in the postgres log.
> > Approximately every 5 minutes worker process is restarted. The snap
> files in the slot directory are still present. The replication system seems
> to be blocked. Why these snap files are not removed. What do they contain?
> >
>
> These contain changes of insert. I think these are not removed for
> your case as your long transaction is never finished. As mentioned
> earlier, for such cases, it is better to use 'streaming' feature
> released as part of PG-14 but anyway here we are trying to debug
> timeout problem.
>
> > I will recompile postgres with your patch to debug.
> >
>
> Okay, that might help.
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-12-22 Thread Fabrice Chapuis
Hello Amit,

I was able to reproduce the timeout problem in the lab.
After loading more than 20 millions of rows in a table which is not
replicated (insert command ends without error), errors related to logical
replication processes appear in the postgres log.
Approximately every 5 minutes worker process is restarted. The snap files
in the slot directory are still present. The replication system seems to be
blocked. Why these snap files are not removed. What do they contain?
I will recompile postgres with your patch to debug.

2021-12-22 14:54:21.506 CET [64939] STATEMENT:  START_REPLICATION SLOT
"sub008_s00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s00"')
2021-12-22 15:01:20.908 CET [64938] ERROR:  terminating logical replication
worker due to timeout
2021-12-22 15:01:20.911 CET [61827] LOG:  worker process: logical
replication worker for subscription 26994 (PID 64938) exited with exit code
1
2021-12-22 15:01:20.923 CET [65037] LOG:  logical replication apply worker
for subscription "sub008_s00" has started
2021-12-22 15:01:20.932 CET [65038] ERROR:  replication slot
"sub008_s00" is active for PID 64939
2021-12-22 15:01:20.932 CET [65038] STATEMENT:  START_REPLICATION SLOT
"sub008_s00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s00"')
2021-12-22 15:01:20.932 CET [65037] ERROR:  could not start WAL streaming:
ERROR:  replication slot "sub008_s00" is active for PID 64939
2021-12-22 15:01:20.933 CET [61827] LOG:  worker process: logical
replication worker for subscription 26994 (PID 65037) exited with exit code
1
2021-12-22 15:01:25.944 CET [65039] LOG:  logical replication apply worker
for subscription "sub008_s00" has started
2021-12-22 15:01:25.951 CET [65040] ERROR:  replication slot
"sub008_s00" is active for PID 64939
2021-12-22 15:01:25.951 CET [65040] STATEMENT:  START_REPLICATION SLOT
"sub008_s00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s00"')
2021-12-22 15:01:25.951 CET [65039] ERROR:  could not start WAL streaming:
ERROR:  replication slot "sub008_s00" is active for PID 64939
2021-12-22 15:01:25.952 CET [61827] LOG:  worker process: logical
replication worker for subscription 26994 (PID 65039) exited with exit code
1
2021-12-22 15:01:30.962 CET [65041] LOG:  logical replication apply worker
for subscription "sub008_s00" has started
2021-12-22 15:01:30.970 CET [65042] ERROR:  replication slot
"sub008_s00" is active for PID 64939
2021-12-22 15:01:30.970 CET [65042] STATEMENT:  START_REPLICATION SLOT
"sub008_s00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s00"')
2021-12-22 15:01:30.970 CET [65041] ERROR:  could not start WAL streaming:
ERROR:  replication slot "sub008_s00" is active for PID 64939
2021-12-22 15:01:30.971 CET [61827] LOG:  worker process: logical
replication worker for subscription 26994 (PID 65041) exited with exit code
1
2021-12-22 15:01:35.982 CET [65043] LOG:  logical replication apply worker
for subscription "sub008_s00" has started
2021-12-22 15:01:35.990 CET [65044] ERROR:  replication slot
"sub008_s00" is active for PID 64939
2021-12-22 15:01:35.990 CET [65044] STATEMENT:  START_REPLICATION SLOT
"sub008_s00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s00"')

-rw---. 1 postgres postgres 16270723 Dec 22 16:02
xid-14312-lsn-23-9900.snap
-rw---. 1 postgres postgres 16145717 Dec 22 16:02
xid-14312-lsn-23-9A000000.snap
-rw---. 1 postgres postgres 10889437 Dec 22 16:02
xid-14312-lsn-23-9B00.snap
[postgres@s729058a debug]$ ls -ltr pg_replslot/sub008_s012a00/ | wc -l
1420

On Fri, Nov 12, 2021 at 5:57 PM Fabrice Chapuis 
wrote:

> I made a mistake in the configuration of my test script, in fact I cannot
> reproduce the problem at the moment.
> Yes, on the original environment there is physical replication, that's why
> for the lab I configured 2 nodes with physical replication.
> I'll try new tests next week
> Regards
>
> On Fri, Nov 12, 2021 at 7:23 AM Amit Kapila 
> wrote:
>
>> On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
>>  wrote:
>> >
>> > Hello,
>> > Our lab is ready now. Amit,  I compile Postgres 10.18 with your
>> patch.Tang, I used your script to configure logical replication between 2
>> databases and to generate 10 million entries in an unreplicated foo table.
>> On a standalone instance no error message appears in log.
>> > I activate the physical replication between 2 nodes, and I got
>> following error:
>> >
>> > 2021-11-10 10:49:12.297 CET [12126] LOG:  attempt to send keep alive
>> message
>> >

Re: Logical replication timeout problem

2021-11-12 Thread Fabrice Chapuis
I made a mistake in the configuration of my test script, in fact I cannot
reproduce the problem at the moment.
Yes, on the original environment there is physical replication, that's why
for the lab I configured 2 nodes with physical replication.
I'll try new tests next week
Regards

On Fri, Nov 12, 2021 at 7:23 AM Amit Kapila  wrote:

> On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
>  wrote:
> >
> > Hello,
> > Our lab is ready now. Amit,  I compile Postgres 10.18 with your
> patch.Tang, I used your script to configure logical replication between 2
> databases and to generate 10 million entries in an unreplicated foo table.
> On a standalone instance no error message appears in log.
> > I activate the physical replication between 2 nodes, and I got following
> error:
> >
> > 2021-11-10 10:49:12.297 CET [12126] LOG:  attempt to send keep alive
> message
> > 2021-11-10 10:49:12.297 CET [12126] STATEMENT:  START_REPLICATION
> 0/300 TIMELINE 1
> > 2021-11-10 10:49:15.127 CET [12064] FATAL:  terminating logical
> replication worker due to administrator command
> > 2021-11-10 10:49:15.127 CET [12036] LOG:  worker process: logical
> replication worker for subscription 16413 (PID 12064) exited with exit code
> 1
> > 2021-11-10 10:49:15.155 CET [12126] LOG:  attempt to send keep alive
> message
> >
> > This message look like strange because no admin command have been
> executed during data load.
> > I did not find any error related to the timeout.
> > The message coming from the modification made with the patch comes back
> all the time: attempt to send keep alive message. But there is no "sent
> keep alive message".
> >
> > Why logical replication worker exit when physical replication is
> configured?
> >
>
> I am also not sure why that happened may be due to
> max_worker_processes reaching its limit. This can happen because it
> seems you configured both publisher and subscriber in the same
> cluster. Tang, did you also see the same problem?
>
> BTW, why are you bringing physical standby configuration into the
> test? Does in your original setup where you observe the problem the
> physical standbys were there?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-11-11 Thread Fabrice Chapuis
Hello,
Our lab is ready now. Amit,  I compile Postgres 10.18 with your patch.Tang,
I used your script to configure logical replication between 2 databases and
to generate 10 million entries in an unreplicated foo table. On a
standalone instance no error message appears in log.
I activate the physical replication between 2 nodes, and I got following
error:

2021-11-10 10:49:12.297 CET [12126] LOG:  attempt to send keep alive message
2021-11-10 10:49:12.297 CET [12126] STATEMENT:  START_REPLICATION 0/300
TIMELINE 1
2021-11-10 10:49:15.127 CET [12064] FATAL:  terminating logical replication
worker due to administrator command
2021-11-10 10:49:15.127 CET [12036] LOG:  worker process: logical
replication worker for subscription 16413 (PID 12064) exited with exit code
1
2021-11-10 10:49:15.155 CET [12126] LOG:  attempt to send keep alive message

This message look like strange because no admin command have been executed
during data load.
I did not find any error related to the timeout.
The message coming from the modification made with the patch comes back all
the time: attempt to send keep alive message. But there is no "sent keep
alive message".

Why logical replication worker exit when physical replication is configured?

Thanks for your help

Fabrice



On Fri, Oct 8, 2021 at 9:33 AM Fabrice Chapuis 
wrote:

> Thanks Tang for your script.
> Our debugging environment will be ready soon. I will test your script and
> we will try to reproduce the problem by integrating the patch provided by
> Amit. As soon as I have results I will let you know.
>
> Regards
>
> Fabrice
>
> On Thu, Sep 30, 2021 at 3:15 AM Tang, Haiying/唐 海英 <
> tanghy.f...@fujitsu.com> wrote:
>
>> On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <
>> fabrice636...@gmail.com> wrote:
>>
>> >
>>
>> > Thanks for your patch, we are going to set up a lab in order to debug
>> the function.
>>
>>
>>
>> Hi
>>
>>
>>
>> I tried to reproduce this timeout problem on version10.18 but failed.
>>
>> In my trial, I inserted large amounts of data at publisher, which took
>> more than 1 minute to replicate.
>>
>> And with the patch provided by Amit, I saw that the frequency of invoking
>>
>> WalSndKeepaliveIfNecessary function is raised after I inserted data.
>>
>>
>>
>> The test script is attached. Maybe you can try it on your machine and
>> check if this problem could happen.
>>
>> If I miss something in the script, please let me know.
>>
>> Of course, it will be better if you can provide your script to reproduce
>> the problem.
>>
>>
>>
>> Regards
>>
>> Tang
>>
>>


Re: Logical replication timeout problem

2021-10-08 Thread Fabrice Chapuis
Thanks Tang for your script.
Our debugging environment will be ready soon. I will test your script and
we will try to reproduce the problem by integrating the patch provided by
Amit. As soon as I have results I will let you know.

Regards

Fabrice

On Thu, Sep 30, 2021 at 3:15 AM Tang, Haiying/唐 海英 
wrote:

> On Friday, September 24, 2021 12:04 AM, Fabrice Chapuis <
> fabrice636...@gmail.com> wrote:
>
> >
>
> > Thanks for your patch, we are going to set up a lab in order to debug
> the function.
>
>
>
> Hi
>
>
>
> I tried to reproduce this timeout problem on version10.18 but failed.
>
> In my trial, I inserted large amounts of data at publisher, which took
> more than 1 minute to replicate.
>
> And with the patch provided by Amit, I saw that the frequency of invoking
>
> WalSndKeepaliveIfNecessary function is raised after I inserted data.
>
>
>
> The test script is attached. Maybe you can try it on your machine and
> check if this problem could happen.
>
> If I miss something in the script, please let me know.
>
> Of course, it will be better if you can provide your script to reproduce
> the problem.
>
>
>
> Regards
>
> Tang
>
>


Re: Logical replication timeout problem

2021-09-23 Thread Fabrice Chapuis
Thanks for your patch, we are going to set up a lab in order to debug the
function.
Regards
Fabrice

On Thu, Sep 23, 2021 at 3:50 PM Amit Kapila  wrote:

> On Wed, Sep 22, 2021 at 9:46 PM Fabrice Chapuis 
> wrote:
> >
> > If you would like I can test the patch you send to me.
> >
>
> Okay, please find an attached patch for additional logs. I would like
> to see the logs during the time when walsender appears to be writing
> to files. We might need to add more logs to find the exact problem but
> let's start with this.
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-09-22 Thread Fabrice Chapuis
If you would like I can test the patch you send to me.

Regards

Fabrice

On Wed, Sep 22, 2021 at 11:02 AM Amit Kapila 
wrote:

> On Tue, Sep 21, 2021 at 9:12 PM Fabrice Chapuis 
> wrote:
> >
> > > IIUC, these are called after processing each WAL record so not
> > sure how is it possible in your case that these are not reached?
> >
> > I don't know, as you say, to highlight the problem we would have to
> debug the WalSndKeepaliveIfNecessary function
> >
> > > I was curious to know if the walsender has exited before walreceiver
> >
> > During the last tests we made we didn't observe any timeout of the wal
> sender process.
> >
> > > Do you mean you are planning to change from 1 minute to 5 minutes?
> >
> > We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new
> test. The result is surprising and rather positive there is no timeout any
> more in the log and the 20Gb of snap files are removed in less than 5
> minutes.
> > How to explain that behaviour, why the snap files are consumed suddenly
> so quickly.
> >
>
> I think it is because we decide that the data in those snap files
> doesn't need to be sent at xact end, so we remove them.
>
> > I choose the value arbitrarily for
> wal_sender_timeout/wal_receiver_timeout parameters, are theses values
> appropriate from your point of view?
> >
>
> It is difficult to say what is the appropriate value for these
> parameters unless in some way we debug WalSndKeepaliveIfNecessary() to
> find why it didn't send keep alive when it is expected. Would you be
> able to make code changes and test or if you want I can make changes
> and send the patch if you can test it? If not, is it possible that in
> some way you send a reproducible test?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-09-21 Thread Fabrice Chapuis
> IIUC, these are called after processing each WAL record so not
sure how is it possible in your case that these are not reached?

I don't know, as you say, to highlight the problem we would have to debug
the WalSndKeepaliveIfNecessary function

> I was curious to know if the walsender has exited before walreceiver

During the last tests we made we didn't observe any timeout of the wal
sender process.

> Do you mean you are planning to change from 1 minute to 5 minutes?

We set wal_sender_timeout/wal_receiver_timeout to 5' and launch new test.
The result is surprising and rather positive there is no timeout any more
in the log and the 20Gb of snap files are removed in less than 5 minutes.
How to explain that behaviour, why the snap files are consumed suddenly so
quickly.
I choose the value arbitrarily for wal_sender_timeout/wal_receiver_timeout
parameters, are theses values appropriate from your point of view?

Best Regards

Fabrice



On Tue, Sep 21, 2021 at 11:52 AM Amit Kapila 
wrote:

> On Tue, Sep 21, 2021 at 1:52 PM Fabrice Chapuis 
> wrote:
> >
> > If I understand, the instruction to send keep alive by the wal sender
> has not been reached in the for loop, for what reason?
> > ...
> > * Check for replication timeout. */
> >   WalSndCheckTimeOut();
> >
> > /* Send keepalive if the time has come */
> >   WalSndKeepaliveIfNecessary();
> > ...
> >
>
> Are you sure that these functions have not been called? Or the case is
> that these are called but due to some reason the keep-alive is not
> sent? IIUC, these are called after processing each WAL record so not
> sure how is it possible in your case that these are not reached?
>
> > The data load is performed on a table which is not replicated, I do not
> understand why the whole transaction linked to an insert is copied to snap
> files given that table does not take part of the logical replication.
> >
>
> It is because we don't know till the end of the transaction (where we
> start sending the data) whether the table will be replicated or not. I
> think specifically for this purpose the new 'streaming' feature
> introduced in PG-14 will help us to avoid writing data of such tables
> to snap/spill files. See 'streaming' option in Create Subscription
> docs [1].
>
> > We are going to do a test by modifying parameters
> wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that
> these parameters are global and changing them will also impact the physical
> replication.
> >
>
> Do you mean you are planning to change from 1 minute to 5 minutes? I
> agree with the global nature of parameters and I think your approach
> to finding out the root cause is good here because otherwise, under
> some similar or more heavy workload, it might lead to the same
> situation.
>
> > Concerning the walsender timeout, when the worker is started again after
> a timeout, it will trigger a new walsender associated with it.
> >
>
> Right, I know that but I was curious to know if the walsender has
> exited before walreceiver.
>
> [1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-09-21 Thread Fabrice Chapuis
If I understand, the instruction to send keep alive by the wal sender has
not been reached in the for loop, for what reason?
...
* Check for replication timeout. */
  WalSndCheckTimeOut();

/* Send keepalive if the time has come */
  WalSndKeepaliveIfNecessary();
...

The data load is performed on a table which is not replicated, I do not
understand why the whole transaction linked to an insert is copied to snap
files given that table does not take part of the logical replication.
We are going to do a test by modifying parameters
wal_sender_timeout/wal_receiver_timeout from 1' to 5'. The problem is that
these parameters are global and changing them will also impact the physical
replication.

Concerning the walsender timeout, when the worker is started again after a
timeout, it will trigger a new walsender associated with it.

postgres 55680 12546  0 Sep20 ?00:00:02 postgres: aq: bgworker:
logical replication worker for subscription 24651602
postgres 55681 12546  0 Sep20 ?00:00:00 postgres: aq: wal sender
process repuser 127.0.0.1(57930) idle

Kind Regards

Fabrice

On Tue, Sep 21, 2021 at 8:38 AM Amit Kapila  wrote:

> On Mon, Sep 20, 2021 at 9:43 PM Fabrice Chapuis 
> wrote:
> >
> > By passing the autovacuum parameter to off the problem did not occur
> right after loading the table as in our previous tests. However, the
> timeout occurred later. We have seen the accumulation of .snap files for
> several Gb.
> >
> > ...
> > -rw---. 1 postgres postgres 16791226 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F500.snap
> > -rw---. 1 postgres postgres 16973268 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F600.snap
> > -rw---. 1 postgres postgres 16790984 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F700.snap
> > -rw---. 1 postgres postgres 16988112 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F800.snap
> > -rw---. 1 postgres postgres 16864593 Sep 20 15:26
> xid-1238444701-lsn-2D2B-F900.snap
> > -rw---. 1 postgres postgres 16902167 Sep 20 15:26
> xid-1238444701-lsn-2D2B-FA00.snap
> > -rw---. 1 postgres postgres 16914638 Sep 20 15:26
> xid-1238444701-lsn-2D2B-FB00.snap
> > -rw---. 1 postgres postgres 16782471 Sep 20 15:26
> xid-1238444701-lsn-2D2B-FC00.snap
> > -rw---. 1 postgres postgres 16963667 Sep 20 15:27
> xid-1238444701-lsn-2D2B-FD00.snap
> > ...
> >
>
> Okay, still not sure why the publisher is not sending keep_alive
> messages in between spilling such a big transaction. If you see, we
> have logic in WalSndLoop() wherein each time after sending data we
> check whether we need to send a keep-alive message via function
> WalSndKeepaliveIfNecessary(). I think to debug this problem further
> you need to add some logs in function WalSndKeepaliveIfNecessary() to
> see why it is not sending keep_alive messages when all these files are
> being created.
>
> Did you change the default value of
> wal_sender_timeout/wal_receiver_timeout? What is the value of those
> variables in your environment? Did you see the message "terminating
> walsender process due to replication timeout" in your server logs?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-09-20 Thread Fabrice Chapuis
By passing the autovacuum parameter to off the problem did not occur right
after loading the table as in our previous tests. However, the timeout
occurred later. We have seen the accumulation of .snap files for several Gb.

...
-rw---. 1 postgres postgres 16791226 Sep 20 15:26
xid-1238444701-lsn-2D2B-F500.snap
-rw---. 1 postgres postgres 16973268 Sep 20 15:26
xid-1238444701-lsn-2D2B-F600.snap
-rw---. 1 postgres postgres 16790984 Sep 20 15:26
xid-1238444701-lsn-2D2B-F700.snap
-rw---. 1 postgres postgres 16988112 Sep 20 15:26
xid-1238444701-lsn-2D2B-F800.snap
-rw---. 1 postgres postgres 16864593 Sep 20 15:26
xid-1238444701-lsn-2D2B-F900.snap
-rw---. 1 postgres postgres 16902167 Sep 20 15:26
xid-1238444701-lsn-2D2B-FA00.snap
-rw---. 1 postgres postgres 16914638 Sep 20 15:26
xid-1238444701-lsn-2D2B-FB00.snap
-rw---. 1 postgres postgres 16782471 Sep 20 15:26
xid-1238444701-lsn-2D2B-FC00.snap
-rw---. 1 postgres postgres 16963667 Sep 20 15:27
xid-1238444701-lsn-2D2B-FD00.snap
...



2021-09-20 17:11:29 CEST [12687]: [1283-1] user=,db=,client= LOG:
checkpoint starting: time
2021-09-20 17:11:31 CEST [12687]: [1284-1] user=,db=,client= LOG:
checkpoint complete: wrote 13 buffers (0.0%); 0 WAL file(s) added, 0
removed, 0 recycled; write=1.713 s, sync=0.001 s, total=1.718 s
; sync files=12, longest=0.001 s, average=0.001 s; distance=29 kB,
estimate=352191 kB
2021-09-20 17:12:43 CEST [59986]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2021-09-20 17:12:43 CEST [12546]: [1068-1] user=,db=,client= LOG:  worker
process: logical replication worker for subscription 24215702 (PID 59986)
exited with exit code 1
2021-09-20 17:12:43 CEST [39945]: [1-1] user=,db=,client= LOG:  logical
replication apply worker for subscription "sub" has started
2021-09-20 17:12:43 CEST [39946]: [1-1] user=repuser,db=db,client=127.0.0.1
LOG:  received replication command: IDENTIFY_SYSTEM

Regards,

Fabrice



On Mon, Sep 20, 2021 at 1:51 PM Amit Kapila  wrote:

> On Mon, Sep 20, 2021 at 4:10 PM Fabrice Chapuis 
> wrote:
> >
> > Hi Amit,
> >
> > We can replay the problem: we load a table of several Gb in the schema
> of the publisher, this generates the worker's timeout after one minute from
> the end of this load. The table on which this load is executed is not
> replicated.
> >
> > 2021-09-16 12:06:50 CEST [24881]: [1-1]
> user=postgres,db=db012a00,client=[local] LOG:  duration: 1281408.171 ms
> statement: COPY db.table (col1, col2) FROM stdin;
> >
> > 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:
> automatic analyze of table "db.table " system usage: CPU: user: 4.13 s,
> system: 0.55 s, elapsed: 9.58 s
> >
> > 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
> terminating logical replication worker due to timeout
> >
> > Before increasing value for wal_sender_timeout and wal_receiver_timeout
> I thought to further investigate the mechanisms leading to this timeout.
> >
>
> The basic problem here seems to be that WAL Sender is not able to send
> a keepalive or any other message for the configured
> wal_receiver_timeout. I am not sure how that can happen but can you
> once try by switching autovacuum = off? I wanted to ensure that
> WALSender is not blocked due to the background process autovacuum.
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-09-20 Thread Fabrice Chapuis
Hi Amit,

We can replay the problem: we load a table of several Gb in the schema of
the publisher, this generates the worker's timeout after one minute from
the end of this load. The table on which this load is executed is not
replicated.

2021-09-16 12:06:50 CEST [24881]: [1-1]
user=postgres,db=db012a00,client=[local] LOG:  duration: 1281408.171 ms
statement: COPY db.table (col1, col2) FROM stdin;

2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic
analyze of table "db.table " system usage: CPU: user: 4.13 s, system: 0.55
s, elapsed: 9.58 s

2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout

Before increasing value for wal_sender_timeout and wal_receiver_timeout I
thought to further investigate the mechanisms leading to this timeout.

Thanks for your help

Fabrice



On Sun, Sep 19, 2021 at 6:25 AM Amit Kapila  wrote:

> On Fri, Sep 17, 2021 at 8:08 PM Fabrice Chapuis 
> wrote:
> >
> > the publisher and the subscriber run on the same postgres instance.
> >
>
> Okay, but there is no log corresponding to operations being performed
> by the publisher. By looking at current logs it is not very clear to
> me what might have caused this. Did you try increasing
> wal_sender_timeout and wal_receiver_timeout?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical replication timeout problem

2021-09-17 Thread Fabrice Chapuis
the publisher and the subscriber run on the same postgres instance.

Regards,
Fabrice

On Fri, Sep 17, 2021 at 12:26 PM Amit Kapila 
wrote:

> On Fri, Sep 17, 2021 at 3:29 PM Fabrice Chapuis 
> wrote:
> >
> > Hi,
> >
> > Logical replication is configured on one instance in version 10.18.
> Timeout errors occur regularly and the worker process exit with an exit
> code 1
> >
> > 2021-09-16 12:06:50 CEST [24881]: [1-1]
> user=postgres,db=foo,client=[local] LOG:  duration: 1281408.171 ms
> statement: COPY schem.tab (col1, col2) FROM stdin;
> > 2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:
> automatic analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s,
> system: 0.55 s, elapsed: 9.58 s
> > 2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
> terminating logical replication worker due to timeout
> > 2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG:  worker
> process: logical replication worker for subscription 24106654 (PID 3770)
> exited with exit code 1
> > 2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG:  logical
> replication apply worker for subscription "sub" has started
> > 2021-09-16 12:07:50 CEST [13873]: [1-1]
> user=repuser,db=foo,client=127.0.0.1 LOG:  received replication command:
> IDENTIFY_SYSTEM
> >
>
> Can you share the publisher-side log as well?
>
>
> --
> With Regards,
> Amit Kapila.
>


Logical replication timeout problem

2021-09-17 Thread Fabrice Chapuis
Hi,

Logical replication is configured on one instance in version 10.18. Timeout
errors occur regularly and the worker process exit with an exit code 1

2021-09-16 12:06:50 CEST [24881]: [1-1] user=postgres,db=foo,client=[local]
LOG:  duration: 1281408.171 ms  statement: COPY schem.tab (col1, col2) FROM
stdin;
2021-09-16 12:07:11 CEST [12161]: [1-1] user=,db=,client= LOG:  automatic
analyze of table "foo.schem.tab" system usage: CPU: user: 4.13 s, system:
0.55 s, elapsed: 9.58 s
2021-09-16 12:07:50 CEST [3770]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2021-09-16 12:07:50 CEST [12546]: [11-1] user=,db=,client= LOG:  worker
process: logical replication worker for subscription 24106654 (PID 3770)
exited with exit code 1
2021-09-16 12:07:50 CEST [13872]: [1-1] user=,db=,client= LOG:  logical
replication apply worker for subscription "sub" has started
2021-09-16 12:07:50 CEST [13873]: [1-1]
user=repuser,db=foo,client=127.0.0.1 LOG:  received replication command:
IDENTIFY_SYSTEM

Why this happen?

Thanks a lot for your help

Fabrice