Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2024-05-21 Thread Ajin Cherian
On Tue, Mar 19, 2024 at 1:49 PM Ajin Cherian  wrote:

>
>
>> Of course you can, but this will only convert disk space into memory
>> space.
>>  For details, please see the case in Email [1].
>>
>> [1]
>> https://www.postgresql.org/message-id/CAGfChW51P944nM5h0HTV9HistvVfwBxNaMt_s-OZ9t%3DuXz%2BZbg%40mail.gmail.com
>>
>> Regards, lijie
>>
>>
>
In some testing, I see a crash:
(gdb) bt
#0  0x7fa5bcbfd277 in raise () from /lib64/libc.so.6
#1  0x7fa5bcbfe968 in abort () from /lib64/libc.so.6
#2  0x009e0940 in ExceptionalCondition (
conditionName=conditionName@entry=0x7fa5ab8b9842 "RelationSyncCache !=
NULL",
fileName=fileName@entry=0x7fa5ab8b9820 "pgoutput.c",
lineNumber=lineNumber@entry=1991)
at assert.c:66
#3  0x7fa5ab8b7804 in get_rel_sync_entry (data=data@entry=0x2492288,
relation=relation@entry=0x7fa5be30a768) at pgoutput.c:1991
#4  0x7fa5ab8b7cda in pgoutput_table_filter (ctx=,
relation=0x7fa5be30a768,
change=0x24c5c20) at pgoutput.c:1671
#5  0x00813761 in filter_by_table_cb_wrapper (ctx=ctx@entry=0x2491fd0,

relation=relation@entry=0x7fa5be30a768, change=change@entry=0x24c5c20)
at logical.c:1268
#6  0x0080e20f in FilterByTable (ctx=ctx@entry=0x2491fd0,
change=change@entry=0x24c5c20)
at decode.c:690
#7  0x0080e8e3 in DecodeInsert (ctx=ctx@entry=0x2491fd0,
buf=buf@entry=0x7fff0db92550)
at decode.c:1070
#8  0x0080f43d in heap_decode (ctx=ctx@entry=0x2491fd0,
buf=buf@entry=0x7fff0db92550)
at decode.c:485
#9  0x0080eca6 in LogicalDecodingProcessRecord
(ctx=ctx@entry=0x2491fd0,
record=0x2492368)
at decode.c:118
#10 0x0081338f in DecodingContextFindStartpoint
(ctx=ctx@entry=0x2491fd0)
at logical.c:672
#11 0x0083c650 in CreateReplicationSlot (cmd=cmd@entry=0x2490970)
at walsender.c:1323
#12 0x0083fd48 in exec_replication_command (
cmd_string=cmd_string@entry=0x239c880 "CREATE_REPLICATION_SLOT
\"pg_16387_sync_16384_7371301304766135621\" LOGICAL pgoutput (SNAPSHOT
'use')") at walsender.c:2116

The reason for the crash is that the RelationSyncCache was NULL prior to
reaching a consistent point.
Hi li jie, I see that you created a new thread with an updated version of
this patch [1]. I used that patch and addressed the crash seen above,
rebased the patch and addressed a few other comments.
I'm happy to help you with this patch and address comments if you are not
available.

regards,
Ajin Cherian
Fujitsu Australia
[1] -
https://www.postgresql.org/message-id/CAGfChW7%2BZMN4_NHPgz24MM42HVO83ecr9TLfpihJ%3DM0s1GkXFw%40mail.gmail.com


v2-0001-Reduce-useless-changes-before-reassemble-during-l.patch
Description: Binary data


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-22 Thread Ajin Cherian
On Thu, Apr 18, 2024 at 4:26 PM Ajin Cherian  wrote:

>
> Attaching the patch for your review and comments. Big thanks to Kuroda-san
> for also working on the patch.
>
>
Looking at this a bit more, maybe rolling back all prepared transactions on
the subscriber when toggling two_phase from true to false might not be
desirable for the customer. Maybe we should have an option for customers to
control whether transactions should be rolled back or not. Maybe
transactions should only be rolled back if a "force" option is also set.
What do people think?

regards,
Ajin Cherian
Fujitsu Australia


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-18 Thread Ajin Cherian
On Tue, Apr 16, 2024 at 4:25 PM Amit Kapila  wrote:

>
> >
>
> Can you please once consider the idea shared by me at [1] (One naive
> idea is that on the publisher .) to solve this problem?
>
> [1] -
> https://www.postgresql.org/message-id/CAA4eK1K1fSkeK%3Dkc26G5cq87vQG4%3D1qs_b%2Bno4%2Bep654SeBy1w%40mail.gmail.com
>
>
>
Expanding on Amit's idea, we found out that there is already a mechanism in
code to fully decode prepared transactions prior to a defined LSN where
two_phase is enabled using the "two_phase_at" LSN in the slot. Look at
ReorderBufferFinishPrepared() on how this is done. This code was not
working as expected in our patch because
we were setting two_phase on the slot to true as soon as the alter command
was received. This was not the correct way, initially when two_phase is
enabled, the two_phase changes to pending state and two_phase option on the
slot should only be set to true when two_phase moves from pending to
enabled. This will happen once the replication is restarted with two_phase
option. Look at code in  CreateDecodingContext() on how "two_phase_at" is
set in the slot when done this way. So we changed the code to not remotely
alter two_phase when toggling from false to true. With this change, now
even if there are pending transactions on the publisher when toggling
two_phase from false to true, these pending transactions will be fully
decoded and sent once the commit prepared is decoded as the pending
prepared transactions are prior to the "two_phase_at" LSN. With this patch,
now we are able to handle both pending prepared transactions when altering
two_phase from true to false as well as false to true.

Attaching the patch for your review and comments. Big thanks to Kuroda-san
for also working on the patch.

regards,
Ajin Cherian
Fujitsu Australia.


v4-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch
Description: Binary data


v4-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description: Binary data


v4-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description: Binary data


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-16 Thread Ajin Cherian
On Tue, Apr 16, 2024 at 1:31 AM Давыдов Виталий 
wrote:

> Dear All,
> Just interested, does anyone tried to reproduce the problem with slow
> catchup of twophase transactions (pgbench should be used with big number of
> clients)? I haven't seen any messages from anyone other that me that the
> problem takes place.
>
>
>

 Yes, I was able to reproduce the slow catchup of twophase transactions
with pgbench with 20 clients.

regards,
Ajin Cherian
Fujitsu Australia


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-05 Thread Ajin Cherian
On Thu, Apr 4, 2024 at 4:38 PM Amit Kapila  wrote:

>
> I think this would probably be better than the current situation but
> can we think of a solution to allow toggling the value of two_phase
> even when prepared transactions are present? Can you please summarize
> the reason for the problems in doing that and the solutions, if any?
>
> --
> With Regards,
> Amit Kapila.
>

Updated the patch, as it wasn't addressing updating of two-phase in the
remote slot.

 Currently the main issue that needs to be handled is the handling of
pending prepared transactions while the two_phase is altered. I see 3
issues with the current approach.

1. Uncommitted prepared transactions when toggling two_phase from true to
false
When two_phase was true, prepared transactions were decoded at PREPARE time
and send to the subscriber, which is then prepared on the subscriber with a
new gid. Once the two_phase is toggled to false, then the COMMIT PREPARED
on the publisher is converted to commit and the entire transaction is
decoded and sent to the subscriber. This will   leave the previously
prepared transaction pending.

2. Uncommitted prepared transactions when toggling two_phase form false to
true
When two_phase was false, prepared transactions were ignored and not
decoded at PREPARE time on the publisher. Once the two_phase is toggled to
true, the apply worker and the walsender are restarted and a replication is
restarted from a new "start_decoding_at" LSN. Now, this new
"start_decoding_at" could be past the LSN of the PREPARE record and if so,
the PREPARE record is skipped and not send to the subscriber. Look at
comments in DecodeTXNNeedSkip() for detail.  Later when the user issues
COMMIT PREPARED, this is decoded and sent to the subscriber. but there is
no prepared transaction on the subscriber, and this fails because the
corresponding gid of the transaction couldn't be found.

3. While altering the two_phase of the subscription, it is required to also
alter the two_phase field of the slot on the primary. The subscription
cannot remotely alter the two_phase option of the slot when the
subscription is  enabled, as the slot is owned by the walsender on the
publisher side.

Possible solutions for the 3 problems:

1. While toggling two_phase from true to false, we could probably get list
of prepared transactions for this subscriber id and rollback/abort the
prepared transactions. This will allow the transactions to be re-applied
like a normal transaction when the commit comes. Alternatively, if this
isn't appropriate doing it in the ALTER SUBSCRIPTION context, we could
store the xids of all prepared transactions of this subscription in a list
and when the corresponding xid is being committed by the apply worker,
prior to commit, we make sure the previously prepared transaction is rolled
back. But this would add the overhead of checking this list every time a
transaction is committed by the apply worker.

2. No solution yet.

3. We could mandate that the altering of two_phase state only be done after
disabling the subscription, just like how it is handled for failover option.
Let me know your thoughts.

regards,
Ajin Cherian
Fujitsu Australia


v2-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description: Binary data


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-03 Thread Ajin Cherian
On Wed, Mar 6, 2024 at 1:29 AM Давыдов Виталий 
wrote:

> In usual work, the subscription has two_phase = on. I have to change this
> option at catchup stage only, but this parameter can not be altered. There
> was a patch proposal in past to implement altering of two_phase option, but
> it was rejected. I think, the recreation of the subscription with two_phase
> = off will not work.
>
>
>
The altering of two_phase was restricted because if there was a previously
prepared transaction on the subscriber when the two_phase was on, and then
it was turned off, the apply worker on the subscriber would re-apply the
transaction a second time and this might result in an inconsistent replica.
Here's a patch that allows toggling two_phase option provided that there
are no pending uncommitted prepared transactions on the subscriber for that
subscription.

Thanks to Kuroda-san for working on the patch.

regards,
Ajin Cherian
Fujitsu Australia


v1-0001-Allow-altering-of-two_phase-option-in-subscribers.patch
Description: Binary data


Re: Skip collecting decoded changes of already-aborted transactions

2024-03-27 Thread Ajin Cherian
On Mon, Mar 18, 2024 at 7:50 PM Masahiko Sawada 
wrote:

>
> In addition to these changes, I've made some changes to the latest
> patch. Here is the summary:
>
> - Use txn_flags field to record the transaction status instead of two
> 'committed' and 'aborted' flags.
> - Add regression tests.
> - Update commit message.
>
> Regards,
>
>
Hi Sawada-san,

Thanks for the updated patch. Some comments:

1.
+ * already aborted, we discards all changes accumulated so far and ignore
+ * future changes, and return true. Otherwise return false.

we discards/we discard

2. In function ReorderBufferCheckTXNAbort(): I haven't tested this but I
wonder how prepared transactions would be considered, they are neither
committed, nor in progress.

regards,
Ajin Cherian
Fujitsu Australia


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Ajin Cherian
On Tue, Mar 26, 2024 at 7:57 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Please see the attached v23 patches. I've addressed all the review
> comments received so far from Amit and Shveta.
>
>
In patch 0003:
+ SpinLockAcquire(>mutex);
+ }
+
+ Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+ if (slot->inactive_since > 0 &&
+ slot->data.inactive_timeout > 0)
+ {
+ TimestampTz now;
+
+ /* inactive_since is only tracked for inactive slots */
+ Assert(slot->active_pid == 0);
+
+ now = GetCurrentTimestamp();
+ if (TimestampDifferenceExceeds(slot->inactive_since, now,
+   slot->data.inactive_timeout * 1000))
+ inavidation_cause = RS_INVAL_INACTIVE_TIMEOUT;
+ }
+
+ if (need_locks)
+ {
+ SpinLockRelease(>mutex);

Here, GetCurrentTimestamp() is still called with SpinLock held. Maybe do
this prior to acquiring the spinlock.

regards,
Ajin Cherian
Fujitsu Australia


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-22 Thread Ajin Cherian
On Fri, Mar 22, 2024 at 7:15 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Fri, Mar 22, 2024 at 12:39 PM Bertrand Drouvot
>  wrote:
> >
> > > > Please find the v14-0001 patch for now.
> >
> > Thanks!
> >
> > > LGTM. Let's wait for Bertrand to see if he has more comments on 0001
> > > and then I'll push it.
> >
> > LGTM too.
>
> Thanks. Here I'm implementing the following:
>
> 0001 Track invalidation_reason in pg_replication_slots
> 0002 Track last_inactive_at in pg_replication_slots
> 0003 Allow setting inactive_timeout for replication slots via SQL API
> 0004 Introduce new SQL funtion pg_alter_replication_slot
> 0005 Allow setting inactive_timeout in the replication command
> 0006 Add inactive_timeout based replication slot invalidation
>
> 1. Keep it last_inactive_at as a shared memory variable, but always
> set it at restart if the slot's inactive_timeout has non-zero value
> and reset it as soon as someone acquires that slot so that if the slot
> doesn't get acquired  till inactive_timeout, checkpointer will
> invalidate the slot.
> 2. Ensure with pg_alter_replication_slot one could "only" alter the
> timeout property for the time being, if not that could lead to the
> subscription inconsistency.
> 3. Have some notes in the CREATE and ALTER SUBSCRIPTION docs about
> using an existing slot to leverage inactive_timeout feature.
> 4. last_inactive_at should also be set to the current time during slot
> creation because if one creates a slot and does nothing with it then
> it's the time it starts to be inactive.
> 5. We don't set last_inactive_at to GetCurrentTimestamp() for failover
> slots.
> 6. Leave the patch that added support for inactive_timeout in
> subscriptions.
>
> Please see the attached v14 patch set. No change in the attached
> v14-0001 from the previous patch.
>
>
>
Some comments:
1. In patch 0005:
In ReplicationSlotAlter():
+ lock_acquired = false;
  if (MyReplicationSlot->data.failover != failover)
  {
  SpinLockAcquire(>mutex);
+ lock_acquired = true;
  MyReplicationSlot->data.failover = failover;
+ }
+
+ if (MyReplicationSlot->data.inactive_timeout != inactive_timeout)
+ {
+ if (!lock_acquired)
+ {
+ SpinLockAcquire(>mutex);
+ lock_acquired = true;
+ }
+
+ MyReplicationSlot->data.inactive_timeout = inactive_timeout;
+ }
+
+ if (lock_acquired)
+ {
  SpinLockRelease(>mutex);

Can't you make it shorter like below:
lock_acquired = false;

if (MyReplicationSlot->data.failover != failover ||
MyReplicationSlot->data.inactive_timeout != inactive_timeout) {
SpinLockAcquire(>mutex);
lock_acquired = true;
}

if (MyReplicationSlot->data.failover != failover) {
MyReplicationSlot->data.failover = failover;
}

if (MyReplicationSlot->data.inactive_timeout != inactive_timeout) {
MyReplicationSlot->data.inactive_timeout = inactive_timeout;
}

if (lock_acquired) {
SpinLockRelease(>mutex);
ReplicationSlotMarkDirty();
    ReplicationSlotSave();
}

2. In patch 0005:  why change walrcv_alter_slot option? it doesn't seem to
be used anywhere, any use case for it? If required, would the intention be
to add this as a Create Subscription option?

regards,
Ajin Cherian
Fujitsu Australia


Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2024-03-18 Thread Ajin Cherian
>
> Of course you can, but this will only convert disk space into memory space.
>  For details, please see the case in Email [1].
>
> [1]
> https://www.postgresql.org/message-id/CAGfChW51P944nM5h0HTV9HistvVfwBxNaMt_s-OZ9t%3DuXz%2BZbg%40mail.gmail.com
>
> Regards, lijie
>
>
Hi lijie,

Overall, I think the patch is a good improvement. Some comments from first
run through of patch:
1. The patch no longer applies cleanly, please rebase.

2. While testing the patch, I saw something strange. If I try to truncate a
table that is published. I still see the message:
2024-03-18 22:25:51.243 EDT [29385] LOG:  logical filter change by table
pg_class

This gives the impression that the truncate operation on the published
table has been filtered but it hasn't. Also the log message needs to be
reworded. Maybe, "Logical filtering change by non-published table
"

3. Below code:
@@ -1201,11 +1343,14 @@ DecodeMultiInsert(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
+
+ if (FilterByTable(ctx, change))
+ continue;;

extra semi-colon after continue.

4. I am not sure if this is possible, but is there a way to avoid the
overhead in the patch if the publication publishes "ALL TABLES"?

5. In function: pgoutput_table_filter() - this code appears to be filtering
out not just unpublished tables but also applying row based filters on
published tables as well. Is this really within the scope of the feature?

regards,
Ajin Cherian
Fujitsu Australia


Re: Skip collecting decoded changes of already-aborted transactions

2024-03-14 Thread Ajin Cherian
On Fri, Mar 15, 2024 at 3:17 PM Masahiko Sawada 
wrote:

>
> I resumed working on this item. I've attached the new version patch.
>
> I rebased the patch to the current HEAD and updated comments and
> commit messages. The patch is straightforward and I'm somewhat
> satisfied with it, but I'm thinking of adding some tests for it.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com


I just had a look at the patch, the patch no longer applies because of a
removal of a header in a recent commit. Overall the patch looks fine, and I
didn't find any issues. Some cosmetic comments:
in ReorderBufferCheckTXNAbort()
+ /* Quick return if we've already knew the transaction status */
+ if (txn->aborted)
+ return true;

knew/know

/*
+ * If logical_replication_mode is "immediate", we don't check the
+ * transaction status so the caller always process this transaction.
+ */
+ if (debug_logical_replication_streaming ==
DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE)
+ return false;

/process/processes

regards,
Ajin Cherian
Fujitsu Australia


Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-03-11 Thread Ajin Cherian
On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:

>
> Thanks, I have created the following Commitfest entry for this:
> https://commitfest.postgresql.org/47/4816/
>
> Regards,
> Vignesh
>

Thanks for the patch, I have verified that the fix works well by following
the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor
comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense
when it was a bool, but now that it is a tri-state enum, it doesn't fit
well.

regards,
Ajin Cherian
Fujitsu Australia


Re: Synchronizing slots from primary to standby

2024-03-07 Thread Ajin Cherian
On Fri, Mar 8, 2024 at 2:33 PM Amit Kapila  wrote:

> On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> >
> > Attach the V108 patch set which addressed above and Peter's comments.
> > I also removed the check for "*" in guc check hook.
> >
>
>
> Pushed with minor modifications. I'll keep an eye on BF.
>
> BTW, one thing that we should try to evaluate a bit more is the
> traversal of slots in StandbySlotsHaveCaughtup() where we verify if
> all the slots mentioned in standby_slot_names have received the
> required WAL. Even if the standby_slot_names list is short the total
> number of slots can be much larger which can lead to an increase in
> CPU usage during traversal. There is an optimization that allows to
> cache ss_oldest_flush_lsn and ensures that we don't need to traverse
> the slots each time so it may not hit frequently but still there is a
> chance. I see it is possible to further optimize this area by caching
> the position of each slot mentioned in standby_slot_names in
> replication_slots array but not sure whether it is worth.
>
>
>
I tried to test this by configuring a large number of logical slots while
making sure the standby slots are at the end of the array and checking if
there was any performance hit in logical replication from these searches.

Setup:
1. 1 primary server configured with 3 servers in the standby_slot_names, 1
extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

2. 1 primary server configured with 3 servers in the standby_slot_names,
100 extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

3. 1 primary server configured with 3 servers in the standby_slot_names,
500 extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

In the three setups, 3 standby_slot_names are compared with a list of 2,101
and 501 slots respectively.

I ran a pgbench for 15 minutes for all 3 setups:

Case 1: Average TPS - 8.143399 TPS
Case 2: Average TPS - 8.187462 TPS
Case 3: Average TPS - 8.190611 TPS

I see no degradation in the performance, the differences in performance are
well within the run to run variations seen.


Nisha also did some performance tests to record the lag introduced by the
large number of slots traversal in StandbySlotsHaveCaughtup(). The tests
logged time at the start and end of the XLogSendLogical() call (which
eventually calls WalSndWaitForWal() --> StandbySlotsHaveCaughtup())  and
calculated total time taken by this function during the load run for
different total slots count.

Setup:
--one primary with 3 standbys and one subscriber with one active
subscription
--hot_standby_feedback=off and sync_replication_slots=false
--made sure the standby slots remain at the end
ReplicationSlotCtl->replication_slots array to measure performance of worst
case scenario for standby slot search in StandbySlotsHaveCaughtup()

pgbench for 15 min was run. Here is the data:

Case1 : with 1 logical slot, standby_slot_names having 3 slots
Run1: 626.141642 secs
Run2: 631.930254 secs

Case2 : with 100 logical slots,  standby_slot_names having 3 slots
Run1: 629.38332 secs
Run2: 630.548432 secs

Case3 : with 500 logical slots,  standby_slot_names having 3 slots
Run1: 629.910829 secs
Run2: 627.924183 secs

There was no degradation in performance seen.

Thanks Nisha for helping with the testing.

regards,
Ajin Cherian
Fujitsu Australia


Re: Synchronizing slots from primary to standby

2024-03-01 Thread Ajin Cherian
-4890 v2 @ 2.80GHz, 800GB RAM

My addition configuration on each instance is:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

All tests are done using pgbench running for 15 minutes:
Creating tables: pgbench -p 6972 postgres -qis 2
Running benchmark: pgbench postgres -p 6972 -c 10 -j 3 -T 900 -P 5

HEAD code-
 Primary with Synchronous_commit=on, physical standby with
hot_standby_feedback=off
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.226658 8.17815 8.202404

HEAD code-
Primary with Synchronous_commit=on, physical standby with
hot_standby_feedback=on
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.134901 8.229066 8.1819835 -- degradation from first config -0.25%

PATCHED code - (v101-0001)
Primary with synchronous_commit=on, physical standby with
hot_standby_feedback=on, standby_slot_names not configured, logical
subscriber not failover enabled, physical standby not configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.18839 8.18839 8.18839-- degradation from first config *-0.17%*

PATCHED code - (v98-0001)
Synchronous_commit=on, hot_standby_feedback=on, standby_slot_names
configured to physical standby, logical subscriber failover enabled,
physical standby configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.173062 8.068536 8.120799-- degradation from first config* -0.99%*

Overall, I do not see any significant performance degradation with the
patch and sync-slot enabled with one logical subscriber and one physical
standby.
Attaching script for my final test configuration for reference.

regards,
Ajin Cherian
Fujitsu Australia
<>


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-22 Thread Ajin Cherian
On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий 
wrote:

> Dear All,
>
> I'd like to present and talk about a problem when 2PC transactions are
> applied quite slowly on a replica during logical replication. There is a
> master and a replica with established logical replication from the master
> to the replica with twophase = true. With some load level on the master,
> the replica starts to lag behind the master, and the lag will be
> increasing. We have to significantly decrease the load on the master to
> allow replica to complete the catchup. Such problem may create significant
> difficulties in the production. The problem appears at least on
> REL_16_STABLE branch.
>
> To reproduce the problem:
>
>- Setup logical replication from master to replica with subscription
>parameter twophase =  true.
>- Create some intermediate load on the master (use pgbench with custom
>sql with prepare+commit)
>- Optionally switch off the replica for some time (keep load on
>master).
>- Switch on the replica and wait until it reaches the master.
>
> The replica will never reach the master with even some low load on the
> master. If to remove the load, the replica will reach the master for much
> greater time, than expected. I tried the same for regular transactions, but
> such problem doesn't appear even with a decent load.
>
>
>
I tried this setup and I do see that the logical subscriber does reach the
master in a short time. I'm not sure what I'm missing. I stopped the
logical subscriber in between while pgbench was running and then started it
again and ran the following:
postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
 sent_lsn  | pg_current_wal_lsn
---+
 0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)

My pgbench command:
pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql -T 200 -P 5

my custom sql file:
cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;

regards,
Ajin Cherian
Fujitsu Australia


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Ajin Cherian
On Wed, Feb 21, 2024 at 4:09 PM Hayato Kuroda (Fujitsu) <
kuroda.hay...@fujitsu.com> wrote:

> Dear Horiguchi-san,
>
> > GetConnection()@streamutil.c wants to ensure conninfo has a fallback
> > database name ("replication"). However, the function seems to be
> > ignoring the case where neither dbname nor connection string is given,
> > which is the first case Kuroda-san raised. The second case is the
> > intended behavior of the function.
> >
> > > /* pg_recvlogical uses dbname only; others use connection_string
> only.
> > */
> > > Assert(dbname == NULL || connection_string == NULL);
> >
> > And the function incorrectly assumes that the connection string
> > requires "dbname=replication".
> >
> > >  * Merge the connection info inputs given in form of connection
> string,
> > >  * options and default values (dbname=replication,
> replication=true,
> > etc.)
> >
> > But the name is a pseudo database name only used by pg_hba.conf
> > (maybe) , which cannot be used as an actual database name (without
> > quotation marks, or unless it is actually created). The function
> > should not add the fallback database name because the connection
> > string for physical replication connection doesn't require the dbname
> > parameter. (attached patch)
>
> I was also missing, but the requirement was that the dbname should be
> included
> only when the dbname option was explicitly specified [1]. Even mine and
> yours
> cannot handle like that. Libpq function
> PQconnectdbParams()->pqConnectOptions2()
> fills all the parameter to PGconn, at that time the information whether it
> is
> intentionally specified or not is discarded. Then,
> GenerateRecoveryConfig() would
> just write down all the connection parameters from PGconn, they cannot
> recognize.
>
> Well, one option is that if a default dbname should be written to the
configuration file, then "postgres' is a better option than "replication"
or "username" as the default option, at least a db of that name exists.

regards,
Ajin Cherian
Fujitsu Australia


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Ajin Cherian
On Wed, Feb 21, 2024 at 2:04 PM Kyotaro Horiguchi 
wrote:

>
> Although I haven't looked the original thread, it seems that the
> dbname is used only by pg_sync_replication_slots(). If it is true,
> couldn't we make the SQL function require a database name to make a
> connection, instead of requiring it in physical-replication conninfo?
>
>
>
In the original thread, the intention is to not just provide this
functionality  using the function pg_sync_replication_slots(), but provide
a GUC option on standbys to sync logical replication slots periodically
even without calling that function. This requires connecting to a database.

regards,
Ajin Cherian
Fujitsu Australia


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Ajin Cherian
On Wed, Feb 21, 2024 at 2:04 PM Kyotaro Horiguchi 
wrote:

>
> About the proposed patch, pg_basebackup cannot verify the validity of
> the dbname. It could be problematic.
>
> Although I haven't looked the original thread, it seems that the
> dbname is used only by pg_sync_replication_slots(). If it is true,
> couldn't we make the SQL function require a database name to make a
> connection, instead of requiring it in physical-replication conninfo?
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

I agree. If the intention is to meet the new requirement of the sync-slot
patch which requires a dbname in the primary_conninfo, then pseudo dbnames
will not work, whether it be the username or just "replication". I feel if
the user does not specify dbname explicitly in pg_basebackup it should be
left blank in the generated primary_conninfo string as well.

regards,
Ajin Cherian
Fujitsu Australia


Re: Improve eviction algorithm in ReorderBuffer

2024-02-14 Thread Ajin Cherian
On Sat, Feb 10, 2024 at 2:23 AM Masahiko Sawada 
wrote:

> On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada 
> wrote:
> >>
> >>
> >> I've attached the new version patch set.
> >>
> >> Regards,
> >>
> >>
> >> --
> >> Masahiko Sawada
> >> Amazon Web Services: https://aws.amazon.com
> >
> >
> > Thanks for the patch. I reviewed that patch and did minimal testing and
> it seems to show the speed up as claimed. Some minor comments:
>
> Thank you for the comments!
>
> > patch 0001:
> >
> > +static void
> > +bh_enlarge_node_array(binaryheap *heap)
> > +{
> > + if (heap->bh_size < heap->bh_space)
> > + return;
> >
> > why not check "if (heap->bh_size >= heap->bh_space)" outside this
> function to avoid calling this function when not necessary? This check was
> there in code before the patch.
>
> Agreed.
>
> >
> > patch 0003:
> >
> > +/*
> > + * The threshold of the number of transactions in the max-heap
> (rb->txn_heap)
> > + * to switch the state.
> > + */
> > +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024
> >
> > Typo: I think you meant REORDER_ and not REORDE_
>
> Fixed.
>
> These comments are addressed in the v4 patch set I just shared[1].
>
>
> These changes look good to me. I've done some tests with a few varying
levels of subtransaction and I could see that the patch was at least 5%
better in all of them.

regards,
Ajin Cherian
Fujitsu Australia


Re: Improve eviction algorithm in ReorderBuffer

2024-02-09 Thread Ajin Cherian
On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada 
wrote:

>
> I've attached the new version patch set.
>
> Regards,
>
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com


Thanks for the patch. I reviewed that patch and did minimal testing and it
seems to show the speed up as claimed. Some minor comments:
patch 0001:

+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ if (heap->bh_size < heap->bh_space)
+ return;

why not check "if (heap->bh_size >= heap->bh_space)" outside this function
to avoid calling this function when not necessary? This check was there in
code before the patch.

patch 0003:

+/*
+ * The threshold of the number of transactions in the max-heap
(rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024

Typo: I think you meant REORDER_ and not REORDE_

regards,
Ajin Cherian
Fujitsu Australia


Re: Synchronizing slots from primary to standby

2024-02-04 Thread Ajin Cherian
On Mon, Feb 5, 2024 at 1:29 PM Zhijie Hou (Fujitsu) 
wrote:

> On Monday, February 5, 2024 10:17 AM Zhijie Hou (Fujitsu) <
> houzj.f...@fujitsu.com> wrote:
>
> There was one miss in the doc that cause CFbot failure,
> attach the correct version V77_2 here. There are no code changes compared
> to V77 version.
>
> Best Regards,
> Hou zj
>

Just noticed that doc/src/sgml/config.sgml still refers to enable_synclot
instead of sync_replication_slots:

The standbys corresponding to the physical replication slots in
standby_slot_names must configure
enable_syncslot = true so they can receive
 failover logical slots changes from the primary.

regards,
Ajin Cherian
Fujitsu Australia


Re: Synchronizing slots from primary to standby

2024-01-31 Thread Ajin Cherian
On Tue, Jan 30, 2024 at 11:53 PM shveta malik 
wrote:

> On Tue, Jan 30, 2024 at 4:06 PM shveta malik 
> wrote:
> >
> > PFA v73-0001 which addresses the above comments. Other patches will be
> > rebased and posted after pushing this one.
>
> Since v73-0001 is pushed, PFA  rest of the patches. Changes are:
>
> 1) Rebased the patches.
> 2) Ran pg_indent on all.
> 3) patch001: Updated logicaldecoding.sgml for dbname requirement in
> primary_conninfo for slot-synchronization.
>
> thanks
> Shveta
>

Just to test the behaviour, I modified the code to set failover flag to
default to "true" while creating subscription and ran the regression tests.
I only saw the expected errors.
1. Make check in postgres root folder  - all failures are because of
difference when listing subscription as failover flag is now enabled. The
diff is attached for regress.

2. Make check in src/test/subscription - no failures All tests successful.
Files=34, Tests=457, 81 wallclock secs ( 0.14 usr  0.05 sys +  9.53 cusr
13.00 csys = 22.72 CPU)
Result: PASS

3. Make check in src/test/recovery - 3 failures Test Summary Report
---
t/027_stream_regress.pl (Wstat: 256 Tests: 6 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
t/035_standby_logical_decoding.pl   (Wstat: 7424 Tests: 8 Failed: 0)
  Non-zero exit status: 29
  Parse errors: No plan found in TAP output t/
050_standby_failover_slots_sync.pl (Wstat: 7424 Tests: 5 Failed: 0)
  Non-zero exit status: 29
  Parse errors: No plan found in TAP output

3a. Analysis of t/027_stream_regress.pl - No, 027 fails with the same issue
as "make check" in postgres root folder (for which I attached the diffs).
027 is about running the standard regression tests with streaming
replication. Since the regression tests fail because listing subscription
now has failover enabled, 027 also fails in the same way with streaming
replication.

3b. Analysis of t/035_standby_logical_decoding.pl - In this test case, they
attempt to create a subscription from the subscriber to the standby
##
# Test that we can subscribe on the standby with the publication # created
on the primary.
##

Now, this fails because creating a subscription on the standby with
failover enabled will result in error:
I see the following error in the log:
2024-01-28 23:51:30.425 EST [23332] tap_sub STATEMENT:
 CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (FAILOVER, SNAPSHOT
'nothing')
2024-01-28 23:51:30.425 EST [23332] tap_sub ERROR:  cannot create
replication slot with failover enabled on the standby I discussed this with
Shveta and she agreed that this is the expected behaviour as we don't
support failover to cascading standby yet.

3c. Analysis of t/050_standby_failover_slots_sync.pl - This is a new test
case created for this patch, and it creates a subscription without failover
enabled to make sure that the Subscription with failover disabled does not
depend on sync on standby, but this fails because we have failover enabled
by default.

In summary, I don't think these issues are actual bugs but expected
behaviour change.

regards,
Ajin Cherian
Fujitsu Australia


Re: Synchronizing slots from primary to standby

2024-01-23 Thread Ajin Cherian
On Mon, Jan 22, 2024 at 10:30 PM shveta malik 
wrote:
>
> On Mon, Jan 22, 2024 at 3:10 PM Amit Kapila 
wrote:
> >
> > minor comments on the patch:
> > ===
>
> PFA v65 addressing the comments.
>
> Addressed comments by Peter in [1], comments by Hou-San in [2],
> comments by Amit in [3] and [4]
>
> TODO:
> Analyze the issue reported by Swada-san in [5] (pt 2)
> Disallow subscription creation on standby with failover=true (as we do
> not support sync on cascading standbys)
>
> [1]:
https://www.postgresql.org/message-id/CAHut%2BPt5Pk_xJkb54oahR%2Bf9oawgfnmbpewvkZPgnRhoJ3gkYg%40mail.gmail.com
> [2]:
https://www.postgresql.org/message-id/OS0PR01MB57160C7184E17C6765AAE38294752%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> [3]:
https://www.postgresql.org/message-id/CAA4eK1JPB-zpGYTbVOP5Qp26tNQPMjDuYzNZ%2Ba9RFiN5nE1tEA%40mail.gmail.com
> [4]:
https://www.postgresql.org/message-id/CAA4eK1Jhy1-bsu6vc0%3DNja7aw5-EK_%3D101pnnuM3ATqTA8%2B%3DSg%40mail.gmail.com
> [5]:
https://www.postgresql.org/message-id/CAD21AoBgzONdt3o5mzbQ4MtqAE%3DWseiXUOq0LMqne-nWGjZBsA%40mail.gmail.com
>
>
I was doing some testing on this. What I noticed is that creating
subscriptions with failover enabled is taking a lot longer compared with a
subscription with failover disabled. The setup has primary configured with
standby_slot_names and that standby is enabled with enable_synclot turned
on.

Publisher has one publication, no tables.
subscriber:
postgres=# \timing
Timing is on.
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres
host=localhost port=6972' PUBLICATION pub with (failover = true);
NOTICE:  created replication slot "sub" on publisher
CREATE SUBSCRIPTION
Time: 10011.829 ms (00:10.012)

== drop the sub

postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres
host=localhost port=6972' PUBLICATION pub with (failover = false);
NOTICE:  created replication slot "sub" on publisher
CREATE SUBSCRIPTION
Time: 46.317 ms

With failover=true, it takes 10011 ms while failover=false takes 46 ms.

I don't see a similar delay when creating slot on the primary with
pg_create_logical_replication_slot() with failover flag enabled.

Then on primary:
postgres=# SELECT 'init' FROM
pg_create_logical_replication_slot('lsub2_slot', 'pgoutput', false, false,
true);
?column?
--
init
(1 row)

Time: 36.125 ms
postgres=# SELECT 'init' FROM
pg_create_logical_replication_slot('lsub1_slot', 'pgoutput', false, false,
false);
?column?
--
init
(1 row)

Time: 53.981 ms

regards,
Ajin Cherian
Fujitsu Australia


Re: Synchronizing slots from primary to standby

2023-11-30 Thread Ajin Cherian
On Wed, Nov 29, 2023 at 8:17 PM Zhijie Hou (Fujitsu)
 wrote:
>
> This has been fixed.
>
> Best Regards,
> Hou zj

Thanks for addressing my comments. Some comments from my testing of patch v41

1. In my opinion, the second message "aborting the wait...moving to
the next slot" does not hold much value. There might not even be a
"next slot", there might be just one slot. I think the first LOG is
enough to indicate that the sync-slot is waiting as it repeats this
log till the slot catches up. I know these messages hold great value
for debugging but in production,  "waiting..", "aborting the wait.."
might not be as helpful, maybe change it to debug?

2023-11-30 05:13:49.811 EST [6115] LOG:  waiting for remote slot
"sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN
(0/3047AC8) and catalog xmin (745)
2023-11-30 05:13:57.909 EST [6115] LOG:  aborting the wait for remote
slot "sub1" and moving to the next slot, will attempt creating it
again
2023-11-30 05:14:07.921 EST [6115] LOG:  waiting for remote slot
"sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN
(0/3047AC8) and catalog xmin (745)


2. If a slot on the standby is in the "i" state as it hasn't been
synced and it was invalidated on the primary, should you continuously
retry creating this invalidated slot on the standby?

2023-11-30 06:21:41.844 EST [10563] LOG:  waiting for remote slot
"sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN
(0/EED9330) and catalog xmin (785)
2023-11-30 06:21:41.845 EST [10563] WARNING:  slot "sub1" invalidated
on the primary server, slot creation aborted
2023-11-30 06:21:51.892 EST [10563] LOG:  waiting for remote slot
"sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN
(0/EED9330) and catalog xmin (785)
2023-11-30 06:21:51.893 EST [10563] WARNING:  slot "sub1" invalidated
on the primary server, slot creation aborted

3. If creation of a slot on the standby fails for one slot because a
slot of the same name exists, then thereafter no new sync slots are
created on standby. Is this expected? I do see that previously created
slots are kept up to date, just that no new slots are created after
that.

regards,
Ajin Cherian
Fujitsu australia




Re: PATCH: Add REINDEX tag to event triggers

2023-11-27 Thread Ajin Cherian
On Mon, Nov 27, 2023 at 11:00 AM jian he  wrote:
>
> On Fri, Nov 24, 2023 at 10:44 AM Michael Paquier  wrote:
> hi.
> v5-0001. changed the REINDEX command tag from event_trigger_ok: false
> to event_trigger_ok: true.
> Move ReindexStmt moves from standard_ProcessUtility to ProcessUtilitySlow.
> By default ProcessUtilitySlow will call trigger related functions.
> So the event trigger facility can support reindex statements.
>
> v5-0002. In GetCommandLogLevel, change T_ReindexStmt from lev =
> LOGSTMT_ALL to lev = LOGSTMT_DDL. so log_statement (enum) >= 'ddl'
> will make the reindex statement be logged.
>
> v5-0003. Refactor the following functions: {ReindexIndex,
> ReindexTable, ReindexMultipleTables,
> ReindexPartitions,ReindexMultipleInternal
> ,ReindexRelationConcurrently, reindex_relation,reindex_index} by
> adding `const ReindexStmt *stmt` as their first argument.
> This is for event trigger support reindex. We need to pass both the
> newly generated indexId and the ReindexStmt to
> EventTriggerCollectSimpleCommand right after the newly index gets
> their lock. To do that, we have to refactor related functions.
>
> v5-0004. Event trigger support reindex command implementation,
> documentation, regress test, helper function pass reindex info to
> function EventTriggerCollectSimpleCommand.

I just started reviewing the patch. Some minor comments:
In patch 0001:
In standard_ProcessUtility(), since you are unconditionally calling
ProcessUtilitySlow() in case of T_ReindexStmt, you really don't need
the case statement for T_ReindexStmt just like all the other commands
which have event trigger support. It will call ProcessUtilitySlow() as
default.

In patch 0004:
No need to duplicate reindex_event_trigger_collect in indexcmds.c
since it is already present in index.c. Add it to index.h and make the
function extern so that it can be accessed in both index.c and
indexcmds.c

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-11-23 Thread Ajin Cherian
On Tue, Nov 21, 2023 at 8:32 PM shveta malik  wrote:
>
> v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd,
> rebased the patches.  PFA v37_2 patches.

Thanks for the patch. Some comments:
subscriptioncmds.c:
CreateSubscription()
and tablesync.c:
process_syncing_tables_for_apply()
 walrcv_create_slot(wrconn, opts.slot_name, false,
twophase_enabled,
-   CRS_NOEXPORT_SNAPSHOT, NULL);
-
-if (twophase_enabled)
-UpdateTwoPhaseState(subid,
LOGICALREP_TWOPHASE_STATE_ENABLED);
-
+   failover_enabled,
CRS_NOEXPORT_SNAPSHOT, NULL);

either here or in libpqrcv_create_slot(), shouldn't you check the
remote server version if it supports the failover flag?


+
+/*
+ * If only the slot_name is specified, it is possible
that the user intends to
+ * use an existing slot on the publisher, so here we
enable failover for the
+ * slot if requested.
+ */
+else if (opts.slot_name && failover_enabled)
+{
+walrcv_alter_slot(wrconn, opts.slot_name, opts.failover);
+ereport(NOTICE,
+(errmsg("enabled failover for replication
slot \"%s\" on publisher",
+opts.slot_name)));
+}

Here, the code only alters the slot if failover = true. You could use
"else if (opts.slot_name && IsSet(opts.specified_opts,
SUBOPT_FAILOVER)" to check if the failover flag is specified and alter
for failover=false as well. Also, shouldn't you check for the server
version if the command ALTER_REPLICATION_SLOT is supported?

slot.c:
ReplicationSlotAlter()

+void
+ReplicationSlotAlter(const char *name, bool failover)
+{
+Assert(MyReplicationSlot == NULL);
+
+ReplicationSlotAcquire(name, true);
+
+if (SlotIsPhysical(MyReplicationSlot))
+ereport(ERROR,
+errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot use %s with a physical replication slot",
+   "ALTER_REPLICATION_SLOT"));

shouldn't you release the slot by calling ReplicationSlotRelease
before erroring out?

slot.c:
+/*
+ * A helper function to validate slots specified in standby_slot_names GUCs.
+ */
+static bool
+validate_standby_slots(char **newval)
+{
+char   *rawname;
+List   *elemlist;
+ListCell   *lc;
+
+/* Need a modifiable copy of string */
+rawname = pstrdup(*newval);

rawname is not always freed.

launcher.c:

+SlotSyncWorker->hdr.proc = MyProc;
+
+before_shmem_exit(slotsync_worker_detach, (Datum) 0);
+
+LWLockRelease(SlotSyncWorkerLock);
+}

before_shmem_exit() can error out leaving the lock acquired. Maybe you
should release the lock prior to calling before_shmem_exit() because
you don't need the lock there.

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-11-16 Thread Ajin Cherian
On Tue, Nov 14, 2023 at 12:57 AM Zhijie Hou (Fujitsu)
 wrote:
>
> 2) Raise error if user slot with same name already exists on standby.

"ERROR:  not synchronizing slot test; it is a user created slot"
I just tested this using v35 and to me the error message when this
happens is not very good. Neither does it sound like an error, nor is
there clarity on what the underlying problem is or how to correct it.

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-11-13 Thread Ajin Cherian
On Mon, Nov 13, 2023 at 5:38 PM shveta malik  wrote:
> Okay. Thanks for testing Ajin. I think it needs a fix wherein we set
> the local-slot's invalidation status (provided it is not invalidated
> already) from the remote slot before this check itself. And if the
> slot is invalidated locally (either by itself) or by primary_slot
> being invalidated, then we should skip the sync. I will fix this in
> the next version.

Yes, that works.
Another bug I see in my testing is that
pg_get_slot_invalidation_cause() does not release the LOCK if it finds
the slot it is searching for.

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-11-12 Thread Ajin Cherian
On Thu, Nov 9, 2023 at 9:54 PM shveta malik  wrote:
>
> PFA v32 patches which has below changes:
Testing with this patch, I see that if the failover enabled slot is
invalidated on the primary, then the corresponding synced slot is not
invalidated on the standby. Instead, I see that it continuously gets
the below error:
" WARNING:  not synchronizing slot sub; synchronization would move it backwards"

In the code, I see that:
if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn)
{
ereport(WARNING,
errmsg("not synchronizing slot %s; synchronization
would move"
   " it backwards", remote_slot->name));

ReplicationSlotRelease();
CommitTransactionCommand();
return;
}

If the restart_lsn of the remote slot is behind, then the
local_slot_update() function is never called to set the invalidation
status on the local slot. And for invalidated slots, restart_lsn is
always NULL.

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-10-30 Thread Ajin Cherian
On Thu, Oct 26, 2023 at 6:08 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shveta,
>
> > PFA v25 patch set. The changes are:
>
> Thanks for making the patch! It seems that there are lots of comments, so
> I can put some high-level comments for 0001.
> Sorry if there are duplicated comments.
>
> 1.
> The patch seemed not to consider the case that failover option between 
> replication
> slot and subscription were different. Currently slot option will be 
> overwritten
> by subscription one.
>
> Actually, I'm not sure what specification is better. Regarding the two_phase,
> 2PC will be decoded only when the both of settings are true. Should we follow?
>

But this is the intention, we want the Alter subscription to be able
to change the failover behaviour
of the slot.

> 2.
> Currently ctx->failover is set only in the pgoutput_startup(), but not sure 
> it is OK.
> Can we change the parameter in CreateDecodingContext() or similar functions?
>
> Because IIUC it means that only slots which have pgoutput can wait. Other
> output plugins must understand the change and set faliover flag as well -
> I felt it is not good. E.g., you might miss to enable the parameter in 
> test_decoding.
>
> Regarding the two_phase parameter, setting on plugin layer is good because it
> quite affects the output. As for the failover, it is not related with the
> content so that all of slots should be enabled.
>
> I think CreateDecodingContext or StartupDecodingContext() is the common path.
> Or, is it the out-of-scope for now?

Currently, the failover field is part of the options list in the
StartReplicationCmd. This gives some
level of flexibility such that only plugins that are interested in
this need to handle it. The options list
is only deparsed by plugins.  If we move it to outside of the options list,
this sort of changes the protocol for START_REPLICATION and will
impact all plugins.
 But I agree to your larger point that, we need to do it in such a way that
other plugins do not unintentionally change the 'failover' behaviour
of the originally created slot.
Maybe I can code it in such a way that, only if the failover option is
specified in the list of options
passed as part of START_REPLICATION  will it change the original slot
created 'failover' flag by adding
another flag "failover_opt_given". Plugins that set this, will be able
to change the failover flag of the slot,
while plugins that do not support this will not set this and the
failover flag of the created slot will remain.
What do you think?

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-10-23 Thread Ajin Cherian
On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 10/20/23 5:27 AM, shveta malik wrote:
> > On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila  wrote:
> >
> > PFA v25 patch set. The changes are:
> >
> > 1) 'enable_failover' is changed to 'failover'
> > 2) Alter subscription changes to support 'failover'
> > 3) Fixes a bug in patch001 wherein any change in standby_slot_names
> > was not considered in the flow where logical walsenders wait for
> > standby's confirmation. Now during the wait, if standby_slot_names is
> > changed, wait is restarted using new standby_slot_names.
> > 4) Addresses comments by Bertrand and Amit in [1],[2],[3]
> >
> > The changes are mostly in patch001 and a very few in patch002.
> >
> > Thank You Ajin for working on alter-subscription changes and adding
> > more TAP-tests for 'failover'
> >
>
> Thanks for updating the patch!
>
> Looking at 0001 and doing some experiment:
>
> Creating a logical slot with failover = true and then launching
> pg_logical_slot_get_changes() or pg_recvlogical() on it results
> to setting failover back to false.
>
> It occurs while creating the decoding context here:
>
> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>  SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
>  }
>
> +   /* set failover in the slot, as requested */
> +   slot->data.failover = ctx->failover;
> +
>
> I think we can get rid of this change in CreateDecodingContext().
>
Yes, I too noticed this in my testing, however just removing this from
CreateDecodingContext will not allow us to change the slot's failover flag
using Alter subscription. Currently alter subscription re-establishes
the connection
using START REPLICATION and failover is one of the options passed in along with
START REPLICATION. I am thinking of moving this change to
StartLogicalReplication prior to calling CreateDecodingContext by
parsing the command options in StartReplicationCmd
without adding it to the LogicalDecodingContext.

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-10-15 Thread Ajin Cherian
mention the cleanup.

> ~~~
>
> 22. slotsync_remove_obsolete_dbs
>
> This function says:
> +/*
> + * Slot-sync workers remove obsolete DBs from db-list
> + *
> + * If the DBIds fetched from the primary are lesser than the ones being 
> managed
> + * by slot-sync workers, remove extra dbs from worker's db-list. This
> may happen
> + * if some slots are removed on primary but 'synchronize_slot_names' has not
> + * been changed yet.
> + */
> +static void
> +slotsync_remove_obsolete_dbs(List *remote_dbs)
>
> But, there was another similar logic function too:
>
> +/*
> + * Drop obsolete slots
> + *
> + * Drop the slots which no longer need to be synced i.e. these either
> + * do not exist on primary or are no longer part of synchronize_slot_names.
> + *
> + * Also drop the slots which are valid on primary and got invalidated
> + * on standby due to conflict (say required rows removed on primary).
> + * The assumption is, these will get recreated in next sync-cycle and
> + * it is okay to drop and recreate such slots as long as these are not
> + * consumable on standby (which is the case currently).
> + */
> +static void
> +drop_obsolete_slots(Oid *dbids, List *remote_slot_list)
>
> Those function header comments suggest these have a lot of overlapping
> functionality.
>
> Can't those 2 functions be combined? Or maybe one delegate to the other?
>
==
One is called by the launcher, and the other is called by the slotsync
worker. While one
prunes the list of dbs that needs to be passed to each slot-sync
worker, the other prunes
the list of slots each slot-sync worker handles in its dblist. Both
are different.

> ~~~
>
> 23.
> + ListCell   *lc;
> + Oid*dbids;
> + int widx;
> + int dbidx;
> + int i;
>
> Scope of some of these variable declarations can be different so they
> are declared closer to where they are used.
>
==
fixed

> ~~~
>
> 24.
> + /* If not found, then delete this db from worker's db-list */
> + if (!found)
> + {
> + for (i = dbidx; i < worker->dbcount; i++)
> + {
> + /* Shift the DBs and get rid of wdbid */
> + if (i < (worker->dbcount - 1))
> + dbids[i] = dbids[i + 1];
> + }
>
> IIUC, that shift/loop could just have been a memmove() call to remove
> one Oid element.
>
==
fixed

> ~~~
>
> 25.
> + /* If dbcount for any worker has become 0, shut it down */
> + for (widx = 0; widx < max_slotsync_workers; widx++)
> + {
> + SlotSyncWorker *worker = >ss_workers[widx];
> +
> + if (worker->hdr.in_use && !worker->dbcount)
> + slotsync_worker_stop_internal(worker);
> + }
>
> Is it safe to stop this unguarded by SlotSyncWorkerLock locking? Is
> there a window where another dbid decides to reuse this worker at the
> same time this process is about to stop it?
>
==
Only the launcher can do this, and there is only one launcher.

> ~~~
>
> 26. primary_connect
>
> +/*
> + * Connect to primary server for slotsync purpose and return the connection
> + * info. Disconnect previous connection if provided in wrconn_prev.
> + */
>
> /primary server/the primary server/
>
==
fixed

> ~~~
>
> 27.
> + if (!RecoveryInProgress())
> + return NULL;
> +
> + if (max_slotsync_workers == 0)
> + return NULL;
> +
> + if (strcmp(synchronize_slot_names, "") == 0)
> + return NULL;
> +
> + /* The primary_slot_name is not set */
> + if (!WalRcv || WalRcv->slotname[0] == '\0')
> + {
> + ereport(WARNING,
> + errmsg("Skipping slots synchronization as primary_slot_name "
> +"is not set."));
> + return NULL;
> + }
> +
> + /* The hot_standby_feedback must be ON for slot-sync to work */
> + if (!hot_standby_feedback)
> + {
> + ereport(WARNING,
> + errmsg("Skipping slots synchronization as hot_standby_feedback "
> +"is off."));
> + return NULL;
> + }
>
> How come some of these checks giving WARNING that slot synchronization
> will be skipped, but others are just silently returning NULL?
>
==
primary_slot_name and hot_standby_feedback are not GUCs exclusive to
slot synchronization, they
are previously existing - so warning only for them. The others are
specific to slot synchronization,
so if users set them (which shows that the user intends to use sync-slot),
then warning to let the user know that these others also need to be set.

> ~~~
>
> 28. SaveCurrentSlotSyncConfigs
>
> +static void
> +SaveCurrentSlotSyncConfigs()
> +{
> + PrimaryConnInfoPreReload = pstrdup(PrimaryConnInfo);
> + PrimarySlotNamePreReload = pstrdup(WalRcv->slotname);
> + SyncSlotNamesPreReload = pstrdup(synchronize_slot_names);
> +}
>
> Shouldn't this code also do pfree first? Otherwise these will slowly
> leak every time this function is called, right?
>
==
fixed

> ~~~
>
> 29. SlotSyncConfigsChanged
>
> +static bool
> +SlotSyncConfigsChanged()
> +{
> + if (strcmp(PrimaryConnInfoPreReload, PrimaryConnInfo) != 0)
> + return true;
> +
> + if (strcmp(PrimarySlotNamePreReload, WalRcv->slotname) != 0)
> + return true;
> +
> + if (strcmp(SyncSlotNamesPreReload, synchronize_slot_names) != 0)
> + return true;
>
> I felt those can all be combined to have 1 return instead of 3.
>
==
fixed

> ~~~
>
> 30.
> + /*
> + * If we have reached this stage, it means original value of
> + * hot_standby_feedback was 'true', so consider it changed if 'false' now.
> + */
> + if (!hot_standby_feedback)
> + return true;
>
> "If we have reached this stage" seems a bit vague. Can this have some
> more explanation? And, maybe also an Assert(hot_standby_feedback); is
> helpful in the calling code (before the config is reloaded)?
>
==
rewrote this without that comment.

regards,
Ajin Cherian
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-08-15 Thread Ajin Cherian
On Mon, Aug 14, 2023 at 8:38 PM shveta malik  wrote:
>
> On Mon, Aug 14, 2023 at 3:22 PM shveta malik  wrote:
> >
> > On Tue, Aug 8, 2023 at 11:11 AM Drouvot, Bertrand
> >  wrote:
> > >
> > > Hi,
> > >
> > > On 8/8/23 7:01 AM, shveta malik wrote:
> > > > On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand
> > > >  wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> On 8/4/23 1:32 PM, shveta malik wrote:
> > > >>> On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand
> > > >>>  wrote:
> > > >>>> On 7/28/23 4:39 PM, Bharath Rupireddy wrote:
> > > >>
> > > >
> > > > Agreed. That is why in v10,v11 patches, we have different infra for
> > > > sync-slot worker i.e. it is not relying on "logical replication
> > > > background worker" anymore.
> > >
> > > yeah saw that, looks like the right way to go to me.
> > >
> > > >> Maybe we should start some tests/benchmark with only one sync worker 
> > > >> to get numbers
> > > >> and start from there?
> > > >
> > > > Yes, we can do that performance testing to figure out the difference
> > > > between the two modes. I will try to get some statistics on this.
> > > >
> > >
> > > Great, thanks!
> > >
> >
> > We (myself and Ajin) performed the tests to compute the lag in standby
> > slots as compared to primary slots with different number of slot-sync
> > workers configured.
> >
> > 3 DBs were created, each with 30 tables and each table having one
> > logical-pub/sub configured. So this made a total of 90 logical
> > replication slots to be synced. Then the workload was run for aprox 10
> > mins. During this workload, at regular intervals, primary and standby
> > slots' lsns were captured (from pg_replication_slots) and compared. At
> > each capture, the intent was to know how much is each standby's slot
> > lagging behind corresponding primary's slot by taking the distance
> > between confirmed_flush_lsn of primary and standby slot. Then we took
> > the average (integer value) of this distance over the span of 10 min
> > workload and this is what we got:
> >
>
> I have attached the scripts for schema-setup, running workload and
> capturing lag. Please go through Readme for details.
>
>
I did some more tests for 10,20 and 40 slots to calculate the average
lsn distance
between slots, comparing 1 worker and 3 workers.

My results are as follows:

10 slots
1 worker: 5529.75527426 (average lsn distance between primary and
standby per slot)
3 worker: 2224.57589134

20 slots
1 worker: 9592.87234043
3 worker: 3194.6293

40 slots
1 worker: 20566.093
3 worker: 7885.80952381

90 slots
1 worker: 36706.8405797
3 worker: 10236.6393162

regards,
Ajin Cherian
Fujitsu Australia


Re: Support logical replication of DDLs

2023-06-09 Thread Ajin Cherian
ype, data, sizeof(DeparsedCommandType));
> + data += sizeof(int);
> + change->data.ddl.prefix = MemoryContextAlloc(rb->context, prefix_size);
> + memcpy(change->data.ddl.prefix, data, prefix_size);
> + Assert(change->data.ddl.prefix[prefix_size - 1] == '\0');
> + data += prefix_size;
>
> I had suggested before ([3] #23) that it would be better to use:
> data += sizeof(DeparsedCommandType);
>
> instead of:
> data += sizeof(int);
>
> You already changed this OK in another place but this instance got
> accidentally missed.
>

fixed.

> ==
> src/backend/replication/logical/worker.c
>
> 23. preprocess_create_table
>
> + if (castmt->objtype == OBJECT_TABLE)
> + {
> + /*
> + * Force skipping data population to avoid data
> + * inconsistency. Data should be replicated from the
> + * publisher instead.
> + */
> + castmt->into->skipData = true;
> + }
>
> I had suggested before ([4] #16b) that the "Force skipping" comments
> are not necessary because the function header comment already says the
> same thing. One of the "Force skipping" comments was removed OK, but
> there is still one more remaining that should be removed.
>

fixed.

> ~~~
>
> 24. postprocess_ddl_create_table
>
> + commandTag = CreateCommandTag((Node *) command);
> + cstmt = (CreateStmt *) command->stmt;
> + rv = cstmt->relation;
> +
> + if (commandTag != CMDTAG_CREATE_TABLE)
> + return;
> +
> + cstmt = (CreateStmt *) command->stmt;
> + rv = cstmt->relation;
> + if (!rv)
> + return;
>
> This code is still flawed as previously described (see [4]#18). There
> are duplicate assignments of 'cstmt' and 'rv'.
>

fixed.

> ~~~
>
> 25. apply_handle_ddl
>
> +/*
> + * Handle DDL replication messages. Convert the json string into a query
> + * string and run it through the query portal.
> + */
> +static void
> +apply_handle_ddl(StringInfo s)
>
> IMO for consistency this should use the same style as the other
> function comments. So after the first sentence, put a more detailed
> description after a blank line.
>

fixed.

> ~~~
>
> 26. apply_handle_ddl
>
> I previously ([4]#21) asked a questio:
> There seems to be an assumption here that the only kind of command
> processed here would be TABLE related. Maybe that is currently true,
> but shouldn't there be some error checking just to make sure it cannot
> execute unexpected commands?
>
> ~
>
> IMO this question remains relevant -- I think this ddl code needs some
> kind of additional guards/checks in it otherwise it will attempt to
> deparse commands that it does not understand (e.g. imagine a later
> version publisher which supports more DDL than the subscriber does).
>

Currently, according to the design, there is no distinction between
what publisher supports
and subscriber supports. Only the publication decides what is
replicated and not, subscription has no control.

> ==
> src/backend/replication/pgoutput/pgoutput.c
>
> 27. PGOutputTxnData
>
>  typedef struct PGOutputTxnData
>  {
>   bool sent_begin_txn; /* flag indicating whether BEGIN has been sent */
> + List*deleted_relids; /* maintain list of deleted table oids */
>  } PGOutputTxnData;
>
> Actually, from my previous review (see [4]#22) I meant for this to be
> a more detailed *structure* level comment to say why this is necessary
> even to have this member; not just a basic field comment like what has
> been added.
>

fixed.

> ~~~
>
> 28. is_object_published
>
> + /*
> + * Only send this ddl if we don't publish ddl message or the ddl
> + * need to be published via its root relation.
> + */
> + if (relentry->pubactions.pubddl_table &&
> + relentry->publish_as_relid == objid)
> + return true;
>
> The comment seems wrong/confused – "Only send this ddl if we don't
> publish ddl message" (??)
>

fixed.

> ==
> src/bin/pg_dump/pg_dump.c
>
> 29. getEventTriggers
>
> + /* skip internally created event triggers by checking evtisinternal */
>   appendPQExpBufferStr(query,
>   "SELECT e.tableoid, e.oid, evtname, evtenabled, "
>   "evtevent, evtowner, "
>
> Uppercase the comment.
>

fixed.

> ==
> src/include/catalog/pg_event_trigger.h
>
> 33.
> @@ -36,7 +36,7 @@ CATALOG(pg_event_trigger,3466,EventTriggerRelationId)
>   * called */
>   char evtenabled; /* trigger's firing configuration WRT
>   * session_replication_role */
> -
> + bool evtisinternal; /* trigger is system-generated */
>  #ifdef CATALOG_VARLEN
>   text evttags[1]; /* command TAGs this event trigger targets */
>  #endif
>
> ~
>
> This change should not remove the blank line that previously existed
> before the #ifdef CATALOG_VARLEN.
>

fixed.

> ==
> src/include/catalog/pg_publication.
>
> 34.
> +/* Publication trigger events */
> +#define PUB_TRIG_DDL_CMD_START "ddl_command_start"
> +#define PUB_TRIG_DDL_CMD_END "ddl_command_end"
> +#define PUB_TRIG_TBL_REWRITE "table_rewrite"
> +#define PUB_TRIG_TBL_INIT_WRITE "table_init_write"
>
> Elsewhere in PG15 code there are already hardcoded literal strings for
> these triggers, so I am wondering if these constants should really be
> defined in some common place where everybody can make use of them
> instead of having a mixture of string literals and macros for the same
> strings.
>

fixed.

> ==
> src/include/commands/event_trigger.h
>
> 35.
> -extern Oid CreateEventTrigger(CreateEventTrigStmt *stmt);
> +extern Oid CreateEventTrigger(CreateEventTrigStmt *stmt, bool isinternal);
>  extern Oid get_event_trigger_oid(const char *trigname, bool missing_ok);
>
> IMO a better name is 'is_internal' (Using a snake-case name matches
> like the other 'missing_ok')
>

fixed.

> ==
> src/include/replication/ddlmessage.h
>
> 36.
> + * Copyright (c) 2022, PostgreSQL Global Development Group
>
> Copyright for the new file should be 2023?
>

fixed.

> ==
> src/include/tcop/ddldeparse.h
>
> 37.
>  * ddldeparse.h
>  *
>  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
>  * Portions Copyright (c) 1994, Regents of the University of California
>  *
>  * src/include/tcop/ddldeparse.h
>
> ~
>
> I think this is a new file for the feature so why is the copyright
> talking about old dates like 1994,1996 etc?
>

fixed.

regards,
Ajin Cherian




Re: running logical replication as the subscription owner

2023-05-23 Thread Ajin Cherian
On Mon, May 22, 2023 at 10:36 PM Masahiko Sawada  wrote:
>
> On Wed, May 17, 2023 at 10:10 AM Ajin Cherian  wrote:
> >
> > On Mon, May 15, 2023 at 10:47 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, May 15, 2023 at 5:44 PM Ajin Cherian  wrote:
> > > >
> > > > On Fri, May 12, 2023 at 9:55 PM Ajin Cherian  wrote:
> > > > >
> > > > > If nobody else is working on this, I can come up with a patch to fix 
> > > > > this
> > > > >
> > > >
> > > > Attaching a patch which attempts to fix this.
> > > >
> > >
> > > Thank you for the patch! I think we might want to have tests for it.
> > >
> > I have updated the patch with a test case as well.
>
> Thank you for updating the patch! Here are review comments:
>
> +   /*
> +* Make sure that the copy command runs as the table owner, unless
> +* the user has opted out of that behaviour.
> +*/
> +   run_as_owner = MySubscription->runasowner;
> +   if (!run_as_owner)
> +   SwitchToUntrustedUser(rel->rd_rel->relowner, );
> +
> /* Now do the initial data copy */
> PushActiveSnapshot(GetTransactionSnapshot());
>
> I think we should switch users before the acl check in
> LogicalRepSyncTableStart().
>
> ---
> +# Create a trigger on table alice.unpartitioned that writes
> +# to a table that regress_alice does not have permission.
> +$node_subscriber->safe_psql(
> +   'postgres', qq(
> +SET SESSION AUTHORIZATION regress_alice;
> +CREATE OR REPLACE FUNCTION alice.alice_audit()
> +RETURNS trigger AS
> +\$\$
> +   BEGIN
> +   insert into public.admin_audit values(2);
> +   RETURN NEW;
> +   END;
> +\$\$
> +LANGUAGE 'plpgsql';
> +CREATE TRIGGER ALICE_TRIGGER AFTER INSERT ON alice.unpartitioned FOR EACH ROW
> +EXECUTE PROCEDURE alice.alice_audit();
> +ALTER TABLE alice.unpartitioned ENABLE ALWAYS TRIGGER ALICE_TRIGGER;
> +));
>
> While this approach works, I'm not sure we really need a trigger for
> this test. I've attached a patch for discussion that doesn't use
> triggers for the regression tests. We create a new subscription owned
> by a user who doesn't have the permission to the target table. The
> test passes only if run_as_owner = true works.
>
this is better, thanks. Since you are testing run_as_owner = false behaviour
during table copy phase, you might as well add a test case that it
correctly behaves during insert replication as well.

regards,
Ajin Cherian
Fujitsu Australia




Re: running logical replication as the subscription owner

2023-05-16 Thread Ajin Cherian
On Mon, May 15, 2023 at 10:47 PM Masahiko Sawada  wrote:
>
> On Mon, May 15, 2023 at 5:44 PM Ajin Cherian  wrote:
> >
> > On Fri, May 12, 2023 at 9:55 PM Ajin Cherian  wrote:
> > >
> > > If nobody else is working on this, I can come up with a patch to fix this
> > >
> >
> > Attaching a patch which attempts to fix this.
> >
>
> Thank you for the patch! I think we might want to have tests for it.
>
I have updated the patch with a test case as well.

regards,
Ajin Cherian
Fujitsu Australia


v2-0001-Fix-issue-where-the-copy-command-does-not-adhere-.patch
Description: Binary data


Re: running logical replication as the subscription owner

2023-05-15 Thread Ajin Cherian
On Fri, May 12, 2023 at 9:55 PM Ajin Cherian  wrote:
>
> If nobody else is working on this, I can come up with a patch to fix this
>

Attaching a patch which attempts to fix this.

regards,
Ajin Cherian
Fujitsu Australia


v1-0001-Fix-issue-where-the-copy-command-does-not-adhere-.patch
Description: Binary data


Re: running logical replication as the subscription owner

2023-05-12 Thread Ajin Cherian
On Fri, May 12, 2023 at 1:49 PM Amit Kapila  wrote:
>
> On Fri, May 12, 2023 at 9:10 AM Masahiko Sawada  wrote:
> >
> > On Fri, May 12, 2023 at 1:12 AM Robert Haas  wrote:
> > >
> > > On Thu, May 11, 2023 at 7:38 AM Amit Kapila  
> > > wrote:
> > > > Do we want the initial sync to also respect 'run_as_owner' option? I
> > > > might be missing something but I don't see anything in the docs about
> > > > initial sync interaction with this option. In the commit a2ab9c06ea,
> > > > we did the permission checking during the initial sync so I thought we
> > > > should do it here as well.
> > >
> > > It definitely should work that way. lf it doesn't, that's a bug.
> >
> > After some tests, it seems that the initial sync worker respects
> > 'run_as_owner' during catching up but not during COPYing.
> >
>
> Yeah, I was worried during copy phase only. During catchup, the code
> is common with apply worker code, so it will work.
>

I tried the following test:


Repeat On the publisher and subscriber:
 /* Create role regress_alice with  NOSUPERUSER on
   publisher and subscriber and a table for replication */

CREATE ROLE regress_alice NOSUPERUSER LOGIN;
CREATE ROLE regress_admin SUPERUSER LOGIN;
GRANT CREATE ON DATABASE postgres TO regress_alice;
SET SESSION AUTHORIZATION regress_alice;
CREATE SCHEMA alice;
GRANT USAGE ON SCHEMA alice TO regress_admin;
CREATE TABLE alice.test (i INTEGER);
ALTER TABLE alice.test REPLICA IDENTITY FULL;

On the publisher:
postgres=> insert into alice.test values(1);
postgres=> insert into alice.test values(2);
postgres=> insert into alice.test values(3);
postgres=> CREATE PUBLICATION alice FOR TABLE alice.test
WITH (publish_via_partition_root = true);

On the subscriber: /* create table admin_audit which regress_alice
does not have access to */
SET SESSION AUTHORIZATION regress_admin;
create table admin_audit (i integer);

On the subscriber: /* Create a trigger for table alice.test which
inserts on table admin_audit which the table owner of alice.test does
not have access to */
SET SESSION AUTHORIZATION regress_alice;
CREATE OR REPLACE FUNCTION alice.alice_audit()
RETURNS trigger AS
$$
BEGIN
insert into public.admin_audit values(2);
RETURN NEW;
END;
$$
LANGUAGE 'plpgsql';
create trigger test_alice after insert on alice.test for each row
execute procedure alice.alice_audit();
alter table alice.test enable always trigger test_alice;

On the subscriber: /* Create a subscription with run_as_owner = false */
CREATE SUBSCRIPTION admin_sub CONNECTION 'dbname=postgres
host=localhost port=6972' PUBLICATION alice WITH (run_as_owner =
false);
===

What I see is that as part of tablesync, the trigger invokes an
updates admin_audit which it shouldn't, as the table owner
of alice.test should not have access to the
table admin_audit. This means the table copy is being invoked as the
subscription owner and not the table owner.

However, I see subsequent inserts fail on replication with
permission denied error, so the apply worker correctly
applies the inserts as the table owner.

If nobody else is working on this, I can come up with a patch to fix this

regards,
Ajin Cherian
Fujitsu Australia




Re: Support logical replication of DDLs

2023-01-14 Thread Ajin Cherian
On Fri, Jan 13, 2023 at 5:33 PM vignesh C  wrote:
> Adding support for CREATE/ALTER/DROP Publication ddl deparsing.
> The attached v61 patch has the changes for the same.
>

Hi Vignesh,
this doesn't seem to compile:

gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -g -ggdb -Og -g3
-fno-omit-frame-pointer -I../../../src/include  -D_GNU_SOURCE   -c -o
ddl_deparse.o ddl_deparse.c
ddl_deparse.c: In function ‘deparse_PublicationObjects’:
ddl_deparse.c:8956:3: error: unknown type name ‘publication_rel’
   publication_rel *pub_rel = (publication_rel *) lfirst(lc1);
   ^
ddl_deparse.c:8956:31: error: ‘publication_rel’ undeclared (first use
in this function)
   publication_rel *pub_rel = (publication_rel *) lfirst(lc1);
   ^
ddl_deparse.c:8956:31: note: each undeclared identifier is reported
only once for each function it appears in
ddl_deparse.c:8956:48: error: expected expression before ‘)’ token
   publication_rel *pub_rel = (publication_rel *) lfirst(lc1);

regards,
Ajin Cherian
Fujitsu Australia




Re: Support logical replication of DDLs

2022-12-12 Thread Ajin Cherian
;
> 20.
>
>  OBJS = \
>   decode.o \
> + ddlmessage.o\
>   launcher.o \
> Change should be in alphabetical order.
>

Fixed.

> ==
>
> src/backend/replication/logical/ddlmessage.c
>
> 21. File Comment
>
> + * Unlike generic logical messages, these DDL messages have only 
> transactional
> + * mode.Note by default DDLs in PostgreSQL are transactional.
>
> Missing space before "Note"
>

Fixed.

> ~~~
>
> 22. LogLogicalDDLMessage
>
> + /*
> + * Ensure we have a valid transaction id.
> + */
> + Assert(IsTransactionState());
> + GetCurrentTransactionId();
>
> Single line comment should be OK here
>

Fixed.

> ~
>
> 23.
>
> + /* trailing zero is critical; see logicalddlmsg_desc */
>
> Uppercase comment
>

fixed.

> ~
>
> 24.
>
> + /* allow origin filtering */
>
> Uppercase comment
>

fixed.

> ==
>
> src/backend/replication/logical/proto.c
>
> 25. logicalrep_read_ddlmessage
>
> + uint8 flags;
> + char *msg;
> +
> + //TODO double check when do we need to get TransactionId.
> +
> + flags = pq_getmsgint(in, 1);
> + if (flags != 0)
> + elog(ERROR, "unrecognized flags %u in ddl message", flags);
> + *lsn = pq_getmsgint64(in);
> + *prefix = pq_getmsgstring(in);
> + *sz = pq_getmsgint(in, 4);
> + msg = (char *) pq_getmsgbytes(in, *sz);
> +
> + return msg;
>
> 25a.
> This code will fail if the associated *write* function has sent a xid.
> Maybe additional param is needed to tell it when to read the xid?
>

removed to not send xid, not required.

> ~
>
> 25b.
> Will be tidier to have a blank line after the elog
>

fixed.

> ~~~
>
> 26. logicalrep_write_ddlmessage
>
> + /* transaction ID (if not valid, we're not streaming) */
> + if (TransactionIdIsValid(xid))
> + pq_sendint32(out, xid);
>
> Perhaps this "write" function should *always* write the xid even if it
> is invalid because then the "read" function will know to always read
> it.
>

changed it to never send xid.

> ==
>
> src/backend/replication/logical/reorderbuffer.c
>
> 27. ReorderBufferQueueDDLMessage
>
> + Assert(xid != InvalidTransactionId);
>
> SUGGESTION
> Assert(TransactionIdIsValid(xid));
>

fixed.

> ~~~
>
> 28. ReorderBufferSerializeChange
>
> + data += sizeof(int);
> + memcpy(data, change->data.ddlmsg.prefix,
> +prefix_size);
> + data += prefix_size;
>
> Unnecessary wrapping of memcpy.
>

fixed.

> ~
>
> 29.
>
> + memcpy(data, >data.ddlmsg.cmdtype, sizeof(int));
> + data += sizeof(int);
>
> Would that be better to write as:
>
> sizeof(DeparsedCommandType) instead of sizeof(int)
>

fixed.

> ~~~
>
> 30. ReorderBufferChangeSize
>
> + case REORDER_BUFFER_CHANGE_DDLMESSAGE:
> + {
> + Size prefix_size = strlen(change->data.ddlmsg.prefix) + 1;
> +
> + sz += prefix_size + change->data.ddlmsg.message_size +
> + sizeof(Size) + sizeof(Size) + sizeof(Oid) + sizeof(int);
>
> sizeof(DeparsedCommandType) instead of sizeof(int)
>

fixed.

Breaking this into two mails, next set of comments in next mail.

regards,
Ajin Cherian
Fujitsu Australia




Re: Support logical replication of DDLs

2022-11-28 Thread Ajin Cherian
On Tue, Nov 29, 2022 at 4:22 PM rajesh singarapu
 wrote:
>
> Isn't it a good idea to move triggers to CreateReplicationSlot() ? as
> create publication also create replication slot, AFAIK.
>
> thanks
> Rajesh
>
Currently we're trying to get this work using "Create Publication",
maybe in future
we'll consider adding it as part of replication slot parameters.

regards,
Ajin Cherian
Fujitsu Australia




Re: Support logical replication of DDLs

2022-11-28 Thread Ajin Cherian
On Tue, Nov 29, 2022 at 3:39 PM rajesh singarapu
 wrote:
>
> Thanks Ajin for the reply.
>
> We "Create/Install" these trigger function at the time of "Create
> publication", (CreatePublication())
> but If I create a replication slot using something like "select * from
> pg_create_logical_replication_slot('test1', 'test_decoding')"
> we would not install these triggers in the system, so we dont get DDLs
> decoded, right ?
>
> I am a bit new to this postgres, is there anything missing in my 
> understanding ?
>
> thanks
> Raejsh
>

Currently this feature is only supported using "Create publication".
We have not added
a slot level parameter to trigger this.

regards,
Ajin Cherian
Fujitsu Australia




Re: Support logical replication of DDLs

2022-11-28 Thread Ajin Cherian
On Tue, Nov 29, 2022 at 1:29 PM rajesh singarapu
 wrote:
>
> One question,
>
> I understand that we create/enable triggers on create publication command 
> flow.
> I am wondering how this works in case of logical replication using slots.
>
>
> thanks
> Rajesh
>
Rajesh,

The triggers functions when invoked write these ddl commands to WAL
and the logical decoding WAL sender which is
registered for that replication slot decodes the WAL logged DDL
commands and sends them as logical replication
messages to the subscriber side. The apply worker on the subscriber
side, then converts these messages to actual
DDL commands and executes them.

regards,
Ajin Cherian
Fujitsu Australia




Re: Failed Assert in pgstat_assoc_relation

2022-11-28 Thread Ajin Cherian
On Mon, Nov 28, 2022 at 8:10 PM vignesh C  wrote:
>
> Hi,
>
> While reviewing/testing one of the patches I found the following Assert:
> #0  __pthread_kill_implementation (no_tid=0, signo=6,
> threadid=139624429171648) at ./nptl/pthread_kill.c:44
> #1  __pthread_kill_internal (signo=6, threadid=139624429171648) at
> ./nptl/pthread_kill.c:78
> #2  __GI___pthread_kill (threadid=139624429171648,
> signo=signo@entry=6) at ./nptl/pthread_kill.c:89
> #3  0x7efcda6e3476 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/posix/raise.c:26
> #4  0x7efcda6c97f3 in __GI_abort () at ./stdlib/abort.c:79
> #5  0x5590bf283139 in ExceptionalCondition
> (conditionName=0x5590bf468170 "rel->pgstat_info->relation == NULL",
> fileName=0x5590bf46812b "pgstat_relation.c", lineNumber=143) at
> assert.c:66
> #6  0x5590bf0ce5f8 in pgstat_assoc_relation (rel=0x7efcce996a48)
> at pgstat_relation.c:143
> #7  0x5590beb83046 in initscan (scan=0x5590bfbf4af8, key=0x0,
> keep_startblock=false) at heapam.c:343
> #8  0x5590beb8466f in heap_beginscan (relation=0x7efcce996a48,
> snapshot=0x5590bfb5a520, nkeys=0, key=0x0, parallel_scan=0x0,
> flags=449) at heapam.c:1223
> #9  0x5590bf02af39 in table_beginscan (rel=0x7efcce996a48,
> snapshot=0x5590bfb5a520, nkeys=0, key=0x0) at
> ../../../src/include/access/tableam.h:891
> #10 0x5590bf02bf8a in DefineQueryRewrite (rulename=0x5590bfb281d0
> "_RETURN", event_relid=16387, event_qual=0x0, event_type=CMD_SELECT,
> is_instead=true, replace=false, action=0x5590bfbf4aa8)
> at rewriteDefine.c:447
> #11 0x5590bf02b5ab in DefineRule (stmt=0x5590bfb285c0,
> queryString=0x5590bfb277a8 "CREATE RULE \"_RETURN\" AS ON SELECT TO t
> DO INSTEAD  SELECT * FROM t1;") at rewriteDefine.c:213
>
> I could reproduce this issue with the following steps:
> create table t1(c int);
> BEGIN;
> CREATE TABLE t (c int);
> SAVEPOINT q;
> CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD  SELECT * FROM t1;
> select * from t;
> ROLLBACK TO q;
> CREATE RULE "_RETURN" AS ON SELECT TO t DO INSTEAD  SELECT * FROM t1;
> ROLLBACK;
>
> Regards,
> Vignesh


I think what is happening here is that the previous relation is not
unlinked when pgstat_init_relation() is called
because the relation is now a view and for relations without storage
the relation is not unlinked in pgstat_init_relation()

void
pgstat_init_relation(Relation rel)
{
charrelkind = rel->rd_rel->relkind;

/*
 * We only count stats for relations with storage and partitioned tables
 */
if (!RELKIND_HAS_STORAGE(relkind) && relkind != RELKIND_PARTITIONED_TABLE)
{
rel->pgstat_enabled = false;
rel->pgstat_info = NULL;
return;
}

There is a logic in DefineQueryRewrite() which converts a relation to
a view when you create such a rule like the test case does.
So initially the relation had storage,  the pgstat_info is linked,
then table is converted to a view, but in init, the previous
relation is not unlinked but when it tries to link a new relation, the
assert fails saying a previous relation is already linked to
pgstat_info

I have made a small patch with a fix, but I am not sure if this is the
right way to fix this.

regards,
Ajin Cherian
Fujitsu Australia


pgstat_assoc_fix_for_views.patch
Description: Binary data


Re: Support logical replication of DDLs

2022-10-06 Thread Ajin Cherian
On Fri, Oct 7, 2022 at 8:30 AM Zheng Li  wrote:
>
> > > Some tweaking is made in deparse_drop_command in order to make DROP
> > > TRANSFORM deparsing work. This is because the objidentity captured in
> > > currentEventTriggerState->SQLDropList contains the keyword 'on', for
> > > example "for typename on language lang", but the keyword 'on' is not
> > > needed in the current DROP TRANSFORM syntax. So we need to remove the
> > > 'on' keyword in objidentity. I'm not sure if this is the best way to
> > > handle it, maybe we can consider directly modifying what's captured in
> > > currentEventTriggerState->SQLDropList
> > > so we don't have the "on" keyword to begin with?
> >
> > The exact output format for identity is not set in stone; we should only
> > set it in stone once we have an actual working case for them.  This is
> > the first such use, so it seems OK to make minor modifications (such as
> > removing an undesirable ON) if it's a reasonable change and allows
> > consumer code to be more easily written.
>
> > So, +1 to dropping ON here.  However, if there are further strings that
> > need to be modified, let's see what they are.
>
> Thanks for confirming. Attaching the new patch set that removes the
> undesirable ON from getObjectIdentityParts() for TRANSFORM.
>
Thanks for the new patch-set.
Could you add the changes to patch 1 and patch 2, rather than adding a
new patch?
Otherwise, we'll have a separate patch for each command and it will
take double work to keep it updated
for each new command added.

thanks,
Ajin Cherian
Fujitsu Australia




Re: First draft of the PG 15 release notes

2022-05-12 Thread Ajin Cherian
On Wed, May 11, 2022 at 1:44 AM Bruce Momjian  wrote:
>
> I have completed the first draft of the PG 15 release notes and you can
> see the results here:
>
> https://momjian.us/pgsql_docs/release-15.html
>
> The feature count is similar to recent major releases:
>
> release-10 195
> release-11 185
> release-12 198
> release-13 183
> release-14 229
> --> release-15 186
>
> I assume there will be major adjustments in the next few weeks based on
> feedback.
>

I wonder if this is worth mentioning:

Skip empty transactions for logical replication.
commit d5a9d86d8ffcadc52ff3729cd00fbd83bc38643c

https://github.com/postgres/postgres/commit/d5a9d86d8ffcadc52ff3729cd00fbd83bc38643c

regards,
Ajin Cherian
Fujitsu Australia




Re: Support logical replication of DDLs

2022-05-10 Thread Ajin Cherian
On Fri, May 6, 2022 at 11:24 PM Amit Kapila  wrote:
> As we have hacked CreatePublication function for this POC, the
> regression tests are not passing but we can easily change it so that
> we invoke new functionality with the syntax proposed in this thread or
> with some other syntax and we shall do that in the next patch unless
> this approach is not worth pursuing.
>
> This POC is prepared by Ajin Cherian, Hou-San, and me.
>
> Thoughts?
>
> [1] - 
> https://www.postgresql.org/message-id/20150215044814.GL3391%40alvh.no-ip.org

I have updated Amit's patch by including a public action "create" when
creating publication which is turned off by default.
Now the 'make check' tests pass. I also fixed a problem that failed to
create tables when the table has a primary key.

regards,
Ajin Cherian
Fujitsu Australia


v2-0001-Fix-race-in-032_relfilenode_reuse.pl.patch
Description: Binary data


v2-0002-Functions-to-deparse-DDL-commands.patch
Description: Binary data


Re: deparsing utility commands

2022-04-15 Thread Ajin Cherian
On Thu, Apr 14, 2022 at 12:19 AM Peter Eisentraut
 wrote:
> The patch you posted contains neither a detailed commit message nor
> documentation or test changes, so it's impossible to tell what it's
> supposed to do.
>

Sorry, I was only rebasing the patches and have kept the same commit
messages as previously present.

regards,
Ajin Cherian




Re: deparsing utility commands

2022-04-15 Thread Ajin Cherian
On Wed, Apr 13, 2022 at 2:12 PM Shulgin, Oleksandr
 wrote:
>
>
>> You seem to have squashed the patches?  Please keep the split out.
>
>
> Well, if that makes the review process easier :-)
>
> --
> Alex
>

I've rebased patches 1, 2 and 5 (now 1,2 and 3). Patches 3 and 4 seem
to be related to the testing of the extension and contrib module
ddl_deparse which is not in this patch-set, so I have left those
patches out, as I have no way of testing the test cases added as part
of these patches.

regards,
Ajin Cherian
Fujitsu Australia


0002-Move-some-ddl-deparsing-code-from-ruleutils-to-exten.patch
Description: Binary data


0003-Add-optional-constraint-name-to-field-IndexStmt.patch
Description: Binary data


0001-ddl_deparse-core-support.patch
Description: Binary data


Re: deparsing utility commands

2022-04-12 Thread Ajin Cherian
On Wed, Apr 13, 2022 at 2:29 PM Ajin Cherian  wrote:
>
>
> This patch-set has not been rebased for some time. I see that there is
> interest in this patch from the logical
> replication of DDL thread [1].
>

Forgot to add the link to the thread.

[1] - 
https://www.postgresql.org/message-id/202203162206.7spggyktx63e@alvherre.pgsql




Re: deparsing utility commands

2022-04-12 Thread Ajin Cherian
On Wed, Apr 13, 2022 at 2:12 PM Shulgin, Oleksandr
 wrote:

>> You seem to have squashed the patches?  Please keep the split out.
>
>
> Well, if that makes the review process easier :-)
>
> --
> Alex
>

This patch-set has not been rebased for some time. I see that there is
interest in this patch from the logical
replication of DDL thread [1].

I will take a stab at rebasing this patch-set, I have already rebased
the first patch and will work on the other
patches in the coming days. Do review and give me feedback.

regards,
Ajin Cherian
Fujitsu Australia


0001-ddl_deparse-core-support.patch
Description: Binary data


Re: Support logical replication of DDLs

2022-04-06 Thread Ajin Cherian
On Wed, Mar 23, 2022 at 4:09 PM Japin Li  wrote:
>
>
> On Tue, 22 Mar 2022 at 04:56, Zheng Li  wrote:
> > Hi Japin,
> >
> >> You should use a different user that has different length from your 
> >> current one.
> >> For example:
> >>
> >> px@localhost$ make check-world
> >
> > This is fixed in the latest commit:
> > https://github.com/zli236/postgres/commits/ddl_replication
> >
>
> Thanks for fixing this.  I rebase the patchset on current master (383f222119)
> and attach here for further review.
>

The patch no longer applies. The patch is a very good attempt, and I
would also like to contribute if required.
I have a few comments but will hold it till a rebased version is available.

regards,
Ajin Cherian
Fujitsu Australia




Re: Logical replication row filtering and TOAST

2022-04-05 Thread Ajin Cherian
On Tue, Apr 5, 2022 at 8:32 PM Amit Kapila  wrote:

>
> How about something like the attached?
>

LGTM.

regards,
Ajin Cherian
Fujitsu Australia




Re: logical replication empty transactions

2022-03-18 Thread Ajin Cherian
On Thu, Mar 17, 2022 at 10:43 PM Amit Kapila  wrote:
>
> Review comments/suggestions:
> =
> 1. Isn't it sufficient to call pgoutput_send_begin from
> maybe_send_schema as that is commonplace for all others and is always
> the first message we send? If so, I think we can remove it from other
> places?

I've done the other way, I've removed it from maybe_send_schema as we
always call this
prior to calling maybe_send_schema.

> 2. Can we write some comments to explain why we don't skip streaming
> or prepared empty transactions and some possible solutions (the
> protocol change and additional subscription parameter as discussed
> [1]) as discussed in this thread pgoutput.c?

I've added comment in the header of pgoutput_begin_prepare_txn() and
pgoutput_stream_start()

> 3. Can we add a simple test for it in one of the existing test
> files(say in 001_rep_changes.pl)?

added a simple test.

> 4. I think we can drop the skip streaming patch as we can't do that for now.

Dropped,

In addition, I have also added a few more comments explaining why the begin send
is delayed in pgoutput_change till row_filter is checked and also ran pgindent.

regards,
Ajin Cherian
Fujitsu Australia


v26-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-03-16 Thread Ajin Cherian
On Mon, Mar 7, 2022 at 11:44 PM Ajin Cherian  wrote:
>
> Fixed.
>
> regards,
> Ajin Cherian
> Fujitsu Australia

Rebased the patch and fixed some whitespace errors.
regards,
Ajin Cherian
Fujitsu Australia


v25-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


v25-0002-Skip-empty-streamed-transactions-for-logical-rep.patch
Description: Binary data


Re: Logical replication timeout problem

2022-03-07 Thread Ajin Cherian
On Tue, Mar 8, 2022 at 12:25 PM wangw.f...@fujitsu.com
 wrote:
> Attach the new patch.
> 1. Fix wrong variable setting and skip unnecessary time records.[suggestion 
> by Kuroda-San and me.]
> 2. Introduce version information.[suggestion by Peter, Kuroda-San]
>
> Regards,
> Wang wei

Some comments.

1. The comment  on top of SendKeepaliveIfNecessary

 Try to send a keepalive message if too many changes was skipped.

change to

Try to send a keepalive message if too many changes wer skipped.

2. In pgoutput_change:

+ /* Reset the counter for skipped changes. */
+ SendKeepaliveIfNecessary(ctx, false);
+

This reset is called too early, this function might go on to skip
changes because of the row filter, so this
reset fits better once we know for sure that a change is sent out. You
will also need to send keep alive
when the change is skipped due to the row filter.

regards,
Ajin Cherian
Fujitsu Australia




Re: logical replication empty transactions

2022-03-07 Thread Ajin Cherian
On Mon, Mar 7, 2022 at 7:50 PM shiy.f...@fujitsu.com
 wrote:
>
> On Fri, Mar 4, 2022 9:41 AM Ajin Cherian  wrote:
> >
> > I have split the patch into two. I have kept the logic of skipping
> > streaming changes in the second patch.
> > I will work on the second patch once we can figure out a solution for
> > the COMMIT PREPARED after restart problem.
> >
>
> Thanks for updating the patch.
>
> A comment on v23-0001 patch.
>
> @@ -1429,6 +1520,19 @@ pgoutput_message(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
> if (in_streaming)
> xid = txn->xid;
>
> +   /*
> +* Output BEGIN if we haven't yet.
> +* Avoid for non-transactional messages.
> +*/
> +   if (in_streaming || transactional)
> +   {
> +   PGOutputTxnData *txndata = (PGOutputTxnData *) 
> txn->output_plugin_private;
> +
> +   /* Send BEGIN if we haven't yet */
> +   if (txndata && !txndata->sent_begin_txn)
> +   pgoutput_send_begin(ctx, txn);
> +   }
> +
> OutputPluginPrepareWrite(ctx, true);
> logicalrep_write_message(ctx->out,
>  xid,
>
> I think we don't need to send BEGIN if in_streaming is true, right? The first
> patch doesn't skip streamed transaction, so should we modify
> +   if (in_streaming || transactional)
> to
> +   if (!in_streaming && transactional)
> ?
>

Fixed.

regards,
Ajin Cherian
Fujitsu Australia


v24-0002-Skip-empty-streamed-transactions-for-logical-rep.patch
Description: Binary data


v24-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-03-03 Thread Ajin Cherian
I have split the patch into two. I have kept the logic of skipping
streaming changes in the second patch.
I will work on the second patch once we can figure out a solution for
the COMMIT PREPARED after restart problem.

regards,
Ajin Cherian


v23-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


v23-0002-Skip-empty-streamed-transactions-for-logical-rep.patch
Description: Binary data


Re: logical replication empty transactions

2022-03-02 Thread Ajin Cherian
On Wed, Mar 2, 2022 at 1:01 PM shiy.f...@fujitsu.com
 wrote:
>
> Hi,
>
> Here are some comments on the v21 patch.
>
> 1.
> +   WalSndKeepalive(false, 0);
>
> Maybe we can use InvalidXLogRecPtr here, instead of 0.
>

Fixed.

> 2.
> +   pq_sendint64(_message, writePtr ? writePtr : sentPtr);
>
> Similarly, should we use XLogRecPtrIsInvalid()?

Fixed

>
> 3.
> @@ -1183,6 +1269,20 @@ pgoutput_change(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
> Assert(false);
> }
>
> +   if (in_streaming)
> +   {
> +   /* If streaming, send STREAM START if we haven't yet */
> +   if (txndata && !txndata->sent_stream_start)
> +   pgoutput_send_stream_start(ctx, txn);
> +   }
> +   else
> +   {
> +   /* If not streaming, send BEGIN if we haven't yet */
> +   if (txndata && !txndata->sent_begin_txn)
> +   pgoutput_send_begin(ctx, txn);
> +   }
> +
> +
> /* Avoid leaking memory by using and resetting our own context */
> old = MemoryContextSwitchTo(data->context);
>
>
> I am not sure if it is suitable to send begin or stream_start here, because 
> the
> row filter is not checked yet. That means, empty transactions caused by row
> filter are not skipped.
>

Moved the check down, so that row_filters are taken into account.

regards,
Ajin Cherian
Fujitsu Australia


v22-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-03-02 Thread Ajin Cherian
On Wed, Mar 2, 2022 at 1:01 PM shiy.f...@fujitsu.com
 wrote:
>
> 4.
> @@ -1617,9 +1829,21 @@ pgoutput_stream_prepare_txn(LogicalDecodingContext 
> *ctx,
> ReorderBufferTXN *txn,
> XLogRecPtr 
> prepare_lsn)
>  {
> +   PGOutputTxnData *txndata = txn->output_plugin_private;
> +   boolsent_begin_txn = txndata->sent_begin_txn;
> +
> Assert(rbtxn_is_streamed(txn));
>
> -   OutputPluginUpdateProgress(ctx);
> +   pfree(txndata);
> +   txn->output_plugin_private = NULL;
> +
> +   if (!sent_begin_txn)
> +   {
> +   elog(DEBUG1, "Skipping replication of an empty transaction in 
> stream prepare");
> +   return;
> +   }
> +
> +   OutputPluginUpdateProgress(ctx, false);
> OutputPluginPrepareWrite(ctx, true);
> logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
> OutputPluginWrite(ctx, true);
>
> I notice that the patch skips stream prepared transaction, this would cause an
> error on subscriber side when committing this transaction on publisher side, 
> so
> I think we'd better not do that.
>
> For example:
> (set logical_decoding_work_mem = 64kB, max_prepared_transactions = 10 in
> postgresql.conf)
>
> -- publisher
> create table test (a int, b text, primary key(a));
> create table test2 (a int, b text, primary key(a));
> create publication pub for table test;
>
> -- subscriber
> create table test (a int, b text, primary key(a));
> create table test2 (a int, b text, primary key(a));
> create subscription sub connection 'dbname=postgres port=5432' publication 
> pub with(two_phase=on, streaming=on);
>
> -- publisher
> begin;
> INSERT INTO test2 SELECT i, md5(i::text) FROM generate_series(1, 1000) s(i);
> prepare transaction 't';
> commit prepared 't';
>
> The error message in subscriber log:
> ERROR:  prepared transaction with identifier "pg_gid_16391_722" does not exist
>

Thanks for the test. I guess this mixed streaming+two-phase runs into
the same problem that
was there while skipping two-phased transactions. If the eventual
commit prepared comes after a restart,
then there is no way of knowing if the original transaction was
skipped or not and we can't know if the commit prepared
needs to be sent. I tried not skipping the "stream prepare", but that
causes a crash in the apply worker
as it tries to find the non-existent streamed file. We could add logic
to silently ignore a spurious "stream prepare"
but that might not be ideal. Any thoughts on how to address this? Or
else, we will need to avoid skipping streamed
transactions as well.

regards,
Ajin Cherian
Fujitsu Australia




Re: logical replication empty transactions

2022-02-28 Thread Ajin Cherian
end the origin id in the subsequent streams.
>   */
> - if (rbtxn_is_streamed(txn))
> + if (txndata->sent_any_stream)
>   send_replication_origin = false;
>
> Given this usage, I wonder if there is a better name for the txndata
> member - e.g. 'sent_first_stream' ?
>
> ~~~

Changed.

>
> 14. src/backend/replication/pgoutput/pgoutput.c - pgoutput_send_stream_start
>
> - /* we're streaming a chunk of transaction now */
> - in_streaming = true;
> + /*
> + * Set the flags that indicate that changes were sent as part of
> + * the transaction and the stream.
> + */
> + txndata->sent_begin_txn = txndata->sent_stream_start = true;
> + txndata->sent_any_stream = true;
>
> Why is this setting member 'sent_begin_txn' true also? It seems odd to
> say so because the BEGIN was not actually sent at all, right?
>
> ~~~

You can have transactions that are partially streamed and partially
not. So if there
is a transaction that started as streaming, but when it is committed,
it is replicated
as part of the commit, then when the changes are decoded, we shouldn't
be sending a "begin"
again.

>
> 15. src/backend/replication/pgoutput/pgoutput.c - pgoutput_stream_abort
>
> @@ -1572,6 +1740,20 @@ pgoutput_stream_abort(struct LogicalDecodingContext 
> *ctx,
>
>   /* determine the toplevel transaction */
>   toptxn = (txn->toptxn) ? txn->toptxn : txn;
> + txndata = toptxn->output_plugin_private;
> + sent_begin_txn = txndata->sent_begin_txn;
> +
> + if (txn->toptxn == NULL)
> + {
> + pfree(txndata);
> + txn->output_plugin_private = NULL;
> + }
> +
> + if (!sent_begin_txn)
> + {
> + elog(DEBUG1, "Skipping replication of an empty transaction in stream 
> abort");
> + return;
> + }
>
> I didn't really understand why this code is checking the
> 'sent_begin_txn' member instead of the 'sent_stream_start' member?
>

Yes, changed this to check "sent_first_stream"
> ~~~
>
> 16. src/backend/replication/pgoutput/pgoutput.c - pgoutput_stream_commit
>
> @@ -1598,7 +1782,17 @@ pgoutput_stream_commit(struct
> LogicalDecodingContext *ctx,
>   Assert(!in_streaming);
>   Assert(rbtxn_is_streamed(txn));
>
> - OutputPluginUpdateProgress(ctx);
> + pfree(txndata);
> + txn->output_plugin_private = NULL;
> +
> + /* If no changes were part of this transaction then drop the commit */
> + if (!sent_begin_txn)
> + {
> + elog(DEBUG1, "Skipping replication of an empty transaction in stream 
> commit");
> + return;
> + }
>
> (Same as previous comment #15). I didn't really understand why this
> code is checking the 'sent_begin_txn' member instead of the
> 'sent_stream_start' member?
>
> ~~~

Changed.

>
> 17. src/backend/replication/syncrep.c - SyncRepEnabled
>
> @@ -539,6 +538,15 @@ SyncRepReleaseWaiters(void)
>  }
>
>  /*
> + * Check if synchronous replication is enabled
> + */
> +bool
> +SyncRepEnabled(void)
> +{
> + return SyncRepRequested() && ((volatile WalSndCtlData *)
> WalSndCtl)->sync_standbys_defined;
> +}
>
> That code was once inline in 'SyncRepWaitForLSN' before it was turned
> into a function, and there is a long comment in SyncRepWaitForLSN
> describing the risks of this logic. e.g.
>
> 
> ... If it's true, we need to check it again
> * later while holding the lock, to check the flag and operate the sync
> * rep queue atomically. This is necessary to avoid the race condition
> * described in SyncRepUpdateSyncStandbysDefined().
> 
>
> This same function is now called from walsender.c. I think maybe it is
> OK but please confirm it.
>
> Anyway, the point is maybe this SyncRepEnabled function should be
> better commented to make some reference about the race concerns of the
> original comment. Otherwise some future caller of this function may be
> unaware of it and come to grief.
>

Leaving this for now, not sure what wording is appropriate to use here.

On Wed, Feb 23, 2022 at 5:24 PM wangw.f...@fujitsu.com
 wrote:
>
> On Feb, Wed 23, 2022 at 10:58 PM Ajin Cherian  wrote:
> >
> Few comments to V19-0001:
>
> 1. I think we should adjust the alignment format.
> git am ../v19-0001-Skip-empty-transactions-for-logical-replication.patch
> .git/rebase-apply/patch:197: indent with spaces.
> * Before we send schema, make sure that STREAM START/BEGIN/BEGIN PREPARE
> .git/rebase-apply/patch:198: indent with spaces.
> * is sent. If not, send now.
> .git/rebase-apply/patch:199: indent with spaces.
> */
> .git/rebase-apply/patch:201: indent with spaces.
>pgoutput_send_stream_start(ctx, toptxn);
> .git/rebase-apply/patch:204: indent with spaces.
>pgoutput_begin(ctx, toptxn);
> warning: 5 lines add whitespace errors.

Fixed.


>
> 2. Structure member initialization.
>  static void
>  pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
>  {
> +   PGOutputTxnData*txndata = MemoryContextAllocZero(ctx->context,
> + 
>sizeof(PGOutputTxnData));
> +
> +   txndata->sent_begin_txn = false;
> +   txn->output_plugin_private = txndata;
> +}
> Do we need to set sent_stream_start and sent_any_stream to false here?

Fixed

>
> 3. Maybe we should add Assert(txndata) like function pgoutput_commit_txn in
> other functions.
>
> 4. In addition, I think we should keep a unified style.
> a). log style (maybe first one is better.)
> First style  : "Skipping replication of an empty transaction in XXX"
> Second style : "skipping replication of an empty transaction"
> b) flag name (maybe second one is better.)
> First style  : variable "sent_begin_txn" in function pgoutput_stream_*.
> Second style : variable "skip" in function pgoutput_commit_txn.
>

Fixed,

Regards,
Ajin Cherian
Fujitsu Australia


v21-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-02-25 Thread Ajin Cherian
On Fri, Feb 18, 2022 at 9:27 PM Amit Kapila  wrote:
>
>
> Yeah, I think there could be multiple ways (a) We can send such a keep
> alive in WalSndUpdateProgress() itself by using ctx->write_location.
> For this, we need to modify WalSndKeepalive() to take sentPtr as
> input. (b) set some flag in WalSndUpdateProgress() and then do it
> somewhere in WalSndLoop probably in WalSndKeepaliveIfNecessary, or
> maybe there is another better way.
>

Thanks for the suggestion Amit and Osumi-san, I experimented with both
the suggestions but finally decided to use
 (a)Modifying WalSndKeepalive() to take an LSN optionally as input and
passed in the  ctx->write_location.

I also verified that if I block the WalSndKeepalive() in
WalSndWaitForWal, then my new code sends the keepalive
when skipping transactions and the syncrep gets back feedback..

I will address comments from Peter and Wang in my next patch update.

regards,
Ajin Cherian
Fujitsu Australia


v20-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-02-22 Thread Ajin Cherian
On Thu, Feb 17, 2022 at 9:42 PM Amit Kapila  wrote:
>
> On Mon, Jan 31, 2022 at 6:18 PM Ajin Cherian  wrote:
> >
>
> Few comments:
> =
> 1. Is there any particular why the patch is not skipping empty xacts
> for streaming (in-progress) transactions as noted in the commit
> message as well?
>

I have added support for skipping streaming transaction.

> 2.
> +static void
> +pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
> +{
>   bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
> + PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
> +
> + Assert(txndata);
>
> I think here you can add an assert for sent_begin_txn to be always false?
>

Added.

> 3.
> +/*
> + * Send BEGIN.
> + * This is where the BEGIN is actually sent. This is called
> + * while processing the first change of the transaction.
> + */
>
> Have an empty line between the first two lines to ensure consistency
> with nearby comments. Also, the formatting of these lines appears
> awkward, either run pgindent or make sure lines are not too short.
>

Changed.

> 4. Do we really need to make any changes in PREPARE
> transaction-related functions if can't skip in that case? I think you
> can have a check if the output plugin private variable is not set then
> ignore special optimization for sending begin.
>

I have modified this as well.

I have also rebased the patch after it did not apply due to a new commit.

I will next work on testing and improving the keepalive logic while
skipping transactions.

regards,
Ajin Cherian
Fujitsu Australia


v19-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-02-18 Thread Ajin Cherian
On Wed, Feb 16, 2022 at 2:15 PM osumi.takami...@fujitsu.com
 wrote:
> Another idea would be, to create an empty file under the the 
> pg_replslot/slotname
> with a prefix different from "xid"  in the DecodePrepare before the shutdown
> if the prepare was empty, and bypass the cleanup of the serialized txns
> and check the existence after the restart. But, this is pretty ad-hoc and I 
> wasn't sure
> if to address the corner case of the restart has the strong enough 
> justification
> to create this new file format.
>

Yes, this doesn't look very efficient.

> Therefore, in my humble opinion, the idea of protocol change slightly wins,
> since the impact of the protocol change would not be big. We introduced
> the protocol version 3 in the devel version and the number of users should be 
> little.

Yes, but we don't want to break backward compatibility for this small
added optimization.

Amit,

I will work on your comments.

regards,
Ajin Cherian
Fujitsu Australia




Re: Logical replication timeout problem

2022-02-17 Thread Ajin Cherian
On Tue, Feb 8, 2022 at 1:59 PM wangw.f...@fujitsu.com
 wrote:
>
> On Wed, Jan 26, 2022 at 11:37 AM I wrote:
> > On Sat, Jan 22, 2022 at 7:12 PM Amit Kapila  wrote:
> > > Now, one idea to solve this problem could be that whenever we skip
> > > sending any change we do try to update the plugin progress via
> > > OutputPluginUpdateProgress(for walsender, it will invoke
> > > WalSndUpdateProgress), and there it tries to process replies and send
> > > keep_alive if necessary as we do when we send some data via
> > > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > > don't know whether it is a good idea to invoke such a mechanism for
> > > every change we skip to send or we should do it after we skip sending
> > > some threshold of continuous changes. I think later would be
> > > preferred. Also, we might want to introduce a new parameter
> > > send_keep_alive to this API so that there is flexibility to invoke
> > > this mechanism as we don't need to invoke it while we are actually
> > > sending data and before that, we just update the progress via this
> > > API.
> > ..
> > Based on above, I think the second idea that sending some threshold of
> > continuous changes might be better, I will do some research about this
> > approach.
> Based on the second idea, I wrote a new patch(see attachment).

Hi Wang,

Some comments:
 I see you only track skipped Inserts/Updates and Deletes. What about
DDL operations that are skipped, what about truncate.
What about changes made to unpublished tables? I wonder if you could
create a test script that only did DDL operations
and truncates, would this timeout happen?

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2022-02-03 Thread Ajin Cherian
On Sat, Jan 29, 2022 at 11:31 AM Andres Freund  wrote:
>
> Hi,
>
> Are there any recent performance evaluations of the overhead of row filters? I
> think it'd be good to get some numbers comparing:
>
> 1) $workload with master
> 2) $workload with patch, but no row filters
> 3) $workload with patch, row filter matching everything
> 4) $workload with patch, row filter matching few rows
>
> For workload I think it'd be worth testing:
> a) bulk COPY/INSERT into one table
> b) Many transactions doing small modifications to one table
> c) Many transactions targetting many different tables
> d) Interspersed DDL + small changes to a table
>

Here's the performance data results for scenario d:

HEAD   "with patch no row filter" "with patch 0%" "row-filter-patch
25%" "row-filter-patch v74 50%" "row-filter-patch 75%"
"row-filter-patch v74 100%"
1 65.397639 64.414034 5.919732 20.012096 36.35911 49.412548 64.508842
2 65.641783 65.255775 5.715082 20.157575 36.957403 51.355821 65.708444
3 65.096526 64.795163 6.146072 21.130709 37.679346 49.568513 66.602145
4 65.173569 64.68 5.787197 20.784607 34.465133 55.397313 63.545337
5 65.791092 66.000412 5.642696 20.258802 36.493626 52.873252 63.511428

The performance is similar to the other scenarios.
The script used is below:

CREATE TABLE test (key int, value text, value1 text, data jsonb,
PRIMARY KEY(key, value));

CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 0); -- 100% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 25); -- 75% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 50); -- 50% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 75); -- 25% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 100); -- 0% allowed

DO
$do$
BEGIN
FOR i IN 1..101 BY 4000 LOOP
Alter table test alter column value1 TYPE varchar(30);
INSERT INTO test VALUES(i,'BAH', row_to_json(row(i)));
Alter table test ALTER COLUMN value1 TYPE text;
UPDATE test SET value = 'FOO' WHERE key = i;
COMMIT;
END LOOP;
END
$do$;

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2022-02-02 Thread Ajin Cherian
Hi Peter,

I just tried scenario b that Andres suggested:

For scenario b, I did some testing with row-filter-patch v74 and
various levels of filtering. 0% replicated to 100% rows replicated.
The times are in seconds, I did 5 runs each.

Results:

RUN  HEAD "with patch 0%" "row-filter-patch 25%" "row-filter-patch
v74 50%" "row-filter-patch 75%" "row-filter-patch v74 100%"
1   17.26178  12.573736   12.869635  13.742167
  17.977112  17.75814
2   17.522473 12.919554   12.640879  14.202737
  14.515481  16.961836
3   17.124001 12.640879   12.706631  14.220245
  15.686613  17.219355
4   17.24122  12.602345   12.674566  14.219423
  15.564312  17.432765
5   17.25352  12.610657   12.689842  14.210725
  15.613708  17.403821

As can see the performance seen on HEAD is similar to that which the
patch achieves with all rows (100%) replication. The performance
improves linearly with
more rows filtered.

The test scenario used was:

1. On publisher and subscriber:
CREATE TABLE test (key int, value text, data jsonb, PRIMARY KEY(key, value));

2. On publisher: (based on which scenario is being tested)
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 0); -- 100% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 25); -- 75% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 50); -- 50% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 75); -- 25% allowed
CREATE PUBLICATION pub_1 FOR TABLE test WHERE (key > 100); -- 0% allowed

3. On the subscriber:
CREATE SUBSCRIPTION sync_sub CONNECTION 'host=127.0.0.1 port=5432
dbname=postgres application_name=sync_sub' PUBLICATION pub_1;

4. now modify the postgresql.conf on the publisher side
synchronous_standby_names = 'sync_sub' and restart.

5. The test case:

DO
$do$
BEGIN
FOR i IN 1..101 BY 10 LOOP
INSERT INTO test VALUES(i,'BAH', row_to_json(row(i)));
UPDATE test SET value = 'FOO' WHERE key = i;
IF I % 1000 = 0 THEN
COMMIT;
END IF;
END LOOP;
END
$do$;


regards,
Ajin Cherian
Fujitsu Australia

On Tue, Feb 1, 2022 at 12:07 PM Peter Smith  wrote:
>
> On Sat, Jan 29, 2022 at 11:31 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > Are there any recent performance evaluations of the overhead of row 
> > filters? I
> > think it'd be good to get some numbers comparing:
> >
> > 1) $workload with master
> > 2) $workload with patch, but no row filters
> > 3) $workload with patch, row filter matching everything
> > 4) $workload with patch, row filter matching few rows
> >
> > For workload I think it'd be worth testing:
> > a) bulk COPY/INSERT into one table
> > b) Many transactions doing small modifications to one table
> > c) Many transactions targetting many different tables
> > d) Interspersed DDL + small changes to a table
> >
>
> I have gathered performance data for the workload case (a):
>
> HEAD 46743.75
> v74 no filters 46929.15
> v74 allow 100% 46926.09
> v74 allow 75% 40617.74
> v74 allow 50% 35744.17
> v74 allow 25% 29468.93
> v74 allow 0% 22540.58
>
> PSA.
>
> This was tested using patch v74 and synchronous pub/sub. There are 1M
> INSERTS for publications using differing amounts of row filtering (or
> none).
>
> Observations:
> - There seems insignificant row-filter overheads (e.g. viz no filter
> and 100% allowed versus HEAD).
> - The elapsed time decreases linearly as there is less data getting 
> replicated.
>
> I will post the results for other workload kinds (b, c, d) when I have them.
>
> --
> Kind Regards,
> Peter Smith.
> Fujitsu Australia.




Re: logical replication empty transactions

2022-01-31 Thread Ajin Cherian
On Sun, Jan 30, 2022 at 7:04 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, January 27, 2022 9:57 PM Ajin Cherian  wrote:
> Hi, thanks for your patch update.
>
>
> > On Wed, Jan 26, 2022 at 8:33 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Tuesday, January 11, 2022 6:43 PM Ajin Cherian 
> > wrote:
> > > (3) Is this patch's reponsibility to intialize the data in
> > pgoutput_begin_prepare_txn ?
> > >
> > > @@ -433,6 +487,8 @@ static void
> > >  pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
> > > ReorderBufferTXN *txn)  {
> > > boolsend_replication_origin = txn->origin_id !=
> > InvalidRepOriginId;
> > > +   PGOutputTxnData*txndata =
> > MemoryContextAllocZero(ctx->context,
> > > +
> > > + sizeof(PGOutputTxnData));
> > >
> > > OutputPluginPrepareWrite(ctx, !send_replication_origin);
> > > logicalrep_write_begin_prepare(ctx->out, txn);
> > >
> > >
> > > Even if we need this initialization for either non streaming case or
> > > non two_phase case, there can be another issue.
> > > We don't free the allocated memory for this data, right ?
> > > There's only one place to use free in the entire patch, which is in
> > > the pgoutput_commit_txn(). So, corresponding free of memory looked
> > > necessary in the two phase commit functions.
> > >
> >
> > Actually it is required for begin_prepare to set the data type, so that the 
> > checks
> > in the pgoutput_change can make sure that the begin prepare is sent. I've 
> > also
> > added a free in commit_prepared code.
> Okay, but if we choose the design that this patch takes
> care of the initialization in pgoutput_begin_prepare_txn(),
> we need another free in pgoutput_rollback_prepared_txn().
> Could you please add some codes similar to pgoutput_commit_prepared_txn() to 
> the same ?
> If we simply execute rollback prepared for non streaming transaction,
> we don't free it.
>

Fixed.

>
> Some other new minor comments.
>
> (a) can be "synchronous replication", instead of "Synchronous Replication"
>
> When we have a look at the syncrep.c, we use the former usually in
> a normal comment.
>
>  /*
> + * Check if Synchronous Replication is enabled
> + */

Fixed.

>
> (b) move below pgoutput_truncate two codes to the case where if nrelids > 0.
>
> @@ -770,6 +850,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
>   int nrelations, Relation relations[], 
> ReorderBufferChange *change)
>  {
> PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
> +   PGOutputTxnData *txndata = (PGOutputTxnData *) 
> txn->output_plugin_private;
> MemoryContext old;
> RelationSyncEntry *relentry;
> int i;
> @@ -777,6 +858,9 @@ pgoutput_truncate(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
> Oid*relids;
> TransactionId xid = InvalidTransactionId;
>
> +   /* If not streaming, should have setup txndata as part of BEGIN/BEGIN 
> PREPARE */
> +   Assert(in_streaming || txndata);
> +
>

Fixed.

> (c) fix indent with spaces (for the one sentence of SyncRepEnabled)
>
> @@ -539,6 +538,15 @@ SyncRepReleaseWaiters(void)
>  }
>
>  /*
> + * Check if Synchronous Replication is enabled
> + */
> +bool
> +SyncRepEnabled(void)
> +{
> +return SyncRepRequested() && ((volatile WalSndCtlData *) 
> WalSndCtl)->sync_standbys_defined;
> +}
> +
> +/*
>
> This can be detected by git am.
>

Fixed.

regards,
Ajin Cherian
Fujitsu Australia


v18-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-01-27 Thread Ajin Cherian
On Thu, Jan 27, 2022 at 12:16 AM osumi.takami...@fujitsu.com
 wrote:
>
> On Tuesday, January 11, 2022 6:43 PM From: Ajin Cherian  
> wrote:
> > Minor update to rebase the patch so that it applies clean on HEAD.
> Hi, let me share some additional comments on v16.
>
>
> (1) comment of pgoutput_change
>
> @@ -630,11 +688,15 @@ pgoutput_change(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
> Relation relation, ReorderBufferChange 
> *change)
>  {
> PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
> +   PGOutputTxnData *txndata = (PGOutputTxnData *) 
> txn->output_plugin_private;
> MemoryContext old;
> RelationSyncEntry *relentry;
> TransactionId xid = InvalidTransactionId;
> Relationancestor = NULL;
>
> +   /* If not streaming, should have setup txndata as part of BEGIN/BEGIN 
> PREPARE */
> +   Assert(in_streaming || txndata);
> +
>
> In my humble opinion, the comment should not touch BEGIN PREPARE,
> because this patch's scope doesn't include two phase commit.
> (We could add this in another patch to extend the scope after the commit ?)
>

We have to include BEGIN PREPARE as well, as the txndata has to be
setup. Only difference is that we will not skip empty transaction in
BEGIN PREPARE

> This applies to pgoutput_truncate's comment.
>
> (2) "keep alive" should be "keepalive" in WalSndUpdateProgress
>
> /*
> +* When skipping empty transactions in synchronous replication, we 
> need
> +* to send a keep alive to keep the MyWalSnd locations updated.
> +*/
> +   force_keepalive_syncrep = send_keepalive && SyncRepEnabled();
> +
>
> Also, this applies to the comment for force_keepalive_syncrep.

Fixed.

>
> (3) Should finish the second sentence with period in the comment of 
> pgoutput_message.
>
> @@ -845,6 +923,19 @@ pgoutput_message(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
> if (in_streaming)
> xid = txn->xid;
>
> +   /*
> +* Output BEGIN if we haven't yet.
> +* Avoid for streaming and non-transactional messages
>

Fixed.

> (4) "begin" can be changed to "BEGIN" in the comment of PGOutputTxnData 
> definition.
>
> In the entire patch, when we express BEGIN message,
> we use capital letters "BEGIN" except for one place.
> We can apply the same to this place as well.
>
> +typedef struct PGOutputTxnData
> +{
> +   bool sent_begin_txn;/* flag indicating whether begin has been 
> sent */
> +} PGOutputTxnData;
> +
>

Fixed.

> (5) inconsistent way to write Assert statements with blank lines
>
> In the below case, it'd be better to insert one blank line
> after the Assert();
>
> +static void
> +pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
> +{
> boolsend_replication_origin = txn->origin_id != 
> InvalidRepOriginId;
> +   PGOutputTxnData *txndata = (PGOutputTxnData *) 
> txn->output_plugin_private;
>
> +   Assert(txndata);
> OutputPluginPrepareWrite(ctx, !send_replication_origin);
>
>

Fixed.

> (6) new codes in the pgoutput_commit_txn looks messy slightly
>
> @@ -419,7 +455,25 @@ static void
>  pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> XLogRecPtr commit_lsn)
>  {
> -   OutputPluginUpdateProgress(ctx);
> +   PGOutputTxnData *txndata = (PGOutputTxnData *) 
> txn->output_plugin_private;
> +   boolskip;
> +
> +   Assert(txndata);
> +
> +   /*
> +* If a BEGIN message was not yet sent, then it means there were no 
> relevant
> +* changes encountered, so we can skip the COMMIT message too.
> +*/
> +   skip = !txndata->sent_begin_txn;
> +   pfree(txndata);
> +   txn->output_plugin_private = NULL;
> +   OutputPluginUpdateProgress(ctx, skip);
>
> Could we conduct a refactoring for this new part ?
> IMO, writing codes to free the data structure at the top
> of function seems weird.
>
> One idea is to export some part there
> and write a new function, something like below.
>
> static bool
> txn_sent_begin(ReorderBufferTXN *txn)
> {
> PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
> boolneeds_skip;
>
> Assert(txndata);
>
> needs_skip = !txndata->sent_begin_txn;
>
> pfree(txndata);
> txn->output_plugin_private = NULL;
>
> return needs_skip;
> }
>
> FYI, I had a look at the 
> v12-0002-Skip-empty-prepared-transactions-for-logical-rep.patch
> for reference of pgoutput_rollback_prepared_txn and 
> pgoutput_commit_prepared_txn.
> Looks this kind of function might work for future extensions as well.
> What did you think ?

I changed a bit, but I'd hold a comprehensive rewrite when a future
patch supports skipping
empty transactions in two-phase transactions and streaming transactions.

regards,
Ajin Cherian




Re: logical replication empty transactions

2022-01-27 Thread Ajin Cherian
On Wed, Jan 26, 2022 at 8:33 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Tuesday, January 11, 2022 6:43 PM Ajin Cherian  wrote:
> > Minor update to rebase the patch so that it applies clean on HEAD.
> Hi, thanks for you rebase.
>
> Several comments.
>
> (1) the commit message
>
> "
> transactions, keepalive messages are sent to keep the LSN locations updated
> on the standby.
> This patch does not skip empty transactions that are "streaming" or 
> "two-phase".
> "
>
> I suggest that one blank line might be needed before the last paragraph.

Changed.

>
> (2) Could you please remove one pair of curly brackets for one sentence below 
> ?
>
> @@ -1546,10 +1557,13 @@ WalSndWaitForWal(XLogRecPtr loc)
>  * otherwise idle, this keepalive will trigger a reply. 
> Processing the
>  * reply will update these MyWalSnd locations.
>  */
> -   if (MyWalSnd->flush < sentPtr &&
> +   if (force_keepalive_syncrep ||
> +   (MyWalSnd->flush < sentPtr &&
> MyWalSnd->write < sentPtr &&
> -   !waiting_for_ping_response)
> +   !waiting_for_ping_response))
> +   {
> WalSndKeepalive(false);
> +   }
>
>

Changed.

> (3) Is this patch's reponsibility to intialize the data in 
> pgoutput_begin_prepare_txn ?
>
> @@ -433,6 +487,8 @@ static void
>  pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN 
> *txn)
>  {
> boolsend_replication_origin = txn->origin_id != 
> InvalidRepOriginId;
> +   PGOutputTxnData*txndata = MemoryContextAllocZero(ctx->context,
> + 
>sizeof(PGOutputTxnData));
>
> OutputPluginPrepareWrite(ctx, !send_replication_origin);
> logicalrep_write_begin_prepare(ctx->out, txn);
>
>
> Even if we need this initialization for either non streaming case
> or non two_phase case, there can be another issue.
> We don't free the allocated memory for this data, right ?
> There's only one place to use free in the entire patch,
> which is in the pgoutput_commit_txn(). So,
> corresponding free of memory looked necessary
> in the two phase commit functions.
>

Actually it is required for begin_prepare to set the data type, so
that the checks in the pgoutput_change can make sure that
the begin prepare is sent. I've also added a free in commit_prepared code.

> (4) SyncRepEnabled's better alignment.
>
> IIUC, SyncRepEnabled is called not only by the walsender but also by other 
> backends
> via CommitTransaction -> RecordTransactionCommit -> SyncRepWaitForLSN.
> Then, the place to add the prototype function for SyncRepEnabled seems not 
> appropriate,
> strictly speaking or requires a comment like /* called by wal sender or other 
> backends */.
>
> @@ -90,6 +90,7 @@ extern void SyncRepCleanupAtProcExit(void);
>  /* called by wal sender */
>  extern void SyncRepInitConfig(void);
>  extern void SyncRepReleaseWaiters(void);
> +extern bool SyncRepEnabled(void);
>
> Even if we intend it is only used by the walsender, the current code place
> of SyncRepEnabled in the syncrep.c might not be perfect.
> In this file, seemingly we have a section for functions for wal sender 
> processes
> and the place where you wrote it is not here.
>
> at src/backend/replication/syncrep.c, find a comment below.
> /*
>  * ===
>  * Synchronous Replication functions for wal sender processes
>  * ===
>  */

Changed.
>
> (5) minor alignment for expressing a couple of messages.
>
> @@ -777,6 +846,9 @@ pgoutput_truncate(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
> Oid*relids;
> TransactionId xid = InvalidTransactionId;
>
> +   /* If not streaming, should have setup txndata as part of BEGIN/BEGIN 
> PREPARE */
> +   Assert(in_streaming || txndata);
>
>
> In the commit message, the way you write is below.
> ...
> skip BEGIN / COMMIT messages for transactions that are empty. The patch
> ...
>
> In this case, we have spaces back and forth for "BEGIN / COMMIT".
> Then, I suggest to unify all of those to show better alignment.

fixed.

regards,
Ajin Cherian


v17-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: logical replication empty transactions

2022-01-11 Thread Ajin Cherian
On Wed, Sep 1, 2021 at 8:57 PM Ajin Cherian  wrote:
>
> Thanks for the comments. Addressed them in the attached patch.
>
> regards,
> Ajin Cherian
> Fujitsu Australia

Minor update to rebase the patch so that it applies clean on HEAD.

regards,
Ajin Cherian

regards,
Ajin Cherian


v16-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: row filtering for logical replication

2021-12-20 Thread Ajin Cherian
On Tue, Dec 21, 2021 at 5:58 AM Euler Taveira  wrote:
>
> I reviewed 0003. It uses TupleTableSlot instead of HeapTuple. I probably 
> missed
> the explanation but it requires more changes (logicalrep_write_tuple and 3 new
> entries into RelationSyncEntry). I replaced this patch with a slightly
> different one (0005 in this patch set) that uses HeapTuple instead. I didn't
> only simple tests and it requires tests. I noticed that this patch does not
> include a test to cover the case where TOASTed values are not included in the
> new tuple. We should probably add one.

The reason I changed the code to use virtualtuple slots is to reduce
tuple deforming overhead.
Dilip raised this very valid comment in [1]:

On Tue, Sep 21, 2021 at 4:29 PM Dilip Kumar
 wrote:
>
>In pgoutput_row_filter_update(), first, we are deforming the tuple in
>local datum, then modifying the tuple, and then reforming the tuple.
>I think we can surely do better here. Currently, you are reforming
>the tuple so that you can store it in the scan slot by calling
>ExecStoreHeapTuple which will be used for expression evaluation.
>Instead of that what you need to do is to deform the tuple using
>tts_values of the scan slot and later call ExecStoreVirtualTuple(), so
>advantages are 1) you don't need to reform the tuple 2) the expression
>evaluation machinery doesn't need to deform again for fetching the
>value of the attribute, instead it can directly get from the value
>from the virtual tuple.

Storing the old tuple/new tuple in a slot and re-using the slot avoids
the overhead of
continuous deforming of tuple at multiple levels in the code.

regards,
Ajin Cherian
[1] - 
https://www.postgresql.org/message-id/CAFiTN-vwBjy+eR+iodkO5UVN5cPv_xx1=s8ehzgcrjza+az...@mail.gmail.com




Re: row filtering for logical replication

2021-12-19 Thread Ajin Cherian
On Mon, Dec 20, 2021 at 12:51 PM houzj.f...@fujitsu.com
 wrote:
>
> I think it might not be hard to predict the current behavior. User only need 
> to be
> aware of that:
> 1) pubaction and row filter on different publications are combined with 'OR'.
> 2) FOR UPDATE, we execute the fiter for both OLD and NEW tuple and would 
> change
>the operation type accordingly.
>
> For the example mentioned:
> create table tbl1 (a int primary key, b int);
> create publication A for table tbl1 where (b<2) with(publish='insert');
> create publication B for table tbl1 where (a>1) with(publish='update');
>
> If we follow the rule 1) and 2), I feel we are able to predict the following
> conditions:
> --
> WHERE (action = 'insert' AND b < 2) OR (action = 'update' AND a > 1)
> --
>
> So, it seems acceptable to me.
>
> Personally, I think the current design could give user more flexibility to
> handle some complex scenario. If user want some simple setting for 
> publication,
> they can also set same row filter for the same table in different 
> publications.
> To avoid confusion, I think we can document about these rules clearly.
>
> BTW, From the document of IBM, I think IBM also support this kind of complex
> condition [1].
> [1] https://www.ibm.com/docs/en/idr/11.4.0?topic=rows-log-record-variables

Yes, I agree with this. It's better to give users more flexibility
while warning him on what the consequences are rather than restricting
him with constraints.
We could explain this in the documentation so that users can better
predict the effect of having pubaction specific filters.

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-12-17 Thread Ajin Cherian
On Fri, Dec 17, 2021 at 5:46 PM Greg Nancarrow  wrote:

> So using the v47 patch-set, I still find that the UPDATE above results in 
> publication of an INSERT of (2,1), rather than an UPDATE of (1,1) to (2,1).
> This is according to the 2nd UPDATE rule below, from patch 0003.
>
> + * old-row (no match)new-row (no match)  -> (drop change)
> + * old-row (no match)new row (match) -> INSERT
> + * old-row (match)   new-row (no match)  -> DELETE
> + * old-row (match)   new row (match) -> UPDATE
>
> This is because the old row (1,1) doesn't match the UPDATE filter "(a>1)", 
> but the new row (2,1) does.
> This functionality doesn't seem right to me. I don't think it can be assumed 
> that (1,1) was never published (and thus requires an INSERT rather than 
> UPDATE) based on these checks, because in this example, (1,1) was previously 
> published via a different operation - INSERT (and using a different filter 
> too).
> I think the fundamental problem here is that these UPDATE rules assume that 
> the old (current) row was previously UPDATEd (and published, or not 
> published, according to the filter applicable to UPDATE), but this is not 
> necessarily the case.
> Or am I missing something?

But it need not be correct in assuming that the old-row was part of a
previous INSERT either (and published, or not published according to
the filter applicable to an INSERT).
For example, change the sequence of inserts and updates prior to the
last update:

truncate tbl1 ;
insert into tbl1 values (1,5); ==> not replicated since insert and ! (b < 2);
update tbl1 set b = 1; ==> not replicated since update and ! (a > 1)
update tbl1 set a = 2; ==> replicated and update converted to insert
since (a > 1)

In this case, the last update "update tbl1 set a = 2; " is updating a
row that was previously updated and not inserted and not replicated to
the subscriber.
How does the replication logic differentiate between these two cases,
and decide if the update was previously published or not?
I think it's futile for the publisher side to try and figure out the
history of published rows. In fact, if this level of logic is required
then it is best implemented on the subscriber side, which then defeats
the purpose of a publication filter.


regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-12-01 Thread Ajin Cherian
On Wed, Dec 1, 2021 at 3:27 AM vignesh C  wrote:
>
> Here we will not be able to do a direct comparison as we store the
> transformed where clause in the pg_publication_rel table. We will have
> to transform the where clause and then check. I have attached a patch
> where we can check the transformed where clause and see if the where
> clause is the same or not. If you are ok with this approach you could
> make similar changes.

thanks for your patch, I have used the same logic with minor changes
and shared it with Peter for v44.

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-10-11 Thread Ajin Cherian
On Tue, Oct 12, 2021 at 1:37 AM Dilip Kumar  wrote:
>
> On Wed, Oct 6, 2021 at 2:33 PM Ajin Cherian  wrote:
> >
> > On Sat, Oct 2, 2021 at 5:44 PM Ajin Cherian  wrote:
> > >
> > > I have for now also rebased the patch and merged the first 5 patches
> > > into 1, and added my changes for the above into the second patch.
> >
> > I have split the patches back again, just to be consistent with the
> > original state of the patches. Sorry for the inconvenience.
>
> Thanks for the updated version of the patch, I was looking into the
> latest version and I have a few comments.
>
>
> +if ((att->attlen == -1 &&
> VARATT_IS_EXTERNAL_ONDISK(tmp_new_slot->tts_values[i])) &&
> +(!old_slot->tts_isnull[i] &&
> +!(VARATT_IS_EXTERNAL_ONDISK(old_slot->tts_values[i]
> +{
> +tmp_new_slot->tts_values[i] = old_slot->tts_values[i];
> +newtup_changed = true;
> +}
>
> If the attribute is stored EXTERNAL_ONDIS on the new tuple and it is
> not null in the old tuple then it must be logged completely in the old
> tuple, so instead of checking
> !(VARATT_IS_EXTERNAL_ONDISK(old_slot->tts_values[i]), it should be
> asserted,
>
>
> +heap_deform_tuple(newtuple, desc, new_slot->tts_values,
> new_slot->tts_isnull);
> +heap_deform_tuple(oldtuple, desc, old_slot->tts_values,
> old_slot->tts_isnull);
> +
> +if (newtup_changed)
> +tmpnewtuple = heap_form_tuple(desc, tmp_new_slot->tts_values,
> new_slot->tts_isnull);
> +
> +old_matched = pgoutput_row_filter(relation, NULL, oldtuple, entry);
> +new_matched = pgoutput_row_filter(relation, NULL,
> +  newtup_changed ? tmpnewtuple :
> newtuple, entry);
>
> I do not like the fact that, first we have deformed the tuples and we
> are again using the HeapTuple
> for expression evaluation machinery and later the expression
> evaluation we do the deform again.
>
> So why don't you use the deformed tuple as it is to store as a virtual tuple?
>
> Infact, if newtup_changed is true then you are forming back the tuple
> just to get it deformed again
> in the expression evaluation.
>
> I think I have already given this comment on the last version.

Right, I only used the deformed tuple later when it was written to the
stream. I will modify this as well.

regards,
Ajin Cherian
Fujitsu Australia




Re: extensible options syntax for replication parser?

2021-09-26 Thread Ajin Cherian
On Mon, Sep 27, 2021 at 11:20 AM Ajin Cherian  wrote:
>
> On Sat, Sep 25, 2021 at 4:28 AM tushar  wrote:
> >
> > On 9/24/21 10:36 PM, Robert Haas wrote:
> > > Here's v9, fixing the issue reported by Fujii Masao.
> >
> > Please refer this scenario where publication on v14RC1  and subscription
> > on HEAD (w/patch)
> >
> > --create a subscription with parameter two_phase=1 on HEAD
> >
> > postgres=# CREATE SUBSCRIPTION r1015 CONNECTION 'dbname=postgres
> > host=localhost port=5454' PUBLICATION p WITH (two_phase=1);
> > NOTICE:  created replication slot "r1015" on publisher
> > CREATE SUBSCRIPTION
> > postgres=#
> >
> > --check on 14RC1
> >
> > postgres=# select two_phase from pg_replication_slots where
> > slot_name='r105';
> >   two_phase
> > ---
> >   f
> > (1 row)
> >
> > so are we silently ignoring this parameter as it is not supported on
> > v14RC/HEAD ? and if yes then why not we just throw an message like
> > ERROR:  unrecognized subscription parameter: "two_phase"
> >
> > --
>
> There is usually a time lag between a subscription created with two_phase on 
> and
> the slot on the publisher enabling two_phase. It only happens after a
> tablesync is completed and
> the apply worker is restarted. There are logs which indicate that this
> has happened. If you could share the
> logs (on publisher and subscriber) when this happens, I could have a look.
>

And in case you do see a problem, I request you create a seperate
thread for this. I didn't want to derail this patch.

regards,
Ajin Cherian
Fujitsu Australia




Re: extensible options syntax for replication parser?

2021-09-26 Thread Ajin Cherian
On Sat, Sep 25, 2021 at 4:28 AM tushar  wrote:
>
> On 9/24/21 10:36 PM, Robert Haas wrote:
> > Here's v9, fixing the issue reported by Fujii Masao.
>
> Please refer this scenario where publication on v14RC1  and subscription
> on HEAD (w/patch)
>
> --create a subscription with parameter two_phase=1 on HEAD
>
> postgres=# CREATE SUBSCRIPTION r1015 CONNECTION 'dbname=postgres
> host=localhost port=5454' PUBLICATION p WITH (two_phase=1);
> NOTICE:  created replication slot "r1015" on publisher
> CREATE SUBSCRIPTION
> postgres=#
>
> --check on 14RC1
>
> postgres=# select two_phase from pg_replication_slots where
> slot_name='r105';
>   two_phase
> ---
>   f
> (1 row)
>
> so are we silently ignoring this parameter as it is not supported on
> v14RC/HEAD ? and if yes then why not we just throw an message like
> ERROR:  unrecognized subscription parameter: "two_phase"
>
> --

There is usually a time lag between a subscription created with two_phase on and
the slot on the publisher enabling two_phase. It only happens after a
tablesync is completed and
the apply worker is restarted. There are logs which indicate that this
has happened. If you could share the
logs (on publisher and subscriber) when this happens, I could have a look.

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-09-21 Thread Ajin Cherian
On Wed, Sep 22, 2021 at 1:50 PM Amit Kapila  wrote:
>
> On Wed, Sep 22, 2021 at 6:42 AM Ajin Cherian  wrote:
> >
>
> Why do you think that the second assumption (if there is an old tuple
> it will contain all RI key fields.) is broken? It seems to me even
> when we are planning to include unchanged toast as part of old_key, it
> will contain all the key columns, isn't that true?

Yes, I assumed wrongly. Just checked. What you say is correct.

>
> > I think we
> > still need to deform both old tuple and new tuple, just to handle this case.
> >
>
> Yeah, but we will anyway talking about saving that cost for later if
> we decide to send that tuple. I think we can further try to optimize
> it by first checking whether the new tuple has any toasted value, if
> so then only we need this extra pass of deforming.

Ok, I will go ahead with this approach.

>
> > There is currently logic in ReorderBufferToastReplace() which already
> > deforms the new tuple
> > to detoast changed toasted fields in the new tuple. I think if we can
> > enhance this logic for our
> > purpose, then we can avoid an extra deform of the new tuple.
> > But I think you had earlier indicated that having untoasted unchanged
> > values in  the new tuple
> > can be bothersome.
> >
>
> I think it will be too costly on the subscriber side during apply
> because it will update all the unchanged toasted values which will
> lead to extra writes both for WAL and data.
>

Ok, agreed.

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-09-21 Thread Ajin Cherian
On Tue, Sep 21, 2021 at 9:42 PM Dilip Kumar  wrote:
>
> On Tue, Sep 21, 2021 at 4:29 PM Dilip Kumar  wrote:
> > Some more comments,
> >
> > In pgoutput_row_filter_update(), first, we are deforming the tuple in
> > local datum, then modifying the tuple, and then reforming the tuple.
> > I think we can surely do better here.  Currently, you are reforming
> > the tuple so that you can store it in the scan slot by calling
> > ExecStoreHeapTuple which will be used for expression evaluation.
> > Instead of that what you need to do is to deform the tuple using
> > tts_values of the scan slot and later call ExecStoreVirtualTuple(), so
> > advantages are 1) you don't need to reform the tuple 2) the expression
> > evaluation machinery doesn't need to deform again for fetching the
> > value of the attribute, instead it can directly get from the value
> > from the virtual tuple.
> >
>
> I have one more question, while looking into the
> ExtractReplicaIdentity() function, it seems that if any of the "rep
> ident key" fields is changed then we will write all the key fields in
> the WAL as part of the old tuple, not just the changed fields.  That
> means either the old tuple will be NULL or it will be having all the
> key attributes.  So if we are supporting filter only on the "rep ident
> key fields" then is there any need to copy the fields from the new
> tuple to the old tuple?
>

Yes, I just figured this out while testing. So we don't need to copy fields
from the new tuple to the old tuple.

But there is still the case of your fix for the unchanged toasted RI
key fields in the new tuple
which needs to be copied from the old tuple to the new tuple. This
particular case
seems to violate both rules that an old tuple will be present only
when there are changed
RI key fields and that if there is an old tuple it will contain all RI
key fields. I think we
still need to deform both old tuple and new tuple, just to handle this case.

There is currently logic in ReorderBufferToastReplace() which already
deforms the new tuple
to detoast changed toasted fields in the new tuple. I think if we can
enhance this logic for our
purpose, then we can avoid an extra deform of the new tuple.
But I think you had earlier indicated that having untoasted unchanged
values in  the new tuple
can be bothersome.

Any suggestions?

regards,
Ajin Cherian
Fujitsu Australia

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-09-20 Thread Ajin Cherian
On Tue, Sep 21, 2021 at 12:03 AM Dilip Kumar  wrote:
>
> On Mon, Sep 20, 2021 at 5:37 PM Amit Kapila  wrote:
> > > >
> > >
> > > Adding a patch that strives to do the logic that I described above.
> > > For updates, the row filter is applied on both old_tuple
> > > and new_tuple. This patch assumes that the row filter only uses
> > > columns that are part of the REPLICA IDENTITY. (the current patch-set
> > > only
> > > restricts this for row-filters that are delete only)
> > > The old_tuple only has columns that are part of the old_tuple and have
> > > been changed, which is a problem while applying the row-filter. Since
> > > unchanged REPLICA IDENTITY columns
> > > are not present in the old_tuple, this patch creates a temporary
> > > old_tuple by getting such column values from the new_tuple and then
> > > applies the filter on this hand-created temp old_tuple. The way the
> > > old_tuple is created can be better optimised in future versions.
>
> I understand why this is done, but I have 2 concerns here 1) We are
> having extra deform and copying the field from new to old in case it
> is unchanged replica identity.  2) The same unchanged attribute values
> get qualified in the old tuple as well as in the new tuple.  What
> exactly needs to be done is that the only updated field should be
> validated as part of the old as well as the new tuple, the unchanged
> field does not make sense to have redundant validation.   For that we
> will have to change the filter for the old tuple to just validate the
> attributes which are actually modified and remaining unchanged and new
> values will anyway get validated in the new tuple.
>
But what if the filter expression depends on multiple columns, say (a+b) > 100
where a is unchanged while b is changed. Then we will still need both
columns for applying
the filter even though one is unchanged. Also, I am not aware of any
mechanism by which
we can apply a filter expression on individual attributes. The current
mechanism does it
on a tuple. Do let me know if you have any ideas there?

Even if it were done, there would still be the overhead of deforming the tuple.
I will run some performance tests like Amit suggested and see what the
overhead is and
try to minimise it.

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-09-08 Thread Ajin Cherian
On Wed, Sep 1, 2021 at 9:23 PM Euler Taveira  wrote:
>
> On Sun, Aug 29, 2021, at 11:14 PM, Peter Smith wrote:
>
> Here are the new v26* patches. This is a refactoring of the row-filter
> caches to remove all the logic from the get_rel_sync_entry function
> and delay it until if/when needed in the pgoutput_row_filter function.
> This is now implemented per Amit's suggestion to move all the cache
> code [1]. It is a replacement for the v25* patches.
>
> The make check and TAP subscription tests are all OK. I have repeated
> the performance tests [2] and those results are good too.
>
> v26-0001 <--- v23 (base RF patch)
> v26-0002 <--- ExprState cache mods (refactored row filter caching)
> v26-0002 <--- ExprState cache extra debug logging (temp)
>
> Peter, I'm still reviewing this new cache mechanism. I will provide a feedback
> as soon as I integrate it as part of this recent modification.
>
> I'm attaching a new version that simply including Houzj review [1]. This is
> based on v23.
>
> There has been a discussion about which row should be used by row filter. We
> don't have a unanimous choice, so I think it is prudent to provide a way for
> the user to change it. I suggested in a previous email [2] that a publication
> option should be added. Hence, row filter can be applied to old tuple, new
> tuple, or both. This approach is simpler than using OLD/NEW references (less
> code and avoid validation such as NEW reference for DELETEs and OLD reference
> for INSERTs). I think about a reasonable default value and it seems _new_ 
> tuple
> is a good one because (i) it is always available and (ii) user doesn't have
> to figure out that replication is broken due to a column that is not part
> of replica identity. I'm attaching a POC that implements it. I'm still
> polishing it. Add tests for multiple row filters and integrate Peter's caching
> mechanism [3] are the next steps.
>

Assuming this _new_tuple option is enabled and
1. An UPDATE, where the new_tuple satisfies the row filter, but the
old_tuple did not  (not checked). Since the row filter check passed
but the actual row never existed on the subscriber, would this patch
convert the UPDATE to an INSERT or would this UPDATE be ignored? Based
on the tests that I did, I see that it is ignored.
2. An UPDATE where the new tuple does not satisfy the row filter but
the old_tuple did. Since the new_tuple did not match the row filter,
wouldn't this row now remain divergent on the replica?

Somehow this approach of either new_tuple or old_tuple doesn't seem to
be very fruitful if the user requires that his replica is up-to-date
based on the filter condition. For that, I think you will need to
convert UPDATES to either INSERTS or DELETES if only new_tuple or
old_tuple matches the filter condition but not both matches the filter
condition.

UPDATE
old-row (match)   new-row (no match)  -> DELETE
old-row (no match)  new row (match)   -> INSERT
old-row (match)   new row (match)   -> UPDATE
old-row (no match)  new-row (no match)  -> (drop change)

regards,
Ajin Cherian
Fujitsu Australia




Re: [BUG]Update Toast data failure in logical replication

2021-09-07 Thread Ajin Cherian
On Wed, Aug 11, 2021 at 10:45 PM Dilip Kumar  wrote:

> Yeah we can avoid that by detecting any toasted replica identity key
> in HeapDetermineModifiedColumns, check the attached patch.
>

I had a second look at this, and I just had a small doubt. Since the
convention is that for UPDATES, the old tuple/key is written to
WAL only if the one of the columns in the key has changed as part of
the update, and we are breaking that convention with this patch by
also including
the old key if it is toasted and is stored in disk even if it is not changed.
Why do we not include the detoasted key as part of the new tuple
rather than the old tuple? Then we don't really break this convention.

And one small typo in the patch:

The header above ExtractReplicaIdentity()

Before:
 * key_required should be false if caller knows that no replica identity
 * columns changed value and it doesn't has any external data.
 * It's always true in the DELETE case.

After:
 * key_required should be false if caller knows that no replica identity
 * columns changed value and it doesn't have any external data.
 * It's always true in the DELETE case.

regards,
Ajin Cherian
Fujitsu Australia




Re: [BUG]Update Toast data failure in logical replication

2021-09-02 Thread Ajin Cherian
On Wed, Aug 11, 2021 at 10:45 PM Dilip Kumar  wrote:
>
> Yeah we can avoid that by detecting any toasted replica identity key
> in HeapDetermineModifiedColumns, check the attached patch.
>

The patch applies cleanly, all tests pass, I tried out a few toast
combination tests and they seem to be working fine.
No review comments, the patch looks good to me.

regards,
Ajin Cherian
Fujitsu Australia




Re: logical replication empty transactions

2021-09-01 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 5:15 PM Peter Smith  wrote:
>
> I reviewed the v14-0001 patch.
>
> All my previous comments have been addressed.
>
> Apply / build / test was all OK.
>
> --
>
> More review comments:
>
> 1. Params names in the function declarations should match the rest of the 
> code.
>
> 1a. src/include/replication/logical.h
>
> @@ -26,7 +26,8 @@ typedef LogicalOutputPluginWriterWrite
> LogicalOutputPluginWriterPrepareWrite;
>
>  typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct
> LogicalDecodingContext *lr,
>   XLogRecPtr Ptr,
> - TransactionId xid
> + TransactionId xid,
> + bool send_keep_alive
>
> =>
> Change "send_keep_alive" --> "send_keepalive"
>
> ~~
>
> 1b. src/include/replication/output_plugin.h
>
> @@ -243,6 +243,6 @@ typedef struct OutputPluginCallbacks
>  /* Functions in replication/logical/logical.c */
>  extern void OutputPluginPrepareWrite(struct LogicalDecodingContext
> *ctx, bool last_write);
>  extern void OutputPluginWrite(struct LogicalDecodingContext *ctx,
> bool last_write);
> -extern void OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx);
> +extern void OutputPluginUpdateProgress(struct LogicalDecodingContext
> *ctx, bool send_keep_alive);
>
> =>
> Change "send_keep_alive" --> "send_keepalive"
>
> --
>
> 2. Comment should be capitalized - src/backend/replication/walsender.c
>
> @@ -170,6 +170,9 @@ static TimestampTz last_reply_timestamp = 0;
>  /* Have we sent a heartbeat message asking for reply, since last reply? */
>  static bool waiting_for_ping_response = false;
>
> +/* force keep alive when skipping transactions in synchronous
> replication mode */
> +static bool force_keepalive_syncrep = false;
>
> =>
> "force" --> "Force"
>
> --
>
> Otherwise, v14-0001 LGTM.
>

Thanks for the comments. Addressed them in the attached patch.

regards,
Ajin Cherian
Fujitsu Australia


v15-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


Re: Failure of subscription tests with topminnow

2021-08-31 Thread Ajin Cherian
On Tue, Aug 31, 2021 at 3:47 PM Masahiko Sawada  wrote:
>
> On Tue, Aug 31, 2021 at 12:11 PM Amit Kapila  wrote:
> >
> > On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian  wrote:
> > >
> > > On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila  
> > > wrote:
> > >
> > > I have made the above changes on HEAD.
> > >
> >
> > Thanks, this looks mostly good to me. I'll push and backpatch this
> > tomorrow unless you or someone else thinks otherwise.
> >
> > Minor comments
> > ==
> > 1.
> >  $oldpid = $node_publisher->safe_psql('postgres',
> > - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
> > + "SELECT pid FROM pg_stat_replication WHERE application_name =
> > 'tap_sub' AND state = 'streaming';;"
> >  );
> >
> > An extra semicolon at the end of the statement.
> >
> > 2.
> > +# restart of subscription workers. We check the state along with
> > application_name
> > +# to ensure that the walsender is (re)started.
> >
> > It is better to keep application_name in an above comment in the
> > second line as that will make this line looks a bit more consistent
> > with other comments.
> >
> > 3. In commit message, the text: "The reason was that the test was
> > assuming the walsender started before it reaches the 'streaming' state
> > and The check to test whether the subscription workers were restarting
> > after a change in the subscription was failing." seems to be
> > repeated/redundant.
> >
> > 4. Kindly submit the patches for back-branches.
>
> The patch with the above comments looks good to me. One minor
> suggestion is to change the two messages of "die" to make the
> investigation a bit easier. For example,
>
> $node_publisher->poll_query_until('postgres',
> "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
> ) or die "Timed out while waiting for apply to restart after changing
> CONNECTION";
>
> and
>
> $node_publisher->poll_query_until('postgres',
> "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
> ) or die "Timed out while waiting for apply to restart after changing
> PUBLICATION";
>
> Regards,
>

Thanks Masahiko-san. I have included this change and made a new patch-set.

Hi Amit,

I have included your comments as well and also attached the patches
for the back-branches.

regards,
Ajin Cherian
Fujitsu Australia


head-v4-0001-Fix-the-random-test-failure-in-001_rep_changes.patch
Description: Binary data


REL-10-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch
Description: Binary data


REL-11-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch
Description: Binary data


REL-12-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch
Description: Binary data


REL-13-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch
Description: Binary data


REL-14-STABLE-v2-0001-Fix-the-random-test-failure-in-001_rep_changes.patch
Description: Binary data


Re: Failure of subscription tests with topminnow

2021-08-30 Thread Ajin Cherian
On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila  wrote:

> Isn't it better to check the streaming state when we are fetching
> oldpid? If we don't add, then I suspect that the next time someone
> adding tests on similar lines might get confused about where to check
> the state and where not. Also, if you agree, add some comments before
> the test on why it is important to check states.
>
> For ex., in below queries, the queries used for $oldpid.
> my $oldpid = $node_publisher->safe_psql('postgres',
> "SELECT pid FROM pg_stat_replication WHERE application_name =
> 'tap_sub';"
> );
> $node_subscriber->safe_psql('postgres',
> "ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
> sslmode=disable'"
> );
> $node_publisher->poll_query_until('postgres',
> "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
> ) or die "Timed out while waiting for apply to restart";
>
> $oldpid = $node_publisher->safe_psql('postgres',
> "SELECT pid FROM pg_stat_replication WHERE application_name =
> 'tap_sub';"
> );
> $node_subscriber->safe_psql('postgres',
> "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only
> WITH (copy_data = false)"
> );
> $node_publisher->poll_query_until('postgres',
> "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
>

I have made the above changes on HEAD.

regards,
Ajin Cherian
Fujitsu Australia


v3-0001-Fix-the-random-test-failure-in-001_rep_changes.patch
Description: Binary data


Re: Failure of subscription tests with topminnow

2021-08-27 Thread Ajin Cherian
On Fri, Aug 27, 2021 at 3:29 PM Amit Kapila  wrote:
>
>
> I think the fix is correct but similar changes are required in
> 022_twophase_cascade.pl as well (search for $oldpid in tests). I am
> not completely sure but I think it is better to make this test change
> in back branches as well to make it stable and reduce such random
> build farm failures.

I have made the changes in 022_twophase_cascade.pl for HEAD as well as
patches for older branches.

regards,
Ajin Cherian
Fujitsu Australia


HEAD-v2-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data


REL-13-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data


REL-12-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data


REL-10-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data


REL-11-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data


REL-14-STABLE-v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data


Re: Failure of subscription tests with topminnow

2021-08-26 Thread Ajin Cherian
On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada  wrote:

> I think that it’s possible that the orders of the process writing
> disconnections logs and setting 0 to walsender's pid are mismatched.
> We set 0 to walsender's pid in WalSndKill() that is called during
> on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
> show the entry. On the other hand, disconnections logs are written by
> log_disconnections() that is called during on_proc_exit callback. That
> is, the following sequence could happen:
>
> 1. the second walsender (pid = 16475) raises an error as the slot is
> already active (held by the first walsender).
> 2. the first walsender (pid = 16336) clears its pid on the shmem.
> 3. the polling query checks the walsender’s pid, and returns true
> (since there is only the second walsender now).
> 4. the second walsender clears its pid on the shmem.
> 5. the second walsender writes disconnection log.
> 6. the first walsender writes disconneciton log.

I agree with this.

Attaching a patch on head that modifies this particular script to also
consider the state of the walsender.

regards,
Ajin Cherian
Fujitsu Australia


v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data


Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila  wrote:
>
> On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian  wrote:
> >
> > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila  wrote:
> >
> > >
> > > You have a point but if we see the below logs, it seems the second
> > > walsender (#step6) seemed to exited before the first walsender
> > > (#step4).
> > >
> > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > > session time: 0:00:06.367 user=nm database=postgres host=[local]
> > >
> > > Isn't it possible that pid is cleared in the other order due to which
> > > we are seeing this problem?
> >
> > If the pid is cleared in the other order, wouldn't the query [1] return a 
> > false?
> >
> > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> > application_name =  'tap_sub';"
> >
>
> I think it should return true because pid for 16336 is cleared first
> and the remaining one will be 16475.

Yes, that was what I explained as well. 16336 is PID 'a' (first
walsender) in my explanation. The first walsender should
be cleared first for this theory to work.

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila  wrote:

>
> You have a point but if we see the below logs, it seems the second
> walsender (#step6) seemed to exited before the first walsender
> (#step4).
>
> 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> session time: 0:00:00.036 user=nm database=postgres host=[local]
> 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> session time: 0:00:06.367 user=nm database=postgres host=[local]
>
> Isn't it possible that pid is cleared in the other order due to which
> we are seeing this problem?

If the pid is cleared in the other order, wouldn't the query [1] return a false?

[1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
application_name =  'tap_sub';"

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada  wrote:
>
> On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian  wrote:
> >
> > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I did a quick check with the following tap test code:
> > > >
> > > > $node_publisher->poll_query_until('postgres',
> > > >   qq(
> > > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > > ));
> > > >
> > > > The query returns {t, f} but poll_query_until() never finished. The
> > > > same is true when the query returns {f, t}.
> > > >
> >
> > Yes, this is true, I also see the same behaviour.
> >
> > >
> > > This means something different is going on in Ajin's setup. Ajin, can
> > > you please share how did you confirm your findings about poll_query?
> >
> > Relooking at my logs, I think what happens is this:
> >
> > 1. First walsender 'a' is running.
> > 2. Second walsender 'b' starts and attempts at acquiring the slot
> > finds that the slot is active for pid a.
> > 3. Now both walsenders are active, the query does not return.
> > 4. First walsender 'a' times out and exits.
> > 5. Now only the second walsender is active and the query returns OK
> > because pid != a.
> > 6. Second walsender exits with error.
> > 7. Another query attempts to get the pid of the running walsender for
> > tap_sub but returns null because both walsender exits.
> > 8. This null return value results in the next query erroring out and
> > the test failing.
>
> So this is slightly different than what we can see in the topminnow
> logs? According to the server logs, step #5 happened (at 18:44:38.016)
> before step #4 happened (at 18:44:38.043).
>

Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).

Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG:  statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG:  disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]

When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila  wrote:
>
> On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada  wrote:
> >
> > I did a quick check with the following tap test code:
> >
> > $node_publisher->poll_query_until('postgres',
> >   qq(
> > select 1 != foo.column1 from (values(0), (1)) as foo;
> > ));
> >
> > The query returns {t, f} but poll_query_until() never finished. The
> > same is true when the query returns {f, t}.
> >

Yes, this is true, I also see the same behaviour.

>
> This means something different is going on in Ajin's setup. Ajin, can
> you please share how did you confirm your findings about poll_query?

Relooking at my logs, I think what happens is this:

1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.

>Can you additionally check the value of 'state' from
>pg_stat_replication for both the old and new walsender sessions?

Yes, will try this and post a patch tomorrow.

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada  wrote:

>
> IIUC the query[1] used for polling returns two rows in this case: {t,
> f} or {f, t}. But did poll_query_until() returned OK in this case even
> if we expected one row of 't'? My guess of how this issue happened is:
>
> 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> passed (for some reason).
> 2. all wal senders exited.
> 3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
> 4. the second polling query resulted in a syntax error since $oldpid is null.
>
> If the fact that two walsender with the same application_name could
> present in pg_stat_replication view was the cause of this issue,
> poll_query_until() should return OK even if we expected just 't'. I
> might be missing something, though.
>
> [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = '$appname';"

Yes, the query [1] returns OK with a {f,t} or {t,f}

[1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian  wrote:
>
> On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila  wrote:
> >
> > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian  wrote:
> > >
> > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila  
> > > wrote:
> > >
> > > > But will poll function still poll or exit? Have you tried that?
> > >
> > > I have forced that condition with a changed query and found that the
> > > poll will not exit in case of a NULL return.
> > >
> >
> > What if the query in a poll is fired just before we get an error
> > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > Won't at that stage both old and new walsender's are present, so the
> > query might return true. You can check that via debugger by stopping
> > just before this error occurs and then check pg_stat_replication view.
>
> If this error happens then the PID is NOT updated as the pid in the
> Replication slot. I have checked this
> and explained this in my first email itself
>

Sorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:

postgres=# select pid from pg_stat_replication where application_name = 'sub';
 pid
--
 7899
 7993
(2 rows)


ajin  7896  3326  0 05:22 pts/200:00:00 psql -p 6972 postgres
ajin  7897  7882  0 05:22 ?00:00:00 postgres: ajin
postgres [local] idle
ajin  7899  7882  0 05:22 ?00:00:00 postgres: walsender
ajin ::1(37719) START_REPLICATION
ajin  7992  3647  0 05:24 ?00:00:00 postgres: logical
replication worker for subscription 16426
ajin  7993  7882  0 05:24 ?00:00:00 postgres: walsender
ajin ::1(37720) START_REPLICATION

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila  wrote:
>
> On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian  wrote:
> >
> > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila  
> > wrote:
> >
> > > But will poll function still poll or exit? Have you tried that?
> >
> > I have forced that condition with a changed query and found that the
> > poll will not exit in case of a NULL return.
> >
>
> What if the query in a poll is fired just before we get an error
> "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> Won't at that stage both old and new walsender's are present, so the
> query might return true. You can check that via debugger by stopping
> just before this error occurs and then check pg_stat_replication view.

If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itself

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-24 Thread Ajin Cherian
On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila  wrote:

> But will poll function still poll or exit? Have you tried that?

I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.

> Because it is not clear from your explanation how in the first
> statement it returns a valid value which leads poll to exit and then
> in second statement it returns NULL or maybe nothing. Can you share

I don't have an explanation for it either. Maybe things are different
in REL_11_STABLE

> the log also when you are getting "replication slot "tap_sub" is
> active for ..."? If you see in the below log [1], the STATEMENT is
> printed twice, I want to see if you also get prints in a similar way
> or is it something different? Do you know why it is printed twice?
>

Yes, I get the same. For every "LOG or ERROR" message, there is the
associated STATEMENT message with it.
First there is a LOG "received replication command" followed by the
STATEMENT and then the ERROR "replication slot .. is active.."
followed by the STATEMENT.

I will try with REL_11_STABLE and see if the behaviour is different.

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-24 Thread Ajin Cherian
On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila  wrote:

> What happens when there is neither a new walsender nor an old
> walsender is present? It means to run the above statement when a new
> walsender is exited due to error "... slot is active ..." and before a
> new walsender could start. Also, allow old walsender (due to which the
> error occurs) to exit before executing the statement.
>
I tried this, then the query returns a null instead of 'false' because
there is no entry for that application_name.

postgres=# select pid != 31876 from pg_stat_replication where
application_name = 'sub';
 ?column?
--
(0 rows)


> Also, it seems this failure happens on REL_11_STABLE branch, not sure
> if that matters, but it is better to use the same branch if you are
> using a different branch to reproduce the issue.
>

Ok, I didn't realise this. I will try this.

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-23 Thread Ajin Cherian
On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila  wrote:

> The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
> walsender process. Now, here the problem seems to be that the previous
> walsender process (16336) didn't exit and the new process started to
> use the slot which leads to the error shown in the log. This is
> evident from the below part of log where we can see that 16336 has
> exited after new process started to use the slot.
>
> 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG:  received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/16BEEB0 (proto_version '1', publication_names
> '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
> START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> publication_names '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR:  replication
> slot "tap_sub" is active for PID 16336
> 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
> START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> publication_names '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> session time: 0:00:00.036 user=nm database=postgres host=[local]
> 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> session time: 0:00:06.367 user=nm database=postgres host=[local]
>
> One idea to solve this is to first disable the subscription, wait for
> the walsender process to exit, and then change the connection string
> and re-enable the subscription.

I tried to simulate this by putting delays prior to the exit of the
walsender. Even though I see the same error
in the logs that the replication slot is active for the previous PID,
the test case does not fail in the way seen in this case here..

The new walsender tries to acquire the slot but seeing that there is
another PID that is active on the slot, it
errors and exits. At no point does this new failed walsender update
its pid for the slot. As a result, the below polled query
would not return a true value

$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
) or die "Timed out while waiting for apply to restart";

In my runs I see that this query is repeated until a new walsender is
able to successfully acquire the slot.
I am not able to explain why this query returned true in the topminnow
tap test. This would imply that a walsender
was able to acquire the slot and update its pid but I don't see how.
We also know that if this polled query returns
a true (implying a pid other than $oldpid), then the next query in the
test is to try and identify the pid:

SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';

>From the topminnow logs we know that this query returned a "NULL", as
the value extracted from this is used
to formulate the next query which errored out and eventually caused
the test case to fail.

 [16482:3] 001_rep_changes.pl ERROR:  syntax error at or near "FROM"
at character 16
 [16482:4] 001_rep_changes.pl STATEMENT:  SELECT pid !=  FROM
pg_stat_replication WHERE application_name = 'tap_sub';

I am not an expert in perl but I looked at the perl function used in
the tap test poll_query_until(), this would poll until the query
returns a 'true' (unless specified otherwise).
I don't see in my tests that the polled function exits if the query
returns a NULL. I don't know if differences in the installed perl can
cause this difference in
behaviour. Can a NULL set be construed as a true and cause the poll to exit?

 Any suggestions?

regards,
Ajin Cherian
Fujitsu Australia




Re: logical replication empty transactions

2021-08-17 Thread Ajin Cherian
On Mon, Aug 16, 2021 at 4:44 PM Peter Smith  wrote:

> I have reviewed the v13-0001 patch.
>
> Apply / build / test was all OK
>
> Below are my code review comments.
>
> //
>
> Comments for v13-0001
> =
>
> 1. Patch comment
>
> =>
>
> Probably this comment should include some description for the new
> "keepalive" logic as well.

Added.

>
> --
>
> 2. src/backend/replication/syncrep.c - new function
>
> @@ -330,6 +330,18 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
>  }
>
>  /*
> + * Check if Sync Rep is enabled
> + */
> +bool
> +SyncRepEnabled(void)
> +{
> + if (SyncRepRequested() && ((volatile WalSndCtlData *)
> WalSndCtl)->sync_standbys_defined)
> + return true;
> + else
> + return false;
> +}
> +
>
> 2a. Function comment =>
>
> Why abbreviations in the comment? Why not say "synchronous
> replication" instead of "Sync Rep".
>

Changed.

> ~~
>
> 2b. if/else =>
>
> Remove the if/else. e.g.
>
> return SyncRepRequested() && ((volatile WalSndCtlData *)
> WalSndCtl)->sync_standbys_defined;
>
> ~~

Changed.

>
> 2c. Call the new function =>
>
> There is some existing similar code in SyncRepWaitForLSN(), e.g.
>
> if (!SyncRepRequested() ||
> !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
> return;
>
> Now that you have a new function you maybe can call it from here, e.g.
>
> if (!SyncRepEnabled())
> return;
>

Updated.

> --
>
> 3. src/backend/replication/walsender.c - whitespace
>
> + if (send_keep_alive)
> + force_keep_alive_syncrep = true;
> +
> +
>
> =>
>
> Extra blank line?

Removed.

>
> --
>
> 4. src/backend/replication/walsender.c - call keepalive
>
>   if (MyWalSnd->flush < sentPtr &&
>   MyWalSnd->write < sentPtr &&
>   !waiting_for_ping_response)
> + {
>   WalSndKeepalive(false);
> + }
> + else
> + {
> + if (force_keep_alive_syncrep && SyncRepEnabled())
> + WalSndKeepalive(false);
> + }
>
>
> 4a. Move the SynRepEnabled() call =>
>
> I think it is not necessary to call the SynRepEnabled() here. Instead,
> it might be better if this is called back when you assign the
> force_keep_alive_syncrep flag. So change the WalSndUpdateProgress,
> e.g.
>
> BEFORE
> if (send_keep_alive)
>   force_keep_alive_syncrep = true;
> AFTER
> force_keep_alive_syncrep = send_keep_alive && SyncRepEnabled();
>
> Note: Also, that assignment also deserves a big comment to say what it is 
> doing.
>
> ~~

changed.

>
> 4b. Change the if/else =>
>
> If you make the change for 4a. then perhaps the keepalive if/else is
> overkill and could be changed.e.g.
>
> if (force_keep_alive_syncrep ||
>   MyWalSnd->flush < sentPtr &&
>   MyWalSnd->write < sentPtr &&
>   !waiting_for_ping_response)
>   WalSndKeepalive(false);
>

Changed.

regards,
Ajin Cherian
Fujitsu Australia


v14-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data


  1   2   3   >