Logical Replication Conflict Resolution

2023-05-25 Thread Stavros Koureas
Working with PostgreSQL Logical Replication is just great! It helps a lot
doing real time replication for analytical purposes without using any other
3d party service. Although all these years working as a product architect
of reporting I have noted a few requirements which are always a challenge
and may help enhance logical replication even better.

To the point:
PostgreSQL15 Logical Replication is going to stop while there is a conflict
on the destination database. On the other hand, PGLogical can handle the
conflicts with more options like error, apply_remote,
keep_local, last_update_wins, first_update_wins while streaming.

I was thinking that it would be great to add those capabilities into
Logical Replication during streaming and even better on snapshot if it is
possible. This enhancement method is going to solve a lot of issues while
there are hybrid solutions which are updating databases with SQL Scripts
and Logical Replication. At the same time will make Logical Replication
"more reliable" as it will not stop replicating lines in live
environments when the decision of conflict is already decided and
configured.

In my case I am consolidating data from multiple erp system databases to a
destination database for reporting purposes. All erps, have the same table
schema as the destination database, source database has the tenant_id
identifier in non primary keys but has replica identity index. Now there
are scenarios where we may need to update manually the destination database
using scripts which are having the ONCONFLICT statement, but during that
time if a new record is inserted into the database and the batch statement
finishes earlier than replication, the replication will find a conflict.


Re: Logical Replication Custom Column Expression

2023-02-06 Thread Stavros Koureas
>> And yes, probably you need to change the way you reply to email on
>> this list. Top-posting is generally avoided. See
>> https://wiki.postgresql.org/wiki/Mailing_Lists.

>Thanks for bringing this into the discussion :)

Thinking these days more about this topic, subscriber name is not a bad
idea, although it makes sense to be able to give your own value even on
subscriber level, for example an integer.
Having a custom integer value is better as definitely this integer will
participate later in all joins beside the tables and for sure joining with
an integer it would be quicker rather than joining on a character varying
(plus the rest of the columns).
In addition, discussing with other people and also on Stack
Overflow/DBAExchange I have found that other people think it is a great
enhancement for analytical purposes.

Στις Τετ 30 Νοε 2022 στις 10:39 π.μ., ο/η Stavros Koureas <
koureasstav...@gmail.com> έγραψε:

>
>
> Στις Τρί 29 Νοε 2022 στις 3:27 μ.μ., ο/η Ashutosh Bapat <
> ashutosh.bapat@gmail.com> έγραψε:
> > That would be too restrictive - not necessarily in your application
> > but generally. There could be some tables where consolidating rows
> > with same PK from different publishers into a single row in subscriber
> > would be desirable. I think we need to enable the property for every
> > subscriber that intends to add publisher column to the desired and
> > subscribed tables. But there should be another option per table which
> > will indicate that receiver should add publisher when INSERTING row to
> > that table.
>
> So we are discussing the scope level of this property, if this property
> will be implemented on subscriber level or on subscriber table.
> In that case I am not sure how this will be implemented as currently
> postgres subscribers can have multiple tables streamed from a single
> publisher.
> In that case we may have an additional syntax on subscriber, for example:
>
> CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost port=5432
> user=postgres password=XX dbname=publisher1' PUBLICATION pub1 with
> (enabled = false, create_slot = false, slot_name = NONE, tables =
> {tableA:union, tableB:none, });
>
> Something like this?
>
> > And yes, probably you need to change the way you reply to email on
> > this list. Top-posting is generally avoided. See
> > https://wiki.postgresql.org/wiki/Mailing_Lists.
>
> Thanks for bringing this into the discussion :)
>


Re: Logical Replication Custom Column Expression

2022-11-30 Thread Stavros Koureas
Στις Τρί 29 Νοε 2022 στις 3:27 μ.μ., ο/η Ashutosh Bapat <
ashutosh.bapat@gmail.com> έγραψε:
> That would be too restrictive - not necessarily in your application
> but generally. There could be some tables where consolidating rows
> with same PK from different publishers into a single row in subscriber
> would be desirable. I think we need to enable the property for every
> subscriber that intends to add publisher column to the desired and
> subscribed tables. But there should be another option per table which
> will indicate that receiver should add publisher when INSERTING row to
> that table.

So we are discussing the scope level of this property, if this property
will be implemented on subscriber level or on subscriber table.
In that case I am not sure how this will be implemented as currently
postgres subscribers can have multiple tables streamed from a single
publisher.
In that case we may have an additional syntax on subscriber, for example:

CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost port=5432 user=postgres
password=XX dbname=publisher1' PUBLICATION pub1 with (enabled = false,
create_slot = false, slot_name = NONE, tables = {tableA:union, tableB:none,
});

Something like this?

> And yes, probably you need to change the way you reply to email on
> this list. Top-posting is generally avoided. See
> https://wiki.postgresql.org/wiki/Mailing_Lists.

Thanks for bringing this into the discussion :)


Re: Logical Replication Custom Column Expression

2022-11-28 Thread Stavros Koureas
Sure I understand and neither do I have good knowledge of what else could
be influenced by such a change.
If the value of the column is the subscriber name has no benefit to this
idea of merging multiple upstreams with same primary keys, later you
describe the "connection dbname", yes this could be a possibility.
I do not fully understand that part "how will the initial tablesync COPY
efficiently assign these subscriber name column values?"
Why is difficult that during the initial sync put everywhere the same value
for all rows of the same origin?

Στις Δευ 28 Νοε 2022 στις 10:16 π.μ., ο/η Peter Smith 
έγραψε:

> On Fri, Nov 25, 2022 at 9:43 PM Stavros Koureas
>  wrote:
> >
> > Yes, if the property is on the subscription side then it should be
> applied for all the tables that the connected publication is exposing.
> > So if the property is enabled you should be sure that this origin column
> exists to all of the tables that the publication is exposing...
> >
> > Sure this is the complete idea, that the subscriber should match the PK
> of origin, 
> > As the subscription table will contain same key values from different
> origins, for example:
> >
> > For publisher1 database table
> > id pk integer | value character varying
> > 1   | testA1
> > 2   | testA2
> >
> > For publisher2 database table
> > id pk integer | value character varying
> > 1   | testB1
> > 2   | testB2
> >
> > For subscriber database table
> > origin pk character varying | id pk integer | value character varying
> > publisher1   | 1   | testA1
> > publisher1   | 2   | testA2
> > publisher2   | 1   | testB1
> > publisher2   | 2   | testB2
> >
> > All statements INSERT, UPDATE, DELETE should always include the
> predicate of the origin.
> >
>
> This sounds similar to what I had posted [1] although I was saying the
> generated column value might be the *subscriber* name, not the origin
> publisher name. (where are you getting that value from -- somehow from
> the subscriptions' CONNECTION dbname?)
>
> Anyway, regardless of the details, please note -- my idea was really
> intended just as a discussion starting point to demonstrate that
> required functionality might be achieved using a simpler syntax than
> what had been previously suggested. But in practice there may be some
> problems with this approach -- e.g. how will the initial tablesync
> COPY efficiently assign these subscriber name column values?
>
> --
> [1]
> https://www.postgresql.org/message-id/CAHut%2BPuZowXd7Aa7t0nqjP6afHMwJarngzeMq%2BQP0vE2KKLOgQ%40mail.gmail.com
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia.
>


Re: Logical Replication Custom Column Expression

2022-11-25 Thread Stavros Koureas
Yes, if the property is on the subscription side then it should be applied
for all the tables that the connected publication is exposing.
So if the property is enabled you should be sure that this origin column
exists to all of the tables that the publication is exposing...

Sure this is the complete idea, that the subscriber should match the PK of
origin, 
As the subscription table will contain same key values from different
origins, for example:

*For publisher1 database **table*
id pk integer | value character varying
1   | testA1
2   | testA2

*For publisher2 database **table*
id pk integer | value character varying
1   | testB1
2   | testB2

*For subscriber database table*
origin *pk *character varying | id *pk *integer | value character varying
publisher1   | 1   | testA1
publisher1   | 2   | testA2
publisher2   | 1   | testB1
publisher2   | 2   | testB2

All statements INSERT, UPDATE, DELETE should always include the predicate
of the origin.

Στις Παρ 25 Νοε 2022 στις 12:21 μ.μ., ο/η Ashutosh Bapat <
ashutosh.bapat@gmail.com> έγραψε:

> On Wed, Nov 23, 2022 at 4:54 AM Peter Smith  wrote:
> >
> > On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
> >  wrote:
> > >
> > > Reading more carefully what you described, I think you are interested
> in getting something you call origin from publishers, probably some
> metadata from the publications.
> > >
> > > This identifier in those metadata maybe does not have business value
> on the reporting side. The idea is to use a value which has specific
> meaning to the user at the end.
> > >
> > > For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at
> the end based on a dimension table which holds this mapping the user would
> be able to filter the data. So programmatically the user can set the id
> value of the column plus creating the mapping table from an application
> let’s say and be able to distinguish the data.
> > >
> > > In addition this column should have the ability to be part of the
> primary key on the subscription table in order to not conflict with lines
> from other tenants having the same keys.
> > >
> > >
> >
> > I was wondering if a simpler syntax solution might also work here.
> >
> > Imagine another SUBSCRIPTION parameter that indicates to write the
> > *name* of the subscription to some pre-defined table column:
> > e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
> > CONNECTION '...' WITH (subscription_column);
> >
> > Logical Replication already allows the subscriber table to have extra
> > columns, so you just need to manually create the extra 'subscription'
> > column up-front.
> >
> > Then...
> >
> > ~~
> >
> > On Publisher:
> >
> > test_pub=# CREATE TABLE tab(id int primary key, description varchar);
> > CREATE TABLE
> >
> > test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
> > INSERT 0 3
> >
> > test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
> > CREATE PUBLICATION
> >
> > ~~
> >
> > On Subscriber:
> >
> > test_sub=# CREATE TABLE tab(id int, description varchar, subscription
> varchar);
> > CREATE TABLE
> >
> > test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
> > dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
> > CREATE SUBSCRIPTION
> >
> > test_sub=# SELECT * FROM tab;
> >  id | description | subscription
> > +-+--
> >   1 | one | sub_tenant1
> >   2 | two | sub_tenant1
> >   3 | three   | sub_tenant1
> > (3 rows)
> >
> > ~~
> >
> Thanks for the example. This is more concrete than just verbal description.
>
> In this example, do all the tables that a subscription subscribes to
> need that additional column or somehow the pglogical receiver will
> figure out which tables have that column and populate rows
> accordingly?
>
> My further fear is that the subscriber will also need to match the
> subscription column along with the rest of PK so as not to update rows
> from other subscriptions.
> --
> Best Wishes,
> Ashutosh Bapat
>


Re: Logical Replication Custom Column Expression

2022-11-23 Thread Stavros Koureas
Just one correction for the subscriber
On Subscriber:

test_sub=# CREATE TABLE tab(id int *pkey*, description varchar,
subscription varchar *pkey*);
CREATE TABLE

The subscription table should have the same primary key columns as the
publisher plus one more.
We need to make sure that on update only the same origin data is
being updated.

Στις Τετ 23 Νοε 2022 στις 1:24 π.μ., ο/η Peter Smith 
έγραψε:

> On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
>  wrote:
> >
> > Reading more carefully what you described, I think you are interested in
> getting something you call origin from publishers, probably some metadata
> from the publications.
> >
> > This identifier in those metadata maybe does not have business value on
> the reporting side. The idea is to use a value which has specific meaning
> to the user at the end.
> >
> > For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the
> end based on a dimension table which holds this mapping the user would be
> able to filter the data. So programmatically the user can set the id value
> of the column plus creating the mapping table from an application let’s say
> and be able to distinguish the data.
> >
> > In addition this column should have the ability to be part of the
> primary key on the subscription table in order to not conflict with lines
> from other tenants having the same keys.
> >
> >
>
> I was wondering if a simpler syntax solution might also work here.
>
> Imagine another SUBSCRIPTION parameter that indicates to write the
> *name* of the subscription to some pre-defined table column:
> e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
> CONNECTION '...' WITH (subscription_column);
>
> Logical Replication already allows the subscriber table to have extra
> columns, so you just need to manually create the extra 'subscription'
> column up-front.
>
> Then...
>
> ~~
>
> On Publisher:
>
> test_pub=# CREATE TABLE tab(id int primary key, description varchar);
> CREATE TABLE
>
> test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
> INSERT 0 3
>
> test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
> CREATE PUBLICATION
>
> ~~
>
> On Subscriber:
>
> test_sub=# CREATE TABLE tab(id int, description varchar, subscription
> varchar);
> CREATE TABLE
>
> test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
> dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
> CREATE SUBSCRIPTION
>
> test_sub=# SELECT * FROM tab;
>  id | description | subscription
> +-+--
>   1 | one | sub_tenant1
>   2 | two | sub_tenant1
>   3 | three   | sub_tenant1
> (3 rows)
>
> ~~
>
> Subscriptions to different tenants would be named differently.
>
> And using other SQL you can map/filter those names however your
> application wants.
>
> --
> Kind Regards,
> Peter Smith.
> Fujitsu Australia
>


Re: Logical Replication Custom Column Expression

2022-11-23 Thread Stavros Koureas
It's easy to answer this question.

Imagine that in a software company who sells the product and also offers
reporting solutions, the ERP tables will not have this additional column to
all the tables.
Now the reporting department comes and needs to consolidate all that data
from different databases (publishers) and create one multitenant database
to have all the data.
So in an ERP like NAV or anything else you cannot suggest change all the
code to all of the tables plus all functions to add one additional column
to this table, even that was possible then you cannot work with integers
but you need to work with GUIDs as this column should be predefined to each
ERP. Then joining with GUID in the second phase for reporting
definitely will slow down the performance.

In summary:

   1. Cannot touch the underlying source (important)
   2. GUID identifier column will slow down the reporting performance


Στις Τετ 23 Νοε 2022 στις 5:19 π.μ., ο/η Amit Kapila <
amit.kapil...@gmail.com> έγραψε:

> On Wed, Nov 23, 2022 at 1:40 AM Stavros Koureas
>  wrote:
> >
> > Reading more carefully what you described, I think you are interested in
> getting something you call origin from publishers, probably some metadata
> from the publications.
> >
> > This identifier in those metadata maybe does not have business value on
> the reporting side. The idea is to use a value which has specific meaning
> to the user at the end.
> >
> > For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the
> end based on a dimension table which holds this mapping the user would be
> able to filter the data. So programmatically the user can set the id value
> of the column plus creating the mapping table from an application let’s say
> and be able to distinguish the data.
> >
>
> In your example, are different tenants represent different publisher
> nodes? If so, why can't we have a predefined column and value for the
> required tables on each publisher rather than logical replication
> generate that value while replicating data?
>
> --
> With Regards,
> Amit Kapila.
>


Re: Logical Replication Custom Column Expression

2022-11-22 Thread Stavros Koureas
Reading more carefully what you described, I think you are interested in 
getting something you call origin from publishers, probably some metadata from 
the publications.

This identifier in those metadata maybe does not have business value on the 
reporting side. The idea is to use a value which has specific meaning to the 
user at the end.

For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end 
based on a dimension table which holds this mapping the user would be able to 
filter the data. So programmatically the user can set the id value of the 
column plus creating the mapping table from an application let’s say and be 
able to distinguish the data.

In addition this column should have the ability to be part of the primary key 
on the subscription table in order to not conflict with lines from other 
tenants having the same keys.

> 
> 22 Νοε 2022, 14:52, ο χρήστης «Stavros Koureas » 
> έγραψε:
> 
> 
> Sure, this can be implemented as a subscription option, and it will cover 
> this use case scenario as each subscriber points only to one database.
> I also have some more analytical/reporting use-cases which need additions in 
> logical-replication, I am not sure if I need to open different discussions 
> for each one, all ideas are for publication/subscription.
> 
> Στις Τρί 22 Νοε 2022 στις 2:22 μ.μ., ο/η Amit Kapila 
>  έγραψε:
>> On Mon, Nov 21, 2022 at 5:05 PM Ashutosh Bapat
>>  wrote:
>> >
>> > On Sat, Nov 19, 2022 at 6:47 PM Stavros Koureas
>> >  wrote:
>> > >
>> > > What does not support is the option for defining custom column 
>> > > expressions, as keys or values, into the upstream (publication). This 
>> > > will give more flexibility into making replication from multiple 
>> > > upstreams into less downstreams adding more logic. For instance, in a 
>> > > project for analytical purposes there is the need to consolidate data 
>> > > from multiple databases into one and at the same time keep the origin of 
>> > > each replicated data identified by a tenanant_id column. In this case we 
>> > > also need the ability to define the new column as an additional key 
>> > > which will participate into the destination table.
>> > >
>> > > Tenant 1 table
>> > > id serial pk
>> > > description varchar
>> > >
>> > > Tenant 2 table
>> > > id integer pk
>> > > description varchar
>> > >
>> > > Group table
>> > > tenant integer pk
>> > > id integer pk
>> > > description varchar
>> > >
>> > > Possible syntax to archive that
>> > > CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias} 
>> > > ,"id", "name")
>> > >
>> > > Example
>> > > CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} 
>> > > ,"id", "name")
>> >
>> > I think that's a valid usecase.
>> >
>> > This looks more like a subscription option to me. In multi-subscriber
>> > multi-publisher scenarios, on one subscriber a given upstream may be
>> > tenant 1 but on some other it could be 2. But I don't think we allow
>> > specifying subscription options for a single table. AFAIU, the origin
>> > ids are available as part of the commit record which contained this
>> > change; that's how conflict resolution is supposed to know it. So
>> > somehow the subscriber will need to fetch those from there and set the
>> > tenant.
>> >
>> 
>> Yeah, to me also it appears that we can handle it on the subscriber
>> side. We have the provision of sending origin information in proto.c.
>> But note that by default publishers won't have any origin associated
>> with change unless someone has defined it. I think this work needs
>> more thought but sounds to be an interesting feature.
>> 
>> -- 
>> With Regards,
>> Amit Kapila.


Re: Logical Replication Custom Column Expression

2022-11-22 Thread Stavros Koureas
Sure, this can be implemented as a subscription option, and it will cover
this use case scenario as each subscriber points only to one database.
I also have some more analytical/reporting use-cases which need additions
in logical-replication, I am not sure if I need to open
different discussions for each one, all ideas are for
publication/subscription.

Στις Τρί 22 Νοε 2022 στις 2:22 μ.μ., ο/η Amit Kapila <
amit.kapil...@gmail.com> έγραψε:

> On Mon, Nov 21, 2022 at 5:05 PM Ashutosh Bapat
>  wrote:
> >
> > On Sat, Nov 19, 2022 at 6:47 PM Stavros Koureas
> >  wrote:
> > >
> > > What does not support is the option for defining custom column
> expressions, as keys or values, into the upstream (publication). This will
> give more flexibility into making replication from multiple upstreams into
> less downstreams adding more logic. For instance, in a project for
> analytical purposes there is the need to consolidate data from multiple
> databases into one and at the same time keep the origin of each replicated
> data identified by a tenanant_id column. In this case we also need the
> ability to define the new column as an additional key which will
> participate into the destination table.
> > >
> > > Tenant 1 table
> > > id serial pk
> > > description varchar
> > >
> > > Tenant 2 table
> > > id integer pk
> > > description varchar
> > >
> > > Group table
> > > tenant integer pk
> > > id integer pk
> > > description varchar
> > >
> > > Possible syntax to archive that
> > > CREATE PUBLICATION pb_test FOR TABLE test
> ({value:datatype:iskey:alias} ,"id", "name")
> > >
> > > Example
> > > CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant}
> ,"id", "name")
> >
> > I think that's a valid usecase.
> >
> > This looks more like a subscription option to me. In multi-subscriber
> > multi-publisher scenarios, on one subscriber a given upstream may be
> > tenant 1 but on some other it could be 2. But I don't think we allow
> > specifying subscription options for a single table. AFAIU, the origin
> > ids are available as part of the commit record which contained this
> > change; that's how conflict resolution is supposed to know it. So
> > somehow the subscriber will need to fetch those from there and set the
> > tenant.
> >
>
> Yeah, to me also it appears that we can handle it on the subscriber
> side. We have the provision of sending origin information in proto.c.
> But note that by default publishers won't have any origin associated
> with change unless someone has defined it. I think this work needs
> more thought but sounds to be an interesting feature.
>
> --
> With Regards,
> Amit Kapila.
>


Logical Replication Custom Column Expression

2022-11-19 Thread Stavros Koureas
Hi all,

Working with PostgreSQL Logical Replication is just great! It helps a lot
doing real time replication for analytical purposes without using any other
3d party service. Although all these years working as product architect of
reporting i have noted a few requirements which are always a challenge and
may help enhance logical replication even better.

To the point:
PostgreSQL14 Logical Replication allows replication of a table to another
table that exists in another database or even in another host. It also
allows multiple upstream tables using the same structure to downstream into
a single table.
*CREATE PUBLICATION pb_test FOR TABLE test*

PostgreSQL15 Logical Replication allows even better replication options,
like selecting subsets of the columns from publisher tables. It also
supports plenty of options like disable_on_error etc.
*CREATE PUBLICATION pb_test FOR TABLE test ("id", "name")*

What does not support is the option for defining custom column expressions,
as keys or values, into the upstream (publication). This will give more
flexibility into making replication from multiple upstreams into less
downstreams adding more logic. For instance, in a project for analytical
purposes there is the need to consolidate data from multiple databases into
one and at the same time keep the origin of each replicated data
identified by a tenanant_id column. In this case we also need the ability
to define the new column as an additional key which will participate into
the destination table.

Tenant 1 table
id serial pk
description varchar

Tenant 2 table
id integer pk
description varchar

Group table
tenant integer pk
id integer pk
description varchar

Possible syntax to archive that
*CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias}
,"id", "name")*

Example
*CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} ,"id",
"name")*

I suppose the column definition should exist in the publication syntax as
the publication should know from before the datatype and if is a key before
being consumed by a subscriber which may already have the column.

So making an insert or update or delete statement into a tenant 1 database:
INSERT INTO test (id, description) VALUES (5, 'data')
UPDATE test SET description = 'data' WHERE id = 5
DELETE FROM test WHERE id = 5
Will be reflected into subscriber as the following
INSERT INTO test (tenant, id, description) VALUES (1, 5, 'data')
UPDATE test SET description = 'data' WHERE tenant=1 AND id = 5
DELETE FROM test WHERE tenant=1 AND id = 5

For more clarifications please reach me at koureasstav...@gmail.com
Thanks!