Re: [HACKERS] Design for In-Core Logical Replication

2016-08-04 Thread Simon Riggs
On 29 July 2016 at 16:53, Robert Haas  wrote:

> I think that to really understand exactly what you and Petr have in
> mind, we'd need a description of where publication and subscription
> data is stored within the server, and exactly what gets stored.
> Perhaps that will come in a later email.  I'm not bashing the design,
> exactly, I just can't quite see how all of the pieces fit together
> yet.

Sure no problem.

It's clear there are about 12 ways of putting this together and each
way has various terms needed.

We need input from many people to get this right, since this is much
more about UI than system architecture.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-29 Thread Alvaro Herrera
Robert Haas wrote:

> One minor comment is that this document makes extensive use of Terms
> With Initial Capitals, which seems stylistically odd, although I'm not
> precisely sure what would be better.

We use publication on the first use only, which is turned
into italics.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-29 Thread Robert Haas
On Wed, Jul 20, 2016 at 4:08 AM, Simon Riggs  wrote:
> In this post, Petr and I present a joint view on a design for how this
> should work in-core, based upon our implementation experiences with physical
> replication, pglogical and various comments so far.
>
> Note that this has substantial user-visible differences from pglogical,
> though much of the underlying architecture is reused.
>
> I should stress that not all of the aspects are implemented yet. The post
> here today is a combination of all of our attempts to bring architecture,
> usability and security into one place, including a coherent way of
> describing the features and how they work.
>
> Your comments and questions are sought now as we begin the main development
> effort to get this into PostgreSQL 10.0

Thanks for publishing this.

One minor comment is that this document makes extensive use of Terms
With Initial Capitals, which seems stylistically odd, although I'm not
precisely sure what would be better.

I would have expected that there would be a concept of a REPLICATION
SET, defining which tables are to be replicated; here, that seems to
be the Publication.  That may fine, but I wonder if there is any value
in separating those things.  It's clear, for example, that a
replication set can be dumped: which tables are members of which
replication sets is durable metadata.  It's less clear that a
publication can be dumped; that might include things which are not
durable metadata, such as associations with slots.

It's generally not really clear to me based on reading this exactly
what information is encapsulated in a Publication or a Subscription,
which makes it hard to evaluate design decisions like this one:

>   
> The definition of a Publication object will be included within
> pg_dump by default when all of the objects in the Publication are
> requested as part of the dump specification.
>   
>   
> Subscriptions are not dumped by pg_dump by default, but can be
> requested using --subscriptions parameter.
>   

I think that to really understand exactly what you and Petr have in
mind, we'd need a description of where publication and subscription
data is stored within the server, and exactly what gets stored.
Perhaps that will come in a later email.  I'm not bashing the design,
exactly, I just can't quite see how all of the pieces fit together
yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-26 Thread Petr Jelinek

On 26/07/16 00:05, Hannu Krosing wrote:


CREATE PUBLICATION mypub;
ALTER PUBLICATION mypub ADD TABLE users, departments;


Would a subscription just be a logical grouping or would it be something
stronger
like meaning atomic subscriptions and/or a dedicated replication slot ?



Not sure what you mean by atomic subscription but subscription creation 
adds replication slot to the provider node. Other than that subscription 
lives on the downstream node only.



Can you subscribe to multiple publications through single SUBSCRIPTION ?



Yes.


What is supposed to happen if table A is in two subscriptions S1 and S2,
and you
subscribe to both? Will you get table a only once (both initial copy and
events)?


Yes only once, the replication works with tables, publication is really 
just grouping/filtering, what you get is union of tables in the 
publications.




Would a subscription of "mypub" pop up on subscriber side atomically, or
will subscribed
tables appear one-by one when they are ready (initial copy + catchup
event replay completed) ?



Yes that's my plan as that makes it easier to parallelize and recover 
from crashes (also makes this faster as tables that are already done 
don't need to be copied again) during the initialization. Also makes it 
easier to reuse the table initialization code for adding new tables at 
later time.




CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar
user=repuser PUBLICATION mypub;


For the pgq-like version which consider a PUBLICATION just as list of
tables to subscribe, I would add

CREATE SUBSCRIPTION mysub WITH CONNECTION 'dbname=foo host=bar
user=repuser' PUBLICATION mypub, mypub1;



Yes that works as well.


ALTER SUBSCRIPTION mysub DROP PUBLICATION mypub1;

ALTER SUBSCRIPTION mysub ADD PUBLICATION mypub2;



This does not yet, but I agree we should have it.

--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-25 Thread Hannu Krosing
On 07/20/2016 10:08 AM, Simon Riggs wrote:
> 
>   Monitoring
>   
> pg_stat_replication
>   
>   
> pg_stat_subscription
>   
> 

and probably also `pg_stat_publication` or some other way to see, what
tables are currently in a PUBLICATION, who has subscribed etc.

> 
> CREATE PUBLICATION mypub;
> ALTER PUBLICATION mypub ADD TABLE users, departments;
> 
Would a subscription just be a logical grouping or would it be something
stronger
like meaning atomic subscriptions and/or a dedicated replication slot ?

Can you subscribe to multiple publications through single SUBSCRIPTION ?

What is supposed to happen if table A is in two subscriptions S1 and S2,
and you
subscribe to both? Will you get table a only once (both initial copy and
events)?

Would a subscription of "mypub" pop up on subscriber side atomically, or
will subscribed
tables appear one-by one when they are ready (initial copy + catchup
event replay completed) ?

I recall that one of the drivers of developing pgq/skytools to replace
Slony was the
fact that Slony's "replication group" design made it very easy to
blunder subscription
changes in more complex topologies which manifested in deadlocks.

PGQ-s table-by-table subscription avoided this entirely at the cost on
non-atomic
subscribed table appearance.

Of course once subscribed, everything was transaction-consistent again.

> 
> CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar
> user=repuser PUBLICATION mypub;
> 
For the pgq-like version which consider a PUBLICATION just as list of
tables to subscribe, I would add

CREATE SUBSCRIPTION mysub WITH CONNECTION 'dbname=foo host=bar
user=repuser' PUBLICATION mypub, mypub1;

ALTER SUBSCRIPTION mysub DROP PUBLICATION mypub1;

ALTER SUBSCRIPTION mysub ADD PUBLICATION mypub2;





-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-22 Thread Jim Nasby

On 7/20/16 8:35 PM, Craig Ringer wrote:

I'm not sure how Petr's current design for in-core replication addresses
this, if it does, or whether it's presently focused only on
point-to-point replication like pglogical. As far as I'm concerned so
long as it does direct point-to-point replication with no forwarding
that's good enough for a first cut feature, so long as the UI, catalog
and schema design leaves room for adding more later.


That's the part I'm pretty worried about. I'd feel a lot better if there 
was at least a wiki page of future topology/communication features and a 
sketch of a design.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-22 Thread Jim Nasby
ISTM pg_dump should alter publication/subscription when doing partial 
dumps, similar to how it handles foreign keys.


Subscribed tables should be read-only by default. If you need something 
more advanced than that adding an extra option to the subscription (or 
wherever else) is going to be the least of your worries. If we don't 
want to get too wrapped up in config/syntax I think there should at 
least be a read-only statement level trigger installed by default that 
users can then drop or disable if desired.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-22 Thread Craig Ringer
On 22 July 2016 at 22:55, Simon Riggs  wrote:


>
> Filtering data at both the sender and receiver has been requested, so
> I guess it makes sense to have a WHERE clause on both the Publication
> and the Subscription, but I will think more on that.
>

Yeah, output plugin level filtering predicates are certainly a desirable
feature for down the track.

One important factor to consider there is that we're very restricted in
what we can safely access. Currently the historical snapshot infrastructure
set up by logical decoding doesn't stop you trying to access tables that
you can't safely access, you might just get surprising/wrong results or
failures. If we're going to allow arbitrary user defined predicates we'll
likely need to change that so we explicitly ERROR when an attempt to
heap_open a table that isn't accessible during logical decoding is made.
Probably not hard, likely a good idea anyway.

Such predicates should generally be simple, though with less strict
requirements than CHECK constraints. We can probably do subqueries and I
don't think functions have to be immutable, though obviously anything that
tries to allocate an xid will fail.

Transforming data at the receiver/Apply side we had envisaged would be
> done using standard triggers. So we had envisaged you would add the
> trigger onto the table using the existing trigger syntax, so that when
> an INSERT was applied the trigger would execute, using the standard
> mechanisms.
>

Yeah. In the medium term at least.

I imagine eventually we'll want more powerful transforms that doesn't
require the overhead of trigger function calls, but that's a long way
down the road.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-22 Thread Simon Riggs
On 22 July 2016 at 09:54, Dmitriy Sarafannikov  wrote:
>
>>
>>CREATE PUBLICATION mypub;
>>ALTER PUBLICATION mypub ADD TABLE users, departments;
>
>>CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar 
>>user=repuser PUBLICATION mypub;
>
>>The above will start the replication process which synchronizes the
>>initial table contents of users and
>>departments tables and then starts replicating
>>incremental changes to those tables.
>
> Hi, hackers.
>
> it is very good to have logical replication in core. Also i have some 
> proposal.

Thanks for your input.

> What if we would have ability to execute custom trigger functions on events 
> on particular table? Also it would be useful if would have ability to ignore 
> some tables in publication or replicatie with some WHERE condition.

Both of those requirements are eventual goals for this. The initial
commits of Logical Replication for 10.0 won't include those features,
keeping the scope tight so we make sure we get this into 10.0, but we
want to design it with those things in mind and we may be lucky enough
to get it in the first release.

> For example, we want replicate table "users" as is (maybe with some WHERE 
> conditions), but on events on table "departments" we want execute trigger 
> function departments_event_handler().

> ALTER SUBSCRIPTION mysub ADD TABLE users REPLICATE TO LOCAL TABLE my_users 
> INSERT WHERE new.id_user > 1000 UPDATE WHERE old.id_user < 1000; -- we don't 
> want replicate deletes, for example.
> ALTER SUBSCRIPTION mysub ADD TABLE departments ON INSERT WHEN (new.id_user > 
> 1000)  EXECUTE PROCEDURE departments_event_handler(); -- just like trigger

Filtering data at both the sender and receiver has been requested, so
I guess it makes sense to have a WHERE clause on both the Publication
and the Subscription, but I will think more on that.

Transforming data at the receiver/Apply side we had envisaged would be
done using standard triggers. So we had envisaged you would add the
trigger onto the table using the existing trigger syntax, so that when
an INSERT was applied the trigger would execute, using the standard
mechanisms.
So I don't think we need to add trigger style syntax onto the
Subscription, but it is an option to consider, I guess.

(Note to self: make sure we get the dependencies right here so when we
drop objects they are automatically removed from Subscriptions, whilst
allowing Subscriptions to be dropped without causing tables to be
dropped (and Publications also)).

> And we don't want handle events on third table which was added to publication.

Not sure what you mean for that part. Could you explain a little more?

(I'm on leave for some days, so I don't respond quickly I will respond
eventually)

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Design for In-Core Logical Replication

2016-07-22 Thread Dmitriy Sarafannikov

>
>CREATE PUBLICATION mypub;
>ALTER PUBLICATION mypub ADD TABLE users, departments;

>CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar 
>user=repuser PUBLICATION mypub;

>    The above will start the replication process which synchronizes the
>    initial table contents of users and
>    departments tables and then starts replicating
>    incremental changes to those tables.

Hi, hackers.

it is very good to have logical replication in core. Also i have some proposal. 
What if we would have ability to execute custom trigger functions on events on 
particular table? Also it would be useful if would have ability to ignore some 
tables in publication or replicatie with some WHERE condition. For example, we 
want replicate table "users" as is (maybe with some WHERE conditions), but on 
events on table "departments" we want execute trigger function 
departments_event_handler(). And we don't want handle events on third table 
which was added to publication.

Something like this:

CREATE PUBLICATION mypub;
ALTER PUBLICATION mypub ADD TABLE users, departments, unnecessary_tbl;

CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar 
user=repuser PUBLICATION mypub;
ALTER SUBSCRIPTION mysub ADD TABLE users REPLICATE TO LOCAL TABLE my_users 
INSERT WHERE new.id_user > 1000 UPDATE WHERE old.id_user < 1000; -- we don't 
want replicate deletes, for example.
ALTER SUBSCRIPTION mysub ADD TABLE departments ON INSERT WHEN (new.id_user > 
1000)  EXECUTE PROCEDURE departments_event_handler(); -- just like trigger


Regards,
Dmitriy Sarafannikov
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Craig Ringer
On 21 July 2016 at 11:05, Joshua D. Drake  wrote:

> On 07/20/2016 06:35 PM, Craig Ringer wrote:
>
> First, I'd like to emphasise that logical replication has been stalled
>> for ages now because we can no longer make forward progress on core
>> features needed until we have in-core logical replication (they're
>> dismissed as irrelevant, no in core users, etc) - but we have also had
>> difficulty getting logical replication into core. To break this impasse
>> we really need logical replication in core and need to focus on getting
>> the minimum viable feature in place, not trying to make it do everything
>> all at once. Point-to-point replication with no forwarding should be
>> just fine for the first release. Lets not bog this in extra "must have"
>> features that aren't actually crucial.
>>
>
> I don't think any person who actually works on postgresql with customers
> and clearly deals with "competition" can state with any sanity that we
> don't need logical replication in core.


No, and while people used to, we're past that now.

However, infrastructure improvements to make out-of-tree logical
replication that we can get into user hands *now* rather than two+ years
from now have been getting knocked back because there's no in-tree user,
and tools like pglogical dismissed as irrelevant. Once we have logical
replication in core hopefully we can start making infrastructure progress
again as well.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Joshua D. Drake

On 07/20/2016 06:35 PM, Craig Ringer wrote:


First, I'd like to emphasise that logical replication has been stalled
for ages now because we can no longer make forward progress on core
features needed until we have in-core logical replication (they're
dismissed as irrelevant, no in core users, etc) - but we have also had
difficulty getting logical replication into core. To break this impasse
we really need logical replication in core and need to focus on getting
the minimum viable feature in place, not trying to make it do everything
all at once. Point-to-point replication with no forwarding should be
just fine for the first release. Lets not bog this in extra "must have"
features that aren't actually crucial.


I don't think any person who actually works on postgresql with customers 
and clearly deals with "competition" can state with any sanity that we 
don't need logical replication in core.


JD


--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Craig Ringer
On 21 July 2016 at 01:20, Simon Riggs  wrote:

> On 20 July 2016 at 17:52, Rod Taylor  wrote:
>
>
>> I think it's important for communication channels to be defined
>> separately from the subscriptions.
>>
>
> I agree and believe it will be that way.
>
> Craig is working on allowing Replication Slots to failover between nodes,
> to provide exactly that requested behaviour.
>
>

First, I'd like to emphasise that logical replication has been stalled for
ages now because we can no longer make forward progress on core features
needed until we have in-core logical replication (they're dismissed as
irrelevant, no in core users, etc) - but we have also had difficulty
getting logical replication into core. To break this impasse we really need
logical replication in core and need to focus on getting the minimum viable
feature in place, not trying to make it do everything all at once.
Point-to-point replication with no forwarding should be just fine for the
first release. Lets not bog this in extra "must have" features that aren't
actually crucial.

That said:

I had a patch in it for 9.6 to provide the foundations for logical
replication to follow physical failover, but it got pulled at the last
minute. It'll be submitted for 10.0 along with some other enhancements to
make it usable without hacky extensions, most notably support for using a
physical replication slot and hot standby feedback to pin a master's
catalog_xmin where it's needed by slots on a physical replica.

That's for when we're combining physical and logical replication though,
e.g. "node A" is a master/standby pair, and "node B" is also a
master/standby pair.

For non-star logical topologies, which is what I think you might've been
referring to, it's necessary to have:

- Node identity
- Which nodes we want to receive data from
- How we connect to each node

all of which are separate things. Who's out there, what we want from them,
and how to get it.

pglogical doesn't really separate the latter two much at this point.
Subscriptions identify both the node to connect to and the data we want to
receive from a node; there's no selective data forwarding from one node to
another. Though there's room for that in pglogical's hooks/filters by using
filtering by replication origin, it just doesn't do it yet.

It sounds like that's what you're getting at. Wanting to be able to say
"node A wants to get data from node B and node C" separately to "node A
connects to node B to receive data", with the replication system somehow
working out that that means data written from C to B should be forwarded to
A.

Right?

If so, it's not always easy to figure that out. If you create connections
to both B and C, we then have to automagically work out that we should stop
forwarding data from C over our connection to B.

The plan with pglogical has been to allow connections to specify forwarding
options, so the connection explicitly says what nodes it wants to get data
from. It's users' job to ensure that they don't declare connections that
overlap. This is simpler to implement, but harder to admin.

One challenge with either approach is ensuring a consistent switchover. If
you have a single connection A=>B receiving data from [B,C], then you
switch to two connections A=>B and A=>C with neither forwarding, you must
ensure that the switchover occurs in such a way as that no data is
replicated twice or skipped. That's made easier by the fact that we have
replication origins and we can actually safely receive from both at the
same time then discard from one of them, even use upstream filtering to
avoid sending it over the wire twice. But it does take care and caution.

Note that none of this is actually for logical _failover_, where we lose a
node. For that we need some extra help in the form of placeholder slots
maintained on other peers. This can be done at the application /
replication system level without the need for new core features, but it
might not be something we can support in the first iteration.

I'm not sure how Petr's current design for in-core replication addresses
this, if it does, or whether it's presently focused only on point-to-point
replication like pglogical. As far as I'm concerned so long as it does
direct point-to-point replication with no forwarding that's good enough for
a first cut feature, so long as the UI, catalog and schema design leaves
room for adding more later.



> I also suspect multiple publications will be normal even if only 2 nodes.
>> Old slow moving data almost always got different treatment than fast-moving
>> data; even if only defining which set needs to hit the other node first and
>> which set can trickle through later.
>>
>
> Agreed
>
>
Yes, especially since we can currently only stream transactions one by one
in commit order after commit.

Even once we have interleaved xact streaming, though, there will still be
plenty of times we want to receive different sets of data from the 

Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Petr Jelinek

On 20/07/16 19:07, Simon Riggs wrote:

On 20 July 2016 at 16:39, Joshua D. Drake > wrote:

  Logical Replication uses a Publish and Subscribe model
with one or
  more Subscribers subscribing to one or more Publications on a
  Provider node. Subscribers pull data from the Publications
they
  subscribe to and may subsequently re-publish data to allow
  cascading replication or more complex configurations.


Is that somehow different than Origin/Subscriber or Master/Slave? If
not, why are we using yet more terms?


Thanks for asking, an important question that we have a chance to get
right before we go too far down the road of implementation.

I'll explain my thinking, so we can discuss the terms I've recommended,
which can be summarized as:
A Provider node has one or more Databases, each of which can publish its
data in zero, one or more PUBLICATIONs. A Subscribing node can receive
data in the form of zero, one or more SUBSCRIBERs, where each SUBSCRIBER
may bring together data from one or more PUBLICATIONs

Here's why...



Just to add to what Simon wrote. There is one more reason for not using 
term origin for this - origin of data does not necessarily have to be on 
the provider database once there is a cascading so it does not really 
map all that well.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Simon Riggs
On 20 July 2016 at 17:52, Rod Taylor  wrote:


> I think it's important for communication channels to be defined separately
> from the subscriptions.
>

I agree and believe it will be that way.

Craig is working on allowing Replication Slots to failover between nodes,
to provide exactly that requested behaviour.


> So, I'd rather have:
>
> CREATE CONNECTION machine1;
> CREATE CONNECTION machine2;
>

I think those map to replication slots. (This discussion might get a bit
confusing if we try to guess exactly what each others terms mean, so I'll
go no further than "I think").


> CREATE SUBSCRIPTION TO PUBLICATION mypub;
>

Yep


> I'm not certain the subscription needs to be named. IMO, a publication
> should have the same properties on all nodes (so any node may become the
> primary source). If a subscriber needs different behaviour for a
> publication, it should be created as a different publication.
>

Understood, its mostly to allow it to be dropped or altered and monitored.
It's kindof like an index, it needs a name, we just don't much care what it
is.


> Documenting that ThisPub is different from ThatPub is easier than
> documenting that ThisPub on node 1/2/4 is different from ThisPub on node
> 7/8, except Node 7 is temporarily on Node 4 too (database X instead of
> database Y) due to that power problem.
>

Which is why pg_dump support is important to allow us to sync up the
definitions.


> Clearly this is advanced. An initial implementation may only allow mypub
> from a single connection.
>

Good input and clearly explained, thanks. If any of the above changes,
these requirements will remain noted.


> I also suspect multiple publications will be normal even if only 2 nodes.
> Old slow moving data almost always got different treatment than fast-moving
> data; even if only defining which set needs to hit the other node first and
> which set can trickle through later.
>

Agreed


-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Simon Riggs
On 20 July 2016 at 16:39, Joshua D. Drake  wrote:


>  Logical Replication is a method of replicating data objects and their
>>  changes, based upon their Primary Keys (or Replication Identity). We
>>
>
> Do we want a limitation based on Primary Key, or would it be possible to
> use just UNIQUE or is that covered under Replication Identity?


That is covered by replication identity.


>
>>  Logical Replication uses a Publish and Subscribe model with one or
>>  more Subscribers subscribing to one or more Publications on a
>>  Provider node. Subscribers pull data from the Publications they
>>  subscribe to and may subsequently re-publish data to allow
>>  cascading replication or more complex configurations.
>>
>
> Is that somehow different than Origin/Subscriber or Master/Slave? If not,
> why are we using yet more terms?


Thanks for asking, an important question that we have a chance to get right
before we go too far down the road of implementation.

Issue: We need a noun for CREATE "SOMETHING" (or pg_create_$NOUN). So what
noun to use? SQLStandard gives us no guidance here.

I'll explain my thinking, so we can discuss the terms I've recommended,
which can be summarized as:
A Provider node has one or more Databases, each of which can publish its
data in zero, one or more PUBLICATIONs. A Subscribing node can receive data
in the form of zero, one or more SUBSCRIBERs, where each SUBSCRIBER may
bring together data from one or more PUBLICATIONs

Here's why...

Master/Slave is not appropriate, since both sending and receiving nodes are
Masters.

Origin/Subscriber is used by Slony. The term "Replication Origin" is
already used in PG9.5 for something related, but not identical.
Provider/Subscriber is used by Londiste.
Bucardo seems to use Master/Slave according to FAQ.

The Noun we are discussing is something that a single Database can have >1
of, so those terms aren't quite appropriate.

pglogical uses Provider/Subscriber and Replication Sets, so I started with
the thought that we might want CREATE REPLICATION SET or
pg_create_replication_set(). After some time considering this, ISTM that
the term "replication set" may not be that useful since we foresee a future
where data is actually filtered and transformed and the feature set extends
well beyond what we have with Slony, so I began looking for a term that was
general and obvious (POLA).

After some thought, I realised that we are describing this as "Publish &
Subscribe", so it makes a lot of sense to just use the terms Publication &
Subscription. Those phrases are commonly used by SQLServer, Sybase, Oracle,
Redis, RabbitMQ etc which is a pretty big set.
It's also a commonly used Enterprise Integration Design pattern
https://en.wikipedia.org/wiki/Publish–subscribe_pattern
I note especially that Publish/Subscribe does not imply any particular
topology (a mistake I made earlier when I called this stuff BDR, which
confused everybody when we tried to talk about a subset of that
functionality called UDR).
http://www.slideshare.net/ishraqabd/publish-subscribe-model-overview-13368808

So that brings us to...
A Provider node has one or more Databases, each of which can publish its
data in zero, one or more PUBLICATIONs. A Subscribing node can receive data
in the form of zero, one or more SUBSCRIBERs, where each SUBSCRIBER may
bring together data from one or more PUBLICATIONs.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Rod Taylor
On Wed, Jul 20, 2016 at 4:08 AM, Simon Riggs  wrote:


>
>   
> And on Subscriber database:
> 
> CREATE SUBSCRIPTION mysub WITH CONNECTION dbname=foo host=bar
> user=repuser PUBLICATION mypub;
> 
>   
>   
> The above will start the replication process which synchronizes the
> initial table contents of users and
> departments tables and then starts replicating
> incremental changes to those tables.
>   
> 
> 
>

I think it's important for communication channels to be defined separately
from the subscriptions.

If I have nodes 1/2 + 3/4 which operate in pairs, I don't really want to
have to have a script reconfigure replication on 3/4 every-time we do
maintenance on 1 or 2.

3/4 need to know they subscribe to mypub and that they have connections to
machine 1 and machine 2. The replication system should be able to figure
out which (of 1/2) has the most recently available data.


So, I'd rather have:

CREATE CONNECTION machine1;
CREATE CONNECTION machine2;
CREATE SUBSCRIPTION TO PUBLICATION mypub;

Notice I explicitly did not tell it how to get the publication but if we
did have a preference the DNS weighting model might be appropriate.

I'm not certain the subscription needs to be named. IMO, a publication
should have the same properties on all nodes (so any node may become the
primary source). If a subscriber needs different behaviour for a
publication, it should be created as a different publication.

Documenting that ThisPub is different from ThatPub is easier than
documenting that ThisPub on node 1/2/4 is different from ThisPub on node
7/8, except Node 7 is temporarily on Node 4 too (database X instead of
database Y) due to that power problem.


Clearly this is advanced. An initial implementation may only allow mypub
from a single connection.


I also suspect multiple publications will be normal even if only 2 nodes.
Old slow moving data almost always got different treatment than fast-moving
data; even if only defining which set needs to hit the other node first and
which set can trickle through later.

regards,

Rod Taylor


Re: [HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Joshua D. Drake

On 07/20/2016 01:08 AM, Simon Riggs wrote:







   Logical Replication
   
 Logical Replication is a method of replicating data objects and their
 changes, based upon their Primary Keys (or Replication Identity). We


Do we want a limitation based on Primary Key, or would it be possible to 
use just UNIQUE or is that covered under Replication Identity?



   
 Logical Replication uses a Publish and Subscribe model with one or
 more Subscribers subscribing to one or more Publications on a
 Provider node. Subscribers pull data from the Publications they
 subscribe to and may subsequently re-publish data to allow
 cascading replication or more complex configurations.


Is that somehow different than Origin/Subscriber or Master/Slave? If 
not, why are we using yet more terms?




   Publication
   
 A Publication object can be defined on any master node, owned by one
 user. A Publication is a set of changes generated from a group of
 tables, and might also be described as a Change Set or Replication Set.
 Each Publication exists in only one database.


  Then on Provider database:


CREATE PUBLICATION mypub;
ALTER PUBLICATION mypub ADD TABLE users, departments;

   


Outside of my previous comments on reusing terminology that is known to 
our community, I like this. Basically a user creates a pool that is 
replicating, throws various ducks and small children into the pool and 
then replicates. Nice.


Sincerely,

JD


--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Design for In-Core Logical Replication

2016-07-20 Thread Simon Riggs
At PgCon we discussed that Petr Jelinek would be working on the code for an
in-core logical replication implementation, while I would work on user
interface/security models. Petr has been actively working on the code and
will post patch in a few weeks, as discussed and agreed. Craig Ringer is
also active in coding necessary aspects. None of those things are discussed
further here at this time.

In this post, Petr and I present a joint view on a design for how this
should work in-core, based upon our implementation experiences with
physical replication, pglogical and various comments so far.

Note that this has substantial user-visible differences from pglogical,
though much of the underlying architecture is reused.

I should stress that not all of the aspects are implemented yet. The post
here today is a combination of all of our attempts to bring architecture,
usability and security into one place, including a coherent way of
describing the features and how they work.

Your comments and questions are sought now as we begin the main development
effort to get this into PostgreSQL 10.0






  Logical Replication
  
Logical Replication is a method of replicating data objects and their
changes, based upon their Primary Keys (or Replication Identity). We
use the term Logical in contrast to Physical replication which
uses exact block addresses and byte-by-byte replication.
PostgreSQL supports both mechanisms concurrently, see
. Logical Replication allows
fine-grained control over both data replication and security.
  
  
Logical Replication uses a Publish and Subscribe model with one or
more Subscribers subscribing to one or more Publications on a
Provider node. Subscribers pull data from the Publications they
subscribe to and may subsequently re-publish data to allow
cascading replication or more complex configurations.
  
  
Data for committed transactions is streamed in real-time to each
Subscriber.  Logical replication might also be described as Change
Data Capture (CDC) or Transactional Replication.
  
  
The typical use-cases for logical replication are:
  
  

  
Replicating between different major versions of the PostgreSQL
  


  
Replicating a database in full to another master node.
  


  
Replicating a subset of a database to another master node.
  


  
Firing triggers for individual changes as they are incoming to
subscriber.
  


  
Gathering data from multiple databases into a single one (for
example
for analytical purposes).
  

  


  Publication
  
A Publication object can be defined on any master node, owned by one
user. A Publication is a set of changes generated from a group of
tables, and might also be described as a Change Set or Replication Set.
Each Publication exists in only one database.
  
  
Publications are different from table schema and do not affect
how the table is accessed. Each table can be added to multiple
Publications if needed.  Publications may include both tables
and materialized views. Objects must be added explicitly, except
when a Publication is created for "ALL TABLES". There is no
default name for a Publication which specifies all tables.
  
  
Tables added to a Publication must be accessible via SELECT
privilege for the user owning the Publication. Usage on the
Publication can be GRANTed to other users.
  
  
Publications can choose to limit the changes they show using any
combination of INSERT, UPDATE, DELETE and TRUNCATE in a similar
way to the way triggers are fired by particular event types.
  
  
When UPDATEs and DELETEs are replicated by a Publication, all tables
added must have a unique index present on the REPLICA IDENTITY for
the table, or the addition will be refused.
  
  
The definition of a Publication object will be included within
pg_dump by default when all of the objects in the Publication are
requested as part of the dump specification.
  
  
Every Publication can have zero, one or more Subscribers.
  
  
Publications are created using the 
command and may be later altered or dropped using corresponding
commands.
  
  
The individual tables can be added and removed dynamically using
. Both the ADD TABLE and DROP
TABLE operations are transactional so the table will start or stop
replicating at the correct snapshot once the transaction has committed.
  


  Subscription
  
A Subscription is the downstream side of the Logical Replication. The
node where Subscription is defined is referred to as Subscriber.
Subscription defines the connection to another database and set of
Publications (one or more) to which it wants to be subscribed.
It is possible to have a Subscription that currently has no
Publications.