subject:"Re\: \[HACKERS\] \[PATCH 10\/16\] Introduce the concept that wal has a 'origin' node"

On 20 June 2012 11:26, Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Simon Riggs si...@2ndquadrant.com wrote:
 The proposal is to use WAL to generate the logical change stream.
 That has been shown in testing to be around x4 faster than having
 a separate change stream, which must also be WAL logged (as Jan
 noted).

 Sure, that's why I want it.

 I think this argument is basically circular.  The reason it's 4x faster
 is that the WAL stream doesn't actually contain all the information
 needed to generate LCRs (thus all the angst about maintaining catalogs
 in sync, what to do about unfriendly datatypes, etc).  By the time the
 dust has settled and you have a workable system, you will have bloated
 WAL and given back a large chunk of that multiple, thereby invalidating
 the design premise.  Or at least that's my prediction.

The tests were conducted with the additional field added, so your
prediction is not verified.

The additional fields do not bloat WAL records - they take up exactly
the same space as before.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 01:27, Kevin Grittner wrote:

Andres Freundand...@2ndquadrant.com  wrote:


Yes, thats definitely a valid use-case. But that doesn't preclude
the other - also not uncommon - use-case where you want to have
different master which all contain up2date data.


I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.


I think it would be helpful to list down a few concrete examples of 
this. The stereotypical multi-master scenario is that you have a single 
table that's replicated to two servers, and you can insert/update/delete 
on either server. Conflict resolution stretegies vary.


The reason we need an origin id in this scenario is that otherwise this 
will happen:


1. A row is updated on node A
2. Node B receives the WAL record from A, and updates the corresponding 
row in B. This generates a new WAL record.
3. Node A receives the WAL record from B, and updates the rows again. 
This again generates a new WAL record, which is replicated to A, and you 
loop indefinitely.


If each WAL record carries an origin id, node A can use it to refrain 
from applying the WAL record it receives from B, which breaks the loop.


However, note that in this simple scenario, if the logical log replay / 
conflict resolution is smart enough to recognize that the row has 
already been updated, because the old and the new rows are identical, 
the loop is broken at step 3 even without the origin id. That works for 
the newest-update-wins and similar strategies. So the origin id is not 
absolutely necessary in this case.


Another interesting scenario is that you maintain a global counter, like 
in an inventory system, and conflicts are resolved by accumulating the 
updates. For example, if you do UPDATE SET counter = counter + 1 
simultaneously on two nodes, the result is that the counter is 
incremented by two. The avoid-update-if-already-identical optimization 
doesn't help in this case, the origin id is necessary.


Now, let's take the inventory system example further. There are actually 
two ways to update a counter. One is when an item is checked in or out 
of the warehouse, ie. UPDATE counter = counter + 1. Those updates 
should accumulate. But another operation resets the counter to a 
specific value, UPDATE counter = 10, like when taking an inventory. 
That should not accumulate with other changes, but should be 
newest-update-wins. The origin id is not enough for that, because by 
looking at the WAL record and the origin id, you don't know which type 
of an update it was.


So, I don't like the idea of adding the origin id to the record header. 
It's only required in some occasions, and on some record types. And I'm 
worried it might not even be enough in more complicated scenarios.


Perhaps we need a more generic WAL record annotation system, where a 
plugin can tack arbitrary information to WAL records. The extra 
information could be stored in the WAL record after the rmgr payload, 
similar to how backup blocks are stored. WAL replay could just ignore 
the annotations, but a replication system could use it to store the 
origin id or whatever extra information it needs.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 08:35, Robert Haas robertmh...@gmail.com wrote:

 I expect it would be fine to have a tool that pulls LCRs out of WAL to
 prepare that to be sent to remote locations.  Is that what you have in
 mind?

 Yes.  I think it should be possible to generate LCRs from WAL, but I
 think that the on-the-wire format for LCRs should be different from
 the WAL format.  Trying to use the same format for both things seems
 like an unpleasant straightjacket.

You're confusing things here.

Yes, we can use a different on-the-wire format. No problem.

As I've already said, the information needs to be in WAL first before
we can put it into LCRs, so the don't use same format argument is
not relevant as to why the info must be on the WAL record in the first
place. And as said elsewhere, doing that does not cause problems in
any of these areas:  wal bloat, performance, long term restrictions on
numbers of nodeids, as has so far been claimed on this thread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 14:40, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 The reason we need an origin id in this scenario is that otherwise this will
 happen:

 1. A row is updated on node A
 2. Node B receives the WAL record from A, and updates the corresponding row
 in B. This generates a new WAL record.
 3. Node A receives the WAL record from B, and updates the rows again. This
 again generates a new WAL record, which is replicated to A, and you loop
 indefinitely.

 If each WAL record carries an origin id, node A can use it to refrain from
 applying the WAL record it receives from B, which breaks the loop.

 However, note that in this simple scenario, if the logical log replay /
 conflict resolution is smart enough to recognize that the row has already
 been updated, because the old and the new rows are identical, the loop is
 broken at step 3 even without the origin id. That works for the
 newest-update-wins and similar strategies. So the origin id is not
 absolutely necessary in this case.

Including the origin id in the WAL allows us to filter out WAL records
when we generate LCRs, so we can completely avoid step 3, including
all of the CPU, disk and network overhead that implies. Simply put, we
know the change came from A, so no need to send the change to A again.

If we do allow step 3 to exist, we still need to send the origin id.
This is because there is no apriori way to know the origin id is not
required, such as the case where we have concurrent updates which is
effectively a race condition between actions on separate nodes. Not
sending the origin id because it is not required in some cases is
equivalent to saying we can skip locking because a race condition does
not happen in all cases. Making a case that the race condition is rare
is still not a case for skipping locking. Same here: we need the
information to avoid making errors in the general case.

 Another interesting scenario is that you maintain a global counter, like in
 an inventory system, and conflicts are resolved by accumulating the updates.
 For example, if you do UPDATE SET counter = counter + 1 simultaneously on
 two nodes, the result is that the counter is incremented by two. The
 avoid-update-if-already-identical optimization doesn't help in this case,
 the origin id is necessary.

 Now, let's take the inventory system example further. There are actually two
 ways to update a counter. One is when an item is checked in or out of the
 warehouse, ie. UPDATE counter = counter + 1. Those updates should
 accumulate. But another operation resets the counter to a specific value,
 UPDATE counter = 10, like when taking an inventory. That should not
 accumulate with other changes, but should be newest-update-wins. The origin
 id is not enough for that, because by looking at the WAL record and the
 origin id, you don't know which type of an update it was.

Yes, of course. Conflict handling in the general case requires much
additional work. This thread is one minor change of many related
changes. The patches are being submitted in smaller chunks to ease
review, so sometimes there are cross links between things.

 So, I don't like the idea of adding the origin id to the record header. It's
 only required in some occasions, and on some record types.

That conclusion doesn't follow from your stated arguments.

 And I'm worried
 it might not even be enough in more complicated scenarios.

It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


 Perhaps we need a more generic WAL record annotation system, where a plugin
 can tack arbitrary information to WAL records. The extra information could
 be stored in the WAL record after the rmgr payload, similar to how backup
 blocks are stored. WAL replay could just ignore the annotations, but a
 replication system could use it to store the origin id or whatever extra
 information it needs.

Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.

One question I raised in my review was whether this extra information
should be added by a variable length header, so I already asked this
very question. So far there is no evidence that the additional code
complexity would be warranted. If it became so in the future, it can
be modified again. At this stage there is no need, so the proposal is
to add the field to every WAL record without regard to the setting of
wal_level because there is no measurable downside to doing so. The
downsides of additional complexity are clear and real however, so I
wish to avoid them.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 10:32, Simon Riggs wrote:

On 20 June 2012 14:40, Heikki Linnakangas

And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution 
mechanisms, what extra information do you need for those, and where do 
you put it?



Perhaps we need a more generic WAL record annotation system, where a plugin
can tack arbitrary information to WAL records. The extra information could
be stored in the WAL record after the rmgr payload, similar to how backup
blocks are stored. WAL replay could just ignore the annotations, but a
replication system could use it to store the origin id or whatever extra
information it needs.


Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the 
table, then I'll have to just outright object to this patch. Let's 
implement what we can without the origin id, and revisit this later.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 15:45, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 10:32, Simon Riggs wrote:

 On 20 June 2012 14:40, Heikki Linnakangas

 And I'm worried

 it might not even be enough in more complicated scenarios.


 It is not the only required conflict mechanism, and has never been
 claimed to be so. It is simply one piece of information needed, at
 various times.


 So, if the origin id is not sufficient for some conflict resolution
 mechanisms, what extra information do you need for those, and where do you
 put it?

As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
  UPDATE SET counter = counter + 1
by sending just the +1 part, not the current value of counter, as
would be the case with the non-reflexive update
  UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


 Perhaps we need a more generic WAL record annotation system, where a
 plugin
 can tack arbitrary information to WAL records. The extra information
 could
 be stored in the WAL record after the rmgr payload, similar to how backup
 blocks are stored. WAL replay could just ignore the annotations, but a
 replication system could use it to store the origin id or whatever extra
 information it needs.


 Additional information required by logical information will be handled
 by a new wal_level.

 The discussion here is about adding origin_node_id *only*, which needs
 to be added on each WAL record.


 If that's all we can discuss here is, and all other options are off the
 table, then I'll have to just outright object to this patch. Let's implement
 what we can without the origin id, and revisit this later.

As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.

We have the choice to add the required information as a static or as a
variable length addition to the WAL record header. Since there is no
additional requirement for expansion of the header at this point and
no measurable harm in doing so, I suggest we avoid the complexity and
lack of robustness that a variable length header would cause. Of
course, we can go there later if needed, but there is no current need.
If a variable length header had been suggested, it is certain somebody
would say that was overcooked and just do it as static.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 11:17, Simon Riggs wrote:

On 20 June 2012 15:45, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 20.06.2012 10:32, Simon Riggs wrote:


On 20 June 2012 14:40, Heikki Linnakangas


And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution
mechanisms, what extra information do you need for those, and where do you
put it?


As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
by sending just the +1 part, not the current value of counter, as
would be the case with the non-reflexive update
   UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


Yeah, I'd like to hear briefly how you would handle that without any 
further changes to the WAL record header.



Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the
table, then I'll have to just outright object to this patch. Let's implement
what we can without the origin id, and revisit this later.


As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.


It's only needed for multi-master replication, where the same table can 
be updated from multiple nodes. Just leave that out for now. There's 
plenty of functionality and issues left even without that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 16:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 It's only needed for multi-master replication, where the same table can be
 updated from multiple nodes. Just leave that out for now. There's plenty of
 functionality and issues left even without that.

Huh? Multi-master replication is what is being built here and many
people want that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 11:34, Simon Riggs wrote:

On 20 June 2012 16:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


It's only needed for multi-master replication, where the same table can be
updated from multiple nodes. Just leave that out for now. There's plenty of
functionality and issues left even without that.


Huh? Multi-master replication is what is being built here and many
people want that.


Sure, but presumably you're going to implement master-slave first, and 
build multi-master on top of that. What I'm saying is that we can leave 
out the origin-id for now, since we don't have agreement on it, and 
revisit it after master-slave replication is working.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote:
 On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne cbbro...@gmail.com 
wrote:
  On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas robertmh...@gmail.com 
wrote:
  Btw, what do you mean with conflating the stream? I don't really see
  that being proposed.
  
  It seems to me that you are intent on using the WAL stream as the
  logical change stream.  I think that's a bad design.  Instead, you
  should extract changes from WAL and then ship them around in a format
  that is specific to logical replication.
  
  Yeah, that seems worth elaborating on.
  
  What has been said several times is that it's pretty necessary to
  capture the logical changes into WAL.  That seems pretty needful, in
  order that the replication data gets fsync()ed avidly, and so that we
  don't add in the race condition of needing to fsync() something *else*
  almost exactly as avidly as is the case for WAL today..
 
 Check.
 
  But it's undesirable to pull *all* the bulk of contents of WAL around
  if it's only part of the data that is going to get applied.  On a
  physical streaming replica, any logical data that gets captured will
  be useless.  And on a logical replica, they physical bits of WAL
  will be useless.
  
  What I *want* you to mean is that there would be:
  a) WAL readers that pull the physical bits, and
  b) WAL readers that just pull logical bits.
  
  I expect it would be fine to have a tool that pulls LCRs out of WAL to
  prepare that to be sent to remote locations.  Is that what you have in
  mind?
 Yes.  I think it should be possible to generate LCRs from WAL, but I
 think that the on-the-wire format for LCRs should be different from
 the WAL format.  Trying to use the same format for both things seems
 like an unpleasant straightjacket.  This discussion illustrates why:
 we're talking about consuming scarce bit-space in WAL records for a
 feature that only a tiny minority of users will use, and it's still
 not really enough bit space.  That stinks.  If LCR transmission is a
 separate protocol, this problem can be engineered away at a higher
 level.
As I said before, I definitely agree that we want to have a separate transport 
format once we have decoding nailed down. We still need to ship wal around if 
the decoding happens in a different instance, but *after* that it can be 
shipped in something more convenient/appropriate.

 Suppose we have three servers, A, B, and C, that are doing
 multi-master replication in a loop.  A sends LCRs to B, B sends them
 to C, and C sends them back to A.  Obviously, we need to make sure
 that each server applies each set of changes just once, but it
 suffices to have enough information in WAL to distinguish between
 replication transactions and non-replication transactions - that is,
 one bit.  So suppose a change is made on server A.  A generates LCRs
 from WAL, and tags each LCR with node_id = A.  It then sends those
 LCRs to B.  B applies them, flagging the apply transaction in WAL as a
 replication transaction, AND ALSO sends the LCRs to C.  The LCR
 generator on B sees the WAL from apply, but because it's flagged as a
 replication transaction, it does not generate LCRs.  So C receives
 LCRs from B just once, without any need for the node_id to to be known
 in WAL.  C can now also apply those LCRs (again flagging the apply
 transaction as replication) and it can also skip sending them to A,
 because it seems that they originated at A.
One bit is fine if you have only very simple replication topologies. Once you 
think about globally distributed databases its a bit different. You describe 
some of that below, but just to reiterate: 
Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF 
in europe). You may only want to have full intercontinental interconnect 
between two of those (say A and D). If you only have one bit to represent the 
origin thats not going to work because you won't be able discern the changes 
from BC on A from the changes from those originating on DEF.

Another topology which is interesting is circular replications (i.e. changes 
get shipped A-B, B-C, C-A) which is a sensible topology if you only have a 
low change rate and a relatively high number of nodes because you don't need 
the full combinatorial amount of connections.

You still have the origin_id's be meaningful in the local context though. As 
described before, in the communication between the different nodes you can 
simply replace 16bit node id with some fancy UUID or such. And do the reverse 
when replaying LCRs.

 Now suppose we have a more complex topology.  Suppose we have a
 cluster of four servers A .. D which, for improved tolerance against
 network outages, are all connected pairwise.  Normally all the links
 are up, so each server sends all the LCRs it generates directly to all
 other servers.  But how do we prevent cycles?  A generates a change
 and sends it to B, C, and D.  B then sees that the change

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 16:44, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 11:34, Simon Riggs wrote:

 On 20 June 2012 16:23, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 It's only needed for multi-master replication, where the same table can
 be
 updated from multiple nodes. Just leave that out for now. There's plenty
 of
 functionality and issues left even without that.


 Huh? Multi-master replication is what is being built here and many
 people want that.


 Sure, but presumably you're going to implement master-slave first, and build
 multi-master on top of that. What I'm saying is that we can leave out the
 origin-id for now, since we don't have agreement on it, and revisit it after
 master-slave replication is working.

I am comfortable with the idea of deferring applying the patch, but I
don't see any need to defer agreeing the patch is OK, so it can be
applied easily later. It does beg the question of when exactly we
would defer it to though. When would that be?

If you have a reason for disagreement, please raise it now, having
seen explanations/comments on various concerns. Of course, people have
made initial objections, which is fine but its not reasonable to
assume that such complaints continue to exist after. Perhaps there are
other thoughts?

The idea that logical rep is some kind of useful end goal in itself is
slightly misleading. If the thought is to block multi-master
completely on that basis, that would be a shame. Logical rep is the
mechanism for implementing multi-master.

Deferring this could easily end up with a huge patch in last CF, and
then it will be rejected/deferred. Patch submission here is following
the requested process - as early as possible, production ready, small
meaningful patches that build towards a final goal. This is especially
true for format changes, which is why this patch is here now. Doing it
differently just makes patch wrangling and review more difficult,
which reduces overall quality and slows down development.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner

 Heikki Linnakangas  wrote:
 
 I don't like the idea of adding the origin id to the record header.
 It's only required in some occasions, and on some record types.
 
Right.
 
 And I'm worried it might not even be enough in more complicated
 scenarios.

 Perhaps we need a more generic WAL record annotation system, where
 a plugin can tack arbitrary information to WAL records. The extra
 information could be stored in the WAL record after the rmgr
 payload, similar to how backup blocks are stored. WAL replay could
 just ignore the annotations, but a replication system could use it
 to store the origin id or whatever extra information it needs.
 
Not only would that handle absolute versus relative updates and
origin id, but application frameworks could take advantage of such a
system for passing transaction metadata.  I've held back on one
concern so far that I'll bring up now because this suggestion would
address it nicely.
 
Our current trigger-driven logical replication includes a summary
which includes transaction run time, commit time, the transaction
type identifier, the source code line from which that transaction was
invoked, the user ID with which the user connected to the application
(which isn't the same as the database login), etc.  Being able to
decorate a database transaction with arbitrary (from the DBMS POV)
metadata would be very valuable.  In fact, our shop can't maintain
the current level of capabilities without *some* way to associate
such information with a transaction.
 
I think that using up the only unused space in the fixed header to
capture one piece of the transaction metadata needed for logical
replication, and that only in some configurations, is short-sighted.
If we solve the general problem of transaction metadata, this one
specific case will fall out of that.
 
I think removing origin ID from this patch and submitting a separate
patch for a generalized transaction metadata system is the sensible
way to go.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund and...@2ndquadrant.com wrote:
 As I said before, I definitely agree that we want to have a separate transport
 format once we have decoding nailed down. We still need to ship wal around if
 the decoding happens in a different instance, but *after* that it can be
 shipped in something more convenient/appropriate.

Right, OK, we agree on this then.

 One bit is fine if you have only very simple replication topologies. Once you
 think about globally distributed databases its a bit different. You describe
 some of that below, but just to reiterate:
 Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF
 in europe). You may only want to have full intercontinental interconnect
 between two of those (say A and D). If you only have one bit to represent the
 origin thats not going to work because you won't be able discern the changes
 from BC on A from the changes from those originating on DEF.

I don't see the problem.  A certainly knows via which link the LCRs arrived.

So: change happens on A.  A sends the change to B, C, and D.  B and C
apply the change.  One bit is enough to keep them from regenerating
new LCRs that get sent back to A.  So they're fine.  D also receives
the changes (from A) and applies them, but it also does not need to
regenerate LCRs.  Instead, it can take the LCRs that it has already
got (from A) and send those to E and F.

Or: change happens on B.  B sends the changes to A.  Since A knows the
network topology, it sends the changes to C and D.  D sends them to E
and F.  Nobody except B needs to *generate* LCRs.  All any other node
needs to do is suppress *redundant* LCR generation.

 Another topology which is interesting is circular replications (i.e. changes
 get shipped A-B, B-C, C-A) which is a sensible topology if you only have a
 low change rate and a relatively high number of nodes because you don't need
 the full combinatorial amount of connections.

I think this one is OK too.  You just generate LCRs on the origin node
and then pass them around the ring at every step.  When the next hop
would be the origin node then you're done.

I think you may be imagining that A generates LCRs and sends them to
B.  B applies them, and then from the WAL just generated, it produces
new LCRs which then get sent to C.  If you do that, then, yes,
everything that you need to disentangle various network topologies
must be present in WAL.  But what I'm saying is: don't do it like
that.  Generate the LCRs just ONCE, at the origin node, and then pass
them around the network, applying them at every node.  Then, the
information that is needed in WAL is confined to one bit: the
knowledge of whether or not a particular transaction is local (and
thus LCRs should be generated) or non-local (and thus they shouldn't,
because the origin already generated them and thus we're just handing
them around to apply everywhere).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

If you're saying that single-master logical replication isn't useful,
I disagree.  Of course, having both single-master and multi-master
replication together is even more useful.  But I think getting even
single-master logical replication working well in a single release
cycle is going to be a job and a half.  Thinking that we're going to
get MMR in one release is not realistic.  The only way to make it
realistic is to put MMR ahead of every other goal that people have for
logical replication, including robustness and stability.  It's
entirely premature to be designing features for MMR when we don't even
have the design for SMR nailed down yet.  And that's even assuming we
EVER want MMR in core, which has not even really been argued, let
alone agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 03:19:55 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
  The idea that logical rep is some kind of useful end goal in itself is
  slightly misleading. If the thought is to block multi-master
  completely on that basis, that would be a shame. Logical rep is the
  mechanism for implementing multi-master.
 
 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.  But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.  Thinking that we're going to
 get MMR in one release is not realistic.  The only way to make it
 realistic is to put MMR ahead of every other goal that people have for
 logical replication, including robustness and stability.  It's
 entirely premature to be designing features for MMR when we don't even
 have the design for SMR nailed down yet.  And that's even assuming we
 EVER want MMR in core, which has not even really been argued, let
 alone agreed.
I agree it has not been agreed uppon, but I certainly would consider 
submitting a prototype implementing it an argument for doing it ;)

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.

 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

OK, so your estimate is 1.5 people to do that. And if we have more
people, should they sit around doing nothing?

 Thinking that we're going to
 get MMR in one release is not realistic.

If you block it, then the above becomes true, whether or not it starts true.

You may not want MMR, but others do. I see no reason to prevent people
from having it, which is what you suggest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.

 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?

Oh, give me a break.  You're willfully missing my point.  And to quote
Fred Brooks, nine women can't make a baby in one month.

 Thinking that we're going to
 get MMR in one release is not realistic.

 If you block it, then the above becomes true, whether or not it starts true.

If I had no rational basis for my objections, that would be true.
You've got four people objecting to this patch now, all of whom happen
to be committers.  Whether or not MMR goes into core, who knows, but
it doesn't seem that this patch is going to fly.

My main point in bringing this up is that if you pick a project that
is too large, you will fail.  As I would rather see this project
succeed, I recommend that you don't do that.  Both you and Andres seem
to believe that MMR is a reasonable first target to shoot at, but I
don't think anyone else - including the Slony developers who have
commented on this issue - endorses that position.  At PGCon, you were
talking about getting a new set of features into PG over the next 3-5
years.  Now, it seems like you want to compress that timeline to a
year.  I don't think that's going to work.  You also requested that
people tell you sooner when large patches were in danger of not making
the release.  Now I'm doing that, VERY early, and you're apparently
angry about it.  If the only satisfactory outcome of this conversation
is that everyone agrees with the design pattern you've already decided
on, then you haven't left yourself very much room to walk away
satisfied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 03:02:28 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  One bit is fine if you have only very simple replication topologies. Once
  you think about globally distributed databases its a bit different. You
  describe some of that below, but just to reiterate:
  Imagine having 6 nodes, 3 on one of two continents (ABC in north america,
  DEF in europe). You may only want to have full intercontinental
  interconnect between two of those (say A and D). If you only have one
  bit to represent the origin thats not going to work because you won't be
  able discern the changes from BC on A from the changes from those
  originating on DEF.
 
 I don't see the problem.  A certainly knows via which link the LCRs
 arrived.

 So: change happens on A.  A sends the change to B, C, and D.  B and C
 apply the change.  One bit is enough to keep them from regenerating
 new LCRs that get sent back to A.  So they're fine.  D also receives
 the changes (from A) and applies them, but it also does not need to
 regenerate LCRs.  Instead, it can take the LCRs that it has already
 got (from A) and send those to E and F.

 Or: change happens on B.  B sends the changes to A.  Since A knows the
 network topology, it sends the changes to C and D.  D sends them to E
 and F.  Nobody except B needs to *generate* LCRs.  All any other node
 needs to do is suppress *redundant* LCR generation.
 
  Another topology which is interesting is circular replications (i.e.
  changes get shipped A-B, B-C, C-A) which is a sensible topology if
  you only have a low change rate and a relatively high number of nodes
  because you don't need the full combinatorial amount of connections.
 
 I think this one is OK too.  You just generate LCRs on the origin node
 and then pass them around the ring at every step.  When the next hop
 would be the origin node then you're done.
 
 I think you may be imagining that A generates LCRs and sends them to
 B.  B applies them, and then from the WAL just generated, it produces
 new LCRs which then get sent to C. 
Yes, thats what I am proposing.

 If you do that, then, yes,
 everything that you need to disentangle various network topologies
 must be present in WAL.  But what I'm saying is: don't do it like
 that.  Generate the LCRs just ONCE, at the origin node, and then pass
 them around the network, applying them at every node.  Then, the
 information that is needed in WAL is confined to one bit: the
 knowledge of whether or not a particular transaction is local (and
 thus LCRs should be generated) or non-local (and thus they shouldn't,
 because the origin already generated them and thus we're just handing
 them around to apply everywhere).
Sure, you can do it that way, but I don't think its a good idea. If you do it 
my way you *guarantee* that when replaying changes from node B on node C you 
have replayed changes from A at least as far as B has. Thats a really nice 
property for MM.
You *can* get same with your solution but it starts to get complicated rather 
fast. While my/our proposed solution is trivial to implement.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 20:37, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Heikki Linnakangas  wrote:

 I don't like the idea of adding the origin id to the record header.
 It's only required in some occasions, and on some record types.

 Right.

Wrong, as explained.

 And I'm worried it might not even be enough in more complicated
 scenarios.

 Perhaps we need a more generic WAL record annotation system, where
 a plugin can tack arbitrary information to WAL records. The extra
 information could be stored in the WAL record after the rmgr
 payload, similar to how backup blocks are stored. WAL replay could
 just ignore the annotations, but a replication system could use it
 to store the origin id or whatever extra information it needs.

 Not only would that handle absolute versus relative updates and
 origin id, but application frameworks could take advantage of such a
 system for passing transaction metadata.  I've held back on one
 concern so far that I'll bring up now because this suggestion would
 address it nicely.

 Our current trigger-driven logical replication includes a summary
 which includes transaction run time, commit time, the transaction
 type identifier, the source code line from which that transaction was
 invoked, the user ID with which the user connected to the application
 (which isn't the same as the database login), etc.  Being able to
 decorate a database transaction with arbitrary (from the DBMS POV)
 metadata would be very valuable.  In fact, our shop can't maintain
 the current level of capabilities without *some* way to associate
 such information with a transaction.

 I think that using up the only unused space in the fixed header to
 capture one piece of the transaction metadata needed for logical
 replication, and that only in some configurations, is short-sighted.
 If we solve the general problem of transaction metadata, this one
 specific case will fall out of that.

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being used up as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.

It isn't true that this is needed only for some configurations of
multi-master, per discussion.

This is not transaction metadata, it is WAL record metadata required
for multi-master replication, see later point.

We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.

 I think removing origin ID from this patch and submitting a separate
 patch for a generalized transaction metadata system is the sensible
 way to go.

We already have a very flexible WAL system for recording data of
interest to various resource managers. If you wish to annotate a
transaction, you can either generate a new kind of WAL record or you
can enhance a commit record. There are already unused flag bits on
commit records for just such a purpose.

XLOG_NOOP records can already be generated by your application if you
wish to inject additional metadata to the WAL stream. So no changes
are required for you to implement the generalised transaction metadata
scheme you say you require.

Not sure how or why that relates to requirements for multi-master.


Please note that I've suggested review changes to Andres' work myself.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Hannu Valtonen

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/19/2012 01:47 AM, Christopher Browne wrote:
 That numbering scheme gets pretty anti-intuitive fairly quickly,
 from whence we took the approach of having a couple digits
 indicating data centre followed by a digit indicating which node in
 that data centre.
 
 If that all sounds incoherent, well, the more nodes you have
 around, the more difficult it becomes to make sure you *do* have a
 coherent picture of your cluster.
 
 I recall the Slony-II project having a notion of attaching a
 permanent UUID-based node ID to each node.  As long as there is
 somewhere decent to find a symbolically significant node name, I
 like the idea of the ID *not* being in a tiny range, and being
 UUID/OID-like...
 

Just as a sidenote, MySQL's new global transaction ids use a UUID for
the serverid bit of it. [1]

- - Hannu Valtonen

[1]
http://d2-systems.blogspot.fi/2012/04/global-transaction-identifiers-are-in.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP4B2ZAAoJENx2aQJr96ohm/4QAOKqFkyBlbwn1k69z6jmqbyq
vn1pCtAKgpdL3wdOEIipotylJnyl9Aqd8IRQT/j3u2dDOWbFqf6Xs3t8x6NbE06T
4kBANs7VjZwSgUlnmU6zsGhX5RB+mHEmnjKn+GG3uA4m5Qo5WvvzfwxKXDdOfSuw
ChLgTTse28OV5xOeQnU4FSPPrsc0AxL8suoUDAi3Qp3EkefgN7zzkfLkw1B+E119
mfxSus9nu4rl4givP+z8VMtfIhU4EqQvQvcI5w6E4aW88iYxzkH/BICyBYGg73e4
SnJfLpaXg/C/Ll0NjKRr9Gsuxl1yStb7vPzc0AQahyE2IyspN+Ga5Y1eyvEfcv5s
cOpKFLMPE5sQpvTdcSfZ3/nGotos1PDijVpAy7qVY3m0ow5ECyGv/8sdXSQXut+x
f9QVNe3rOznsy3J38Z8OOEaftq30UZt7+cXl4fMvI/eVkva+MyHjdR+aIqzMVh1d
S6uCpkK/UUC11PCENdRmVIka6EHAsvs+m9B4kmsHpR4+T/qKVaLdBX27L7k2prse
OmJ1qFtOmMqZJHFZ8oCVYoVoK5UEaJfZBLbSyf9vU0iSDh2XlKgVNe6TTE0f+OVE
GcqIT2qbc9XG6hFO/C0zy5qdNXem96feBszjWLhf+F9ULq1d0HNEg2WghJJPpan1
adx5JGZK5k0xtSewTj7O
=lK2f
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com wrote:
 If you do that, then, yes,
 everything that you need to disentangle various network topologies
 must be present in WAL.  But what I'm saying is: don't do it like
 that.  Generate the LCRs just ONCE, at the origin node, and then pass
 them around the network, applying them at every node.  Then, the
 information that is needed in WAL is confined to one bit: the
 knowledge of whether or not a particular transaction is local (and
 thus LCRs should be generated) or non-local (and thus they shouldn't,
 because the origin already generated them and thus we're just handing
 them around to apply everywhere).
 Sure, you can do it that way, but I don't think its a good idea. If you do it
 my way you *guarantee* that when replaying changes from node B on node C you
 have replayed changes from A at least as far as B has. Thats a really nice
 property for MM.
 You *can* get same with your solution but it starts to get complicated rather
 fast. While my/our proposed solution is trivial to implement.

That's an interesting point.  I agree that's a useful property, and
might be a reason not to just use a single-bit flag, but I still think
you'd be better off handling that requirement via some other method,
like logging the node ID once per transaction or something.  That lets
you have as much metadata as you end up needing, which is a lot more
flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 03:42:39 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
  On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com 
wrote:
  The idea that logical rep is some kind of useful end goal in itself is
  slightly misleading. If the thought is to block multi-master
  completely on that basis, that would be a shame. Logical rep is the
  mechanism for implementing multi-master.
  
  If you're saying that single-master logical replication isn't useful,
  I disagree.  Of course, having both single-master and multi-master
  replication together is even more useful.
  
  But I think getting even
  single-master logical replication working well in a single release
  cycle is going to be a job and a half.
  Thinking that we're going to
  get MMR in one release is not realistic.
  If you block it, then the above becomes true, whether or not it starts
  true.
 My main point in bringing this up is that if you pick a project that
 is too large, you will fail.
Were not the only ones here that are performing scope creep though... I think 
about all people who have posted in the whole thread except maybe Tom and 
Marko are guilty of doing so.

I still think its rather sensible to focus on exactly duplicated schemas in a 
very first version just because that leaves out some of the complexity while 
paving the road for other nice things.

 You've got four people objecting to this patch now, all of whom happen
 to be committers.  Whether or not MMR goes into core, who knows, but
 it doesn't seem that this patch is going to fly.
I find that a bit too early to say. Sure it won't fly exactly as proposed, but 
hell, who cares? What I want to get in is a solution to the specific problem 
the patch targets. At least you have, not sure about others, accepted that the 
problem needs a solution.
We do not agree yet how that solution looks should like but thats not exactly 
surprising as we started discussing the problem only a good day ago.

If people agree that your proposed way of just one flag bit is the way to go 
we will have to live with that. But thats different from saying the whole 
thing is dead.

 As I would rather see this project
 succeed, I recommend that you don't do that.  Both you and Andres seem
 to believe that MMR is a reasonable first target to shoot at, but I
 don't think anyone else - including the Slony developers who have
 commented on this issue - endorses that position. 
I don't think we get full MMR into 9.3. What I am proposing is that we build 
in the few pieces that are required to implement MMR *ontop* of whats 
hopefully in 9.3.
And I think thats a realistic goal.

 At PGCon, you were
 talking about getting a new set of features into PG over the next 3-5
 years.  Now, it seems like you want to compress that timeline to a
 year.
Well, I obviously would like to be everything be done in a release, but I also 
would like to go hiking for a year, have a restored sailing boat and some 
more. That doesn't make it reality...
To make it absolutely clear: I definitely don't think its realistic to have 
everything in 9.3. And I don't think Simon does so either. What I want is to 
have the basic building blocks in 9.3.
The difference probably just is whats determined as a building block.

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 03:54:43 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  If you do that, then, yes,
  everything that you need to disentangle various network topologies
  must be present in WAL.  But what I'm saying is: don't do it like
  that.  Generate the LCRs just ONCE, at the origin node, and then pass
  them around the network, applying them at every node.  Then, the
  information that is needed in WAL is confined to one bit: the
  knowledge of whether or not a particular transaction is local (and
  thus LCRs should be generated) or non-local (and thus they shouldn't,
  because the origin already generated them and thus we're just handing
  them around to apply everywhere).
  
  Sure, you can do it that way, but I don't think its a good idea. If you
  do it my way you *guarantee* that when replaying changes from node B on
  node C you have replayed changes from A at least as far as B has. Thats
  a really nice property for MM.
  You *can* get same with your solution but it starts to get complicated
  rather fast. While my/our proposed solution is trivial to implement.
 
 That's an interesting point.  I agree that's a useful property, and
 might be a reason not to just use a single-bit flag, but I still think
 you'd be better off handling that requirement via some other method,
 like logging the node ID once per transaction or something.  That lets
 you have as much metadata as you end up needing, which is a lot more
 flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
 said.
If it comes down to that I can definitely live with that. I still think making 
the filtering trivial so it can be done without any logic on a low level is a 
very desirable property but if not, so be it.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 21:42, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.

 If you're saying that single-master logical replication isn't useful,
 I disagree.  Of course, having both single-master and multi-master
 replication together is even more useful.

 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?

 Oh, give me a break.  You're willfully missing my point.  And to quote
 Fred Brooks, nine women can't make a baby in one month.


No, I'm not. The question is not how quickly can N people achieve a
single thing, but how long will it take a few skilled people working
on carefully selected tasks that have few dependencies between them to
achieve something.

We have significantly more preparation, development time and resources
than any other project previously performed for PostgreSQL, that I am
aware of.

Stating that it is impossible to perform a task in a certain period of
time without even considering those points is clearly rubbish. I've
arrived at my thinking based upon detailed project planning of what
was possible in the time.

How exactly did you arrive at your conclusion? Why is yours right and
mine wrong?



 Thinking that we're going to
 get MMR in one release is not realistic.

 If you block it, then the above becomes true, whether or not it starts true.

 If I had no rational basis for my objections, that would be true.
 You've got four people objecting to this patch now, all of whom happen
 to be committers.  Whether or not MMR goes into core, who knows, but
 it doesn't seem that this patch is going to fly.

No, I have four people who had initial objections and who have not
commented on the fact that the points made are regrettably incorrect.
I don't expect everybody commenting on the design to have perfect
knowledge of the whole design, so I expect people to make errors in
their comments. I also expect people to take note of what has been
said before making further objections or drawing conclusions.

Since at least 3 of the people making such comments did not attend the
full briefing meeting in Ottawa, I am not particularly surprised.
However, I do expect people that didn't come to the meeting to
recognise that they are likely to be missing information and to listen
closely, as I listen to them.

When the facts change, I change my mind. What do you do, sir?


 My main point in bringing this up is that if you pick a project that
 is too large, you will fail.  As I would rather see this project
 succeed, I recommend that you don't do that.  Both you and Andres seem
 to believe that MMR is a reasonable first target to shoot at, but I
 don't think anyone else - including the Slony developers who have
 commented on this issue - endorses that position.  At PGCon, you were
 talking about getting a new set of features into PG over the next 3-5
 years.  Now, it seems like you want to compress that timeline to a
 year.  I don't think that's going to work.  You also requested that
 people tell you sooner when large patches were in danger of not making
 the release.  Now I'm doing that, VERY early, and you're apparently
 angry about it.  If the only satisfactory outcome of this conversation
 is that everyone agrees with the design pattern you've already decided
 on, then you haven't left yourself very much room to walk away
 satisfied.

Note that I have already myself given review feedback to Andres and
that change has visibly occurred during this thread via public debate.

Claiming that I only stick to what has already been decided is
patently false, with me at least.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 16:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 11:17, Simon Riggs wrote:

 On 20 June 2012 15:45, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.06.2012 10:32, Simon Riggs wrote:


 On 20 June 2012 14:40, Heikki Linnakangas


 And I'm worried
 it might not even be enough in more complicated scenarios.


 It is not the only required conflict mechanism, and has never been
 claimed to be so. It is simply one piece of information needed, at
 various times.


 So, if the origin id is not sufficient for some conflict resolution
 mechanisms, what extra information do you need for those, and where do
 you
 put it?


 As explained elsewhere, wal_level = logical (or similar) would be used
 to provide any additional logical information required.

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.

 In the case of reflexive updates you raised, a typical response in
 other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
 by sending just the +1 part, not the current value of counter, as
 would be the case with the non-reflexive update
   UPDATE SET counter = 1

 Handling such things in Postgres would require some subtlety, which
 would not be resolved in first release but is pretty certain not to
 require any changes to the WAL record header as a way of resolving it.
 Having already thought about it, I'd estimate that is a very long
 discussion and not relevant to the OT, but if you wish to have it
 here, I won't stop you.


 Yeah, I'd like to hear briefly how you would handle that without any further
 changes to the WAL record header.

I already did:

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.

The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner

Simon Riggs si...@2ndquadrant.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Heikki Linnakangas  wrote:

 I don't like the idea of adding the origin id to the record
 header. It's only required in some occasions, and on some record
 types.

 Right.
 
 Wrong, as explained.
 
The point is not wrong; you are simply not responding to what is
being said.
 
You have not explained why an origin ID is required when there is no
replication, or if there is master/slave logical replication, or
there are multiple masters with non-overlapping primary keys
replicating to a single table in a consolidated database, or each
master replicates to all other masters directly, or any of various
other scenarios raised on this thread.  You've only explained why
it's necessary for certain configurations of multi-master
replication where all rows in a table can be updated on any of the
masters.  I understand that this is the configuration you find most
interesting, at least for initial implementation.  That does not
mean that the other situations don't exist as use cases or should be
not be considered in the overall design.
 
I don't think there is anyone here who would not love to see this
effort succeed, all the way to multi-master replication in the
configuration you are emphasizing.  What is happening is that people
are expressing concerns about parts of the design which they feel
are problematic, and brainstorming about possible alternatives.  As
I'm sure you know, fixing a design problem at this stage in
development is a lot less expensive than letting the problem slide
and trying to deal with it later.
 
 It isn't true that this is needed only for some configurations of
 multi-master, per discussion.
 
I didn't get that out of the discussion; I saw a lot of cases
mentioned as not needing it to which you simply did not respond.
 
 This is not transaction metadata, it is WAL record metadata
 required for multi-master replication, see later point.
 
 We need to add information to every WAL record that is used as the
 source for generating LCRs.
 
If the origin ID of a transaction doesn't count as transaction
metadata (i.e., data about the transaction), what does?  It may be a
metadata element about which you have special concerns, but it is
transaction metadata.  You don't plan on supporting individual WAL
records within a transaction containing different values for origin
ID, do you?  If not, why is it something to store in every WAL
record rather than once per transaction?  That's not intended to be
a rhetorical question.  I think it's because you're still thinking
of the WAL stream as *the medium* for logical replication data
rather than *the source* of logical replication data.
 
As long as the WAL stream is the medium, options are very
constrained.  You can code a very fast engine to handle a single
type of configuration that way, and perhaps that should be a
supported feature, but it's not a configuration I've needed yet. 
(Well, on reflection, if it had been available and easy to use, I
can think of *one* time I *might* have used it for a pair of nodes.)
It seems to me that you are so focused on this one use case that you
are not considering how design choices which facilitate fast
development of that use case paint us into a corner in terms of
expanding to other use cases.
 
 I think removing origin ID from this patch and submitting a
 separate patch for a generalized transaction metadata system is
 the sensible way to go.
 
 We already have a very flexible WAL system for recording data of
 interest to various resource managers. If you wish to annotate a
 transaction, you can either generate a new kind of WAL record or
 you can enhance a commit record.
 
Right.  Like many of us are suggesting should be done for origin ID.
 
 XLOG_NOOP records can already be generated by your application if
 you wish to inject additional metadata to the WAL stream. So no
 changes are required for you to implement the generalised
 transaction metadata scheme you say you require.
 
I'm glad it's that easy.  Are there SQL functions to for that yet?
 
 Not sure how or why that relates to requirements for multi-master.
 
That depends on whether you want to leave the door open to other
logical replication than the one use case on which you are currently
focused.  I even consider some of those other cases multi-master,
especially when multiple databases are replicating to a single table
on another server.  I'm not clear on your definition -- it seems to
be rather more narrow.  Maybe we need to define some terms somewhere
to facilitate discussion.  Is there a Wiki page where that would
make sense?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund and...@2ndquadrant.com wrote:
 Were not the only ones here that are performing scope creep though... I think
 about all people who have posted in the whole thread except maybe Tom and
 Marko are guilty of doing so.

 I still think its rather sensible to focus on exactly duplicated schemas in a
 very first version just because that leaves out some of the complexity while
 paving the road for other nice things.

Well, I guess what I want to know is: what does focusing on exactly
duplicated schemas mean?  If it means we'll disable DDL for tables
when we turn on replication, that's basically the Slony approach: when
you want to make a DDL change, you have to quiesce replication, do it,
and then resume replication.  I would possibly be OK with that
approach.  If it means that we'll hope that the schemas are duplicated
and start spewing garbage data when they're not, then I'm not
definitely not OK with that approach.  If it means using event
triggers to keep the catalogs synchronized, then I don't think I don't
think that's adequately robust.  The user could add more event
triggers that run before or after the ones the replication system
adds, and then you are back to garbage decoding (or crashes).  They
could also modify the catalogs directly, although it's possible we
don't care quite as much about that case (but on the other hand people
do sometimes need to do it to solve real problems).  Although I am
100% OK with pairing back the initial feature set - indeed, I strongly
recommend it - I think that robustness is not a feature which can be
left out in v1 and added in later.  All the robustness has to be
designed in at the start, or we will never have it.

On the whole, I think we're spending far too much time talking about
code and far too little time talking about what the overall design
should look like.  We are having a discussion about whether or not MMR
should be supported by sticking a 16-bit node ID into every WAL record
without having first decided whether we should support MMR, whether
that requires node IDs, whether they should be integers, whether those
integers should be 16 bits in size, whether they should be present in
WAL, and whether or not the record header is the right place to put
them.  There's a right order in which to resolve those questions, and
this isn't it.  More generally, I think there is a ton of complexity
that we're probably overlooking here in focusing in on specific coding
details.  I think the most interesting comment made to date is Steve
Singer's observation that very little of Slony is concerned with
changeset extraction or apply.  Now, on the flip side, all of these
patches seem to be concerned with changeset extraction and apply.
That suggests that we're missing some pretty significant pieces
somewhere in this design.  I think those pieces are things like error
recovery, fault tolerance, user interface design, and control logic.
Slony has spent years trying to get those things right.  Whether or
not they actually have gotten them right is of course an arguable
point, but we're unlikely to do better by ignoring all of those issues
and implementing whatever is most technically expedient.

 You've got four people objecting to this patch now, all of whom happen
 to be committers.  Whether or not MMR goes into core, who knows, but
 it doesn't seem that this patch is going to fly.
 I find that a bit too early to say. Sure it won't fly exactly as proposed, but
 hell, who cares? What I want to get in is a solution to the specific problem
 the patch targets. At least you have, not sure about others, accepted that the
 problem needs a solution.
 We do not agree yet how that solution looks should like but thats not exactly
 surprising as we started discussing the problem only a good day ago.

Oh, no argument with any of that.  I strongly object to the idea of
shoving this patch through as-is, but I don't object to solving the
problem in some other, more appropriate way.  I think that won't look
much like this patch, though; it will be some new patch.

 If people agree that your proposed way of just one flag bit is the way to go
 we will have to live with that. But thats different from saying the whole
 thing is dead.

I think you've convinced me that a single flag-bit is not enough, but
I don't think you've convinced anyone that it belongs in the record
header.

 As I would rather see this project
 succeed, I recommend that you don't do that.  Both you and Andres seem
 to believe that MMR is a reasonable first target to shoot at, but I
 don't think anyone else - including the Slony developers who have
 commented on this issue - endorses that position.
 I don't think we get full MMR into 9.3. What I am proposing is that we build
 in the few pieces that are required to implement MMR *ontop* of whats
 hopefully in 9.3.
 And I think thats a realistic goal.

I can't quite follow that sentence, but my general sense is that,
while you're

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
  This is not transaction metadata, it is WAL record metadata
  required for multi-master replication, see later point.

  We need to add information to every WAL record that is used as the
  source for generating LCRs.
 If the origin ID of a transaction doesn't count as transaction
 metadata (i.e., data about the transaction), what does?  It may be a
 metadata element about which you have special concerns, but it is
 transaction metadata.  You don't plan on supporting individual WAL
 records within a transaction containing different values for origin
 ID, do you?  If not, why is it something to store in every WAL
 record rather than once per transaction?  That's not intended to be
 a rhetorical question.
Its definitely possible to store it per transaction (see the discussion around 
http://archives.postgresql.org/message-
id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via 
the originating node a considerably more complex thing. With our proposal you 
can do it without any complexity involved, on a low level. Storing it per 
transaction means you can only stream out the data to other nodes *after* 
fully reassembling the transaction. Thats a pitty, especially if we go for a 
design where the decoding happens in a proxy instance.

Other metadata will not be needed on such a low level.

I also have to admit that I am very hesitant to start developing some generic 
transaction metadata framework atm. That seems to be a good way to spend a 
good part of time in discussion and disagreeing. Imo thats something for 
later.

 I think it's because you're still thinking
 of the WAL stream as *the medium* for logical replication data
 rather than *the source* of logical replication data.
I don't think thats true. See the above referenced subthread for reasons why I 
think the origin id is important.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?

 Oh, give me a break.  You're willfully missing my point.  And to quote
 Fred Brooks, nine women can't make a baby in one month.

 No, I'm not. The question is not how quickly can N people achieve a
 single thing, but how long will it take a few skilled people working
 on carefully selected tasks that have few dependencies between them to
 achieve something.

The bottleneck is getting the design right, not writing the code.
Selecting tasks for people to work on without an agreement on the
design will not advance the process, unless we just accept whatever
code you choose to right based on whatever design you happen to pick.

 How exactly did you arrive at your conclusion? Why is yours right and
 mine wrong?

I estimated the amount of work that would be required to do this right
and compared it to other large projects that have been successfully
done in the past.  I think you are looking at something on the order
of magnitude of the Windows port, which took about four releases to
become stable, or the SE-Linux project, which still isn't
feature-complete.  Even if it's only a HS-sized project, that took two
releases, as did streaming replication.  SSI got committed within one
release cycle, but there were several years of design and advocacy
work before any code was written, so that, too, was really a
multi-year project.

I'll confine my comments on the second part of the question to the
observation that it is a bit early to know who is right and who is
wrong, but the question could just as easily be turned on its head.

 No, I have four people who had initial objections and who have not
 commented on the fact that the points made are regrettably incorrect.

I think Kevin addressed this point better than I can.  Asserting
something doesn't make it true, and you haven't offered any rational
argument against the points that have been made, probably because
there isn't one.  We *cannot* justify steeling 100% of the available
bit space for a feature that many people won't use and may not be
enough to address the real requirement anyway.

 Since at least 3 of the people making such comments did not attend the
 full briefing meeting in Ottawa, I am not particularly surprised.
 However, I do expect people that didn't come to the meeting to
 recognise that they are likely to be missing information and to listen
 closely, as I listen to them.

Participation in the community development process is not contingent
on having flown to Ottawa in May, or on having decided to spend that
evening at your briefing meeting.  Attributing to ignorance what is
adequately explained by honest disagreement is impolite.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Hi,

On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  Were not the only ones here that are performing scope creep though... I
  think about all people who have posted in the whole thread except maybe
  Tom and Marko are guilty of doing so.
  
  I still think its rather sensible to focus on exactly duplicated schemas
  in a very first version just because that leaves out some of the
  complexity while paving the road for other nice things.
 
 Well, I guess what I want to know is: what does focusing on exactly
 duplicated schemas mean?  If it means we'll disable DDL for tables
 when we turn on replication, that's basically the Slony approach: when
 you want to make a DDL change, you have to quiesce replication, do it,
 and then resume replication.  I would possibly be OK with that
 approach.  If it means that we'll hope that the schemas are duplicated
 and start spewing garbage data when they're not, then I'm not
 definitely not OK with that approach.  If it means using event
 triggers to keep the catalogs synchronized, then I don't think I don't
 think that's adequately robust.  The user could add more event
 triggers that run before or after the ones the replication system
 adds, and then you are back to garbage decoding (or crashes). 
I would prefer the event trigger way because that seems to be the most 
extensible/reusable. It would allow a fully replicated databases and catalog 
only instances.
I think we need to design event triggers in a way you cannot simply circumvent 
them. We already have the case that if users try to screw around system 
triggers we give back wrong answers with the planner relying on foreign keys 
btw.
If the problem is having user trigger after system triggers: Lets make that 
impossible. Forbidding DDL on the other instances once we have that isn't that 
hard.

Perhaps all that will get simpler if we can make reading the catalog via 
custom built snapshots work as you proposed otherwhere in this thread. That 
would make checking errors way much easier even if you just want to apply to  
a database with exactly the same schema. Thats the next thing I plan to work 
on.

 They could also modify the catalogs directly, although it's possible we
 don't care quite as much about that case (but on the other hand people
 do sometimes need to do it to solve real problems).
With that you already can crash the database perfectly fine today. I think 
trying to care for that is a waste of time.

 Although I am
 100% OK with pairing back the initial feature set - indeed, I strongly
 recommend it - I think that robustness is not a feature which can be
 left out in v1 and added in later.  All the robustness has to be
 designed in at the start, or we will never have it.
I definitely don't intend to cut down on robustness.

 On the whole, I think we're spending far too much time talking about
 code and far too little time talking about what the overall design
 should look like.
Agreed.

 We are having a discussion about whether or not MMR
 should be supported by sticking a 16-bit node ID into every WAL record
 without having first decided whether we should support MMR, whether
 that requires node IDs, whether they should be integers, whether those
 integers should be 16 bits in size, whether they should be present in
 WAL, and whether or not the record header is the right place to put
 them.  There's a right order in which to resolve those questions, and
 this isn't it.  More generally, I think there is a ton of complexity
 that we're probably overlooking here in focusing in on specific coding
 details.  I think the most interesting comment made to date is Steve
 Singer's observation that very little of Slony is concerned with
 changeset extraction or apply.  Now, on the flip side, all of these
 patches seem to be concerned with changeset extraction and apply.
 That suggests that we're missing some pretty significant pieces
 somewhere in this design.  I think those pieces are things like error
 recovery, fault tolerance, user interface design, and control logic.
 Slony has spent years trying to get those things right.  Whether or
 not they actually have gotten them right is of course an arguable
 point, but we're unlikely to do better by ignoring all of those issues
 and implementing whatever is most technically expedient.
I agree that the focus isn't 100% optimal and that there are *loads* of issues 
we haven't event started to look at. But you need a point to start and 
extraction  apply seems to be a good one because you can actually test it 
without the other issues solved which is not really the case the other way 
round.
Also its possible to plug in the newly built changeset extraction into 
existing solutions to make them more efficient while retaining most of their 
respective framework.

So I disagree that thats the wrong part to start with.

  You've got four people objecting to this patch now,

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 23:34, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Heikki Linnakangas  wrote:

 I don't like the idea of adding the origin id to the record
 header. It's only required in some occasions, and on some record
 types.

 Right.

 Wrong, as explained.

 The point is not wrong; you are simply not responding to what is
 being said.

Heikki said that the origin ID was not required for all MMR
configs/scenarios. IMHO that is wrong, with explanation given.

By agreeing with him, I assumed you were sharing that assertion,
rather than saying something else.


 You have not explained why an origin ID is required when there is no
 replication, or if there is master/slave logical replication, or
...

You're right; I never claimed it was needed. Origin Id is only needed
for multi-master replication and that is the only context I've
discussed it.


 This is not transaction metadata, it is WAL record metadata
 required for multi-master replication, see later point.

 We need to add information to every WAL record that is used as the
 source for generating LCRs.

 If the origin ID of a transaction doesn't count as transaction
 metadata (i.e., data about the transaction), what does?  It may be a
 metadata element about which you have special concerns, but it is
 transaction metadata.  You don't plan on supporting individual WAL
 records within a transaction containing different values for origin
 ID, do you?  If not, why is it something to store in every WAL
 record rather than once per transaction?  That's not intended to be
 a rhetorical question.  I think it's because you're still thinking
 of the WAL stream as *the medium* for logical replication data
 rather than *the source* of logical replication data.

 As long as the WAL stream is the medium, options are very
 constrained.  You can code a very fast engine to handle a single
 type of configuration that way, and perhaps that should be a
 supported feature, but it's not a configuration I've needed yet.
 (Well, on reflection, if it had been available and easy to use, I
 can think of *one* time I *might* have used it for a pair of nodes.)
 It seems to me that you are so focused on this one use case that you
 are not considering how design choices which facilitate fast
 development of that use case paint us into a corner in terms of
 expanding to other use cases.

 I think removing origin ID from this patch and submitting a
 separate patch for a generalized transaction metadata system is
 the sensible way to go.

 We already have a very flexible WAL system for recording data of
 interest to various resource managers. If you wish to annotate a
 transaction, you can either generate a new kind of WAL record or
 you can enhance a commit record.

 Right.  Like many of us are suggesting should be done for origin ID.

 XLOG_NOOP records can already be generated by your application if
 you wish to inject additional metadata to the WAL stream. So no
 changes are required for you to implement the generalised
 transaction metadata scheme you say you require.

 I'm glad it's that easy.  Are there SQL functions to for that yet?


Yes, another possible design is to generate a new kind of WAL record
for the origin id.

Doing it that way will slow down multi-master by a measurable amount,
and slightly bloat the WAL stream.

The proposed way uses space that is currently wasted and likely to
remain so. Only 2 bytes of 6 bytes available are proposed for use,
with a flag design that allows future extension if required. When MMR
is not in use, the WAL records would look completely identical to the
way they look now, in size, settings and speed of writing them.

Putting the origin id onto each WAL record allows very fast and simple
stateless filtering. I suggest using it because those bytes have been
sitting there unused for close to 10 years now and no better use
springs to mind.

The proposed design is the fastest way of implementing MMR, without
any loss for non-users.

As I noted before, slowing down MMR by a small amount causes geometric
losses in performance across the whole cluster.


 Not sure how or why that relates to requirements for multi-master.

 That depends on whether you want to leave the door open to other
 logical replication than the one use case on which you are currently
 focused.  I even consider some of those other cases multi-master,
 especially when multiple databases are replicating to a single table
 on another server.  I'm not clear on your definition -- it seems to
 be rather more narrow.  Maybe we need to define some terms somewhere
 to facilitate discussion.  Is there a Wiki page where that would
 make sense?

The project is called BiDirectional Replication to ensure that people
understood this is not just multi-master. But that doesn't mean that
multi-master can't have its own specific requirements.

Adding originid is also useful for the use case

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Christopher Browne

On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund and...@2ndquadrant.com wrote:
 On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
  This is not transaction metadata, it is WAL record metadata
  required for multi-master replication, see later point.

  We need to add information to every WAL record that is used as the
  source for generating LCRs.
 If the origin ID of a transaction doesn't count as transaction
 metadata (i.e., data about the transaction), what does?  It may be a
 metadata element about which you have special concerns, but it is
 transaction metadata.  You don't plan on supporting individual WAL
 records within a transaction containing different values for origin
 ID, do you?  If not, why is it something to store in every WAL
 record rather than once per transaction?  That's not intended to be
 a rhetorical question.
 Its definitely possible to store it per transaction (see the discussion around
 http://archives.postgresql.org/message-
 id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via
 the originating node a considerably more complex thing. With our proposal you
 can do it without any complexity involved, on a low level. Storing it per
 transaction means you can only stream out the data to other nodes *after*
 fully reassembling the transaction. Thats a pitty, especially if we go for a
 design where the decoding happens in a proxy instance.

I guess I'm not seeing the purpose to having the origin node id in the
WAL stream either.

We have it in the Slony sl_log_* stream, however there is a crucial
difference, in that sl_log_* is expressly a shared structure.  In
contrast, WAL isn't directly sharable; you don't mix together multiple
WAL streams.

It seems as though the point in time at which you need to know the
origin ID is the moment at which you're deciding to read data from the
WAL files, and knowing which stream you are reading from is an
assertion that might be satisfied by looking at configuration that
doesn't need to be in the WAL stream itself.  It might be *nice* for
the WAL stream to be self-identifying, but that doesn't seem to be
forcibly necessary.

The case where it *would* be needful is if you are in the process of
assembling together updates coming in from multiple masters, and need
to know:
   - This INSERT was replicated from node #1, so should be ignored downstream
   - That INSERT was replicated from node #2, so should be ignored downstream
   - This UPDATE came from the local node, so needs to be passed to
downstream users

Or perhaps something else is behind the node id being deeply embedded
into the stream that I'm not seeing altogether.

 Other metadata will not be needed on such a low level.

 I also have to admit that I am very hesitant to start developing some generic
 transaction metadata framework atm. That seems to be a good way to spend a
 good part of time in discussion and disagreeing. Imo thats something for
 later.

Well, I see there being a use in there being at least 3 sorts of LCR records:
a) Capturing literal SQL that is to replayed downstream.  This
parallels two use cases existing in existing replication systems:
   i) In pre-2.2 versions of Slony, statements are replayed literally.
 So there's a stream of INSERT/UPDATE/DELETE statements.
   ii) DDL capture and replay.  In existing replication systems, DDL
isn't captured implicitly, the way Dimitri's Event Triggers are to do,
but rather is captured explicitly.
   There should be a function to allow injecting such SQL explicitly;
that is sure to be a useful sort of thing to be able to do.

b) Capturing tuple updates in a binary form that can be turned readily
into heap updates on a replica.
Unfortunately, this form is likely not to play well when
replicating across platforms or Postgres versions, so I suspect that
this performance optimization should be implemented as a *last*
resort, rather than first.  Michael Jackson had some rules of
optimization that said don't do it, and, for the expert, don't do
it YET...

c) Capturing tuple data in some reasonably portable and readily
re-writable form.
Slony 2.2 changes from SQL fragments (of a) i) above) to storing
updates as an array of text values indicating:
 - relation name
 - attribute names
 - attribute values, serialized into strings
I don't know that this provably represents the *BEST*
representation, but it definitely will be portable where b) would not
be, and lends itself to being able to reuse query plans, where a)
requires extraordinary amounts of parsing work, today.  So I'm pretty
sure it's better than a) and b) for a sizable set of cases.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund and...@2ndquadrant.com wrote:
 I would prefer the event trigger way because that seems to be the most
 extensible/reusable. It would allow a fully replicated databases and catalog
 only instances.
 I think we need to design event triggers in a way you cannot simply circumvent
 them. We already have the case that if users try to screw around system
 triggers we give back wrong answers with the planner relying on foreign keys
 btw.
 If the problem is having user trigger after system triggers: Lets make that
 impossible. Forbidding DDL on the other instances once we have that isn't that
 hard.

So, this is interesting.  I think something like this could solve the
problem, but then why not just make it built-in code that runs from
the same place as the event trigger rather than using the trigger
mechanism per se?  Presumably the trigger code's purpose is going to
be to inject additional data into the WAL stream (am I wrong?) which
is not something you're going to be able to do from PL/pgsql anyway,
so you don't really need a trigger, just a call to some C function -
which not only has the advantage of being not bypassable, but is also
faster.

 Perhaps all that will get simpler if we can make reading the catalog via
 custom built snapshots work as you proposed otherwhere in this thread. That
 would make checking errors way much easier even if you just want to apply to
 a database with exactly the same schema. Thats the next thing I plan to work
 on.

I realized a problem with that idea this morning: it might work for
reading things, but if anyone attempts to write data you've got big
problems.  Maybe we could get away with forbidding that, not sure.
Would be nice to get some input from other hackers on this.

 They could also modify the catalogs directly, although it's possible we
 don't care quite as much about that case (but on the other hand people
 do sometimes need to do it to solve real problems).
 With that you already can crash the database perfectly fine today. I think
 trying to care for that is a waste of time.

You're probably right.

 I agree that the focus isn't 100% optimal and that there are *loads* of issues
 we haven't event started to look at. But you need a point to start and
 extraction  apply seems to be a good one because you can actually test it
 without the other issues solved which is not really the case the other way
 round.
 Also its possible to plug in the newly built changeset extraction into
 existing solutions to make them more efficient while retaining most of their
 respective framework.

 So I disagree that thats the wrong part to start with.

I think extraction is a very sensible place to start; actually, I
think it's the best possible place to start.  But this particular
thread is about adding origin_ids to WAL, which I think is definitely
not the best place to start.

 I definitely do want to provide code that generates a textual representation
 of the changes. As you say, even if its not used for anything its needed for
 debugging. Not sure if it should be sql or maybe the new slony representation.
 If thats provided and reusable it should make sure that ontop of that other
 solutions can be built.

Oh, yeah.  If we can get that, I will throw a party.

 I find your supposition that I/we just want to get MMR without regard for
 anything else a bit offensive. I wrote at least three times in this thread
 that I do think its likely that we will not get more than the minimal basis
 for implementing MMR into 9.3. I wrote multiple times that I want to provide
 the basis for multiple solutions. The prototype - while obviously being
 incomplete - tried hard to be modular.
 You cannot blame us that we want the work we do to be *also* usable for what
 one of our major aims?
 What can I do to convince you/others that I am not planning to do something
 evil but that I try to reach as many goals at once as possible?

Sorry.  I don't think you're planning to do something evil, but before
I thought you said you did NOT want to write the code to extract
changes as text or something similar.  I think that would be a really
bad thing to skip for all kinds of reasons.  I think we need that as a
foundational technology before we do much else.  Now, once we have
that, if we can safely detect cases where it's OK to bypass decoding
to text and skip it in just those cases, I think that's great
(although possibly difficult to implement correctly).  I basically
feel that without decode-to-text, this can't possibly be a basis for
multiple solutions; it will be a basis only for itself, and extremely
difficult to debug, too.  No other replication solution can even
theoretically have any use for the raw on-disk tuple, at least not
without horrible kludgery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  I would prefer the event trigger way because that seems to be the most
  extensible/reusable. It would allow a fully replicated databases and
  catalog only instances.
  I think we need to design event triggers in a way you cannot simply
  circumvent them. We already have the case that if users try to screw
  around system triggers we give back wrong answers with the planner
  relying on foreign keys btw.
  If the problem is having user trigger after system triggers: Lets make
  that impossible. Forbidding DDL on the other instances once we have that
  isn't that hard.
 
 So, this is interesting.  I think something like this could solve the
 problem, but then why not just make it built-in code that runs from
 the same place as the event trigger rather than using the trigger
 mechanism per se?  Presumably the trigger code's purpose is going to
 be to inject additional data into the WAL stream (am I wrong?) which
 is not something you're going to be able to do from PL/pgsql anyway,
 so you don't really need a trigger, just a call to some C function -
 which not only has the advantage of being not bypassable, but is also
 faster.
I would be totally fine with that. As long as event triggers provide the 
infrastructure that shouldn't be a big problem.

  Perhaps all that will get simpler if we can make reading the catalog via
  custom built snapshots work as you proposed otherwhere in this thread.
  That would make checking errors way much easier even if you just want to
  apply to a database with exactly the same schema. Thats the next thing I
  plan to work on.
 I realized a problem with that idea this morning: it might work for
 reading things, but if anyone attempts to write data you've got big
 problems.  Maybe we could get away with forbidding that, not sure.
Hm, why is writing a problem? You mean io conversion routines writing data? 
Yes, that will be a problem. I am fine with simply forbidding that, we should 
be able to catch that and provide a sensible error message, since SSI we have 
the support for that.

 Would be nice to get some input from other hackers on this.
Oh, yes!


  I agree that the focus isn't 100% optimal and that there are *loads* of
  issues we haven't event started to look at. But you need a point to
  start and extraction  apply seems to be a good one because you can
  actually test it without the other issues solved which is not really the
  case the other way round.
  Also its possible to plug in the newly built changeset extraction into
  existing solutions to make them more efficient while retaining most of
  their respective framework.
  
  So I disagree that thats the wrong part to start with.
 
 I think extraction is a very sensible place to start; actually, I
 think it's the best possible place to start.  But this particular
 thread is about adding origin_ids to WAL, which I think is definitely
 not the best place to start.
Yep. I think the reason everyone started at it is that the patch was actually 
really simple ;).
Note that the wal enrichement  decoding patches were before the origin_id 
patch in the patchseries ;)

  I definitely do want to provide code that generates a textual
  representation of the changes. As you say, even if its not used for
  anything its needed for debugging. Not sure if it should be sql or maybe
  the new slony representation. If thats provided and reusable it should
  make sure that ontop of that other solutions can be built.
 Oh, yeah.  If we can get that, I will throw a party.
Good ;)

  I find your supposition that I/we just want to get MMR without regard for
  anything else a bit offensive. I wrote at least three times in this
  thread that I do think its likely that we will not get more than the
  minimal basis for implementing MMR into 9.3. I wrote multiple times that
  I want to provide the basis for multiple solutions. The prototype -
  while obviously being incomplete - tried hard to be modular.
  You cannot blame us that we want the work we do to be *also* usable for
  what one of our major aims?
  What can I do to convince you/others that I am not planning to do
  something evil but that I try to reach as many goals at once as
  possible?
 Sorry.  I don't think you're planning to do something evil, but before
 I thought you said you did NOT want to write the code to extract
 changes as text or something similar.
Hm. I might have been a bit ambiguous when saying that I do not want to 
provide everything for that use-case.
Once we have a callpoint that has a correct catalog snapshot for exactly the 
tuple in question text conversion is damn near trivial. The point where you 
get passed all that information (action, tuple, table, snapshot) is the one I 
think the patch should mainly provide.

 I think that would be a really
 bad thing to skip for all kinds of reasons.  I think we need that as a

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 23:56, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?

 Oh, give me a break.  You're willfully missing my point.  And to quote
 Fred Brooks, nine women can't make a baby in one month.

 No, I'm not. The question is not how quickly can N people achieve a
 single thing, but how long will it take a few skilled people working
 on carefully selected tasks that have few dependencies between them to
 achieve something.

 The bottleneck is getting the design right, not writing the code.
 Selecting tasks for people to work on without an agreement on the
 design will not advance the process, unless we just accept whatever
 code you choose to right based on whatever design you happen to pick.

Right. Which is why there are multiple people doing design on
different aspects and much work going into prototyping to give some
useful information into debates that would otherwise be bikeshedding.


 How exactly did you arrive at your conclusion? Why is yours right and
 mine wrong?

 I estimated the amount of work that would be required to do this right
 and compared it to other large projects that have been successfully
 done in the past.  I think you are looking at something on the order
 of magnitude of the Windows port, which took about four releases to
 become stable, or the SE-Linux project, which still isn't
 feature-complete.  Even if it's only a HS-sized project, that took two
 releases, as did streaming replication.  SSI got committed within one
 release cycle, but there were several years of design and advocacy
 work before any code was written, so that, too, was really a
 multi-year project.

The current BDR project uses concepts that have been in discussion for
many years, something like 6-7 years. I first wrote multi-master for
Postgres in userspace in 2007 and the current designs build upon that.
My detailed design for WAL-based logical replication was produced in
2009. I'm not really breaking new ground in the design, and we're
reusing significant parts of existing code. You don't know that cos
you didn't think to ask, which is a little surprising.

On top of that, Andres has refined many of the basic ideas and what he
presents here is a level above my initial design work. I have every
confidence that he'll work independently and produce useful code while
other aspects are considered.


 I'll confine my comments on the second part of the question to the
 observation that it is a bit early to know who is right and who is
 wrong, but the question could just as easily be turned on its head.

 No, I have four people who had initial objections and who have not
 commented on the fact that the points made are regrettably incorrect.

 I think Kevin addressed this point better than I can.  Asserting
 something doesn't make it true, and you haven't offered any rational
 argument against the points that have been made, probably because
 there isn't one.  We *cannot* justify steeling 100% of the available
 bit space for a feature that many people won't use and may not be
 enough to address the real requirement anyway.


You've some assertions above that aren't correct, so its good that you
recognise assertions may not be true, on both sides. I ask that you
listen a little before casting judgements. We won't get anywhere
otherwise.

The current wasted space on 64bit systems is 6 bytes per record. The
current suggestion is to use 2 of those, in a flexible manner that
allows future explansion if it is required, and only if it is
required. That flexibility covers anything else we need. Asserting
that 100% of the available space is being used is wrong, and to
continue to assert that even when told otherwise multiple times is
something else.


 Since at least 3 of the people making such comments did not attend the
 full briefing meeting in Ottawa, I am not particularly surprised.
 However, I do expect people that didn't come to the meeting to
 recognise that they are likely to be missing information and to listen
 closely, as I listen to them.

 Participation in the community development process is not contingent
 on having flown to Ottawa in May, or on having decided to spend that
 evening at your briefing meeting.  Attributing to ignorance what is
 adequately explained by honest disagreement is impolite.

I don't think people need to have attended that briefing. But if they
did not, then they do need to be aware they missed hours of community
discussion and explanation, built on top of months of careful
investigation, all of which was aimed at helping everybody understand.

Jumping straight into a discussion is what the community is about
though and I welcome people's input. It seems reasonable

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund and...@2ndquadrant.com wrote:
 I realized a problem with that idea this morning: it might work for
 reading things, but if anyone attempts to write data you've got big
 problems.  Maybe we could get away with forbidding that, not sure.
 Hm, why is writing a problem? You mean io conversion routines writing data?
 Yes, that will be a problem. I am fine with simply forbidding that, we should
 be able to catch that and provide a sensible error message, since SSI we have
 the support for that.

I think we could do something a little more vigorous than that, like
maybe error out if anyone tries to do anything that would write WAL or
acquire an XID.  Of course, then the question becomes: then what?  We
probably need to think about what happens after that - we don't want
an error replicating one row in one table to permanently break
replication for the entire system.

 Sorry.  I don't think you're planning to do something evil, but before
 I thought you said you did NOT want to write the code to extract
 changes as text or something similar.
 Hm. I might have been a bit ambiguous when saying that I do not want to
 provide everything for that use-case.
 Once we have a callpoint that has a correct catalog snapshot for exactly the
 tuple in question text conversion is damn near trivial. The point where you
 get passed all that information (action, tuple, table, snapshot) is the one I
 think the patch should mainly provide.

This is actually a very interesting list.  We could rephrase the
high-level question about the design of this feature as what is the
best way to make sure that you have these things available?.  Action
and tuple are trivial to get, and table isn't too hard either.  It's
really the snapshot - and all the downstream information that can only
be obtained via using that snapshot - that is the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 21 June 2012 01:40, Andres Freund and...@2ndquadrant.com wrote:

 I think extraction is a very sensible place to start; actually, I
 think it's the best possible place to start.  But this particular
 thread is about adding origin_ids to WAL, which I think is definitely
 not the best place to start.

 Yep. I think the reason everyone started at it is that the patch was actually
 really simple ;).
 Note that the wal enrichement  decoding patches were before the origin_id
 patch in the patchseries ;)

Actually, I thought it would be non contentious...

A format change, so early in release cycle. Already openly discussed
with no comment. Using wasted space for efficiency, for information
already in use by Slony for similar reasons.

Oh well.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Hi Chris!

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
 On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
  Simon Riggs si...@2ndquadrant.com wrote:
   This is not transaction metadata, it is WAL record metadata
   required for multi-master replication, see later point.
   
   We need to add information to every WAL record that is used as the
   source for generating LCRs.
  
  If the origin ID of a transaction doesn't count as transaction
  metadata (i.e., data about the transaction), what does?  It may be a
  metadata element about which you have special concerns, but it is
  transaction metadata.  You don't plan on supporting individual WAL
  records within a transaction containing different values for origin
  ID, do you?  If not, why is it something to store in every WAL
  record rather than once per transaction?  That's not intended to be
  a rhetorical question.
  
  Its definitely possible to store it per transaction (see the discussion
  around http://archives.postgresql.org/message-
  id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering
  via the originating node a considerably more complex thing. With our
  proposal you can do it without any complexity involved, on a low level.
  Storing it per transaction means you can only stream out the data to
  other nodes *after* fully reassembling the transaction. Thats a pitty,
  especially if we go for a design where the decoding happens in a proxy
  instance.
 
 I guess I'm not seeing the purpose to having the origin node id in the
 WAL stream either.
 
 We have it in the Slony sl_log_* stream, however there is a crucial
 difference, in that sl_log_* is expressly a shared structure.  In
 contrast, WAL isn't directly sharable; you don't mix together multiple
 WAL streams.
 
 It seems as though the point in time at which you need to know the
 origin ID is the moment at which you're deciding to read data from the
 WAL files, and knowing which stream you are reading from is an
 assertion that might be satisfied by looking at configuration that
 doesn't need to be in the WAL stream itself.  It might be *nice* for
 the WAL stream to be self-identifying, but that doesn't seem to be
 forcibly necessary.
 
 The case where it *would* be needful is if you are in the process of
 assembling together updates coming in from multiple masters, and need
 to know:
- This INSERT was replicated from node #1, so should be ignored
 downstream - That INSERT was replicated from node #2, so should be ignored
 downstream - This UPDATE came from the local node, so needs to be passed
 to downstream users
Exactly that is the point. And you want to do that in an efficient manner 
without too much logic, thats why something simple like the record header is 
so appealing.

  I also have to admit that I am very hesitant to start developing some
  generic transaction metadata framework atm. That seems to be a good
  way to spend a good part of time in discussion and disagreeing. Imo
  thats something for later.

 Well, I see there being a use in there being at least 3 sorts of LCR
 records:
 a) Capturing literal SQL that is to replayed downstream
 b) Capturing tuple updates in a binary form that can be turned readily
 into heap updates on a replica.
 c) Capturing tuple data in some reasonably portable and readily
 re-writable form
I think we should provide the utilities to do all of those. a) is a 
consequence of being able to do c). 

That doesn't really have something to do with this subthread though? The part 
you quoted above was my response to the suggestion to add some generic 
framework to attach metadata to individual transactions on the generating 
side. We quite possibly will end up needing that but I personally don't think 
we should designing that part atm.

 b) Capturing tuple updates in a binary form that can be turned readily
 into heap updates on a replica.
 Unfortunately, this form is likely not to play well when
 replicating across platforms or Postgres versions, so I suspect that
 this performance optimization should be implemented as a *last*
 resort, rather than first.  Michael Jackson had some rules of
 optimization that said don't do it, and, for the expert, don't do
 it YET...
Well, apply is a bottleneck. Besides field experience I/We have benchmarked it 
and its rather plausible that it is. And I don't think we can magically make 
that faster in pg in general so my plan is to remove the biggest cost factor I 
can see.
And yes, it will have restrictions...

Regards,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 21 June 2012 01:06, Christopher Browne cbbro...@gmail.com wrote:

 I guess I'm not seeing the purpose to having the origin node id in the
 WAL stream either.

 We have it in the Slony sl_log_* stream, however there is a crucial
 difference, in that sl_log_* is expressly a shared structure.  In
 contrast, WAL isn't directly sharable; you don't mix together multiple
 WAL streams.

Unfortunately you do. That's really the core of how this differs from
current Slony.

Every change we make creates WAL records. Whether that is changes
originating on the current node, or changes originating on upstream
nodes that need to be applied on the current node.

The WAL stream is then read and filtered for changes to pass onto
other nodes. So we want to be able to filter out the applied changes
to avoid passing them back to the original nodes.

Having each record know the origin makes the filtering much simpler,
so if its possible to do it efficiently then its the best design. It
turns out to be the best way to do this so far known. There are other
design however, as noted. In all cases we need the origin id in the
WAL.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 17:35, Simon Riggs wrote:

On 20 June 2012 16:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 20.06.2012 11:17, Simon Riggs wrote:


On 20 June 2012 15:45, Heikki Linnakangas
heikki.linnakan...@enterprisedb.comwrote:


So, if the origin id is not sufficient for some conflict resolution
mechanisms, what extra information do you need for those, and where do
you put it?


As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
by sending just the +1 part, not the current value of counter, as
would be the case with the non-reflexive update
   UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.



Yeah, I'd like to hear briefly how you would handle that without any further
changes to the WAL record header.


I already did:


Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.


The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.


Hmm, so you need the origin id in the WAL record header to do filtering. 
Except when that's not enough, you add some more fields to heap update 
and delete records.


Don't you think it would be simpler to only add the extra fields to heap 
insert, update and delete records, and leave the WAL record header 
alone? Do you ever need extra information on other record types?


The other question is, *what* information do you need to put there? We 
don't necessarily need to have a detailed design of that right now, but 
it seems quite clear to me that we need more flexibility there than this 
patch provides, to support more complicated conflict resolution.


I'm not saying that we need to implement all possible conflict 
resolution algorithms right now - on the contrary I think conflict 
resolution belongs outside core - but if we're going to change the WAL 
record format to support such conflict resolution, we better make sure 
the foundation we provide for it is solid.


BTW, one way to work around the lack of origin id in the WAL record 
header is to just add an origin-id column to the table, indicating the 
last node that updated the row. That would be a kludge, but I thought 
I'd mention it..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 07:50:37 PM Robert Haas wrote:
 On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  I realized a problem with that idea this morning: it might work for
  reading things, but if anyone attempts to write data you've got big
  problems.  Maybe we could get away with forbidding that, not sure.
  
  Hm, why is writing a problem? You mean io conversion routines writing
  data? Yes, that will be a problem. I am fine with simply forbidding
  that, we should be able to catch that and provide a sensible error
  message, since SSI we have the support for that.
 I think we could do something a little more vigorous than that, like
 maybe error out if anyone tries to do anything that would write WAL or
 acquire an XID.
I would go for all of them ;). The read only transaction warnings will 
probably result in the best error messages.

 Of course, then the question becomes: then what?  We
 probably need to think about what happens after that - we don't want
 an error replicating one row in one table to permanently break
 replication for the entire system.
Interesting problem, yes. My first reaction was to just to warn and log and 
throw the transaction away... But thats not likely to work very well on the 
apply side...

I don't think its a particularly likely problem though. An io function doing 
that would probably fail in lots of other situations (HS, read only 
transactions, permission problems).

  Sorry.  I don't think you're planning to do something evil, but before
  I thought you said you did NOT want to write the code to extract
  changes as text or something similar.
  Hm. I might have been a bit ambiguous when saying that I do not want to
  provide everything for that use-case.
  Once we have a callpoint that has a correct catalog snapshot for exactly
  the tuple in question text conversion is damn near trivial. The point
  where you get passed all that information (action, tuple, table,
  snapshot) is the one I think the patch should mainly provide.
 This is actually a very interesting list.  We could rephrase the
 high-level question about the design of this feature as what is the
 best way to make sure that you have these things available?.  Action
 and tuple are trivial to get, and table isn't too hard either.  It's
 really the snapshot - and all the downstream information that can only
 be obtained via using that snapshot - that is the hard part.
For others, a sensible entry point into this discussion before switching 
subthreads probably is in http://archives.postgresql.org/message-
id/201206192023.20589.and...@2ndquadrant.com

The table isn't as easy as you might think as the wal record only contains the 
relfilenode. Unless you want to log more its basically solved by solving the 
snapshot issue.

And yes, agreeing on how to do that is the thing we need to solve next. Patch 
07 (add enough information to wal) and this are the first ones that should get 
committed imo. And be ready for that obviously.

Just noticed that I failed to properly add Patch 07 to the commitfest. I have 
done so now, hope thats ok.


Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 16:46, Simon Riggs wrote:

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being used up as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.


 ...


We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.


Or, we could provide a mechanism for resource managers to use those 
padding bytes for whatever data the wish to use. Or modify the record 
format so that the last 4 bytes of the data in the WAL record are always 
automatically stored in those padding bytes, thus making all WAL records 
4 bytes shorter. That would make the WAL even more compact, with only a 
couple of extra CPU instructions in the critical path.


My point is that it's wrong to think that it's free to use those bytes, 
just because they're currently unused. If we use them for one thing, we 
can't use them for other things anymore. If we're so concerned about WAL 
bloat that we can't afford to add any more bytes to the WAL record 
header or heap WAL records, then it would be equally fruitful to look at 
ways to use those padding bytes to save that precious WAL space.


I don't think we're *that* concerned about the WAL bloat, however. So 
let's see what is the most sensible place to add whatever extra 
information we need in the WAL, from the point of view of 
maintainability, flexibility, readability etc. Then we can decide where 
to put it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 21 June 2012 02:32, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 17:35, Simon Riggs wrote:

 On 20 June 2012 16:23, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.06.2012 11:17, Simon Riggs wrote:


 On 20 June 2012 15:45, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com    wrote:


 So, if the origin id is not sufficient for some conflict resolution
 mechanisms, what extra information do you need for those, and where do
 you put it?


 As explained elsewhere, wal_level = logical (or similar) would be used
 to provide any additional logical information required.

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.

 In the case of reflexive updates you raised, a typical response in
 other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
 by sending just the +1 part, not the current value of counter, as
 would be the case with the non-reflexive update
   UPDATE SET counter = 1

 Handling such things in Postgres would require some subtlety, which
 would not be resolved in first release but is pretty certain not to
 require any changes to the WAL record header as a way of resolving it.
 Having already thought about it, I'd estimate that is a very long
 discussion and not relevant to the OT, but if you wish to have it
 here, I won't stop you.



 Yeah, I'd like to hear briefly how you would handle that without any
 further
 changes to the WAL record header.


 I already did:

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.


 The case you mentioned relates to UPDATEs only, so I would suggest
 that we add that information to a new form of update record only.

 That has nothing to do with the WAL record header.


 Hmm, so you need the origin id in the WAL record header to do filtering.
 Except when that's not enough, you add some more fields to heap update and
 delete records.

Yes

 Don't you think it would be simpler to only add the extra fields to heap
 insert, update and delete records, and leave the WAL record header alone? Do
 you ever need extra information on other record types?

No extra info on other record types, in general at least. Doing it
that way is the most logical place, just not the most efficient.

 The other question is, *what* information do you need to put there? We don't
 necessarily need to have a detailed design of that right now, but it seems
 quite clear to me that we need more flexibility there than this patch
 provides, to support more complicated conflict resolution.

Another patch already covers that for the common non-reflexive case.
More on conflict resolution soon, I guess. I was going to begin
working on that in July. Starting with the design discussed on list,
of course. This patch has thrown up stuff I thought was
compartmentalised, but I was wrong.

 I'm not saying that we need to implement all possible conflict resolution
 algorithms right now - on the contrary I think conflict resolution belongs
 outside core

It's a pretty standard requirement to have user defined conflict
handling, if that's what you mean.

I'm OK with conflict handling being outside of core as a module, if
that's what people think. We just need a commit hook. That is more
likely to allow a workable solution with 9.3 as well, ISTM. It's
likely to be contentious...

 - but if we're going to change the WAL record format to support
 such conflict resolution, we better make sure the foundation we provide for
 it is solid.

Agreed

 BTW, one way to work around the lack of origin id in the WAL record header
 is to just add an origin-id column to the table, indicating the last node
 that updated the row. That would be a kludge, but I thought I'd mention it..

err, I hope you mean that to be funny. (It wouldn't actually work either)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 21 June 2012 02:45, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 16:46, Simon Riggs wrote:

 The proposal now includes flag bits that would allow the addition of a
 variable length header, should that ever become necessary. So the
 unused space in the fixed header is not being used up as you say. In
 any case, the fixed header still has 4 wasted bytes on 64bit systems
 even after the patch is applied. So this claim of short sightedness is
 just plain wrong.

 ...



 We need to add information to every WAL record that is used as the
 source for generating LCRs. It is also possible to add this to HEAP
 and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
 using the *currently wasted* bytes on a WAL record header does *not*
 bloat the WAL stream.


Wonderful ideas, these look good.

 Or, we could provide a mechanism for resource managers to use those padding
 bytes for whatever data the wish to use.

Sounds better to me.

 Or modify the record format so that
 the last 4 bytes of the data in the WAL record are always automatically
 stored in those padding bytes, thus making all WAL records 4 bytes shorter.
 That would make the WAL even more compact, with only a couple of extra CPU
 instructions in the critical path.

Sounds cool, but a little weird, even for me.

 My point is that it's wrong to think that it's free to use those bytes, just
 because they're currently unused. If we use them for one thing, we can't use
 them for other things anymore. If we're so concerned about WAL bloat that we
 can't afford to add any more bytes to the WAL record header or heap WAL
 records, then it would be equally fruitful to look at ways to use those
 padding bytes to save that precious WAL space.

Agreed. Thanks for sharing those ideas. Exactly why I like the list (really...)

 I don't think we're *that* concerned about the WAL bloat, however. So let's
 see what is the most sensible place to add whatever extra information we
 need in the WAL, from the point of view of maintainability, flexibility,
 readability etc. Then we can decide where to put it.

Removing FPW is still most important aspect there.

I think allowing rmgrs to redefine the wasted bytes in the header is
the best idea.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 21 June 2012 02:56, Simon Riggs si...@2ndquadrant.com wrote:

 I think allowing rmgrs to redefine the wasted bytes in the header is
 the best idea.

Hmm, I think the best idea is to save 2 bytes off the WAL header for
all records, so there are no wasted bytes on 64bit or 32bit.

That way the potential for use goes away and there's benefit for all,
plus no argument about how to use those bytes in rarer cases.

I'll work on that.

And then we just put the originid on each heap record for MMR, in some
manner, discussed later.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 21:51, Simon Riggs wrote:

On 21 June 2012 02:32, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

I'm not saying that we need to implement all possible conflict resolution
algorithms right now - on the contrary I think conflict resolution belongs
outside core


It's a pretty standard requirement to have user defined conflict
handling, if that's what you mean.

I'm OK with conflict handling being outside of core as a module, if
that's what people think. We just need a commit hook. That is more
likely to allow a workable solution with 9.3 as well, ISTM. It's
likely to be contentious...


Hmm, what do you need to happen at commit time?

We already have RegisterXactCallback(), if that's enough...


BTW, one way to work around the lack of origin id in the WAL record header
is to just add an origin-id column to the table, indicating the last node
that updated the row. That would be a kludge, but I thought I'd mention it..


err, I hope you mean that to be funny. (It wouldn't actually work either)


No, I wasn't serious that we should implement it that way. But now you 
made me curious; why would it not work? If there's an origin-id column 
in a table, it's included in every heap insert/delete/update WAL record. 
Just set it to the current node's id on a local modification, and to the 
origin's id when replaying changes from another node, and you have the 
exact same information as you would with the extra field in WAL record 
header.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Hi,

On Wednesday, June 20, 2012 08:32:53 PM Heikki Linnakangas wrote:
 On 20.06.2012 17:35, Simon Riggs wrote:
  On 20 June 2012 16:23, Heikki Linnakangas
  
  heikki.linnakan...@enterprisedb.com  wrote:
  On 20.06.2012 11:17, Simon Riggs wrote:
  On 20 June 2012 15:45, Heikki Linnakangas
  
  heikki.linnakan...@enterprisedb.comwrote:
  So, if the origin id is not sufficient for some conflict resolution
  mechanisms, what extra information do you need for those, and where do
  you put it?
  
  As explained elsewhere, wal_level = logical (or similar) would be used
  to provide any additional logical information required.
  
  Update and Delete WAL records already need to be different in that
  mode, so additional info would be placed there, if there were any.
  
  In the case of reflexive updates you raised, a typical response in
  other DBMS would be to represent the query
  
 UPDATE SET counter = counter + 1
  
  by sending just the +1 part, not the current value of counter, as
  would be the case with the non-reflexive update
  
 UPDATE SET counter = 1
  
  Handling such things in Postgres would require some subtlety, which
  would not be resolved in first release but is pretty certain not to
  require any changes to the WAL record header as a way of resolving it.
  Having already thought about it, I'd estimate that is a very long
  discussion and not relevant to the OT, but if you wish to have it
  here, I won't stop you.
  
  Yeah, I'd like to hear briefly how you would handle that without any
  further changes to the WAL record header.
  
  I already did:
  Update and Delete WAL records already need to be different in that
  mode, so additional info would be placed there, if there were any.
  
  The case you mentioned relates to UPDATEs only, so I would suggest
  that we add that information to a new form of update record only.
  
  That has nothing to do with the WAL record header.
 
 Hmm, so you need the origin id in the WAL record header to do filtering.
 Except when that's not enough, you add some more fields to heap update
 and delete records.
Imo the whole +1 stuff doesn't have anything to do with the origin_id proposal 
and should be ignored for quite a while. We might go to something like it 
sometime in the future but its nothing we work on (as far as I know ;)).

wal_level=logical (in patch 07) currently only changes the following things 
about the wal stream:

For HEAP_(INSERT|(HOT_)?UPDATE|DELETE)
* prevent full page writes from removing the row data (could be optimized at 
some point to just store the tuple slot)

For HEAP_DELETE
* add the primary key of the changed row

HEAP_MULTI_INSERT obviously needs to get the same treatment in future.

The only real addition that I forsee in the near future is logging the old 
primary key when the primary key changes in HEAP_UPDATE.

Kevin wants an option for full pre-images of rows in HEAP_(UPDATE|DELETE)

 Don't you think it would be simpler to only add the extra fields to heap
 insert, update and delete records, and leave the WAL record header
 alone? Do you ever need extra information on other record types?
Its needed in some more locations: HEAP_HOT_UPDATE, HEAP2_MULTI_INSERT, 
HEAP_NEWPAGE, HEAP_XACT_(ASSIGN, COMMIT, COMMIT_PREPARED, COMMIT_COMPACT, 
ABORT, ABORT_PREPARED) and probably some I didn't remember right now.

Sure, we can add it to all those but then you need to have individual 
knowledge about *all* of those because the location where its stored will be 
different for each of them.

To recap why we think origin_id is a sensible design choice:

There are many sensible replication topologies where it does make sense that 
you want to receive changes (on node C) from one node (say B) that originated 
from some other node (say A).
Reasons include:
* the order of applying changes should be as similar as possible on all nodes. 
That means when applying a change on C that originated on B and if changes 
replicated faster from A-B than from A-C you want to be at least as far with 
the replication from A as B was. Otherwise the conflict ratio will increase. 
If you can recreate the stream from the wal of every node and still detect 
where an individual change originated, thats easy.
* the interconnects between some nodes may be more expensive than from others
* an interconnect between two nodes may fail but others dont

Because of that we think its sensible to be able generate the full LCR stream 
with all changes, local and remote ones, on each individual node. If you then 
can filter on individual origin_id's you can build complex replication 
topologies without much additional complexity.

 I'm not saying that we need to implement all possible conflict
 resolution algorithms right now - on the contrary I think conflict
 resolution belongs outside core - but if we're going to change the WAL
 record format to support such conflict resolution, we better make sure
 the foundation we provide for it is solid.
I think this already

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node


On 20.06.2012 22:11, Simon Riggs wrote:

On 21 June 2012 02:56, Simon Riggssi...@2ndquadrant.com  wrote:


I think allowing rmgrs to redefine the wasted bytes in the header is
the best idea.


Hmm, I think the best idea is to save 2 bytes off the WAL header for
all records, so there are no wasted bytes on 64bit or 32bit.

That way the potential for use goes away and there's benefit for all,
plus no argument about how to use those bytes in rarer cases.

I'll work on that.


I don't think that's actually necessary, the WAL bloat isn't *that* bad 
that we need to start shaving bytes from there. I was just trying to 
make a point.



And then we just put the originid on each heap record for MMR, in some
manner, discussed later.


I reserve the right to object to that, too :-). Others raised the 
concern that a 16-bit integer is not a very intuitive identifier. Also, 
as discussed, for more complex scenarios just the originid is not 
sufficient. ISTM that we need more flexibility.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk

On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund and...@2ndquadrant.com wrote:

To recap why we think origin_id is a sensible design choice:

There are many sensible replication topologies where it does make sense that
you want to receive changes (on node C) from one node (say B) that originated
from some other node (say A).
Reasons include:
* the order of applying changes should be as similar as possible on all nodes.
That means when applying a change on C that originated on B and if changes
replicated faster from A-B than from A-C you want to be at least as far with
the replication from A as B was. Otherwise the conflict ratio will increase.
If you can recreate the stream from the wal of every node and still detect
where an individual change originated, thats easy.

OK, so in this case, I still don't see how the origin_id is even enough.

C applies the change originally from A (routed through B, because it's
faster). But when it get's the change directly from A, how does it
know to *not* apply it again?

* the interconnects between some nodes may be more expensive than from others
* an interconnect between two nodes may fail but others dont

Because of that we think its sensible to be able generate the full LCR stream
with all changes, local and remote ones, on each individual node. If you then
can filter on individual origin_id's you can build complex replication
topologies without much additional complexity.

I'm not saying that we need to implement all possible conflict
resolution algorithms right now - on the contrary I think conflict
resolution belongs outside core - but if we're going to change the WAL
record format to support such conflict resolution, we better make sure
the foundation we provide for it is solid.
I think this already provides a lot. At some point we probably want to have
support for looking on which node a certain local xid originated and when that
was originally executed. While querying that efficiently requires additional
support we already have all the information for that.

There are some more complexities with consistently determining conflicts on
changes that happened in a very small timewindown on different nodes but thats
something for another day.

BTW, one way to work around the lack of origin id in the WAL record
header is to just add an origin-id column to the table, indicating the
last node that updated the row. That would be a kludge, but I thought
I'd mention it..
Yuck. The aim is to improve on whats done today ;)

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Aidan Van Dyk Create like a god,
ai...@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 21 June 2012 03:13, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 20.06.2012 21:51, Simon Riggs wrote:

 On 21 June 2012 02:32, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 I'm not saying that we need to implement all possible conflict resolution

 algorithms right now - on the contrary I think conflict resolution
 belongs
 outside core


 It's a pretty standard requirement to have user defined conflict
 handling, if that's what you mean.

 I'm OK with conflict handling being outside of core as a module, if
 that's what people think. We just need a commit hook. That is more
 likely to allow a workable solution with 9.3 as well, ISTM. It's
 likely to be contentious...


 Hmm, what do you need to happen at commit time?

 We already have RegisterXactCallback(), if that's enough...

Let me refer back to my notes and discuss this later, I'm just going
from vague memory here, which isn't helpful.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 21 June 2012 03:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 And then we just put the originid on each heap record for MMR, in some
 manner, discussed later.


 I reserve the right to object to that, too :-).

OK. But that would be only for MMR, using special record types.

 Others raised the concern
 that a 16-bit integer is not a very intuitive identifier.

Of course

 Also, as
 discussed, for more complex scenarios just the originid is not sufficient.
 ISTM that we need more flexibility.

Of course

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 09:24:29 PM Aidan Van Dyk wrote:
 On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  To recap why we think origin_id is a sensible design choice:
  
  There are many sensible replication topologies where it does make sense
  that you want to receive changes (on node C) from one node (say B) that
  originated from some other node (say A).
  Reasons include:
  * the order of applying changes should be as similar as possible on all
  nodes. That means when applying a change on C that originated on B and
  if changes replicated faster from A-B than from A-C you want to be at
  least as far with the replication from A as B was. Otherwise the
  conflict ratio will increase. If you can recreate the stream from the
  wal of every node and still detect where an individual change
  originated, thats easy.
 
 OK, so in this case, I still don't see how the origin_id is even enough.
 
 C applies the change originally from A (routed through B, because it's
 faster).  But when it get's the change directly from A, how does it
 know to *not* apply it again?
The lsn of the change.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 09:23:34 PM Heikki Linnakangas wrote:
  And then we just put the originid on each heap record for MMR, in some
  manner, discussed later.
 
 I reserve the right to object to that, too :-). Others raised the 
 concern that a 16-bit integer is not a very intuitive identifier. Also, 
 as discussed, for more complex scenarios just the originid is not 
 sufficient. ISTM that we need more flexibility.
I think the '16bit integer is unintiuitive' argument isn't that interesting. 
As pointed out by multiple people in the thread that origin_id can be local 
and mapped to something more complex in communication between the different 
nodes and the configuration.
Before applying changes from another node you lookup their complex id into 
the locally mapped 16bit origin_id which then gets written into the wal 
stream. When decoding the wal stream into the LCR stream its mapped the other 
way.

We might need more information than that at a later point but those probably 
won't needed during low-level filtering of wal before reassembling it into 
transactions...

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk

On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund and...@2ndquadrant.com wrote:

 OK, so in this case, I still don't see how the origin_id is even enough.

 C applies the change originally from A (routed through B, because it's
 faster).  But when it get's the change directly from A, how does it
 know to *not* apply it again?
 The lsn of the change.

So why isn't the LSN good enough for when C propagates the change back to A?

Why does A need more information than C?

a.


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
 On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  OK, so in this case, I still don't see how the origin_id is even
  enough.
  
  C applies the change originally from A (routed through B, because it's
  faster).  But when it get's the change directly from A, how does it
  know to *not* apply it again?
  
  The lsn of the change.
 
 So why isn't the LSN good enough for when C propagates the change back to
 A?
Because every node has individual progress in the wal so the lsn doesn't mean 
anything unless you know from which node it originally is.

 Why does A need more information than C?
Didn't I explain that two mails up?

Perhaps Chris' phrasing explains the basic idea better:

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
 The case where it would be needful is if you are in the process of
 assembling together updates coming in from multiple masters, and need
 to know:
- This INSERT was replicated from node #1, so should be ignored
 downstream - That INSERT was replicated from node #2, so should be ignored
 downstream - This UPDATE came from the local node, so needs to be passed
 to downstream users

Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north 
america.
#1 wants changes from #3 and #4 when talking to #3 but not those from #2  and 
itself (because that would be cheaper to get locally).

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk

On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund and...@2ndquadrant.com wrote:
 On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
 On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund and...@2ndquadrant.com
 wrote:
  OK, so in this case, I still don't see how the origin_id is even
  enough.
 
  C applies the change originally from A (routed through B, because it's
  faster).  But when it get's the change directly from A, how does it
  know to *not* apply it again?
 
  The lsn of the change.

 So why isn't the LSN good enough for when C propagates the change back to
 A?
 Because every node has individual progress in the wal so the lsn doesn't mean
 anything unless you know from which node it originally is.

 Why does A need more information than C?
 Didn't I explain that two mails up?

Probably, but that didn't mean I understood it... I'm trying to keep up here ;-)

So the origin_id isn't strictly for the origin node to know filter an
LCR it's applied already, but it is also to correlate the LSN's
because the LSN's of the re-generated LCR's are meant to contain the
originating nodes's LSN, and every every node applying LCRs needs to
be able to know where it is in every node's LSN progress.

I had assumed any LCR's generated on a node woudl be relative to the
LSN sequencing of that node.

 Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north
 america.
 #1 wants changes from #3 and #4 when talking to #3 but not those from #2  and
 itself (because that would be cheaper to get locally).

Right, but if the link between #1 and #2 ever slows down, changes
from #3 and #4 may very well already have #2's changes, and even
require them.

#1 has to apply them, or is it going to stop applying LCR's from #3
when it see's LCRs from #3 coming in that originate on #2 and have
LSNs greater than what it's so far received from #2?


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 10:12:46 PM Aidan Van Dyk wrote:
 On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
  On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund and...@2ndquadrant.com
  
  wrote:
   OK, so in this case, I still don't see how the origin_id is even
   enough.
   
   C applies the change originally from A (routed through B, because
   it's faster).  But when it get's the change directly from A, how
   does it know to *not* apply it again?
   
   The lsn of the change.
  
  So why isn't the LSN good enough for when C propagates the change back
  to A?
  
  Because every node has individual progress in the wal so the lsn doesn't
  mean anything unless you know from which node it originally is.
  
  Why does A need more information than C?
  
  Didn't I explain that two mails up?
 
 Probably, but that didn't mean I understood it... I'm trying to keep up
 here ;-)
Heh. Yes. This developed into a huge mess already ;)

 So the origin_id isn't strictly for the origin node to know filter an
 LCR it's applied already, but it is also to correlate the LSN's
 because the LSN's of the re-generated LCR's are meant to contain the
 originating nodes's LSN, and every every node applying LCRs needs to
 be able to know where it is in every node's LSN progress.
There are multiple use-cases for it, this is one of them.

 I had assumed any LCR's generated on a node woudl be relative to the
 LSN sequencing of that node.
We have the original lsn in the commit record. Thats needed to support crash 
recovery because you need to know where to restart applying changes again 
from.

  Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in
  north america.
  #1 wants changes from #3 and #4 when talking to #3 but not those from #2
   and itself (because that would be cheaper to get locally).
 
 Right, but if the link between #1 and #2 ever slows down, changes
 from #3 and #4 may very well already have #2's changes, and even
 require them.
 
 #1 has to apply them, or is it going to stop applying LCR's from #3
 when it see's LCRs from #3 coming in that originate on #2 and have
 LSNs greater than what it's so far received from #2?
We will see ;). There are several solutions possible, possibly even depending 
on the use-case. This patch just want's to provide the very basic information 
required to implement such solutions...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Andres Freund and...@2ndquadrant.com writes:
 On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
 What's the cost of going a lot higher?  Because if one makes enough
 numerical space available, one can assign node identities without a
 coordinator, a massive decrease in complexity.

 It would increase the size of every wal record. We just have 16bit left there
 by chance...

Every WAL record?  Why in heck would you attach it to every record?
Surely putting it in WAL page headers would be sufficient.  We could
easily afford to burn a page switch (if not a whole segment switch)
when changing masters.

I'm against the idea of eating any spare space we have in WAL record
headers for this purpose, anyway; there are likely to be more pressing
needs in future.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tuesday, June 19, 2012 12:47:54 AM Christopher Browne wrote:
 On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  Hi Simon,
  
  On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
  On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote:
   This adds a new configuration parameter multimaster_node_id which
   determines the id used for wal originating in one cluster.
  
  Looks good and it seems this aspect at least is commitable in this CF.
  
  I think we need to agree on the parameter name. It currently is
  'multimaster_node_id'. In the discussion with Steve we got to
  replication_node_id. I don't particularly like either.
  
  Other suggestions?
 I wonder if it should be origin_node_id?  That is the term Slony uses.
I think in the slony context its clear that thats related to replication. Less 
so in a general postgres guc. So maybe replication_origin_node_id?

  * Size of field. 16 bits is enough for 32,000 master nodes, which is
  quite a lot. Do we need that many? I think we may have need for a few
  flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
  8 bits. Even if we don't need them in this release, I'd like to have
  them. If they remain unused after a few releases, we may choose to
  redeploy some of them as additional nodeids in future. I don't foresee
  complaints that 256 master nodes is too few anytime soon, so we can
  defer that decision.
  
  I wished we had some flag bits available before as well. I find 256 nodes
  a pretty low value to start with though, 4096 sounds better though, so I
  would be happy with 4 flag bits. I think for cascading setups and such
  you want to add node ids for every node, not only masters...
 Even though the number of nodes that can reasonably participate in
 replication is likely to be not too terribly large, it might be good
 to allow larger values, in case someone is keen on encoding something
 descriptive in the node number.
Well, having a large number space makes the wal records bigger which is not 
something I can see us getting through with. People have gone through 
considerable length to avoid that.
Perhaps we can have a mapping system catalog at some point that includes 
additional infromation to each node id like a name and where its at wrt 
replication...

 I recall the Slony-II project having a notion of attaching a permanent
 UUID-based node ID to each node.  As long as there is somewhere decent
 to find a symbolically significant node name, I like the idea of the
 ID *not* being in a tiny range, and being UUID/OID-like...
I think adding 14 bytes (16bytes of an ooid - 2 bytes available) is out of 
question...

  Any opinions from others on this?
  
  * Do we want origin_id as a parameter or as a setting in pgcontrol?
  IIRC we go to a lot of trouble elsewhere to avoid problems with
  changing on/off parameter values. I think we need some discussion to
  validate where that should live.
  
  Hm. I don't really forsee any need to have it in pg_control. What do you
  want to protect against with that?
  It would need to be changeable anyway, because otherwise it would need to
  become a parameter for initdb which would suck for anybody migrating to
  use replication at some point.
  
  Do you want to protect against problems in replication setups after
  changing the value?
 In Slony, changing the node ID is Not Something That Is Done.  The ID
 is captured in *way* too many places to be able to have any hope of
 updating it in a coordinated way.  I should be surprised if it wasn't
 similarly troublesome here.
If you update you will need to reset consuming nodes, yes. Imo thats still 
something else than disallowing changing the parameter entirely. Requiring 
initdb for that seems like it would make experimentation too hard.

We need to allow at least changing the setting from no node id to an initial 
one.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Hi,

On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
  What's the cost of going a lot higher?  Because if one makes enough
  numerical space available, one can assign node identities without a
  coordinator, a massive decrease in complexity.
  
  It would increase the size of every wal record. We just have 16bit left
  there by chance...
 
 Every WAL record?  Why in heck would you attach it to every record?
 Surely putting it in WAL page headers would be sufficient.  We could
 easily afford to burn a page switch (if not a whole segment switch)
 when changing masters.
The idea is that you can have cascading, circular and whatever replication 
topologies if you include the logical origin of a wal causing action into 
it.
That is, if you have nodes A(1) and B(2) and a insert happens on A the wal 
records generated by that will get an xl_origin_id = 1 and when it will be 
decoded and replayed on B it will *also* get the id 1. Only when a change 
originally is generated on Bit will get xl_origin_id = 2.
That way you can easily have circular or hierarchical replication topologies 
including diamonds.

 I'm against the idea of eating any spare space we have in WAL record
 headers for this purpose, anyway; there are likely to be more pressing
 needs in future.
Every other solution to allowing this seems to be far more complicated than 
this, thats why I arrived at the conclusion that its a good idea.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote:
 On 12-06-18 07:30 AM, Andres Freund wrote:
  Hrmpf #666. I will go through through the series commit-by-commit again
  to make sure everything compiles again. Reordinging this late definitely
  wasn't a good idea...
  
  I pushed a rebased version with all those fixups (and removal of the
  zeroRecPtr patch).
 
 Where did you push that rebased version to? I don't see an attachment,
 or an updated patch in the commitfest app and your repo at
 http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summa
 ry hasn't been updated in 5 days.
To the 2ndquadrant internal repo. Which strangely doesn't help you. 
*Headdesk*. Pushed to the correct repo and manually verified.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Andres Freund and...@2ndquadrant.com writes:
 On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
 Every WAL record?  Why in heck would you attach it to every record?
 Surely putting it in WAL page headers would be sufficient.

 The idea is that you can have cascading, circular and whatever replication 
 topologies if you include the logical origin of a wal causing action into 
 it.
 That is, if you have nodes A(1) and B(2) and a insert happens on A the wal 
 records generated by that will get an xl_origin_id = 1 and when it will be 
 decoded and replayed on B it will *also* get the id 1. Only when a change 
 originally is generated on Bit will get xl_origin_id = 2.

None of this explains to me why each individual WAL record would need to
carry a separate copy of the indicator.  We don't have multiple masters
putting records into the same WAL file, nor does it seem to me that it
could possibly be workable to merge WAL streams.  (If you are thinking
of something sufficiently high-level that merging could possibly work,
then it's not WAL, and we shouldn't be trying to make the WAL
representation cater for it.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
  Every WAL record?  Why in heck would you attach it to every record?
  Surely putting it in WAL page headers would be sufficient.
  
  The idea is that you can have cascading, circular and whatever
  replication topologies if you include the logical origin of a wal
  causing action into it.
  That is, if you have nodes A(1) and B(2) and a insert happens on A the
  wal records generated by that will get an xl_origin_id = 1 and when it
  will be decoded and replayed on B it will *also* get the id 1. Only when
  a change originally is generated on Bit will get xl_origin_id = 2.
 
 None of this explains to me why each individual WAL record would need to
 carry a separate copy of the indicator.  We don't have multiple masters
 putting records into the same WAL file, nor does it seem to me that it
 could possibly be workable to merge WAL streams.  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)
The idea is that if youre replaying changes on node A originating from node B 
you set the origin to *B* in the wal records that are generated during that. 
So when B, in a bidirectional setup, replays the changes that A has made it 
can simply ignore all changes which originated on itself.

That works rather simple  performant if you have a conflict avoidance scheme.

For many scenarios you need to be able to separate locally generated changes 
and changes that have been replayed from another node. Without support of 
something like this this is really hard to achieve.

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Andres Freund and...@2ndquadrant.com writes:
 On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)

 The idea is that if youre replaying changes on node A originating from node B
 you set the origin to *B* in the wal records that are generated during that. 
 So when B, in a bidirectional setup, replays the changes that A has made it 
 can simply ignore all changes which originated on itself.

This is most certainly not possible at the level of WAL.  As I said
above, we shouldn't be trying to shoehorn high level logical-replication
commands into WAL streams.  No good can come of confusing those concepts.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
  ...  (If you are thinking
  of something sufficiently high-level that merging could possibly work,
  then it's not WAL, and we shouldn't be trying to make the WAL
  representation cater for it.)
  
  The idea is that if youre replaying changes on node A originating from
  node B you set the origin to *B* in the wal records that are generated
  during that. So when B, in a bidirectional setup, replays the changes
  that A has made it can simply ignore all changes which originated on
  itself.
 
 This is most certainly not possible at the level of WAL. 
Huh? This isn't used during normal crash-recovery replay. The information is 
used when decoding the wal into logical changes and applying those. Its just a 
common piece of information thats needed for a large number of records. 

Alternatively it could be added to all the records that need it, but that 
would smear the necessary logic - which is currently trivial - over more of 
the backend. And it would increase the actual size of wal which this one did 
not.

 As I said above, we shouldn't be trying to shoehorn high level logical-
 replication commands into WAL streams.  No good can come of confusing those
 concepts.
Its not doing anything high-level in there? All that patch does is embedding 
one single piece of information in previously unused space.

I can follow the argument that you do not want *any* logical information in 
the wal. But as I said in the patchset-introducing email: I don't really see 
an alternative. Otherwise we would just duplicate all the locking/scalability 
issues of xlog as well as the amount of writes.
This is, besides logging some more informations when wal_level = logical in 
some particular records (HEAP_UPDATE|DELETE and ensuring fpw's don't remove 
the record data in HEAP_(INSERT|UPDATE|DELETE) in patch 07/16 the only change 
that I really forsee being needed for doing the logical stuff.

Do you really see this as such a big problem?

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Andres Freund and...@2ndquadrant.com writes:
 On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)

 Do you really see this as such a big problem?

It looks suspiciously like I have a hammer, therefore every problem
must be a nail.  I don't like the design concept of cramming logical
replication records into WAL in the first place.

However, if we're dead set on doing it that way, let us put information
that is only relevant to logical replication records into only the
logical replication records.  Saving a couple bytes in each such record
is penny-wise and pound-foolish, I'm afraid; especially when you're
nailing down hard, unexpansible limits at the very beginning of the
development process in order to save those bytes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Hi,

On Tuesday, June 19, 2012 06:11:20 PM Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
  ...  (If you are thinking
  of something sufficiently high-level that merging could possibly work,
  then it's not WAL, and we shouldn't be trying to make the WAL
  representation cater for it.)
  
  Do you really see this as such a big problem?
 It looks suspiciously like I have a hammer, therefore every problem
 must be a nail.  I don't like the design concept of cramming logical
 replication records into WAL in the first place.
There are - so far - no specific logical replication records. Its a 
relatively minor amount of additional data under wal_level=logical for 
existing records. HEAP_UPDATE gets the old primary key on updates changing the 
pkey and HEAP_DELETE always has the pkey. HEAP_INSERT|UPDATE|
DELETE,HEAP2_MULTI_INSERT put their information in another XLogRecData block 
than the page to handle full page writes. Thats it.

I can definitely understand hesitation about that, but I simply see no 
realistic way to solve the issues of existing replication solutions otherwise.
Do you have a better idea to solve those than the above? Without significant 
complications of the backend code and without loads of additional writes going 
on?
I *really* would like to hear them if you do.

 However, if we're dead set on doing it that way, let us put information
 that is only relevant to logical replication records into only the
 logical replication records. 
I found, and still do, the idea of having the origin_id in there rather 
elegant. If people prefer adding the same block to all of the above xlog 
records: I can live with that and will then do so. It makes some things more 
complicated, but its not too bad.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)

 Do you really see this as such a big problem?

 It looks suspiciously like I have a hammer, therefore every problem
 must be a nail.  I don't like the design concept of cramming logical
 replication records into WAL in the first place.

Me, neither.  I think it's necessary to try to find a way of
generating logical replication records from WAL.  But once generated,
I think those records should form their own stream, independent of
WAL.  If you take the contrary position that they should be included
in WAL, then when you filter the WAL stream down to just the records
of interest to logical replication, you end up with a WAL stream with
holes in it, which is one of the things that Andres listed as an
unresolved design problem in his original email.

Moreover, this isn't necessary at all for single-master replication,
or even multi-source replication where each table has a single master.
 It's only necessary for full multi-master replication, which we have
no consensus to include in core, and even if we did have a consensus
to include it in core, it certainly shouldn't be the first feature we
design.

 However, if we're dead set on doing it that way, let us put information
 that is only relevant to logical replication records into only the
 logical replication records.

Right.  If we decide we need this, and if we did decide to conflate
the WAL stream, both of which I disagree with as noted above, then we
still don't need it on every record.  It would probably be sufficient
for local transactions to do nothing at all (and we can implicitly
assume that they have master node ID = local node ID) and transactions
which are replaying remote changes to emit one record per XID per
checkpoint cycle containing the remote node ID.

 Saving a couple bytes in each such record
 is penny-wise and pound-foolish, I'm afraid; especially when you're
 nailing down hard, unexpansible limits at the very beginning of the
 development process in order to save those bytes.

I completely agree.  I think that, as Dan said upthread, having a 64
or 128 bit ID so that it can be generated automatically rather than
configured by an administrator who must be careful not to duplicate
node IDs in any pair of systems that could ever end up talking to each
other would be a vast usability improvement.  Perhaps systems A, B,
and C are replicating to each other today, as are systems D and E.
But now suppose that someone decides they want to replicate one table
between A and D.  Suddenly the node IDs have to be distinct where they
didn't before, and now there's potentially a problem to hassle with
that wouldn't have been an issue if the node IDs had been wide enough
to begin with.  It is not unusual for people to decide after-the-fact
to begin replicating between machines where this wasn't originally
anticipated and which may even be even be under different
administrative control.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tuesday, June 19, 2012 07:24:13 PM Robert Haas wrote:
 On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Andres Freund and...@2ndquadrant.com writes:
  On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
  ...  (If you are thinking
  of something sufficiently high-level that merging could possibly work,
  then it's not WAL, and we shouldn't be trying to make the WAL
  representation cater for it.)
  
  Do you really see this as such a big problem?
  
  It looks suspiciously like I have a hammer, therefore every problem
  must be a nail.  I don't like the design concept of cramming logical
  replication records into WAL in the first place.
 
 Me, neither.  I think it's necessary to try to find a way of
 generating logical replication records from WAL.  But once generated,
 I think those records should form their own stream, independent of
 WAL.  If you take the contrary position that they should be included
 in WAL, then when you filter the WAL stream down to just the records
 of interest to logical replication, you end up with a WAL stream with
 holes in it, which is one of the things that Andres listed as an
 unresolved design problem in his original email.
Yes.

 Moreover, this isn't necessary at all for single-master replication,
 or even multi-source replication where each table has a single master.
  It's only necessary for full multi-master replication, which we have
 no consensus to include in core, and even if we did have a consensus
 to include it in core, it certainly shouldn't be the first feature we
 design.
Well, you can't blame a patch/prototype trying to implement what it claims to 
implement ;)

More seriously: Even if we don't put MM in core I think putting the basis for 
it in core so that somebody can build such a solution reusing the existing 
infrastructure is a sensible idea. Imo the only thing that requires explicit 
support which is hard to add outside of core is prevention of loops (aka this 
patch). Everything else should be doable reusing the hopefully modular pieces.

  However, if we're dead set on doing it that way, let us put information
  that is only relevant to logical replication records into only the
  logical replication records.
 Right.  If we decide we need this, and if we did decide to conflate
 the WAL stream, both of which I disagree with as noted above, then we
 still don't need it on every record.  It would probably be sufficient
 for local transactions to do nothing at all (and we can implicitly
 assume that they have master node ID = local node ID) and transactions
 which are replaying remote changes to emit one record per XID per
 checkpoint cycle containing the remote node ID.
Youve gone from a pretty trivial 150 line patch without any runtime/space 
overhead to something *considerably* more complex in that case though. 

You suddently need to have relatively complex logic to remember which 
information you got for a certain xid (and forget that information afterwards) 
and whether you already logged that xid and you need to have to decide about 
logging that information at multiple places.

Btw, what do you mean with conflating the stream? I don't really see that 
being proposed.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:
 Robert Haas wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 
 However, if we're dead set on doing it that way, let us put
 information that is only relevant to logical replication records
 into only the logical replication records.
 Right.  If we decide we need this, and if we did decide to
 conflate the WAL stream, both of which I disagree with as noted
 above, then we still don't need it on every record.  It would
 probably be sufficient for local transactions to do nothing at
 all (and we can implicitly assume that they have master node ID =
 local node ID) and transactions which are replaying remote
 changes to emit one record per XID per checkpoint cycle
 containing the remote node ID.
 Youve gone from a pretty trivial 150 line patch without any
 runtime/space overhead to something *considerably* more complex in
 that case though. 
 
I think it might be worth it.  I've done a lot of MM replication,
and so far have not had to use a topology which allowed loops.  Not
only would you be reserving space in the WAL stream which was not
useful for those not using MM replication, you would be reserving it
when even many MM configurations would not need it.  Now you could
argue that the 16 bits you want to use are already there and are not
yet used for anything; but there are two counter-arguments to that:
you lose the opportunity to use them for something else, and you
might want more than 16 bits.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Marko Kreen

On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote:
 This adds a new configuration parameter multimaster_node_id which determines
 the id used for wal originating in one cluster.

 Looks good and it seems this aspect at least is commitable in this CF.

 Design decisions I think we need to review are

 * Naming of field. I think origin is the right term, borrowing from Slony.

I have not read too deeply here, so maybe I am missing
some important detail here, but idea that users need
to coordinate a integer config parameter globally does not
sound too attractive to me.

Why not limit integers to local storage only and map
them to string idents on config, UI and transport?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 04:58, Marko Kreen mark...@gmail.com wrote:
 On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote:
 This adds a new configuration parameter multimaster_node_id which determines
 the id used for wal originating in one cluster.

 Looks good and it seems this aspect at least is commitable in this CF.

 Design decisions I think we need to review are

 * Naming of field. I think origin is the right term, borrowing from Slony.

 I have not read too deeply here, so maybe I am missing
 some important detail here, but idea that users need
 to coordinate a integer config parameter globally does not
 sound too attractive to me.

 Why not limit integers to local storage only and map
 them to string idents on config, UI and transport?

Yes, that can be done. This is simply how the numbers are used
internally, similar to oids.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund and...@2ndquadrant.com wrote:
 More seriously: Even if we don't put MM in core I think putting the basis for
 it in core so that somebody can build such a solution reusing the existing
 infrastructure is a sensible idea. Imo the only thing that requires explicit
 support which is hard to add outside of core is prevention of loops (aka this
 patch). Everything else should be doable reusing the hopefully modular pieces.

I don't think prevention of loops is hard to do outside of core
either, unless you insist on tying logical replication so closely to
WAL that anyone doing MMR is necessarily getting the change stream
from WAL.  In fact, I'd go so far as to say that the ONLY part of this
that's hard to do outside of core is change extraction.  Even
low-level apply can be implemented as a loadable module.

 Right.  If we decide we need this, and if we did decide to conflate
 the WAL stream, both of which I disagree with as noted above, then we
 still don't need it on every record.  It would probably be sufficient
 for local transactions to do nothing at all (and we can implicitly
 assume that they have master node ID = local node ID) and transactions
 which are replaying remote changes to emit one record per XID per
 checkpoint cycle containing the remote node ID.
 Youve gone from a pretty trivial 150 line patch without any runtime/space
 overhead to something *considerably* more complex in that case though.

 You suddently need to have relatively complex logic to remember which
 information you got for a certain xid (and forget that information afterwards)
 and whether you already logged that xid and you need to have to decide about
 logging that information at multiple places.

You need a backend-local hash table inside the wal reader process, and
that hash table needs to map XIDs to node IDs.  And you occasionally
need to prune it, so that it doesn't eat too much memory.  None of
that sounds very hard.

 Btw, what do you mean with conflating the stream? I don't really see that
 being proposed.

It seems to me that you are intent on using the WAL stream as the
logical change stream.  I think that's a bad design.  Instead, you
should extract changes from WAL and then ship them around in a format
that is specific to logical replication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 19 June 2012 14:03, Tom Lane t...@sss.pgh.pa.us wrote:

 Every WAL record?  Why in heck would you attach it to every record?
 Surely putting it in WAL page headers would be sufficient.  We could
 easily afford to burn a page switch (if not a whole segment switch)
 when changing masters.

This does appear to be a reasonable idea at first glance, since it
seems that each node has just a single node id, but that is not the
case.

As we pass changes around we maintain the same origin id for a change,
so there is a mix of origin node ids at the WAL record level, not the
page level. The concept of originating node id is essentially same as
that used in Slony.

 I'm against the idea of eating any spare space we have in WAL record
 headers for this purpose, anyway; there are likely to be more pressing
 needs in future.

Not sure what those pressing needs are, but I can't see any. What we
are doing here is fairly important, just not as important as crash
recovery. But then that has worked pretty much unchanged for some time
now.

I raised the possibility of having variable length headers, but there
is no requirement for that yet.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Christopher Browne

On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote:
 Btw, what do you mean with conflating the stream? I don't really see that
 being proposed.

 It seems to me that you are intent on using the WAL stream as the
 logical change stream.  I think that's a bad design.  Instead, you
 should extract changes from WAL and then ship them around in a format
 that is specific to logical replication.

Yeah, that seems worth elaborating on.

What has been said several times is that it's pretty necessary to
capture the logical changes into WAL.  That seems pretty needful, in
order that the replication data gets fsync()ed avidly, and so that we
don't add in the race condition of needing to fsync() something *else*
almost exactly as avidly as is the case for WAL today..

But it's undesirable to pull *all* the bulk of contents of WAL around
if it's only part of the data that is going to get applied.  On a
physical streaming replica, any logical data that gets captured will
be useless.  And on a logical replica, they physical bits of WAL
will be useless.

What I *want* you to mean is that there would be:
a) WAL readers that pull the physical bits, and
b) WAL readers that just pull logical bits.

I expect it would be fine to have a tool that pulls LCRs out of WAL to
prepare that to be sent to remote locations.  Is that what you have in
mind?  Or are you feeling that the logical bits shouldn't get
captured in WAL altogether, so we need to fsync() them into a
different stream of files?
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 05:46, Robert Haas robertmh...@gmail.com wrote:

 It seems to me that you are intent on using the WAL stream as the
 logical change stream.  I think that's a bad design.  Instead, you
 should extract changes from WAL and then ship them around in a format
 that is specific to logical replication.

The proposal is to read the WAL in order to generate LCRs. The
information needs to be in the original data in order to allow it to
be passed onwards.

In a multi-master config there will be WAL records with many origin
node ids at any time, so this information must be held at WAL record
level not page level.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tuesday, June 19, 2012 10:58:44 PM Marko Kreen wrote:
 On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs si...@2ndquadrant.com wrote:
  On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote:
  This adds a new configuration parameter multimaster_node_id which
  determines the id used for wal originating in one cluster.
  
  Looks good and it seems this aspect at least is commitable in this CF.
  
  Design decisions I think we need to review are
  
  * Naming of field. I think origin is the right term, borrowing from
  Slony.
 
 I have not read too deeply here, so maybe I am missing
 some important detail here, but idea that users need
 to coordinate a integer config parameter globally does not
 sound too attractive to me.
 
 Why not limit integers to local storage only and map
 them to string idents on config, UI and transport?
That should be possible. I don't want to go there till we have the base stuff 
in but it shouldn't be too hard.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 04:31, Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 I've done a lot of MM replication,
 and so far have not had to use a topology which allowed loops.

The proposal is to use WAL to generate the logical change stream. That
has been shown in testing to be around x4 faster than having a
separate change stream, which must also be WAL logged (as Jan noted).

If we use WAL in this way, multi-master implies that the data will
*always* be in a loop. So in any configuration we must be able to tell
difference between changes made by one node and another.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 05:59, Christopher Browne cbbro...@gmail.com wrote:

 But it's undesirable to pull *all* the bulk of contents of WAL around
 if it's only part of the data that is going to get applied.  On a
 physical streaming replica, any logical data that gets captured will
 be useless.  And on a logical replica, they physical bits of WAL
 will be useless.

Completely agree.

What we're discussing here is the basic information content of a WAL
record to enable the construction of suitable LCRs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On 20 June 2012 00:11, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)

 Do you really see this as such a big problem?

 It looks suspiciously like I have a hammer, therefore every problem
 must be a nail.  I don't like the design concept of cramming logical
 replication records into WAL in the first place.

The evidence from prototypes shown at PgCon was that using the WAL in
this way was very efficient and that this level of efficiency is
necessary to make it work in a practical manner. Postgres would not be
the first database to follow this broad design.

 However, if we're dead set on doing it that way, let us put information
 that is only relevant to logical replication records into only the
 logical replication records.

Agreed. In this case, the origin node id information needs to be
available on the WAL record so we can generate the logical changes
(LCRs).

 Saving a couple bytes in each such record
 is penny-wise and pound-foolish, I'm afraid; especially when you're
 nailing down hard, unexpansible limits at the very beginning of the
 development process in order to save those bytes.

Restricting the number of node ids is being done so that there is no
impact on anybody not using this feature.

In later implementations, it should be possible to support greater
numbers of node ids by having a variable length header. But that is an
unnecessary complication for a first release.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tuesday, June 19, 2012 11:46:56 PM Robert Haas wrote:
 On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  More seriously: Even if we don't put MM in core I think putting the basis
  for it in core so that somebody can build such a solution reusing the
  existing infrastructure is a sensible idea. Imo the only thing that
  requires explicit support which is hard to add outside of core is
  prevention of loops (aka this patch). Everything else should be doable
  reusing the hopefully modular pieces.
 I don't think prevention of loops is hard to do outside of core
 either, unless you insist on tying logical replication so closely to
 WAL that anyone doing MMR is necessarily getting the change stream
 from WAL.  In fact, I'd go so far as to say that the ONLY part of this
 that's hard to do outside of core is change extraction.  Even
 low-level apply can be implemented as a loadable module.
I definitely agree that low-level apply is possible as a module. Sure change 
extraction needs core support but I was talking about what you need to 
implement it reusing the plain logical support...

What I do not understand is how you want to prevent loops in a simple manner 
without in core support:

A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT 
action.
B reads the lcr stream from A and applies the changes. A new HEAP_INSERT 
record. Gets decoded into the lcr stream as a INSERT action.
A reads the lcr stream from B and ???

At this point you need to prevent a loop. If you have the information where a 
change originally happened (xl_origin_id = A in this case) you can have the 
simple filter on A which ignores change records if lcr_origin_id == 
local_replication_origin_id).


  Right.  If we decide we need this, and if we did decide to conflate
  the WAL stream, both of which I disagree with as noted above, then we
  still don't need it on every record.  It would probably be sufficient
  for local transactions to do nothing at all (and we can implicitly
  assume that they have master node ID = local node ID) and transactions
  which are replaying remote changes to emit one record per XID per
  checkpoint cycle containing the remote node ID.
  
  Youve gone from a pretty trivial 150 line patch without any runtime/space
  overhead to something *considerably* more complex in that case though.
  
  You suddently need to have relatively complex logic to remember which
  information you got for a certain xid (and forget that information
  afterwards) and whether you already logged that xid and you need to have
  to decide about logging that information at multiple places.
 You need a backend-local hash table inside the wal reader process, and
 that hash table needs to map XIDs to node IDs.  And you occasionally
 need to prune it, so that it doesn't eat too much memory.  None of
 that sounds very hard.
Its not very hard. Its just more complex than what I propose(d).

  Btw, what do you mean with conflating the stream? I don't really see
  that being proposed.
 It seems to me that you are intent on using the WAL stream as the
 logical change stream.  I think that's a bad design.  Instead, you
 should extract changes from WAL and then ship them around in a format
 that is specific to logical replication.
No, I don't want that. I think we will need some different format once we have 
agreed how changeset extraction works.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner

Simon Riggs si...@2ndquadrant.com wrote:
 
 The proposal is to use WAL to generate the logical change stream.
 That has been shown in testing to be around x4 faster than having
 a separate change stream, which must also be WAL logged (as Jan
 noted).
 
Sure, that's why I want it.
 
 If we use WAL in this way, multi-master implies that the data will
 *always* be in a loop. So in any configuration we must be able to
 tell difference between changes made by one node and another.
 
Only if you assume that multi-master means identical databases all
replicating the same data to all the others.  If I have 72 master
replicating non-conflicting data to one consolidated database, I
consider that to be multi-master, too.  Especially if I have other
types of databases replicating disjoint data to the same
consolidated database, and the 72 sources have several databases
replicating disjoint sets of data to them.  We have about 1000
replications paths, none of which create a loop which can send data
back to the originator or cause conflicts.
 
Of course, none of these databases have the same OID for any given
object, and there are numerous different schemas among the
replicating databases, so I need to get to table and column names
before the data is of any use to me.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Wednesday, June 20, 2012 12:15:03 AM Kevin Grittner wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
  If we use WAL in this way, multi-master implies that the data will
  *always* be in a loop. So in any configuration we must be able to
  tell difference between changes made by one node and another.
 
 Only if you assume that multi-master means identical databases all
 replicating the same data to all the others.  If I have 72 master
 replicating non-conflicting data to one consolidated database, I
 consider that to be multi-master, too.
 ...
 Of course, none of these databases have the same OID for any given
 object, and there are numerous different schemas among the
 replicating databases, so I need to get to table and column names
 before the data is of any use to me.
Yes, thats definitely a valid use-case. But that doesn't preclude the other - 
also not uncommon - use-case where you want to have different master which all 
contain up2date data.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:
 
 Yes, thats definitely a valid use-case. But that doesn't preclude
 the other - also not uncommon - use-case where you want to have
 different master which all contain up2date data.
 
I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne cbbro...@gmail.com wrote:
 On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote:
 Btw, what do you mean with conflating the stream? I don't really see that
 being proposed.

 It seems to me that you are intent on using the WAL stream as the
 logical change stream.  I think that's a bad design.  Instead, you
 should extract changes from WAL and then ship them around in a format
 that is specific to logical replication.

 Yeah, that seems worth elaborating on.

 What has been said several times is that it's pretty necessary to
 capture the logical changes into WAL.  That seems pretty needful, in
 order that the replication data gets fsync()ed avidly, and so that we
 don't add in the race condition of needing to fsync() something *else*
 almost exactly as avidly as is the case for WAL today..

Check.

 But it's undesirable to pull *all* the bulk of contents of WAL around
 if it's only part of the data that is going to get applied.  On a
 physical streaming replica, any logical data that gets captured will
 be useless.  And on a logical replica, they physical bits of WAL
 will be useless.

 What I *want* you to mean is that there would be:
 a) WAL readers that pull the physical bits, and
 b) WAL readers that just pull logical bits.

 I expect it would be fine to have a tool that pulls LCRs out of WAL to
 prepare that to be sent to remote locations.  Is that what you have in
 mind?

Yes.  I think it should be possible to generate LCRs from WAL, but I
think that the on-the-wire format for LCRs should be different from
the WAL format.  Trying to use the same format for both things seems
like an unpleasant straightjacket.  This discussion illustrates why:
we're talking about consuming scarce bit-space in WAL records for a
feature that only a tiny minority of users will use, and it's still
not really enough bit space.  That stinks.  If LCR transmission is a
separate protocol, this problem can be engineered away at a higher
level.

Suppose we have three servers, A, B, and C, that are doing
multi-master replication in a loop.  A sends LCRs to B, B sends them
to C, and C sends them back to A.  Obviously, we need to make sure
that each server applies each set of changes just once, but it
suffices to have enough information in WAL to distinguish between
replication transactions and non-replication transactions - that is,
one bit.  So suppose a change is made on server A.  A generates LCRs
from WAL, and tags each LCR with node_id = A.  It then sends those
LCRs to B.  B applies them, flagging the apply transaction in WAL as a
replication transaction, AND ALSO sends the LCRs to C.  The LCR
generator on B sees the WAL from apply, but because it's flagged as a
replication transaction, it does not generate LCRs.  So C receives
LCRs from B just once, without any need for the node_id to to be known
in WAL.  C can now also apply those LCRs (again flagging the apply
transaction as replication) and it can also skip sending them to A,
because it seems that they originated at A.

Now suppose we have a more complex topology.  Suppose we have a
cluster of four servers A .. D which, for improved tolerance against
network outages, are all connected pairwise.  Normally all the links
are up, so each server sends all the LCRs it generates directly to all
other servers.  But how do we prevent cycles?  A generates a change
and sends it to B, C, and D.  B then sees that the change came from A
so it sends it to C and D.  C, receiving that change, sees that came
from A via B, so it sends it to D again, whereupon D, which got it
from C and knows that the origin is A, sends it to B, who will then
send it right back over to D.  Obviously, we've got an infinite loop
here, so this topology will not work.  However, there are several
obvious ways to patch it by changing the LCR protocol.  Most
obviously, and somewhat stupidly, you could add a TTL.  A bit smarter,
you could have each LCR carry a LIST of node_ids that it had already
visited, refusing to send it to any node it had already been to it,
instead of a single node_id.  Smarter still, you could send
handshaking messages around the cluster so that each node can build up
a spanning tree and prefix each LCR it sends with the list of
additional nodes to which the recipient must deliver it.  So,
normally, A would send a message to each of B, C, and D destined only
for that node; but if the A-C link went down, A would choose either B
or D and send each LCR to that node destined for that node *and C*;
then, A would forward the message.  Or perhaps you think this is too
complex and not worth supporting anyway, and that might be true, but
the point is that if you insist that all of the identifying
information must be carried in WAL, you've pretty much ruled it out,
because we are not going to put TTL fields, or lists of node IDs, or
lists of destinations, in WAL.  But there is no reason they can't be
attached to LCRs, which is

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

On Tue, Jun 19, 2012 at 6:14 PM, Andres Freund and...@2ndquadrant.com wrote:
 I definitely agree that low-level apply is possible as a module. Sure change
 extraction needs core support but I was talking about what you need to
 implement it reusing the plain logical support...

 What I do not understand is how you want to prevent loops in a simple manner
 without in core support:

 A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT
 action.
 B reads the lcr stream from A and applies the changes. A new HEAP_INSERT
 record. Gets decoded into the lcr stream as a INSERT action.
 A reads the lcr stream from B and ???

 At this point you need to prevent a loop. If you have the information where a
 change originally happened (xl_origin_id = A in this case) you can have the
 simple filter on A which ignores change records if lcr_origin_id ==
 local_replication_origin_id).

See my email to Chris Browne, which I think covers this.  It needs a
bit in WAL (per txn, or, heck, if it's one bit, maybe per record) but
not a whole node ID.

 You need a backend-local hash table inside the wal reader process, and
 that hash table needs to map XIDs to node IDs.  And you occasionally
 need to prune it, so that it doesn't eat too much memory.  None of
 that sounds very hard.
 Its not very hard. Its just more complex than what I propose(d).

True, but not a whole lot more complex, and a moderate amount of
complexity to save bit-space is a good trade.  Especially when Tom has
come down against eating up the bit space.  And I agree with him.  If
we've only got 16 bits of padding to work with, we surely be judicious
in burning them when it can be avoided for the expense of a few
hundred lines of code.

  Btw, what do you mean with conflating the stream? I don't really see
  that being proposed.
 It seems to me that you are intent on using the WAL stream as the
 logical change stream.  I think that's a bad design.  Instead, you
 should extract changes from WAL and then ship them around in a format
 that is specific to logical replication.
 No, I don't want that. I think we will need some different format once we have
 agreed how changeset extraction works.

I think you are saying that you agree with me that the formats should
be different, but that the LCR format is undecided as yet.  If that is
in fact what you are saying, great.  We'll need to decide that, of
course, but I think there is a lot of cool stuff that can be done that
way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node