Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-20 Thread Andres Freund
Hi Robert, Hi All!

On Wednesday, June 20, 2012 03:08:48 AM Robert Haas wrote:
 On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  Well, the words are fuzzy, but I would define logical replication to
  be something which is independent of the binary format in which stuff
  gets stored on disk.  If it's not independent of the disk format, then
  you can't do heterogenous replication (between versions, or between
  products).  That precise limitation is the main thing that drives
  people to use anything other than SR in the first place, IME.
  
  Not in mine. The main limitation I see is that you cannot write anything
  on the standby. Which sucks majorly for many things. Its pretty much
  impossible to fix that for SR outside of very limited cases.
  While many scenarios don't need multimaster *many* need to write outside
  of the standby's replication set.
 Well, that's certainly a common problem, even if it's not IME the most
 common, but I don't think we need to argue about which one is more
 common, because I'm not arguing against it.  The point, though, is
 that if the logical format is independent of the on-disk format, the
 things we can do are a strict superset of the things we can do if it
 isn't.  I don't want to insist that catalogs be the same (or else you
 get garbage when you decode tuples).  I want to tolerate the fact that
 they may very well be different.  That will in no way preclude writing
 outside the standby's replication set, nor will it prevent
 multi-master replication.  It will, however, enable heterogenous
 replication, which is a very important use case.  It will also mean
 that innocent mistakes (like somehow ending up with a column that is
 text on one server and numeric on another server) produce
 comprehensible error messages, rather than garbage.
I agree with most of that. I think that some parts of the above need to be 
optional because you do loose too much for other scenarious.
I *definitely* want to build the *infrastructure* which make it easy to 
implement all of the above but I find it a bit much to require that from the 
get-go. Its important that everything is reusable for that, yes. Does a 
patchset that wants to implement tightly coupled multimaster need to implement 
everything for that? No.
If we raise the barrier for anything around this topic so high we will *NEVER* 
get anywhere. Its a huge topic with loads of people wanting loads of different 
things. And that will hurt people wanting some feature which matches 90% of 
the proposed goals *far* more.

  Its not only the logging side which is a limitation in todays replication
  scenarios. The apply side scales even worse because its *very* hard to
  distribute it between multiple backends.

 I don't think that making LCR format = on-disk format is going to
 solve that problem.  To solve that problem, we need to track
 dependencies between transactions, so that if tuple A is modified by
 T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
 committed after both T1 and T2 - touches none of the same data as T1
 or T2 - then we can apply it in parallel, so long as we don't commit
 until T1 and T2 have committed (because allowing T3 to commit early
 would produce a serialization anomaly from the point of view of a
 concurrent reader).
Well, doing apply on such low level, without reencoding the data increased 
throughput nearly threefold even for trivial types. So it pushes of the point 
where we need to do the above quite a bit.

  Because the routines that decode tuples don't include enough sanity
  checks to prevent running off the end of the block, or even the end of
  memory completely.  Consider a corrupt TOAST pointer that indicates
  that there is a gigabyte of data stored in an 8kB block.  One of the
  common symptoms of corruption IME is TOAST requests for -3 bytes of
  memory.
  Yes, but we need to put safeguards against that sort of thing anyway. So
  sure, we can have bugs but this is not a fundamental limitation.
 There's a reason we haven't done that already, though: it's probably
 going to stink for performance.  If it turns out that it doesn't stink
 for performance, great.  But if it causes a 5% slowdown on common use
 cases, I suspect we're not gonna do it, and I bet I can construct a
 case where it's worse than that (think: 400 column table with lots of
 varlenas, sorting by column 400 to return column 399).  I think it's
 treading on dangerous ground to assume we're going to be able to just
 go fix this.
I am talking about ensuring that the catalog is the same on the decoding site 
not about making all decoding totally safe in the face of corrupted 
information.

  Postgis uses one information table in a few more complex functions but
  not in anything low-level. Evidenced by the fact that it was totally
  normal for that to go out of sync before  2.0.
  
  But even if such a thing would be needed, it wouldn't be problematic to
  make extension 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-19 Thread Andres Freund
Hi,

The most important part, even for people not following my discussion with 
Robert is at the bottom where the possible wal decoding strategies are laid 
out.

On Tuesday, June 19, 2012 03:20:58 AM Robert Haas wrote:
 On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund and...@2ndquadrant.com 
wrote:
   Hm. Yes, you could do that. But I have to say I don't really see a
   point. Maybe the fact that I do envision multimaster systems at some
   point is clouding my judgement though as its far less easy in that
   case.
  
  Why?  I don't think that particularly changes anything.
  
  Because it makes conflict detection very hard. I also don't think its a
  feature worth supporting. Whats the use-case of updating records you
  cannot properly identify?
 Don't ask me; I just work here.  I think it's something that some
 people want, though.  I mean, if you don't support replicating a table
 without a primary key, then you can't even run pgbench in a
 replication environment.
Well, I have no problem with INSERT only tables not having a PK. And 
pgbench_history is the only pgbench table that doesn't have a pkey? And thats 
only truncated...

 Is that an important workload?  Well,
 objectively, no.  But I guarantee you that other people with more
 realistic workloads than that will complain if we don't have it.
 Absolutely required on day one?  Probably not.  Completely useless
 appendage that no one wants?  Not that, either.
Maybe. And I don't really care, so if others see that as important I am happy 
to appease them ;). 

  In my view, a logical replication solution is precisely one in which
  the catalogs don't need to be in sync.  If the catalogs have to be in
  sync, it's not logical replication.  ISTM that what you're talking
  about is sort of a hybrid between physical replication (pages) and
  logical replication (tuples) - you want to ship around raw binary
  tuple data, but not entire pages.
  Ok, thats a valid point. Simon argued at the cluster summit that
  everything thats not physical is logical. Which has some appeal because
  it seems hard to agree what exactly logical rep is. So definition by
  exclusion makes kind of sense ;)
 Well, the words are fuzzy, but I would define logical replication to
 be something which is independent of the binary format in which stuff
 gets stored on disk.  If it's not independent of the disk format, then
 you can't do heterogenous replication (between versions, or between
 products).  That precise limitation is the main thing that drives
 people to use anything other than SR in the first place, IME.
Not in mine. The main limitation I see is that you cannot write anything on 
the standby. Which sucks majorly for many things. Its pretty much impossible 
to fix that for SR outside of very limited cases.
While many scenarios don't need multimaster *many* need to write outside of 
the standby's replication set.

  I think what you categorized as hybrid logical/physical rep solves an
  important use-case thats very hard to solve at the moment. Before my
  2ndquadrant days I had several client which had huge problemsing the
  trigger based solutions because their overhead simply was to big a
  burden on the master. They couldn't use SR either because every
  consuming database kept loads of local data.
  I think such scenarios are getting more and more common.

 I think this is to some extent true, but I also think you're
 conflating two different things.  Change extraction via triggers
 introduces overhead that can be eliminated by reconstructing tuples
 from WAL in the background rather than forcing them to be inserted
 into a shadow table (and re-WAL-logged!) in the foreground.  I will
 grant that shipping the tuple as a binary blob rather than as text
 eliminates additional overehead on both ends, but it also closes off a
 lot of important use cases.  As I noted in my previous email, I think
 that ought to be a performance optimization that we do, if at all,
 when it's been proven safe, not a baked-in part of the design.  Even a
 solution that decodes WAL to text tuples and ships those around and
 reinserts the via pure SQL should be significantly faster than the
 replication solutions we have today; if it isn't, something's wrong.
Its not only the logging side which is a limitation in todays replication 
scenarios. The apply side scales even worse because its *very* hard to 
distribute it between multiple backends.

  The problem with that is it's going to be tough to make robust.  Users
  could easily end up with answers that are total nonsense, or probably
  even crash the server.
  Why?
 Because the routines that decode tuples don't include enough sanity
 checks to prevent running off the end of the block, or even the end of
 memory completely.  Consider a corrupt TOAST pointer that indicates
 that there is a gigabyte of data stored in an 8kB block.  One of the
 common symptoms of corruption IME is TOAST requests for -3 bytes of
 memory.
Yes, but we need to put 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-19 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:
 
 The problem is just that to support basically arbitrary decoding
 requirements you need to provide at least those pieces of
 information in a transactionally consistent manner:
 * the data
 * table names
 * column names
 * type information
 * replication configuration
 
I'm not sure that the last one needs to be in scope for the WAL
stream, but the others I definitely agree eventually need to be
available to a logical transaction stream consumer.  You lay out the
alternative ways to get all of this pretty clearly, and I don't know
what the best answer is; it seems likely that there is not one best
answer.  In the long run, more than one of those options might need
to be supported, to support different environments.
 
As an initial implementation, I'm leaning toward the position that
requiring a hot standby or a catalog-only proxy is acceptable.  I
think that should allow an application to be written which emits
everything except the replication configuration.  That will allow us
to hook up everything we need at our shop.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund and...@2ndquadrant.com wrote:
 Well, the words are fuzzy, but I would define logical replication to
 be something which is independent of the binary format in which stuff
 gets stored on disk.  If it's not independent of the disk format, then
 you can't do heterogenous replication (between versions, or between
 products).  That precise limitation is the main thing that drives
 people to use anything other than SR in the first place, IME.
 Not in mine. The main limitation I see is that you cannot write anything on
 the standby. Which sucks majorly for many things. Its pretty much impossible
 to fix that for SR outside of very limited cases.
 While many scenarios don't need multimaster *many* need to write outside of
 the standby's replication set.

Well, that's certainly a common problem, even if it's not IME the most
common, but I don't think we need to argue about which one is more
common, because I'm not arguing against it.  The point, though, is
that if the logical format is independent of the on-disk format, the
things we can do are a strict superset of the things we can do if it
isn't.  I don't want to insist that catalogs be the same (or else you
get garbage when you decode tuples).  I want to tolerate the fact that
they may very well be different.  That will in no way preclude writing
outside the standby's replication set, nor will it prevent
multi-master replication.  It will, however, enable heterogenous
replication, which is a very important use case.  It will also mean
that innocent mistakes (like somehow ending up with a column that is
text on one server and numeric on another server) produce
comprehensible error messages, rather than garbage.

 Its not only the logging side which is a limitation in todays replication
 scenarios. The apply side scales even worse because its *very* hard to
 distribute it between multiple backends.

I don't think that making LCR format = on-disk format is going to
solve that problem.  To solve that problem, we need to track
dependencies between transactions, so that if tuple A is modified by
T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
committed after both T1 and T2 - touches none of the same data as T1
or T2 - then we can apply it in parallel, so long as we don't commit
until T1 and T2 have committed (because allowing T3 to commit early
would produce a serialization anomaly from the point of view of a
concurrent reader).

 Because the routines that decode tuples don't include enough sanity
 checks to prevent running off the end of the block, or even the end of
 memory completely.  Consider a corrupt TOAST pointer that indicates
 that there is a gigabyte of data stored in an 8kB block.  One of the
 common symptoms of corruption IME is TOAST requests for -3 bytes of
 memory.
 Yes, but we need to put safeguards against that sort of thing anyway. So sure,
 we can have bugs but this is not a fundamental limitation.

There's a reason we haven't done that already, though: it's probably
going to stink for performance.  If it turns out that it doesn't stink
for performance, great.  But if it causes a 5% slowdown on common use
cases, I suspect we're not gonna do it, and I bet I can construct a
case where it's worse than that (think: 400 column table with lots of
varlenas, sorting by column 400 to return column 399).  I think it's
treading on dangerous ground to assume we're going to be able to just
go fix this.

 Postgis uses one information table in a few more complex functions but not in
 anything low-level. Evidenced by the fact that it was totally normal for that
 to go out of sync before  2.0.

 But even if such a thing would be needed, it wouldn't be problematic to make
 extension configuration tables be replicated as well.

Ugh.  That's a hack on top of a hack.  Now it all works great if type
X is installed as an extension but if it isn't installed as an
extension then the world blows up.

 I am pretty sure its not bad-behaved. But how should the code know that? You
 want each type to explictly say that its unsafe if it is?

Yes, exactly.  Or maybe there are varying degrees of non-safety,
allowing varying degrees of optimization.  Like: wire format = binary
format is super-safe.  Then having to call an I/O function that
promises not to look at any catalogs is a bit less safe.  And then
there's really unsafe.

 I have played with several ideas:

 1.)
 keep the decoding catalog in sync with command/event triggers, correctly
 replicating oids. If those log into some internal event table its easy to keep
 the catalog in a correct transactional state because the events from that
 table get decoded in the transaction and replayed at exactly the right spot in
 there *after* it has been reassembled. The locking on the generating side
 takes care of the concurrency aspects.

I am not following this one completely.

 2.)
 Keep the decoding site up2date by replicating the catalog via normal recovery
 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-18 Thread Robert Haas
On Sat, Jun 16, 2012 at 3:03 PM, Steve Singer st...@ssinger.info wrote:
 I feel that in-core support to capture changes and turn them into change
 records that can be replayed on other databases, without relying on triggers
 and log tables, would be good to have.

 I think we want some flexible enough that people write consumers of the LCRs
 to do conflict resolution for multi-master but I am not sure that the
 conflict resolution support actually belongs in core.

I agree, on both counts.  Anyone else want to chime in here?

 Most of the complexity of slony (both in terms of lines of code, and issues
 people encounter using it) comes not from the log triggers or replay of the
 logged data but comes from the configuration of the cluster.
 Controlling things like

 * Which tables replicate from a node to which other nodes
 * How do you change the cluster configuration on a running system (adding
 nodes, removing nodes, moving the origin of a table, adding tables to
 replication etc...)

Not being as familiar with Slony as I probably ought to be, I hadn't
given this much thought, but it's an interesting point.  The number of
logical replication policies that someone might want to implement, and
the ways in which they might want to change them as the situation
develops, is very large.  Whole cluster, whole database, one or
several schemas, individual tables, perhaps even more fine-grained
than per-table.  Trying to figure all of that out is going to require
a lot of work and, frankly, I question the value of having that stuff
in core anyway.

 I see three catalogs in play here.
 1. The catalog on the origin
 2. The catalog on the proxy system (this is the catalog used to translate
 the WAL records to LCR's).  The proxy system will need essentially the same
 pgsql binaries (same architecture, important complie flags etc..) as the
 origin
 3. The catalog on the destination system(s).

 The catalog 2 must be in sync with catalog 1, catalog 3 shouldn't need to be
 in-sync with catalog 1.   I think catalogs 2 and 3 are combined in the
 current patch set (though I haven't yet looked at the code closely).   I
 think the performance optimizations Andres has implemented to update tuples
 through low-level functions should be left for later and that we should  be
 generating SQL in the apply cache so we don't start assuming much about
 catalog 3.

+1.  Although there is a lot of performance benefit to be had there,
it seems better to me to get the basics working and then do
performance optimization later.  That is, if we can detect that the
catalogs are in sync, then by all means ship around the binary tuple
to make things faster.  But requiring that (without having any way to
know whether it actually holds) strikes me as a mess.

 Part of what people expect from a robust in-core solution is that it should
 work with the the other in-core features.  If we have to list a bunch of
 in-core type as being incompatible with logical replication then people will
 look at logical replication with the same 'there be dragons here' attitude
 that scare many people away from the existing third party replication
 solutions.   Non-core or third party user defined types are a slightly
 different matter because we can't control what they do.

I agree, although I don't think either Andres or I are saying anything else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-18 Thread Robert Haas
On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund and...@2ndquadrant.com wrote:
  Hm. Yes, you could do that. But I have to say I don't really see a point.
  Maybe the fact that I do envision multimaster systems at some point is
  clouding my judgement though as its far less easy in that case.
 Why?  I don't think that particularly changes anything.
 Because it makes conflict detection very hard. I also don't think its a
 feature worth supporting. Whats the use-case of updating records you cannot
 properly identify?

Don't ask me; I just work here.  I think it's something that some
people want, though.  I mean, if you don't support replicating a table
without a primary key, then you can't even run pgbench in a
replication environment.  Is that an important workload?  Well,
objectively, no.  But I guarantee you that other people with more
realistic workloads than that will complain if we don't have it.
Absolutely required on day one?  Probably not.  Completely useless
appendage that no one wants?  Not that, either.

 In my view, a logical replication solution is precisely one in which
 the catalogs don't need to be in sync.  If the catalogs have to be in
 sync, it's not logical replication.  ISTM that what you're talking
 about is sort of a hybrid between physical replication (pages) and
 logical replication (tuples) - you want to ship around raw binary
 tuple data, but not entire pages.
 Ok, thats a valid point. Simon argued at the cluster summit that everything
 thats not physical is logical. Which has some appeal because it seems hard to
 agree what exactly logical rep is. So definition by exclusion makes kind of
 sense ;)

Well, the words are fuzzy, but I would define logical replication to
be something which is independent of the binary format in which stuff
gets stored on disk.  If it's not independent of the disk format, then
you can't do heterogenous replication (between versions, or between
products).  That precise limitation is the main thing that drives
people to use anything other than SR in the first place, IME.

 I think what you categorized as hybrid logical/physical rep solves an
 important use-case thats very hard to solve at the moment. Before my
 2ndquadrant days I had several client which had huge problemsing the trigger
 based solutions because their overhead simply was to big a burden on the
 master. They couldn't use SR either because every consuming database kept
 loads of local data.
 I think such scenarios are getting more and more common.

I think this is to some extent true, but I also think you're
conflating two different things.  Change extraction via triggers
introduces overhead that can be eliminated by reconstructing tuples
from WAL in the background rather than forcing them to be inserted
into a shadow table (and re-WAL-logged!) in the foreground.  I will
grant that shipping the tuple as a binary blob rather than as text
eliminates additional overehead on both ends, but it also closes off a
lot of important use cases.  As I noted in my previous email, I think
that ought to be a performance optimization that we do, if at all,
when it's been proven safe, not a baked-in part of the design.  Even a
solution that decodes WAL to text tuples and ships those around and
reinserts the via pure SQL should be significantly faster than the
replication solutions we have today; if it isn't, something's wrong.

 The problem with that is it's going to be tough to make robust.  Users could
 easily end up with answers that are total nonsense, or probably even crash
 the server.
 Why?

Because the routines that decode tuples don't include enough sanity
checks to prevent running off the end of the block, or even the end of
memory completely.  Consider a corrupt TOAST pointer that indicates
that there is a gigabyte of data stored in an 8kB block.  One of the
common symptoms of corruption IME is TOAST requests for -3 bytes of
memory.

And, of course, even if you could avoid crashing, interpreting what
was originally intended as a series of int4s as a varlena isn't likely
to produce anything terribly meaningful.  Tuple data isn't
self-identifying; that's why this is such a hard problem.

 To step back and talk about DDL more generally, you've mentioned a few
 times the idea of using an SR instance that has been filtered down to
 just the system catalogs as a means of generating logical change
 records.  However, as things stand today, there's no reason to suppose
 that replicating anything less than the entire cluster is sufficient.
 For example, you can't translate enum labels to strings without access
 to the pg_enum catalog, which would be there, because enums are
 built-in types.  But someone could supply a similar user-defined type
 that uses a user-defined table to do those lookups, and now you've got
 a problem.  I think this is a contractual problem, not a technical
 one.  From the point of view of logical replication, it would be nice
 if type output functions were basically guaranteed to 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-16 Thread Andres Freund
Hi Robert,

On Friday, June 15, 2012 10:03:38 PM Robert Haas wrote:
 On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  I don't plan to throw in loads of conflict resolution smarts. The aim is
  to get to the place where all the infrastructure is there so that a MM
  solution can be built by basically plugging in a conflict resolution
  mechanism. Maybe providing a very simple one.
  I think without in-core support its really, really hard to build a
  sensible MM implementation. Which doesn't mean it has to live entirely
  in core.
 
 Of course, several people have already done it, perhaps most notably
 Bucardo.
Bucardo certainly is nice but its not useable for many things just from an 
overhead perspective.

 Anyway, it would be good to get opinions from more people here.  I am
 sure I am not the only person with an opinion on the appropriateness
 of trying to build a multi-master replication solution in core or,
 indeed, the only person with an opinion on any of these other issues.
 It is not good for those other opinions to be saved for a later date.
Agreed.

  Hm. Yes, you could do that. But I have to say I don't really see a point.
  Maybe the fact that I do envision multimaster systems at some point is
  clouding my judgement though as its far less easy in that case.
 Why?  I don't think that particularly changes anything.
Because it makes conflict detection very hard. I also don't think its a 
feature worth supporting. Whats the use-case of updating records you cannot 
properly identify?

  It also complicates the wal format as you now need to specify whether you
  transport a full or a primary-key only tuple...
 Why?  If the schemas are in sync, the target knows what the PK is
 perfectly well.  If not, you're probably in trouble anyway.
True. There already was the wish (from Kevin) of having the option of 
transporting full before/after images anyway, so the wal format might want to 
be able to represent that.

  I think though that we do not want to enforce that mode of operation for
  tightly coupled instances. For those I was thinking of using command
  triggers to synchronize the catalogs.
  One of the big screwups of the current replication solutions is exactly
  that you cannot sensibly do DDL which is not a big problem if you have a
  huge system with loads of different databases and very knowledgeable
  people et al. but at the beginning it really sucks. I have no problem
  with making one of the nodes the schema master in that case.
  Also I would like to avoid the overhead of the proxy instance for
  use-cases where you really want one node replicated as fully as possible
  with the slight exception of being able to have summing tables,
  different indexes et al.
 In my view, a logical replication solution is precisely one in which
 the catalogs don't need to be in sync.  If the catalogs have to be in
 sync, it's not logical replication.  ISTM that what you're talking
 about is sort of a hybrid between physical replication (pages) and
 logical replication (tuples) - you want to ship around raw binary
 tuple data, but not entire pages.
Ok, thats a valid point. Simon argued at the cluster summit that everything 
thats not physical is logical. Which has some appeal because it seems hard to 
agree what exactly logical rep is. So definition by exclusion makes kind of 
sense ;)

I think what you categorized as hybrid logical/physical rep solves an 
important use-case thats very hard to solve at the moment. Before my 
2ndquadrant days I had several client which had huge problemsing the trigger 
based solutions because their overhead simply was to big a burden on the 
master. They couldn't use SR either because every consuming database kept 
loads of local data.
I think such scenarios are getting more and more common.

 The problem with that is it's going to be tough to make robust.  Users could
 easily end up with answers that are total nonsense, or probably even crash
 the server.
Why?

 To step back and talk about DDL more generally, you've mentioned a few
 times the idea of using an SR instance that has been filtered down to
 just the system catalogs as a means of generating logical change
 records.  However, as things stand today, there's no reason to suppose
 that replicating anything less than the entire cluster is sufficient.
 For example, you can't translate enum labels to strings without access
 to the pg_enum catalog, which would be there, because enums are
 built-in types.  But someone could supply a similar user-defined type
 that uses a user-defined table to do those lookups, and now you've got
 a problem.  I think this is a contractual problem, not a technical
 one.  From the point of view of logical replication, it would be nice
 if type output functions were basically guaranteed to look at nothing
 but the datum they get passed as an argument, or at the very least
 nothing other than the system catalogs, but there is no such
 guarantee.  And, without 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-16 Thread Steve Singer

On 12-06-15 04:03 PM, Robert Haas wrote:

On Thu, Jun 14, 2012 at 4:13 PM, Andres Freundand...@2ndquadrant.com  wrote:

I don't plan to throw in loads of conflict resolution smarts. The aim is to get
to the place where all the infrastructure is there so that a MM solution can
be built by basically plugging in a conflict resolution mechanism. Maybe
providing a very simple one.
I think without in-core support its really, really hard to build a sensible MM
implementation. Which doesn't mean it has to live entirely in core.

Of course, several people have already done it, perhaps most notably Bucardo.

Anyway, it would be good to get opinions from more people here.  I am
sure I am not the only person with an opinion on the appropriateness
of trying to build a multi-master replication solution in core or,
indeed, the only person with an opinion on any of these other issues.


This sounds like a good place for me to chime in.

I feel that in-core support to capture changes and turn them into change 
records that can be replayed on other databases, without relying on 
triggers and log tables, would be good to have.


I think we want some flexible enough that people write consumers of the 
LCRs to do conflict resolution for multi-master but I am not sure that 
the conflict resolution support actually belongs in core.


Most of the complexity of slony (both in terms of lines of code, and 
issues people encounter using it) comes not from the log triggers or 
replay of the logged data but comes from the configuration of the cluster.

Controlling things like

* Which tables replicate from a node to which other nodes
* How do you change the cluster configuration on a running system 
(adding nodes, removing nodes, moving the origin of a table, adding 
tables to replication etc...)


This is the harder part of the problem, I think we need to first get the 
infrastructure committed (that the current patch set deals with) to 
capturing, transporting and translating the LCR's into the system before 
get too caught up in the configuration aspects.   I think we will have a 
hard time agreeing on behaviours for some of that other stuff that are 
both flexible for enough use cases and simple enough for 
administrators.  I'd like to see in-core support for a lot of that stuff 
but I'm not holding my breath.



It is not good for those other opinions to be saved for a later date.


Hm. Yes, you could do that. But I have to say I don't really see a point.
Maybe the fact that I do envision multimaster systems at some point is
clouding my judgement though as its far less easy in that case.

Why?  I don't think that particularly changes anything.


It also complicates the wal format as you now need to specify whether you
transport a full or a primary-key only tuple...

Why?  If the schemas are in sync, the target knows what the PK is
perfectly well.  If not, you're probably in trouble anyway.





I think though that we do not want to enforce that mode of operation for
tightly coupled instances. For those I was thinking of using command triggers
to synchronize the catalogs.
One of the big screwups of the current replication solutions is exactly that
you cannot sensibly do DDL which is not a big problem if you have a huge
system with loads of different databases and very knowledgeable people et al.
but at the beginning it really sucks. I have no problem with making one of the
nodes the schema master in that case.
Also I would like to avoid the overhead of the proxy instance for use-cases
where you really want one node replicated as fully as possible with the slight
exception of being able to have summing tables, different indexes et al.

In my view, a logical replication solution is precisely one in which
the catalogs don't need to be in sync.  If the catalogs have to be in
sync, it's not logical replication.  ISTM that what you're talking
about is sort of a hybrid between physical replication (pages) and
logical replication (tuples) - you want to ship around raw binary
tuple data, but not entire pages.  The problem with that is it's going
to be tough to make robust.  Users could easily end up with answers
that are total nonsense, or probably even crash the server.



I see three catalogs in play here.
1. The catalog on the origin
2. The catalog on the proxy system (this is the catalog used to 
translate the WAL records to LCR's).  The proxy system will need 
essentially the same pgsql binaries (same architecture, important 
complie flags etc..) as the origin

3. The catalog on the destination system(s).

The catalog 2 must be in sync with catalog 1, catalog 3 shouldn't need 
to be in-sync with catalog 1.   I think catalogs 2 and 3 are combined in 
the current patch set (though I haven't yet looked at the code 
closely).   I think the performance optimizations Andres has implemented 
to update tuples through low-level functions should be left for later 
and that we should  be generating SQL in the apply cache so we don't 
start 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-15 Thread Robert Haas
On Thu, Jun 14, 2012 at 4:13 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't plan to throw in loads of conflict resolution smarts. The aim is to 
 get
 to the place where all the infrastructure is there so that a MM solution can
 be built by basically plugging in a conflict resolution mechanism. Maybe
 providing a very simple one.
 I think without in-core support its really, really hard to build a sensible MM
 implementation. Which doesn't mean it has to live entirely in core.

Of course, several people have already done it, perhaps most notably Bucardo.

Anyway, it would be good to get opinions from more people here.  I am
sure I am not the only person with an opinion on the appropriateness
of trying to build a multi-master replication solution in core or,
indeed, the only person with an opinion on any of these other issues.
It is not good for those other opinions to be saved for a later date.

 Hm. Yes, you could do that. But I have to say I don't really see a point.
 Maybe the fact that I do envision multimaster systems at some point is
 clouding my judgement though as its far less easy in that case.

Why?  I don't think that particularly changes anything.

 It also complicates the wal format as you now need to specify whether you
 transport a full or a primary-key only tuple...

Why?  If the schemas are in sync, the target knows what the PK is
perfectly well.  If not, you're probably in trouble anyway.

 I think though that we do not want to enforce that mode of operation for
 tightly coupled instances. For those I was thinking of using command triggers
 to synchronize the catalogs.
 One of the big screwups of the current replication solutions is exactly that
 you cannot sensibly do DDL which is not a big problem if you have a huge
 system with loads of different databases and very knowledgeable people et al.
 but at the beginning it really sucks. I have no problem with making one of the
 nodes the schema master in that case.
 Also I would like to avoid the overhead of the proxy instance for use-cases
 where you really want one node replicated as fully as possible with the slight
 exception of being able to have summing tables, different indexes et al.

In my view, a logical replication solution is precisely one in which
the catalogs don't need to be in sync.  If the catalogs have to be in
sync, it's not logical replication.  ISTM that what you're talking
about is sort of a hybrid between physical replication (pages) and
logical replication (tuples) - you want to ship around raw binary
tuple data, but not entire pages.  The problem with that is it's going
to be tough to make robust.  Users could easily end up with answers
that are total nonsense, or probably even crash the server.

To step back and talk about DDL more generally, you've mentioned a few
times the idea of using an SR instance that has been filtered down to
just the system catalogs as a means of generating logical change
records.  However, as things stand today, there's no reason to suppose
that replicating anything less than the entire cluster is sufficient.
For example, you can't translate enum labels to strings without access
to the pg_enum catalog, which would be there, because enums are
built-in types.  But someone could supply a similar user-defined type
that uses a user-defined table to do those lookups, and now you've got
a problem.  I think this is a contractual problem, not a technical
one.  From the point of view of logical replication, it would be nice
if type output functions were basically guaranteed to look at nothing
but the datum they get passed as an argument, or at the very least
nothing other than the system catalogs, but there is no such
guarantee.  And, without such a guarantee, I don't believe that we can
create a high-performance, robust, in-core replication solution.

Now, the nice thing about being the people who make PostgreSQL happen
is we get to decide what the C code that people load into the server
is required to guarantee; we can change the rules.  Before, types were
allowed to do X, but now they're not.  Unfortunately, in this case, I
don't really find that an acceptable solution.  First, it might break
code that has worked with PostgreSQL for many years; but worse, it
won't break it in any obvious way, but rather only if you're using
logical replication, which will doubtless cause people to attribute
the failure to logical replication rather than to their own code.
Even if they do understand that we imposed a rule-change from on high,
there's no really good workaround: an enum type is a good example of
something that you *can't* implement without a side-table.  Second, it
flies in the face of our often-stated desire to make the server
extensible.  Also, even given the existence of such a restriction, you
still need to run any output function that relies on catalogs with
catalog contents that match what existed at the time that WAL was
generated, and under the correct snapshot, which is not 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-15 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 So maybe instead of trying to cobble together a set of catalog
 contents that we can use for decoding any tuple whatsoever, we
 should instead divide the world into well-behaved types and
 poorly-behaved types.  Well-behaved types are those that can be
 interpreted without the catalogs, provided that you know what type
 it is.  Poorly-behaved types (records, enums) are those where you
 can't.  For well-behaved types, we only need a small amount of
 additional information in WAL to identify which types we're trying
 to decode (not the type OID, which might fail in the presence of
 nasty catalog hacks, but something more universal, like a UUID
 that means this is text, or something that identifies the C
 entrypoint).  And then maybe we handle poorly-behaved types by
 pushing some of the work into the foreground task that's
 generating the WAL: in the worst case, the process logs a record
 before each insert/update/delete containing the text
 representation of any values that are going to be hard to decode. 
 In some cases (e.g. records all of whose constituent fields are
 well-behaved types) we could instead log enough additional
 information about the type to permit blind decoding.
 
What about matching those values up to the correct table name and
the respective columns names?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-14 Thread Robert Haas
On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund and...@2ndquadrant.com wrote:
 === Design goals for logical replication === :
 - in core
 - fast
 - async
 - robust
 - multi-master
 - modular
 - as unintrusive as possible implementation wise
 - basis for other technologies (sharding, replication into other DBMSs, ...)

I agree with all of these goals except for multi-master.  I am not
sure that there is a need to have a multi-master replication solution
in core.  The big tricky part of multi-master replication is conflict
resolution, and that seems like an awful lot of logic to try to build
into core - especially given that we will want it to be extensible.

More generally, I would much rather see us focus on efficiently
extracting changesets from WAL and efficiently applying those
changesets on another server.  IMHO, those are the things that are
holding back the not-in-core replication solutions we have today,
particularly the first.  If we come up with a better way of applying
change-sets, well, Slony can implement that too; they are already
installing C code.  What neither they nor any other non-core solution
can implement is change-set extraction, and therefore I think that
ought to be the focus.

To put all that another way, I think it is a 100% bad idea to try to
kill off Slony, Bucardo, Londiste, or any of the many home-grown
solutions that are out there to do replication.  Even if there were no
technical place for third-party replication products (and I think
there is), we will not win many friends by making it harder to extend
and add on to the server.  If we build an in-core replication solution
that can be used all by itself, that is fine with me.  But I think it
should also be able to expose its constituent parts as a toolkit for
third-party solutions.

 While you may argue that most of the above design goals are already provided 
 by
 various trigger based replication solutions like Londiste or Slony, we think
 that thats not enough for various reasons:

 - not in core (and thus less trustworthy)
 - duplication of writes due to an additional log
 - performance in general (check the end of the above presentation)
 - complex to use because there is no native administration interface

I think that your parenthetical note (and thus less trustworthy)
gets at another very important point, which is that one of the
standards for inclusion in core is that it must in fact be trustworthy
enough to justify the confidence that users will place in it.  It will
NOT benefit the project to have two replication solutions in core, one
of which is crappy.  More, even if what we put in core is AS GOOD as
the best third-party solutions that are available, I don't think
that's adequate.  It has to be better.  If it isn't, there is no
excuse for preempting what's already out there.

I imagine you are thinking along similar lines, but I think that it
bears being explicit about.

 As we need a change stream that contains all required changes in the correct
 order, the requirement for this stream to reflect changes across multiple
 concurrent backends raises concurrency and scalability issues. Reusing the
 WAL stream for this seems a good choice since it is needed anyway and adresses
 those issues already, and it further means that we don't incur duplicate
 writes. Any other stream generating componenent would introduce additional
 scalability issues.

Agreed.

 We need a change stream that contains all required changes in the correct 
 order
 which thus needs to be synchronized across concurrent backends which 
 introduces
 obvious concurrency/scalability issues.
 Reusing the WAL stream for this seems a good choice since it is needed anyway
 and adresses those issues already, and it further means we don't duplicate the
 writes and locks already performance for its maintenance.

Agreed.

 Unfortunately, in this case, the WAL is mostly a physical representation of 
 the
 changes and thus does not, by itself, contain the necessary information in a
 convenient format to create logical changesets.

Agreed.

 The biggest problem is, that interpreting tuples in the WAL stream requires an
 up-to-date system catalog and needs to be done in a compatible backend and
 architecture. The requirement of an up-to-date catalog could be solved by
 adding more data to the WAL stream but it seems to be likely that that would
 require relatively intrusive  complex changes. Instead we chose to require a
 synchronized catalog at the decoding site. That adds some complexity to use
 cases like replicating into a different database or cross-version
 replication. For those it is relatively straight-forward to develop a proxy pg
 instance that only contains the catalog and does the transformation to textual
 changes.

The actual requirement here is more complex than an up-to-date
catalog.  Suppose transaction X begins, adds a column to a table,
inserts a row, and commits.  That tuple needs to be interpreted using
the tuple descriptor that 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-14 Thread Andres Freund
Hi Robert,

Thanks for your answer.

On Thursday, June 14, 2012 06:17:26 PM Robert Haas wrote:
 On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  === Design goals for logical replication === :
  - in core
  - fast
  - async
  - robust
  - multi-master
  - modular
  - as unintrusive as possible implementation wise
  - basis for other technologies (sharding, replication into other DBMSs,
  ...)
 
 I agree with all of these goals except for multi-master.  I am not
 sure that there is a need to have a multi-master replication solution
 in core.  The big tricky part of multi-master replication is conflict
 resolution, and that seems like an awful lot of logic to try to build
 into core - especially given that we will want it to be extensible.
I don't plan to throw in loads of conflict resolution smarts. The aim is to get 
to the place where all the infrastructure is there so that a MM solution can 
be built by basically plugging in a conflict resolution mechanism. Maybe 
providing a very simple one.
I think without in-core support its really, really hard to build a sensible MM 
implementation. Which doesn't mean it has to live entirely in core.

Loads of the use-cases we have seen lately have a relatively small, low-
conflict shared dataset and a far bigger sharded one. While that obviously 
isn't the only relevant use case it is a senible important one.

 More generally, I would much rather see us focus on efficiently
 extracting changesets from WAL and efficiently applying those
 changesets on another server.  IMHO, those are the things that are
 holding back the not-in-core replication solutions we have today,
 particularly the first.  If we come up with a better way of applying
 change-sets, well, Slony can implement that too; they are already
 installing C code.  What neither they nor any other non-core solution
 can implement is change-set extraction, and therefore I think that
 ought to be the focus.
It definitely is a very important focus. I don't think it is the only one.  But 
that doesn't seem to be a problem to me as long as everything is kept fairly 
modular (which I tried rather hard to).

 To put all that another way, I think it is a 100% bad idea to try to
 kill off Slony, Bucardo, Londiste, or any of the many home-grown
 solutions that are out there to do replication.  Even if there were no
 technical place for third-party replication products (and I think
 there is), we will not win many friends by making it harder to extend
 and add on to the server.  If we build an in-core replication solution
 that can be used all by itself, that is fine with me.  But I think it
 should also be able to expose its constituent parts as a toolkit for
 third-party solutions.
I agree 100%. Unfortunately I forgot to explictly make that point, but the 
plan is definitely is to make the life of other replication solutions easier 
not harder. I don't think there will ever be one replication solution that fits 
every use-case perfectly.
At pgcon I talked with some of the slony guys and they were definetly 
interested in the changeset generation and I have kept that in mind. If some 
problems that need resolving indepently of that issue is resolved (namely DDL) 
it shouldn't take much generating their output format. The 'apply' code is 
fully abstracted and separted.

  While you may argue that most of the above design goals are already
  provided by various trigger based replication solutions like Londiste or
  Slony, we think that thats not enough for various reasons:
  
  - not in core (and thus less trustworthy)
  - duplication of writes due to an additional log
  - performance in general (check the end of the above presentation)
  - complex to use because there is no native administration interface
 
 I think that your parenthetical note (and thus less trustworthy)
 gets at another very important point, which is that one of the
 standards for inclusion in core is that it must in fact be trustworthy
 enough to justify the confidence that users will place in it.  It will
 NOT benefit the project to have two replication solutions in core, one
 of which is crappy.  More, even if what we put in core is AS GOOD as
 the best third-party solutions that are available, I don't think
 that's adequate.  It has to be better.  If it isn't, there is no
 excuse for preempting what's already out there.
I aggree that it has to be very good. *But* I think it is totally acceptable 
if it doesn't have all the bells and whistles from the start. That would be a 
sure road to disaster. For one implementing all that takes time and for 
another the amount of discussions till we are there is rather huge.

 I imagine you are thinking along similar lines, but I think that it
 bears being explicit about.
Seems like were thinking along the same lines, yes.

  The biggest problem is, that interpreting tuples in the WAL stream
  requires an up-to-date system catalog and needs to be done in a
  compatible backend 

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-13 Thread Robert Haas
On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund and...@2ndquadrant.com wrote:
 Unless somebody objects I will add most of the individual marked as RFC to the
 current commitfest. I hope that with comments stemming from that round we can
 get several of the patches into the first or second commitfest. As soon as the
 design is clear/accepted we will try very hard to get the following patches
 into the second or third round.

I made a logical replication topic within the CommitFest for this
patch series.  I think you should add them all there.  I have some
substantive thoughts about the design as well, which I will write up
in a separate email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-13 Thread Andres Freund
On Wednesday, June 13, 2012 05:53:31 PM Robert Haas wrote:
 On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  Unless somebody objects I will add most of the individual marked as RFC
  to the current commitfest. I hope that with comments stemming from that
  round we can get several of the patches into the first or second
  commitfest. As soon as the design is clear/accepted we will try very
  hard to get the following patches into the second or third round.
 
 I made a logical replication topic within the CommitFest for this
 patch series.  I think you should add them all there.  
Thanks. Added all but the bgworker patch (which is not ready) and the 
WalSndWakeup (different category) one which is not really relevant to the 
topic.

I have the feeling I am due quite some reviewing this round...

 I have some substantive thoughts about the design as well, which I will
 write up in a separate email.
Thanks. Looking forward to it. At least now, before I have read it.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-13 Thread Steve Singer

On 12-06-13 07:27 AM, Andres Freund wrote:

Its also available in the 'cabal-rebasing' branch on
git.postgresql.org/users/andresfreund/postgres.git . That branch will modify
history though.



That branch has a merge error in f685a11ce43b9694cbe61ffa42e396c9fbc32b05


gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include 
-D_GNU_SOURCE -c -o xact.o xact.c

xact.c:4684: error: expected identifier or ‘(’ before ‘’ token
xact.c:4690:46: warning: character constant too long for its type
xact.c:4712:46: warning: character constant too long for its type
xact.c:4719: error: expected identifier or ‘(’ before ‘’ token
xact.c:4740:46: warning: character constant too long for its type
make[4]: *** [xact.o] Error 1



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-13 Thread Andres Freund
On Wednesday, June 13, 2012 07:11:40 PM Steve Singer wrote:
 On 12-06-13 07:27 AM, Andres Freund wrote:
  Its also available in the 'cabal-rebasing' branch on
  git.postgresql.org/users/andresfreund/postgres.git . That branch will
  modify history though.
 
 That branch has a merge error in f685a11ce43b9694cbe61ffa42e396c9fbc32b05
 
 
 gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
 -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
 -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include
 -D_GNU_SOURCE -c -o xact.o xact.c
 xact.c:4684: error: expected identifier or ‘(’ before ‘’ token
 xact.c:4690:46: warning: character constant too long for its type
 xact.c:4712:46: warning: character constant too long for its type
 xact.c:4719: error: expected identifier or ‘(’ before ‘’ token
 xact.c:4740:46: warning: character constant too long for its type
 make[4]: *** [xact.o] Error 1
Aw crap. Will fix that. Sorry.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-13 Thread Andres Freund
On Wednesday, June 13, 2012 07:11:40 PM Steve Singer wrote:
 On 12-06-13 07:27 AM, Andres Freund wrote:
  Its also available in the 'cabal-rebasing' branch on
  git.postgresql.org/users/andresfreund/postgres.git . That branch will
  modify history though.
 
 That branch has a merge error in f685a11ce43b9694cbe61ffa42e396c9fbc32b05
 
 
 gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
 -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
 -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include
 -D_GNU_SOURCE -c -o xact.o xact.c
 xact.c:4684: error: expected identifier or ‘(’ before ‘’ token
 xact.c:4690:46: warning: character constant too long for its type
 xact.c:4712:46: warning: character constant too long for its type
 xact.c:4719: error: expected identifier or ‘(’ before ‘’ token
 xact.c:4740:46: warning: character constant too long for its type
 make[4]: *** [xact.o] Error 1
Hrmpf. I reordered the patch series a last time to be more clear and I somehow 
didn't notice this. I compiled  tested the now pushed head 
(7e0340a3bef927f79b3d97a11f94ede4b911560c) and will submit an updated patch 
[10/16].

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-13 Thread Andres Freund
Hi,

The patch as of yet doesn't contain how you actually can use the prototype... 
Obviously at this point its not very convenient:

I have two config files:
Node 1:
port = 5501
wal_level = logical
max_wal_senders = 10
wal_keep_segments = 200
multimaster_conninfo = 'port=5502 host=/tmp'
multimaster_node_id = 1

Node 2:
port = 5502
wal_level = logical
max_wal_senders = 10
wal_keep_segments = 200
multimaster_conninfo = 'port=5501 host=/tmp'
multimaster_node_id = 2

after initdb'ing the first cluster (initdb required):
$ ~/src/postgresql/build/assert/src/backend/postgres -D 
~/tmp/postgres/bdr/1/datadir/ -c 
config_file=~/tmp/postgres/bdr/1/postgresql.conf -c 
hba_file=~/tmp/postgres/bdr/1/pg_hba.conf -c 
ident_file=~/tmp/postgres/bdr/1/pg_ident.conf

$ psql -p 5501 -U andres postgres
CREATE TABLE data(id serial primary key, data bigint);
ALTER SEQUENCE data_id_seq INCREMENT 2;
SELECT setval('data_id_seq', 1);

shutdown cluster

$ rsync -raxv --delete /home/andres/tmp/postgres/bdr/1/datadir/* 
/home/andres/tmp/postgres/bdr/2/datadir

start both cluster which should sync after some output.

$ psql -p 5501 -U andres postgres
SELECT setval('data_id_seq', 2);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers