Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-22 Thread Kevin Grittner
Simon Riggs wrote:
 Greg Stark st...@mit.edu wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Isn't there an even more serious problem, namely that this
 assumes *all* transactions are serializable?

Do you mean in terms of the serializable transaction isolation level,
or something else?

I haven't read the patches, but I've been trying to follow the
discussion and I don't recall any hint of basing this on serializable
transactions on each source. Of course, when it comes down to
commits, both where a change is committed and where the work is
copied, there must be a commit order; and with asynchronous work
where data isn't partitioned such that there is a single clear owner
for each partition there will be conflicts which must be resolved.  I
don't get the impression that this point has been lost on Simon and
Andres.

 What happens when they aren't? Or even just that the effective
 commit order is not XID order?

 Firstly, I haven't read the code but I'm confident it doesn't make
 the elementary error of assuming commit order == xid order. I
 assume it's applying the reassembled transactions in commit order.

Same here.

 I don't think it assumes the transactions are serializable because
 it's only concerned with writes, not reads. So the transaction
 it's replaying may or may not have been able to view the data
 written by other transactions that commited earlier but it doesn't
 matter when trying to reproduce the effects using constants.

IIRC, the developers of this feature have explicitly said that they
will defer any consideration of trying to extend serializable
transaction isolation behavior to a multi-server basis until after
they have other things working. (Frankly, to do otherwise would not
be sane.) It appears to me that it can't be managed in a general
sense without destroying almost all the advantages of multi-master
replication, at least (as I said before) where data isn't partitioned
such that there is a single clear owner for each partition.

Where such partitioning is present and there are data sets maintained
exclusively by serializable transactions, anomaly-free reads of the
data could be accomplished by committing transactions on the replicas
in apparent order of execution rather than commit order. Apparent
order of execution must take both commit order and read-write
dependencies into consideration.

 The data written by this transaction is either written or not when
 the commit happens and it's all written or not at that time. Even
 in non-serializable mode updates take row locks and nobody can see
 the data or modify it until the transaction commits.

As with read-only transactions and hot standbys, the problem comes in
when a transaction commits and is replicated while a transction
remains uncommitted which is basing its updates on the earlier state
of the data. It gets even more exciting with MMR since the
transaction working with the old version of the data might be on a
different machine, on another continent. With certain types of
workloads, it seems to me that it could get pretty crazy if certain
closely-related actions are not kept within a single database (like
controlling the active batch and adding items to a batch).

In the wild, half-baked, hand-wavey suggestions department -- maybe
there should be some consideration of a long-term way within MMR to
direct activities to certain logical nodes, each of which could be
mapped to a single physical node at any one time. Basically, to route
a request through the MMR network to the current logical node for
handling something, and have the effects ripple back out through all
nodes.

 This uses Commit Serializability, which is valid, as you say.

Well, it is a type of concurrency control with a name and a
definition, if that's what you mean. I agree it provides sufficient
guarantees to create a workable MMR system, if you have adequate
conflict resolution. My concern is that it not be confused with
serializability in the mathematical sense or in the sense of
transaction isolation levels.

In general on this thread, when I've seen the terms serializable
and serializability I haven't been clear on whether the words are
being used in their more general sense as words in the English
language, or in a more particular technical sense.

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-15 Thread Tatsuo Ishii
 The design Andres and Simon have advanced already eliminates a lot of
 the common failure cases (now(), random(), nextval()) suffered by pgPool
 and similar tools.  But remember, this feature doesn't have to be

Well, pgpool-II already solved the now() case by using query rewriting
technique. The technique could be applied to random() as well but I'm
not convinced it is worth the trouble. nexval() would be a little
harder because pgpool needs an assistance from PostgreSQL core.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-11 Thread Hannu Krosing

On 10/11/2012 04:31 AM, Tom Lane wrote:

Greg Stark st...@mit.edu writes:

On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable?  What happens when they aren't?
Or even just that the effective commit order is not XID order?

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants.

I would believe that argument if the apply operations were at a
similar logical level to our current WAL records, namely drop these bits
into that spot.  Unfortunately, they're not.  I think this argument
falls to the ground entirely as soon as you think about DDL being
applied by transactions A,B,C and then needing to express what
concurrent transactions X,Y,Z did in source terms.  Even something as
simple as a few column renames could break it, let alone anything as
esoteric as changing the meaning of datatype literals.

This is the whole reason of moving the reassembly to the source
side and having the possibility to use old snapshots to get the
catalog information.

Also, the locks that protect you from effects of field name changes
by DDL concurrent transactions protect also the logical reassembly
if done in the commit order.



regards, tom lane






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-11 Thread Hannu Krosing

On 10/11/2012 03:10 AM, Robert Haas wrote:

On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.

The lack of pipelining (and the following complexity of applycache
and spilling to disk) is something we have discussed with Andres and
to my understanding it is not a final design decision but just stepping
stones in how this quite large development is structured.

The pipelining (or parallel apply as I described it) requires either a 
large

number of apply backends and code to manage them or autonomous
transactions.

It could (arguably !) be easier to implement autonomous transactions
instead of apply cache, but Andres had valid reasons to start with apply
cache and move to parallel apply later .

As I understand it the parallel apply is definitely one of the things that
will be coming and after that the  performance characteristics (fast AND
smooth) will be very similar to current physical WAL streaming.


With streaming
replication as it exists today, a transaction that modifies a huge
amount of data (such as a bulk load) can be applied on the standby as
it happens.  The rows thus inserted will become visible only if and
when the transaction commits on the master and the commit record is
replayed on the standby.  This has a number of important advantages,
perhaps most importantly that the lag between commit and data
visibility remains short.  With the proposed system, we can't start
applying the changes until the transaction has committed and the
commit record has been replayed, so a big transaction is going to have
a lot of apply latency.

Now, I am not 100% opposed to a design that surrenders this property
in exchange for other important benefits, but I think it would be
worth thinking about whether there is any way that we can design this
that either avoids giving that property up at all, or gives it up for
the time being but allows us to potentially get back to it in a later
version.  Reassembling complete transactions is surely cool and some
clients will want that, but being able to apply replicated
transactions *without* reassembling them in their entirety is even
cooler, and some clients will want that, too.

If we're going to stick with a design that reassembles transactions, I
think there are a number of issues that deserve careful thought.
First, memory usage.  I don't think it's acceptable for the decoding
process to assume that it can allocate enough backend-private memory
to store all of the in-flight changes (either as WAL or in some more
decoded form).  We have assiduously avoided such assumptions thus far;
you can write a terabyte of data in one transaction with just a
gigabyte of shared buffers if you so desire (and if you're patient).
Here's you making the same point in different words:


Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

Second, crash recovery.  I think whatever we put in place here has to
be able to survive a crash on any node.  Decoding must be able to
restart successfully after a system crash, and it has to be able to
apply exactly the set of transactions that were committed but not
applied prior to the crash.  Maybe an appropriate mechanism for this
already exists or has been discussed, but I haven't seen it go by;
sorry if I have missed the boat.


You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi).

Without prejudice to the rest of this review which looks quite
well-considered, I'd like to add a particular +1 to this point.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-11 Thread Josh Berkus
On 10/10/12 7:26 PM, Bruce Momjian wrote:
 How does Slony write its changes without causing serialization replay
 conflicts?

Since nobody from the Slony team answered this:

a) Slony replicates *rows*, not *statements*
b) Slony uses serializable mode internally for row replication
c) it's possible (though difficult) for creative usage to get Slony into
a deadlock situation

FWIW, I have always assumed that is is impossible (even theoretically)
to have statement-based replication without some constraints on the
statements you can run, or some replication failures.  I think we should
expect 9.3's logical replication out-the-gate to have some issues and
impose constraints on users, and improve with time but never be perfect.

The design Andres and Simon have advanced already eliminates a lot of
the common failure cases (now(), random(), nextval()) suffered by pgPool
and similar tools.  But remember, this feature doesn't have to be
*perfect*, it just has to be *better* than the alternatives.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-11 Thread Christopher Browne
On Wed, Oct 10, 2012 at 10:26 PM, Bruce Momjian br...@momjian.us wrote:
 How does Slony write its changes without causing serialization replay
 conflicts?

It uses a sequence to break any ordering conflicts at the time that
data is inserted into its log tables.

If there are two transactions, A and B, that were fighting over a
tuple on the origin, then either:

a) A went first, B went second, and the ordering in the log makes that
order clear;
b) A succeeds, then B fails, so there's no conflict;
c) A is doing its thing, and B is blocked behind it for a while, then
A fails, and B gets to go through, and there's no conflict.

Switch A and B as needed.

The sequence that is used establishes what is termed a compatible
ordering.  There are multiple possible compatible orderings; ours
happen to interleave transactions together, with the sequence
guaranteeing absence of conflict.

If we could get commit orderings, then a different but still
compatible ordering would be to have each transaction establish its
own internal sequence, and apply things in order based on
(commit_tx_order, sequence_within).
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-11 Thread Simon Riggs
On 11 October 2012 03:16, Greg Stark st...@mit.edu wrote:
 On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think I've mentioned it before, but in the interest of not being
 seen to critique the bikeshed only after it's been painted: this
 design gives up something very important that exists in our current
 built-in replication solution, namely pipelining.

 Isn't there an even more serious problem, namely that this assumes
 *all* transactions are serializable?  What happens when they aren't?
 Or even just that the effective commit order is not XID order?

 Firstly, I haven't read the code but I'm confident it doesn't make the
 elementary error of assuming commit order == xid order. I assume it's
 applying the reassembled transactions in commit order.

 I don't think it assumes the transactions are serializable because
 it's only concerned with writes, not reads. So the transaction it's
 replaying may or may not have been able to view the data written by
 other transactions that commited earlier but it doesn't matter when
 trying to reproduce the effects using constants. The data written by
 this transaction is either written or not when the commit happens and
 it's all written or not at that time. Even in non-serializable mode
 updates take row locks and nobody can see the data or modify it until
 the transaction commits.

This uses Commit Serializability, which is valid, as you say.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-11 Thread Steve Singer

On 12-10-11 06:27 PM, Josh Berkus wrote:

On 10/10/12 7:26 PM, Bruce Momjian wrote:

How does Slony write its changes without causing serialization replay
conflicts?

Since nobody from the Slony team answered this:

a) Slony replicates *rows*, not *statements*


True, but the proposed logical replication also would replicate rows not 
the original statements.  I don't consider this proposal to be an 
example of  'statement' replication like pgpool  is.  If the original 
SQL was 'update foo set x=x+1 where id  10';  there will be a WAL 
record to decode for each row modified by the table. In a million row 
table I'd expect the replica will have to apply a million records 
(whether they be binary heap tuples or SQL statements).

b) Slony uses serializable mode internally for row replication


Actually recent versions of slony apply transactions against the replica 
in read committed mode.  Older versions used serializable mode but with 
the SSI changes in 9.1 we found slony tended to have serialization 
conflicts with itself on the slony internal tables resulting in a lot of 
aborted transactions.


When slony applies changes on a replica table it does so in a single 
transaction.  Slony finds a set of transactions that committed on the 
master in between two SYNC events.  It then applies all of the rows 
changed by any of those transactions as part of a single transaction on 
the replica. Chris's post explains this in more detail.


Conflicts with user transactions on the replica are possible.

c) it's possible (though difficult) for creative usage to get Slony into
a deadlock situation

FWIW, I have always assumed that is is impossible (even theoretically)
to have statement-based replication without some constraints on the
statements you can run, or some replication failures.  I think we should
expect 9.3's logical replication out-the-gate to have some issues and
impose constraints on users, and improve with time but never be perfect.

The design Andres and Simon have advanced already eliminates a lot of
the common failure cases (now(), random(), nextval()) suffered by pgPool
and similar tools.  But remember, this feature doesn't have to be
*perfect*, it just has to be *better* than the alternatives.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Peter Geoghegan
On 15 September 2012 01:39, Andres Freund and...@2ndquadrant.com wrote:
 (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)

This patch is the 8th of 8 in a patch series that covers different
aspects of the bi-directional replication feature planned for
PostgreSQL 9.3. For those that are unfamiliar with the BDR projection,
a useful summary is available in an e-mail sent to this list by Simon
Riggs back in April [1]. I should also point out that Andres has made
available a design document that discusses aspects of this patch in
particular, in another thread [2]. That document, High level design
for logical replication in Postgres, emphasizes source data
generation in particular: generalising how PostgreSQL's WAL stream is
generated to represent the changes it describes logically, without
pointers to physical heap tuples and so forth (data generation as a
concept is described in detail in an earlier design document [3]).
This particular patch can be thought of as a response to the earlier
discussion [4] surrounding how to solve the problem of keeping system
catalogs consistent during WAL replay on followers: catalog time
travel is now used, rather than maintaining a synchronized catalog at
the decoding end. Andres' git tree (xlog-decoding-rebasing-cf2
branch) [5] provides additional useful comments in commit messages (he
rebases things such that each commit represents a distinct piece of
functionality/ patch for review).

This patch is not strictly speaking an atomic unit. It is necessary to
apply all 8 patches in order to get the code to compile. However, it
is approximately an atomic unit, that represents a subdivision of the
entire BDR patch that it is manageable and logical to write a discrete
review for. This is idiomatic use of git-format-patch, but it is
unusual enough within our community for me feel the need to note these
facts.

I briefly discussed this patch with Andres off-list. His feeling is
that the review process ought to focus on the design of WAL decoding,
including how it fits within the larger set of replication features
proposed for 9.3. There are a number of known omissions in this patch.
Andres has listed some of these above, and edge-cases and so on are
noted next to XXX and FIXME comments in the patch itself. I am
inclined to agree with Andres' view that we should attempt to solidify
community support for this prototype patch's design, or some variant,
before fixing the edge-cases and working towards committable code. I
will try my best to proceed on that basis.

What follows is an initial overview of the patch (or at least my
understanding of the patch, which you may need to correct), and some
points of concern.

 * applycache module which reassembles transactions from a stream of 
 interspersed changes


This is what the design doc [2] refers to as 4.5. TX reassembly.

This functionality is concentrated in applycache.c. As [2] notes, the
reassembly component was previously coined ApplyCache because it was
proposed to run on replication consumers just before applying changes.
This is not the case anymore. It is still called that way in the
source of the patch recently submitted.

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I meant to avoid talking about anything other than the bigger picture
for now, but I must ask: Why the frequent use of malloc(),
particularly within applycache.c? The obvious answer is that it's
rough code and that that will change, but that still doesn't comport
with my idea about how rough Postgres code should look, so I have to
wonder if there's a better reason.

applycache.c has an acute paucity of comments, which makes it really
hard to review well. [2] doesn't have all that much to say about it
either. I'm going to not comment much on this here, except to say that
I think that the file should be renamed to reassembly.c or something
like that, to reflect its well-specified purpose, and not how it might
be used. Any cache really belongs in src/backend/utils/cache/ anyway.

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

By the way, I see that you're doing this here:

+ /* most callers don't need snapshot.h */
+ typedef struct SnapshotData *Snapshot;

Tom, Alvaro and I had a discussion about whether or not this was an
acceptable way to reduce build dependencies back in July [8] – I lost
that one. You're relying on a Gnu extension/C11 feature here (that is,
typedef redefinition). If you find it intensely irritating that you
cannot do this while following the standard to the letter, you are not
alone.

 * snapbuilder which builds catalog snapshots so that tuples from wal can be 
 understood


This component analyses xlog and builds a special kind of Snapshot.
This has been compared to the 

Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Bruce Momjian
On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:
 On 15 September 2012 01:39, Andres Freund and...@2ndquadrant.com wrote:
  (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)
 
 This patch is the 8th of 8 in a patch series that covers different
 aspects of the bi-directional replication feature planned for
 PostgreSQL 9.3. For those that are unfamiliar with the BDR projection,
 a useful summary is available in an e-mail sent to this list by Simon
 Riggs back in April [1]. I should also point out that Andres has made

Does this design allow replication/communcation between clusters running
different major versions of Postgres?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Josh Berkus

 Does this design allow replication/communcation between clusters running
 different major versions of Postgres?

Just catching up on your email, hmmm?

Yes, that's part of the design 2Q presented.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Bruce Momjian
On Thu, Oct 11, 2012 at 01:34:58AM +0200, anara...@anarazel.de wrote:
 
 
 Bruce Momjian br...@momjian.us schrieb:
 
 On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote:
  On 15 September 2012 01:39, Andres Freund and...@2ndquadrant.com
 wrote:
   (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)
  
  This patch is the 8th of 8 in a patch series that covers different
  aspects of the bi-directional replication feature planned for
  PostgreSQL 9.3. For those that are unfamiliar with the BDR
 projection,
  a useful summary is available in an e-mail sent to this list by Simon
  Riggs back in April [1]. I should also point out that Andres has made
 
 Does this design allow replication/communcation between clusters
 running
 different major versions of Postgres?
 This patchset only contains only the decoding/changeset generation part of 
 logical replication. It provides (as in the debugging example) the 
 capabilities to generate a correct textual format and thus can be used to 
 build a solution with support for cross version/arch replication as long as 
 the text format of the used types is compatible.
 
 Does that answer the question?

Yes.  This was posted so long ago I couldn't remember if that was part
of the design.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Robert Haas
On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 The purpose of ApplyCache/transaction reassembly is to reassemble
 interlaced records, and organise them by XID, so that the consumer
 client code sees only streams (well, lists) of records split by XID.

I think I've mentioned it before, but in the interest of not being
seen to critique the bikeshed only after it's been painted: this
design gives up something very important that exists in our current
built-in replication solution, namely pipelining.  With streaming
replication as it exists today, a transaction that modifies a huge
amount of data (such as a bulk load) can be applied on the standby as
it happens.  The rows thus inserted will become visible only if and
when the transaction commits on the master and the commit record is
replayed on the standby.  This has a number of important advantages,
perhaps most importantly that the lag between commit and data
visibility remains short.  With the proposed system, we can't start
applying the changes until the transaction has committed and the
commit record has been replayed, so a big transaction is going to have
a lot of apply latency.

Now, I am not 100% opposed to a design that surrenders this property
in exchange for other important benefits, but I think it would be
worth thinking about whether there is any way that we can design this
that either avoids giving that property up at all, or gives it up for
the time being but allows us to potentially get back to it in a later
version.  Reassembling complete transactions is surely cool and some
clients will want that, but being able to apply replicated
transactions *without* reassembling them in their entirety is even
cooler, and some clients will want that, too.

If we're going to stick with a design that reassembles transactions, I
think there are a number of issues that deserve careful thought.
First, memory usage.  I don't think it's acceptable for the decoding
process to assume that it can allocate enough backend-private memory
to store all of the in-flight changes (either as WAL or in some more
decoded form).  We have assiduously avoided such assumptions thus far;
you can write a terabyte of data in one transaction with just a
gigabyte of shared buffers if you so desire (and if you're patient).
Here's you making the same point in different words:

 Applycache is presumably where you're going to want to spill
 transaction streams to disk, eventually. That seems like a
 prerequisite to commit.

Second, crash recovery.  I think whatever we put in place here has to
be able to survive a crash on any node.  Decoding must be able to
restart successfully after a system crash, and it has to be able to
apply exactly the set of transactions that were committed but not
applied prior to the crash.  Maybe an appropriate mechanism for this
already exists or has been discussed, but I haven't seen it go by;
sorry if I have missed the boat.

 You consider this to be a throw-away function that won't ever be
 committed. However, I strongly feel that you should move it into
 /contrib, so that it can serve as a sort of reference implementation
 for authors of decoder client code, in the same spirit as numerous
 existing contrib modules (think contrib/spi).

Without prejudice to the rest of this review which looks quite
well-considered, I'd like to add a particular +1 to this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Bruce Momjian
On Wed, Oct 10, 2012 at 09:10:48PM -0400, Robert Haas wrote:
  You consider this to be a throw-away function that won't ever be
  committed. However, I strongly feel that you should move it into
  /contrib, so that it can serve as a sort of reference implementation
  for authors of decoder client code, in the same spirit as numerous
  existing contrib modules (think contrib/spi).
 
 Without prejudice to the rest of this review which looks quite
 well-considered, I'd like to add a particular +1 to this point.

The review was _HUGE_!  :-O

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 The purpose of ApplyCache/transaction reassembly is to reassemble
 interlaced records, and organise them by XID, so that the consumer
 client code sees only streams (well, lists) of records split by XID.

 I think I've mentioned it before, but in the interest of not being
 seen to critique the bikeshed only after it's been painted: this
 design gives up something very important that exists in our current
 built-in replication solution, namely pipelining.

Isn't there an even more serious problem, namely that this assumes
*all* transactions are serializable?  What happens when they aren't?
Or even just that the effective commit order is not XID order?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Greg Stark
On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think I've mentioned it before, but in the interest of not being
 seen to critique the bikeshed only after it's been painted: this
 design gives up something very important that exists in our current
 built-in replication solution, namely pipelining.

 Isn't there an even more serious problem, namely that this assumes
 *all* transactions are serializable?  What happens when they aren't?
 Or even just that the effective commit order is not XID order?

Firstly, I haven't read the code but I'm confident it doesn't make the
elementary error of assuming commit order == xid order. I assume it's
applying the reassembled transactions in commit order.

I don't think it assumes the transactions are serializable because
it's only concerned with writes, not reads. So the transaction it's
replaying may or may not have been able to view the data written by
other transactions that commited earlier but it doesn't matter when
trying to reproduce the effects using constants. The data written by
this transaction is either written or not when the commit happens and
it's all written or not at that time. Even in non-serializable mode
updates take row locks and nobody can see the data or modify it until
the transaction commits.

I have to say I was curious about Robert's point as well when I read
Peter's review. Especially because this is exactly how other logical
replication systems I've seen work too and I've always wondered about
it in those systems. Both MySQL and Oracle reassemble transactions and
don't write anything until they have the whole transaction
reassembled.  To me this always struck me as a bizarre and obviously
bad thing to do though.  It seems to me it would be better to create
sessions (or autonomous transactions) for each transaction seen in the
stream and issue the DML as it shows up, committing and cleaning each
up when the commit or abort (or shutdown or startup) record comes
along.

I imagine the reason lies with dealing with locking and ensuring that
you get the correct results without deadlocks when multiple
transactions try to update the same record. But it seems to me that
the original locks the source database took should protect you against
any problems. As long as you can suspend a transaction when it takes a
lock that blocks and keep processing WAL for other transactions (or an
abort for that transaction if that happened due to a deadlock or user
interruption) you should be fine.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Andres Freund
Hi,

First of: Awesome review.

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
 What follows is an initial overview of the patch (or at least my
 understanding of the patch, which you may need to correct), and some
 points of concern.
 
  * applycache module which reassembles transactions from a stream of
  interspersed changes
 
 This is what the design doc [2] refers to as 4.5. TX reassembly.
 
 This functionality is concentrated in applycache.c. As [2] notes, the
 reassembly component was previously coined ApplyCache because it was
 proposed to run on replication consumers just before applying changes.
 This is not the case anymore. It is still called that way in the
 source of the patch recently submitted.
 
 The purpose of ApplyCache/transaction reassembly is to reassemble
 interlaced records, and organise them by XID, so that the consumer
 client code sees only streams (well, lists) of records split by XID.

 I meant to avoid talking about anything other than the bigger picture
 for now, but I must ask: Why the frequent use of malloc(),
 particularly within applycache.c? The obvious answer is that it's
 rough code and that that will change, but that still doesn't comport
 with my idea about how rough Postgres code should look, so I have to
 wonder if there's a better reason.
Several reasons, not sure how good:
- part of the code (was) supposed to be runnable on a target system without a 
full postgres backend arround
- All of the allocations would basically have to be in TopMemoryContext or 
something equally longlived as we don't have a transaction context or anything 
like that
- the first revision showed that allocating memory was the primary bottleneck 
so I added a small allocation cache to the critical pieces which solved that. 
After that there seems no point in using mctx''s for that kind of memory 
anymore.

 applycache.c has an acute paucity of comments, which makes it really
 hard to review well.
Working on that. I don't think its internals are really all that interesting 
atm.

 I'm going to not comment much on this here, except to say that
 I think that the file should be renamed to reassembly.c or something
 like that, to reflect its well-specified purpose, and not how it might
 be used.
Agreed.

 Applycache is presumably where you're going to want to spill
 transaction streams to disk, eventually. That seems like a
 prerequisite to commit.
Yes.

 By the way, I see that you're doing this here:
 
 + /* most callers don't need snapshot.h */
 + typedef struct SnapshotData *Snapshot;
 
 Tom, Alvaro and I had a discussion about whether or not this was an
 acceptable way to reduce build dependencies back in July [8] – I lost
 that one. You're relying on a Gnu extension/C11 feature here (that is,
 typedef redefinition). If you find it intensely irritating that you
 cannot do this while following the standard to the letter, you are not
 alone.
Yuck :(. So I will just use struct SnapshotData directly instead of a 
typedef...

  * snapbuilder which builds catalog snapshots so that tuples from wal can
  be understood
 
 This component analyses xlog and builds a special kind of Snapshot.
 This has been compared to the KnownAssignedXids machinery for Hot
 Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is
 meant by this). Since decoding only has to occur within a single
 backend, I guess it's sufficient that it's all within local memory (in
 contrast to the KnownAssignedXids array, which is in shared memory).
I haven't found a convincing argument to share that information itself. If we 
want to parallelise this I think sharing the snapshots should be sufficient. 
Given the snapshot only covers system catalogs the changerate should be fine.

 The design document [2] really just explains the problem (which is the
 need for catalog metadata at a point in time to make sense of heap
 tuples), without describing the solution that this patch offers with
 any degree of detail. Rather, [2] says How we build snapshots is
 somewhat intricate and complicated and seems to be out of scope for
 this document, which is unsatisfactory. I look forward to reading the
 promised document that describes this mechanism in more detail. It's
 really hard to guess what you might have meant to do, and why you
 might have done it, much less verifying the codes correctness.
Will concentrate on finishing that document.

 This functionality is concentrated in snapbuild.c. A comment in decode.c
 notes:
 
 + *Its possible that the separation between decode.c and snapbuild.c is
 a + *bit too strict, in the end they just about have the same struct.
 
 I prefer the current separation. I think it's reasonable that decode.c
 is sort of minimal glue code.
Good.

 Decoding means breaking up individual XLogRecord structs, and storing
 them in an applycache (applycache does this, and stores them as
 ApplyCacheChange records), while building a snapshot (which is needed
 in advance of adding tuples 

Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Bruce Momjian
On Thu, Oct 11, 2012 at 03:16:39AM +0100, Greg Stark wrote:
 On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I think I've mentioned it before, but in the interest of not being
  seen to critique the bikeshed only after it's been painted: this
  design gives up something very important that exists in our current
  built-in replication solution, namely pipelining.
 
  Isn't there an even more serious problem, namely that this assumes
  *all* transactions are serializable?  What happens when they aren't?
  Or even just that the effective commit order is not XID order?
 
 Firstly, I haven't read the code but I'm confident it doesn't make the
 elementary error of assuming commit order == xid order. I assume it's
 applying the reassembled transactions in commit order.
 
 I don't think it assumes the transactions are serializable because
 it's only concerned with writes, not reads. So the transaction it's
 replaying may or may not have been able to view the data written by
 other transactions that commited earlier but it doesn't matter when
 trying to reproduce the effects using constants. The data written by
 this transaction is either written or not when the commit happens and
 it's all written or not at that time. Even in non-serializable mode
 updates take row locks and nobody can see the data or modify it until
 the transaction commits.

How does Slony write its changes without causing serialization replay
conflicts?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Isn't there an even more serious problem, namely that this assumes
 *all* transactions are serializable?  What happens when they aren't?
 Or even just that the effective commit order is not XID order?

 I don't think it assumes the transactions are serializable because
 it's only concerned with writes, not reads. So the transaction it's
 replaying may or may not have been able to view the data written by
 other transactions that commited earlier but it doesn't matter when
 trying to reproduce the effects using constants.

I would believe that argument if the apply operations were at a
similar logical level to our current WAL records, namely drop these bits
into that spot.  Unfortunately, they're not.  I think this argument
falls to the ground entirely as soon as you think about DDL being
applied by transactions A,B,C and then needing to express what
concurrent transactions X,Y,Z did in source terms.  Even something as
simple as a few column renames could break it, let alone anything as
esoteric as changing the meaning of datatype literals.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Andres Freund
On Thursday, October 11, 2012 03:10:48 AM Robert Haas wrote:
 On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com 
wrote:
  The purpose of ApplyCache/transaction reassembly is to reassemble
  interlaced records, and organise them by XID, so that the consumer
  client code sees only streams (well, lists) of records split by XID.
 
 I think I've mentioned it before, but in the interest of not being
 seen to critique the bikeshed only after it's been painted: this
 design gives up something very important that exists in our current
 built-in replication solution, namely pipelining.  With streaming
 replication as it exists today, a transaction that modifies a huge
 amount of data (such as a bulk load) can be applied on the standby as
 it happens.  The rows thus inserted will become visible only if and
 when the transaction commits on the master and the commit record is
 replayed on the standby.  This has a number of important advantages,
 perhaps most importantly that the lag between commit and data
 visibility remains short.  With the proposed system, we can't start
 applying the changes until the transaction has committed and the
 commit record has been replayed, so a big transaction is going to have
 a lot of apply latency.
I don't think there is a fundamental problem here, just an incremental ones. 

The major problems are:
* transactions with DDL  DML currently need to be reassembled, it might be 
possible to resolve this though, haven't thought about it too much
* subtransaction are only assigned to toplevel transactions at commit time
* you need a variable amount of backends/parallel transactions open at the 
target system to apply all the transactions concurrently. You can't smash them 
together because one of them might rollback.

All of those seem solveable to me, so I am not too worried about addition of a 
streaming mode somewhere down the line. I don't want to focus on it right now 
though. Ok?

 Here's you making the same point in different words:
  Applycache is presumably where you're going to want to spill
  transaction streams to disk, eventually. That seems like a
  prerequisite to commit.
 
 Second, crash recovery.  I think whatever we put in place here has to
 be able to survive a crash on any node.  Decoding must be able to
 restart successfully after a system crash, and it has to be able to
 apply exactly the set of transactions that were committed but not
 applied prior to the crash.  Maybe an appropriate mechanism for this
 already exists or has been discussed, but I haven't seen it go by;
 sorry if I have missed the boat.
I have discussed it privately  roughly prototyped, but not publically. There 
are two pieces to this:
1) restartable after a crash/disconnection/shutdown
2) pick of exactly where it stopped

Those are somewhat different because 1) is relevant on the source side and be 
solved there. 2) depends on the target system because it needs to ensure that 
it safely received the changes up to some point.

The idea for 1) is to serialize the applycache whenever we reach a checkpoint 
and have that as a starting point for every confirmed flush location of 2).

Obviously 2) will need cooperation by the receiving side.

  You consider this to be a throw-away function that won't ever be
  committed. However, I strongly feel that you should move it into
  /contrib, so that it can serve as a sort of reference implementation
  for authors of decoder client code, in the same spirit as numerous
  existing contrib modules (think contrib/spi).
 
 Without prejudice to the rest of this review which looks quite
 well-considered, I'd like to add a particular +1 to this point.
So were in violent agreement here ;)

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Andres Freund
On Thursday, October 11, 2012 04:16:39 AM Greg Stark wrote:
 On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I think I've mentioned it before, but in the interest of not being
  seen to critique the bikeshed only after it's been painted: this
  design gives up something very important that exists in our current
  built-in replication solution, namely pipelining.
  
  Isn't there an even more serious problem, namely that this assumes
  *all* transactions are serializable?  What happens when they aren't?
  Or even just that the effective commit order is not XID order?
 
 Firstly, I haven't read the code but I'm confident it doesn't make the
 elementary error of assuming commit order == xid order. I assume it's
 applying the reassembled transactions in commit order.
Yes, its commit order.

Imo commit order is more like assuming all transactions are done in read 
committed and not above than assuming serializable? Or am I missing something?

 I don't think it assumes the transactions are serializable because
 it's only concerned with writes, not reads. So the transaction it's
 replaying may or may not have been able to view the data written by
 other transactions that commited earlier but it doesn't matter when
 trying to reproduce the effects using constants. The data written by
 this transaction is either written or not when the commit happens and
 it's all written or not at that time. Even in non-serializable mode
 updates take row locks and nobody can see the data or modify it until
 the transaction commits.
Yes. There will be problems if you want to make serializable work across 
nodes, but that seems like something fiendishly complex anyway. I don't plan to 
work on it in the forseeable future.

Greetings,

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Andres Freund
On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote:
 Greg Stark st...@mit.edu writes:
  On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Isn't there an even more serious problem, namely that this assumes
  *all* transactions are serializable?  What happens when they aren't?
  Or even just that the effective commit order is not XID order?
  
  I don't think it assumes the transactions are serializable because
  it's only concerned with writes, not reads. So the transaction it's
  replaying may or may not have been able to view the data written by
  other transactions that commited earlier but it doesn't matter when
  trying to reproduce the effects using constants.
 
 I would believe that argument if the apply operations were at a
 similar logical level to our current WAL records, namely drop these bits
 into that spot.  Unfortunately, they're not.  I think this argument
 falls to the ground entirely as soon as you think about DDL being
 applied by transactions A,B,C and then needing to express what
 concurrent transactions X,Y,Z did in source terms.  Even something as
 simple as a few column renames could break it,
Not sure what youre getting at here? Are you talking about the problems at the 
source side or the target side?

During decoding such problems should be handled already. As we reconstruct a 
Snapshot that lets catalog access look like it did back when the tuple was 
thrown into was we have the exact column names, data types and everything. The 
locking used when making the original changes prevents the data types, column 
names to be changed mid-transaction.

If youre talking about the receiving/apply side: Sure, if you do careless DDL 
and you don't replicate DDL (not included here, separate project), youre going 
to have problems. I don't think there is much we can do about that.

 let alone anything as esoteric as changing the meaning of datatype literals.
Hm. You mean like changing the input format of a datatype? Yes, sure, that 
will cause havoc.

Greetings,

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel

2012-10-10 Thread Andres Freund
On Thursday, October 11, 2012 04:49:20 AM Andres Freund wrote:
 On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote:
  Greg Stark st...@mit.edu writes:
   On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
   Isn't there an even more serious problem, namely that this assumes
   all transactions are serializable?  What happens when they aren't?
   Or even just that the effective commit order is not XID order?
  
   
  
   I don't think it assumes the transactions are serializable because
   it's only concerned with writes, not reads. So the transaction it's
   replaying may or may not have been able to view the data written by
   other transactions that commited earlier but it doesn't matter when
   trying to reproduce the effects using constants.
 
  
 
  I would believe that argument if the apply operations were at a
  similar logical level to our current WAL records, namely drop these bits
  into that spot.  Unfortunately, they're not.  I think this argument
  falls to the ground entirely as soon as you think about DDL being
  applied by transactions A,B,C and then needing to express what
  concurrent transactions X,Y,Z did in source terms.  Even something as
  simple as a few column renames could break it,
 
 Not sure what youre getting at here? Are you talking about the problems at
 the  source side or the target side?
 
 During decoding such problems should be handled already
Btw, the introductionary email upthread shows a trivial example. As submitted 
the code cannot handle intermingled DDL/DML transactions, but I fixed that now.

There are some problems with CLUSTER/VACUUM FULL on system catalogs, but thats 
going to be separate thread...

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers