Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
Simon Riggs wrote: Greg Stark st...@mit.edu wrote: Tom Lane t...@sss.pgh.pa.us wrote: Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? Do you mean in terms of the serializable transaction isolation level, or something else? I haven't read the patches, but I've been trying to follow the discussion and I don't recall any hint of basing this on serializable transactions on each source. Of course, when it comes down to commits, both where a change is committed and where the work is copied, there must be a commit order; and with asynchronous work where data isn't partitioned such that there is a single clear owner for each partition there will be conflicts which must be resolved. I don't get the impression that this point has been lost on Simon and Andres. What happens when they aren't? Or even just that the effective commit order is not XID order? Firstly, I haven't read the code but I'm confident it doesn't make the elementary error of assuming commit order == xid order. I assume it's applying the reassembled transactions in commit order. Same here. I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. IIRC, the developers of this feature have explicitly said that they will defer any consideration of trying to extend serializable transaction isolation behavior to a multi-server basis until after they have other things working. (Frankly, to do otherwise would not be sane.) It appears to me that it can't be managed in a general sense without destroying almost all the advantages of multi-master replication, at least (as I said before) where data isn't partitioned such that there is a single clear owner for each partition. Where such partitioning is present and there are data sets maintained exclusively by serializable transactions, anomaly-free reads of the data could be accomplished by committing transactions on the replicas in apparent order of execution rather than commit order. Apparent order of execution must take both commit order and read-write dependencies into consideration. The data written by this transaction is either written or not when the commit happens and it's all written or not at that time. Even in non-serializable mode updates take row locks and nobody can see the data or modify it until the transaction commits. As with read-only transactions and hot standbys, the problem comes in when a transaction commits and is replicated while a transction remains uncommitted which is basing its updates on the earlier state of the data. It gets even more exciting with MMR since the transaction working with the old version of the data might be on a different machine, on another continent. With certain types of workloads, it seems to me that it could get pretty crazy if certain closely-related actions are not kept within a single database (like controlling the active batch and adding items to a batch). In the wild, half-baked, hand-wavey suggestions department -- maybe there should be some consideration of a long-term way within MMR to direct activities to certain logical nodes, each of which could be mapped to a single physical node at any one time. Basically, to route a request through the MMR network to the current logical node for handling something, and have the effects ripple back out through all nodes. This uses Commit Serializability, which is valid, as you say. Well, it is a type of concurrency control with a name and a definition, if that's what you mean. I agree it provides sufficient guarantees to create a workable MMR system, if you have adequate conflict resolution. My concern is that it not be confused with serializability in the mathematical sense or in the sense of transaction isolation levels. In general on this thread, when I've seen the terms serializable and serializability I haven't been clear on whether the words are being used in their more general sense as words in the English language, or in a more particular technical sense. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
The design Andres and Simon have advanced already eliminates a lot of the common failure cases (now(), random(), nextval()) suffered by pgPool and similar tools. But remember, this feature doesn't have to be Well, pgpool-II already solved the now() case by using query rewriting technique. The technique could be applied to random() as well but I'm not convinced it is worth the trouble. nexval() would be a little harder because pgpool needs an assistance from PostgreSQL core. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On 10/11/2012 04:31 AM, Tom Lane wrote: Greg Stark st...@mit.edu writes: On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. I would believe that argument if the apply operations were at a similar logical level to our current WAL records, namely drop these bits into that spot. Unfortunately, they're not. I think this argument falls to the ground entirely as soon as you think about DDL being applied by transactions A,B,C and then needing to express what concurrent transactions X,Y,Z did in source terms. Even something as simple as a few column renames could break it, let alone anything as esoteric as changing the meaning of datatype literals. This is the whole reason of moving the reassembly to the source side and having the possibility to use old snapshots to get the catalog information. Also, the locks that protect you from effects of field name changes by DDL concurrent transactions protect also the logical reassembly if done in the commit order. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On 10/11/2012 03:10 AM, Robert Haas wrote: On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. The lack of pipelining (and the following complexity of applycache and spilling to disk) is something we have discussed with Andres and to my understanding it is not a final design decision but just stepping stones in how this quite large development is structured. The pipelining (or parallel apply as I described it) requires either a large number of apply backends and code to manage them or autonomous transactions. It could (arguably !) be easier to implement autonomous transactions instead of apply cache, but Andres had valid reasons to start with apply cache and move to parallel apply later . As I understand it the parallel apply is definitely one of the things that will be coming and after that the performance characteristics (fast AND smooth) will be very similar to current physical WAL streaming. With streaming replication as it exists today, a transaction that modifies a huge amount of data (such as a bulk load) can be applied on the standby as it happens. The rows thus inserted will become visible only if and when the transaction commits on the master and the commit record is replayed on the standby. This has a number of important advantages, perhaps most importantly that the lag between commit and data visibility remains short. With the proposed system, we can't start applying the changes until the transaction has committed and the commit record has been replayed, so a big transaction is going to have a lot of apply latency. Now, I am not 100% opposed to a design that surrenders this property in exchange for other important benefits, but I think it would be worth thinking about whether there is any way that we can design this that either avoids giving that property up at all, or gives it up for the time being but allows us to potentially get back to it in a later version. Reassembling complete transactions is surely cool and some clients will want that, but being able to apply replicated transactions *without* reassembling them in their entirety is even cooler, and some clients will want that, too. If we're going to stick with a design that reassembles transactions, I think there are a number of issues that deserve careful thought. First, memory usage. I don't think it's acceptable for the decoding process to assume that it can allocate enough backend-private memory to store all of the in-flight changes (either as WAL or in some more decoded form). We have assiduously avoided such assumptions thus far; you can write a terabyte of data in one transaction with just a gigabyte of shared buffers if you so desire (and if you're patient). Here's you making the same point in different words: Applycache is presumably where you're going to want to spill transaction streams to disk, eventually. That seems like a prerequisite to commit. Second, crash recovery. I think whatever we put in place here has to be able to survive a crash on any node. Decoding must be able to restart successfully after a system crash, and it has to be able to apply exactly the set of transactions that were committed but not applied prior to the crash. Maybe an appropriate mechanism for this already exists or has been discussed, but I haven't seen it go by; sorry if I have missed the boat. You consider this to be a throw-away function that won't ever be committed. However, I strongly feel that you should move it into /contrib, so that it can serve as a sort of reference implementation for authors of decoder client code, in the same spirit as numerous existing contrib modules (think contrib/spi). Without prejudice to the rest of this review which looks quite well-considered, I'd like to add a particular +1 to this point. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On 10/10/12 7:26 PM, Bruce Momjian wrote: How does Slony write its changes without causing serialization replay conflicts? Since nobody from the Slony team answered this: a) Slony replicates *rows*, not *statements* b) Slony uses serializable mode internally for row replication c) it's possible (though difficult) for creative usage to get Slony into a deadlock situation FWIW, I have always assumed that is is impossible (even theoretically) to have statement-based replication without some constraints on the statements you can run, or some replication failures. I think we should expect 9.3's logical replication out-the-gate to have some issues and impose constraints on users, and improve with time but never be perfect. The design Andres and Simon have advanced already eliminates a lot of the common failure cases (now(), random(), nextval()) suffered by pgPool and similar tools. But remember, this feature doesn't have to be *perfect*, it just has to be *better* than the alternatives. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Wed, Oct 10, 2012 at 10:26 PM, Bruce Momjian br...@momjian.us wrote: How does Slony write its changes without causing serialization replay conflicts? It uses a sequence to break any ordering conflicts at the time that data is inserted into its log tables. If there are two transactions, A and B, that were fighting over a tuple on the origin, then either: a) A went first, B went second, and the ordering in the log makes that order clear; b) A succeeds, then B fails, so there's no conflict; c) A is doing its thing, and B is blocked behind it for a while, then A fails, and B gets to go through, and there's no conflict. Switch A and B as needed. The sequence that is used establishes what is termed a compatible ordering. There are multiple possible compatible orderings; ours happen to interleave transactions together, with the sequence guaranteeing absence of conflict. If we could get commit orderings, then a different but still compatible ordering would be to have each transaction establish its own internal sequence, and apply things in order based on (commit_tx_order, sequence_within). -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On 11 October 2012 03:16, Greg Stark st...@mit.edu wrote: On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? Firstly, I haven't read the code but I'm confident it doesn't make the elementary error of assuming commit order == xid order. I assume it's applying the reassembled transactions in commit order. I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. The data written by this transaction is either written or not when the commit happens and it's all written or not at that time. Even in non-serializable mode updates take row locks and nobody can see the data or modify it until the transaction commits. This uses Commit Serializability, which is valid, as you say. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On 12-10-11 06:27 PM, Josh Berkus wrote: On 10/10/12 7:26 PM, Bruce Momjian wrote: How does Slony write its changes without causing serialization replay conflicts? Since nobody from the Slony team answered this: a) Slony replicates *rows*, not *statements* True, but the proposed logical replication also would replicate rows not the original statements. I don't consider this proposal to be an example of 'statement' replication like pgpool is. If the original SQL was 'update foo set x=x+1 where id 10'; there will be a WAL record to decode for each row modified by the table. In a million row table I'd expect the replica will have to apply a million records (whether they be binary heap tuples or SQL statements). b) Slony uses serializable mode internally for row replication Actually recent versions of slony apply transactions against the replica in read committed mode. Older versions used serializable mode but with the SSI changes in 9.1 we found slony tended to have serialization conflicts with itself on the slony internal tables resulting in a lot of aborted transactions. When slony applies changes on a replica table it does so in a single transaction. Slony finds a set of transactions that committed on the master in between two SYNC events. It then applies all of the rows changed by any of those transactions as part of a single transaction on the replica. Chris's post explains this in more detail. Conflicts with user transactions on the replica are possible. c) it's possible (though difficult) for creative usage to get Slony into a deadlock situation FWIW, I have always assumed that is is impossible (even theoretically) to have statement-based replication without some constraints on the statements you can run, or some replication failures. I think we should expect 9.3's logical replication out-the-gate to have some issues and impose constraints on users, and improve with time but never be perfect. The design Andres and Simon have advanced already eliminates a lot of the common failure cases (now(), random(), nextval()) suffered by pgPool and similar tools. But remember, this feature doesn't have to be *perfect*, it just has to be *better* than the alternatives. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On 15 September 2012 01:39, Andres Freund and...@2ndquadrant.com wrote: (0008-Introduce-wal-decoding-via-catalog-timetravel.patch) This patch is the 8th of 8 in a patch series that covers different aspects of the bi-directional replication feature planned for PostgreSQL 9.3. For those that are unfamiliar with the BDR projection, a useful summary is available in an e-mail sent to this list by Simon Riggs back in April [1]. I should also point out that Andres has made available a design document that discusses aspects of this patch in particular, in another thread [2]. That document, High level design for logical replication in Postgres, emphasizes source data generation in particular: generalising how PostgreSQL's WAL stream is generated to represent the changes it describes logically, without pointers to physical heap tuples and so forth (data generation as a concept is described in detail in an earlier design document [3]). This particular patch can be thought of as a response to the earlier discussion [4] surrounding how to solve the problem of keeping system catalogs consistent during WAL replay on followers: catalog time travel is now used, rather than maintaining a synchronized catalog at the decoding end. Andres' git tree (xlog-decoding-rebasing-cf2 branch) [5] provides additional useful comments in commit messages (he rebases things such that each commit represents a distinct piece of functionality/ patch for review). This patch is not strictly speaking an atomic unit. It is necessary to apply all 8 patches in order to get the code to compile. However, it is approximately an atomic unit, that represents a subdivision of the entire BDR patch that it is manageable and logical to write a discrete review for. This is idiomatic use of git-format-patch, but it is unusual enough within our community for me feel the need to note these facts. I briefly discussed this patch with Andres off-list. His feeling is that the review process ought to focus on the design of WAL decoding, including how it fits within the larger set of replication features proposed for 9.3. There are a number of known omissions in this patch. Andres has listed some of these above, and edge-cases and so on are noted next to XXX and FIXME comments in the patch itself. I am inclined to agree with Andres' view that we should attempt to solidify community support for this prototype patch's design, or some variant, before fixing the edge-cases and working towards committable code. I will try my best to proceed on that basis. What follows is an initial overview of the patch (or at least my understanding of the patch, which you may need to correct), and some points of concern. * applycache module which reassembles transactions from a stream of interspersed changes This is what the design doc [2] refers to as 4.5. TX reassembly. This functionality is concentrated in applycache.c. As [2] notes, the reassembly component was previously coined ApplyCache because it was proposed to run on replication consumers just before applying changes. This is not the case anymore. It is still called that way in the source of the patch recently submitted. The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I meant to avoid talking about anything other than the bigger picture for now, but I must ask: Why the frequent use of malloc(), particularly within applycache.c? The obvious answer is that it's rough code and that that will change, but that still doesn't comport with my idea about how rough Postgres code should look, so I have to wonder if there's a better reason. applycache.c has an acute paucity of comments, which makes it really hard to review well. [2] doesn't have all that much to say about it either. I'm going to not comment much on this here, except to say that I think that the file should be renamed to reassembly.c or something like that, to reflect its well-specified purpose, and not how it might be used. Any cache really belongs in src/backend/utils/cache/ anyway. Applycache is presumably where you're going to want to spill transaction streams to disk, eventually. That seems like a prerequisite to commit. By the way, I see that you're doing this here: + /* most callers don't need snapshot.h */ + typedef struct SnapshotData *Snapshot; Tom, Alvaro and I had a discussion about whether or not this was an acceptable way to reduce build dependencies back in July [8] – I lost that one. You're relying on a Gnu extension/C11 feature here (that is, typedef redefinition). If you find it intensely irritating that you cannot do this while following the standard to the letter, you are not alone. * snapbuilder which builds catalog snapshots so that tuples from wal can be understood This component analyses xlog and builds a special kind of Snapshot. This has been compared to the
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote: On 15 September 2012 01:39, Andres Freund and...@2ndquadrant.com wrote: (0008-Introduce-wal-decoding-via-catalog-timetravel.patch) This patch is the 8th of 8 in a patch series that covers different aspects of the bi-directional replication feature planned for PostgreSQL 9.3. For those that are unfamiliar with the BDR projection, a useful summary is available in an e-mail sent to this list by Simon Riggs back in April [1]. I should also point out that Andres has made Does this design allow replication/communcation between clusters running different major versions of Postgres? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
Does this design allow replication/communcation between clusters running different major versions of Postgres? Just catching up on your email, hmmm? Yes, that's part of the design 2Q presented. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thu, Oct 11, 2012 at 01:34:58AM +0200, anara...@anarazel.de wrote: Bruce Momjian br...@momjian.us schrieb: On Thu, Oct 11, 2012 at 12:02:26AM +0100, Peter Geoghegan wrote: On 15 September 2012 01:39, Andres Freund and...@2ndquadrant.com wrote: (0008-Introduce-wal-decoding-via-catalog-timetravel.patch) This patch is the 8th of 8 in a patch series that covers different aspects of the bi-directional replication feature planned for PostgreSQL 9.3. For those that are unfamiliar with the BDR projection, a useful summary is available in an e-mail sent to this list by Simon Riggs back in April [1]. I should also point out that Andres has made Does this design allow replication/communcation between clusters running different major versions of Postgres? This patchset only contains only the decoding/changeset generation part of logical replication. It provides (as in the debugging example) the capabilities to generate a correct textual format and thus can be used to build a solution with support for cross version/arch replication as long as the text format of the used types is compatible. Does that answer the question? Yes. This was posted so long ago I couldn't remember if that was part of the design. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. With streaming replication as it exists today, a transaction that modifies a huge amount of data (such as a bulk load) can be applied on the standby as it happens. The rows thus inserted will become visible only if and when the transaction commits on the master and the commit record is replayed on the standby. This has a number of important advantages, perhaps most importantly that the lag between commit and data visibility remains short. With the proposed system, we can't start applying the changes until the transaction has committed and the commit record has been replayed, so a big transaction is going to have a lot of apply latency. Now, I am not 100% opposed to a design that surrenders this property in exchange for other important benefits, but I think it would be worth thinking about whether there is any way that we can design this that either avoids giving that property up at all, or gives it up for the time being but allows us to potentially get back to it in a later version. Reassembling complete transactions is surely cool and some clients will want that, but being able to apply replicated transactions *without* reassembling them in their entirety is even cooler, and some clients will want that, too. If we're going to stick with a design that reassembles transactions, I think there are a number of issues that deserve careful thought. First, memory usage. I don't think it's acceptable for the decoding process to assume that it can allocate enough backend-private memory to store all of the in-flight changes (either as WAL or in some more decoded form). We have assiduously avoided such assumptions thus far; you can write a terabyte of data in one transaction with just a gigabyte of shared buffers if you so desire (and if you're patient). Here's you making the same point in different words: Applycache is presumably where you're going to want to spill transaction streams to disk, eventually. That seems like a prerequisite to commit. Second, crash recovery. I think whatever we put in place here has to be able to survive a crash on any node. Decoding must be able to restart successfully after a system crash, and it has to be able to apply exactly the set of transactions that were committed but not applied prior to the crash. Maybe an appropriate mechanism for this already exists or has been discussed, but I haven't seen it go by; sorry if I have missed the boat. You consider this to be a throw-away function that won't ever be committed. However, I strongly feel that you should move it into /contrib, so that it can serve as a sort of reference implementation for authors of decoder client code, in the same spirit as numerous existing contrib modules (think contrib/spi). Without prejudice to the rest of this review which looks quite well-considered, I'd like to add a particular +1 to this point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Wed, Oct 10, 2012 at 09:10:48PM -0400, Robert Haas wrote: You consider this to be a throw-away function that won't ever be committed. However, I strongly feel that you should move it into /contrib, so that it can serve as a sort of reference implementation for authors of decoder client code, in the same spirit as numerous existing contrib modules (think contrib/spi). Without prejudice to the rest of this review which looks quite well-considered, I'd like to add a particular +1 to this point. The review was _HUGE_! :-O -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
Robert Haas robertmh...@gmail.com writes: On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? Firstly, I haven't read the code but I'm confident it doesn't make the elementary error of assuming commit order == xid order. I assume it's applying the reassembled transactions in commit order. I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. The data written by this transaction is either written or not when the commit happens and it's all written or not at that time. Even in non-serializable mode updates take row locks and nobody can see the data or modify it until the transaction commits. I have to say I was curious about Robert's point as well when I read Peter's review. Especially because this is exactly how other logical replication systems I've seen work too and I've always wondered about it in those systems. Both MySQL and Oracle reassemble transactions and don't write anything until they have the whole transaction reassembled. To me this always struck me as a bizarre and obviously bad thing to do though. It seems to me it would be better to create sessions (or autonomous transactions) for each transaction seen in the stream and issue the DML as it shows up, committing and cleaning each up when the commit or abort (or shutdown or startup) record comes along. I imagine the reason lies with dealing with locking and ensuring that you get the correct results without deadlocks when multiple transactions try to update the same record. But it seems to me that the original locks the source database took should protect you against any problems. As long as you can suspend a transaction when it takes a lock that blocks and keep processing WAL for other transactions (or an abort for that transaction if that happened due to a deadlock or user interruption) you should be fine. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
Hi, First of: Awesome review. On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote: What follows is an initial overview of the patch (or at least my understanding of the patch, which you may need to correct), and some points of concern. * applycache module which reassembles transactions from a stream of interspersed changes This is what the design doc [2] refers to as 4.5. TX reassembly. This functionality is concentrated in applycache.c. As [2] notes, the reassembly component was previously coined ApplyCache because it was proposed to run on replication consumers just before applying changes. This is not the case anymore. It is still called that way in the source of the patch recently submitted. The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I meant to avoid talking about anything other than the bigger picture for now, but I must ask: Why the frequent use of malloc(), particularly within applycache.c? The obvious answer is that it's rough code and that that will change, but that still doesn't comport with my idea about how rough Postgres code should look, so I have to wonder if there's a better reason. Several reasons, not sure how good: - part of the code (was) supposed to be runnable on a target system without a full postgres backend arround - All of the allocations would basically have to be in TopMemoryContext or something equally longlived as we don't have a transaction context or anything like that - the first revision showed that allocating memory was the primary bottleneck so I added a small allocation cache to the critical pieces which solved that. After that there seems no point in using mctx''s for that kind of memory anymore. applycache.c has an acute paucity of comments, which makes it really hard to review well. Working on that. I don't think its internals are really all that interesting atm. I'm going to not comment much on this here, except to say that I think that the file should be renamed to reassembly.c or something like that, to reflect its well-specified purpose, and not how it might be used. Agreed. Applycache is presumably where you're going to want to spill transaction streams to disk, eventually. That seems like a prerequisite to commit. Yes. By the way, I see that you're doing this here: + /* most callers don't need snapshot.h */ + typedef struct SnapshotData *Snapshot; Tom, Alvaro and I had a discussion about whether or not this was an acceptable way to reduce build dependencies back in July [8] – I lost that one. You're relying on a Gnu extension/C11 feature here (that is, typedef redefinition). If you find it intensely irritating that you cannot do this while following the standard to the letter, you are not alone. Yuck :(. So I will just use struct SnapshotData directly instead of a typedef... * snapbuilder which builds catalog snapshots so that tuples from wal can be understood This component analyses xlog and builds a special kind of Snapshot. This has been compared to the KnownAssignedXids machinery for Hot Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is meant by this). Since decoding only has to occur within a single backend, I guess it's sufficient that it's all within local memory (in contrast to the KnownAssignedXids array, which is in shared memory). I haven't found a convincing argument to share that information itself. If we want to parallelise this I think sharing the snapshots should be sufficient. Given the snapshot only covers system catalogs the changerate should be fine. The design document [2] really just explains the problem (which is the need for catalog metadata at a point in time to make sense of heap tuples), without describing the solution that this patch offers with any degree of detail. Rather, [2] says How we build snapshots is somewhat intricate and complicated and seems to be out of scope for this document, which is unsatisfactory. I look forward to reading the promised document that describes this mechanism in more detail. It's really hard to guess what you might have meant to do, and why you might have done it, much less verifying the codes correctness. Will concentrate on finishing that document. This functionality is concentrated in snapbuild.c. A comment in decode.c notes: + *Its possible that the separation between decode.c and snapbuild.c is a + *bit too strict, in the end they just about have the same struct. I prefer the current separation. I think it's reasonable that decode.c is sort of minimal glue code. Good. Decoding means breaking up individual XLogRecord structs, and storing them in an applycache (applycache does this, and stores them as ApplyCacheChange records), while building a snapshot (which is needed in advance of adding tuples
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thu, Oct 11, 2012 at 03:16:39AM +0100, Greg Stark wrote: On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? Firstly, I haven't read the code but I'm confident it doesn't make the elementary error of assuming commit order == xid order. I assume it's applying the reassembled transactions in commit order. I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. The data written by this transaction is either written or not when the commit happens and it's all written or not at that time. Even in non-serializable mode updates take row locks and nobody can see the data or modify it until the transaction commits. How does Slony write its changes without causing serialization replay conflicts? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
Greg Stark st...@mit.edu writes: On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. I would believe that argument if the apply operations were at a similar logical level to our current WAL records, namely drop these bits into that spot. Unfortunately, they're not. I think this argument falls to the ground entirely as soon as you think about DDL being applied by transactions A,B,C and then needing to express what concurrent transactions X,Y,Z did in source terms. Even something as simple as a few column renames could break it, let alone anything as esoteric as changing the meaning of datatype literals. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thursday, October 11, 2012 03:10:48 AM Robert Haas wrote: On Wed, Oct 10, 2012 at 7:02 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. With streaming replication as it exists today, a transaction that modifies a huge amount of data (such as a bulk load) can be applied on the standby as it happens. The rows thus inserted will become visible only if and when the transaction commits on the master and the commit record is replayed on the standby. This has a number of important advantages, perhaps most importantly that the lag between commit and data visibility remains short. With the proposed system, we can't start applying the changes until the transaction has committed and the commit record has been replayed, so a big transaction is going to have a lot of apply latency. I don't think there is a fundamental problem here, just an incremental ones. The major problems are: * transactions with DDL DML currently need to be reassembled, it might be possible to resolve this though, haven't thought about it too much * subtransaction are only assigned to toplevel transactions at commit time * you need a variable amount of backends/parallel transactions open at the target system to apply all the transactions concurrently. You can't smash them together because one of them might rollback. All of those seem solveable to me, so I am not too worried about addition of a streaming mode somewhere down the line. I don't want to focus on it right now though. Ok? Here's you making the same point in different words: Applycache is presumably where you're going to want to spill transaction streams to disk, eventually. That seems like a prerequisite to commit. Second, crash recovery. I think whatever we put in place here has to be able to survive a crash on any node. Decoding must be able to restart successfully after a system crash, and it has to be able to apply exactly the set of transactions that were committed but not applied prior to the crash. Maybe an appropriate mechanism for this already exists or has been discussed, but I haven't seen it go by; sorry if I have missed the boat. I have discussed it privately roughly prototyped, but not publically. There are two pieces to this: 1) restartable after a crash/disconnection/shutdown 2) pick of exactly where it stopped Those are somewhat different because 1) is relevant on the source side and be solved there. 2) depends on the target system because it needs to ensure that it safely received the changes up to some point. The idea for 1) is to serialize the applycache whenever we reach a checkpoint and have that as a starting point for every confirmed flush location of 2). Obviously 2) will need cooperation by the receiving side. You consider this to be a throw-away function that won't ever be committed. However, I strongly feel that you should move it into /contrib, so that it can serve as a sort of reference implementation for authors of decoder client code, in the same spirit as numerous existing contrib modules (think contrib/spi). Without prejudice to the rest of this review which looks quite well-considered, I'd like to add a particular +1 to this point. So were in violent agreement here ;) Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thursday, October 11, 2012 04:16:39 AM Greg Stark wrote: On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: I think I've mentioned it before, but in the interest of not being seen to critique the bikeshed only after it's been painted: this design gives up something very important that exists in our current built-in replication solution, namely pipelining. Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? Firstly, I haven't read the code but I'm confident it doesn't make the elementary error of assuming commit order == xid order. I assume it's applying the reassembled transactions in commit order. Yes, its commit order. Imo commit order is more like assuming all transactions are done in read committed and not above than assuming serializable? Or am I missing something? I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. The data written by this transaction is either written or not when the commit happens and it's all written or not at that time. Even in non-serializable mode updates take row locks and nobody can see the data or modify it until the transaction commits. Yes. There will be problems if you want to make serializable work across nodes, but that seems like something fiendishly complex anyway. I don't plan to work on it in the forseeable future. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote: Greg Stark st...@mit.edu writes: On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Isn't there an even more serious problem, namely that this assumes *all* transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. I would believe that argument if the apply operations were at a similar logical level to our current WAL records, namely drop these bits into that spot. Unfortunately, they're not. I think this argument falls to the ground entirely as soon as you think about DDL being applied by transactions A,B,C and then needing to express what concurrent transactions X,Y,Z did in source terms. Even something as simple as a few column renames could break it, Not sure what youre getting at here? Are you talking about the problems at the source side or the target side? During decoding such problems should be handled already. As we reconstruct a Snapshot that lets catalog access look like it did back when the tuple was thrown into was we have the exact column names, data types and everything. The locking used when making the original changes prevents the data types, column names to be changed mid-transaction. If youre talking about the receiving/apply side: Sure, if you do careless DDL and you don't replicate DDL (not included here, separate project), youre going to have problems. I don't think there is much we can do about that. let alone anything as esoteric as changing the meaning of datatype literals. Hm. You mean like changing the input format of a datatype? Yes, sure, that will cause havoc. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 8/8] Introduce wal decoding via catalog timetravel
On Thursday, October 11, 2012 04:49:20 AM Andres Freund wrote: On Thursday, October 11, 2012 04:31:21 AM Tom Lane wrote: Greg Stark st...@mit.edu writes: On Thu, Oct 11, 2012 at 2:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Isn't there an even more serious problem, namely that this assumes all transactions are serializable? What happens when they aren't? Or even just that the effective commit order is not XID order? I don't think it assumes the transactions are serializable because it's only concerned with writes, not reads. So the transaction it's replaying may or may not have been able to view the data written by other transactions that commited earlier but it doesn't matter when trying to reproduce the effects using constants. I would believe that argument if the apply operations were at a similar logical level to our current WAL records, namely drop these bits into that spot. Unfortunately, they're not. I think this argument falls to the ground entirely as soon as you think about DDL being applied by transactions A,B,C and then needing to express what concurrent transactions X,Y,Z did in source terms. Even something as simple as a few column renames could break it, Not sure what youre getting at here? Are you talking about the problems at the source side or the target side? During decoding such problems should be handled already Btw, the introductionary email upthread shows a trivial example. As submitted the code cannot handle intermingled DDL/DML transactions, but I fixed that now. There are some problems with CLUSTER/VACUUM FULL on system catalogs, but thats going to be separate thread... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers