Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 11:26, Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Simon Riggs si...@2ndquadrant.com wrote: The proposal is to use WAL to generate the logical change stream. That has been shown in testing to be around x4 faster than having a separate change stream, which must also be WAL logged (as Jan noted). Sure, that's why I want it. I think this argument is basically circular. The reason it's 4x faster is that the WAL stream doesn't actually contain all the information needed to generate LCRs (thus all the angst about maintaining catalogs in sync, what to do about unfriendly datatypes, etc). By the time the dust has settled and you have a workable system, you will have bloated WAL and given back a large chunk of that multiple, thereby invalidating the design premise. Or at least that's my prediction. The tests were conducted with the additional field added, so your prediction is not verified. The additional fields do not bloat WAL records - they take up exactly the same space as before. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 01:27, Kevin Grittner wrote: Andres Freundand...@2ndquadrant.com wrote: Yes, thats definitely a valid use-case. But that doesn't preclude the other - also not uncommon - use-case where you want to have different master which all contain up2date data. I agree. I was just saying that while one requires an origin_id, the other doesn't. And those not doing MM replication definitely don't need it. I think it would be helpful to list down a few concrete examples of this. The stereotypical multi-master scenario is that you have a single table that's replicated to two servers, and you can insert/update/delete on either server. Conflict resolution stretegies vary. The reason we need an origin id in this scenario is that otherwise this will happen: 1. A row is updated on node A 2. Node B receives the WAL record from A, and updates the corresponding row in B. This generates a new WAL record. 3. Node A receives the WAL record from B, and updates the rows again. This again generates a new WAL record, which is replicated to A, and you loop indefinitely. If each WAL record carries an origin id, node A can use it to refrain from applying the WAL record it receives from B, which breaks the loop. However, note that in this simple scenario, if the logical log replay / conflict resolution is smart enough to recognize that the row has already been updated, because the old and the new rows are identical, the loop is broken at step 3 even without the origin id. That works for the newest-update-wins and similar strategies. So the origin id is not absolutely necessary in this case. Another interesting scenario is that you maintain a global counter, like in an inventory system, and conflicts are resolved by accumulating the updates. For example, if you do UPDATE SET counter = counter + 1 simultaneously on two nodes, the result is that the counter is incremented by two. The avoid-update-if-already-identical optimization doesn't help in this case, the origin id is necessary. Now, let's take the inventory system example further. There are actually two ways to update a counter. One is when an item is checked in or out of the warehouse, ie. UPDATE counter = counter + 1. Those updates should accumulate. But another operation resets the counter to a specific value, UPDATE counter = 10, like when taking an inventory. That should not accumulate with other changes, but should be newest-update-wins. The origin id is not enough for that, because by looking at the WAL record and the origin id, you don't know which type of an update it was. So, I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. And I'm worried it might not even be enough in more complicated scenarios. Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 08:35, Robert Haas robertmh...@gmail.com wrote: I expect it would be fine to have a tool that pulls LCRs out of WAL to prepare that to be sent to remote locations. Is that what you have in mind? Yes. I think it should be possible to generate LCRs from WAL, but I think that the on-the-wire format for LCRs should be different from the WAL format. Trying to use the same format for both things seems like an unpleasant straightjacket. You're confusing things here. Yes, we can use a different on-the-wire format. No problem. As I've already said, the information needs to be in WAL first before we can put it into LCRs, so the don't use same format argument is not relevant as to why the info must be on the WAL record in the first place. And as said elsewhere, doing that does not cause problems in any of these areas: wal bloat, performance, long term restrictions on numbers of nodeids, as has so far been claimed on this thread. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 14:40, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: The reason we need an origin id in this scenario is that otherwise this will happen: 1. A row is updated on node A 2. Node B receives the WAL record from A, and updates the corresponding row in B. This generates a new WAL record. 3. Node A receives the WAL record from B, and updates the rows again. This again generates a new WAL record, which is replicated to A, and you loop indefinitely. If each WAL record carries an origin id, node A can use it to refrain from applying the WAL record it receives from B, which breaks the loop. However, note that in this simple scenario, if the logical log replay / conflict resolution is smart enough to recognize that the row has already been updated, because the old and the new rows are identical, the loop is broken at step 3 even without the origin id. That works for the newest-update-wins and similar strategies. So the origin id is not absolutely necessary in this case. Including the origin id in the WAL allows us to filter out WAL records when we generate LCRs, so we can completely avoid step 3, including all of the CPU, disk and network overhead that implies. Simply put, we know the change came from A, so no need to send the change to A again. If we do allow step 3 to exist, we still need to send the origin id. This is because there is no apriori way to know the origin id is not required, such as the case where we have concurrent updates which is effectively a race condition between actions on separate nodes. Not sending the origin id because it is not required in some cases is equivalent to saying we can skip locking because a race condition does not happen in all cases. Making a case that the race condition is rare is still not a case for skipping locking. Same here: we need the information to avoid making errors in the general case. Another interesting scenario is that you maintain a global counter, like in an inventory system, and conflicts are resolved by accumulating the updates. For example, if you do UPDATE SET counter = counter + 1 simultaneously on two nodes, the result is that the counter is incremented by two. The avoid-update-if-already-identical optimization doesn't help in this case, the origin id is necessary. Now, let's take the inventory system example further. There are actually two ways to update a counter. One is when an item is checked in or out of the warehouse, ie. UPDATE counter = counter + 1. Those updates should accumulate. But another operation resets the counter to a specific value, UPDATE counter = 10, like when taking an inventory. That should not accumulate with other changes, but should be newest-update-wins. The origin id is not enough for that, because by looking at the WAL record and the origin id, you don't know which type of an update it was. Yes, of course. Conflict handling in the general case requires much additional work. This thread is one minor change of many related changes. The patches are being submitted in smaller chunks to ease review, so sometimes there are cross links between things. So, I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. That conclusion doesn't follow from your stated arguments. And I'm worried it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. Additional information required by logical information will be handled by a new wal_level. The discussion here is about adding origin_node_id *only*, which needs to be added on each WAL record. One question I raised in my review was whether this extra information should be added by a variable length header, so I already asked this very question. So far there is no evidence that the additional code complexity would be warranted. If it became so in the future, it can be modified again. At this stage there is no need, so the proposal is to add the field to every WAL record without regard to the setting of wal_level because there is no measurable downside to doing so. The downsides of additional complexity are clear and real however, so I wish to avoid them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] pgbench--new transaction type
On 01.06.2012 03:02, Jeff Janes wrote: I've attached a new patch which addresses several of your concerns, and adds the documentation. The description is much longer than the descriptions of other nearby options, which mostly just give a simple statement of what they do rather than a description of why that is useful. I don't know if that means I'm starting a good trend, or a bad one, or I'm just putting the exposition in the wrong place. In addition to showing the benefits of coding things on the server side when that is applicable, it also allows hackers to stress parts of the server code that are not easy to stress otherwise. As you mentioned in your original email over a year ago, most of this could be done as a custom script. It's nice to have another workload to test, but then again, there's an infinite number of workloads that might be interesting. You can achieve the same with this custom script: \set loops 512 do $$ DECLARE sum integer default 0; amount integer; account_id integer; BEGIN FOR i IN 1..:loops LOOP account_id=1 + floor(random() * :scale); SELECT abalance into strict amount FROM pgbench_accounts WHERE aid = account_id; sum := sum + amount; END LOOP; END; $$; It's a bit awkward because it has to be all on one line, and you don't get the auto-detection of scale. But those are really the only differences AFAICS. I think we would be better off improving pgbench to support those things in custom scripts. It would be nice to be able to write initialization steps that only run once in the beginning of the test. You could then put the SELECT COUNT(*) FROM pgbench_branches there, to do the scale auto-detection. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench--new transaction type
On 20 June 2012 15:32, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 01.06.2012 03:02, Jeff Janes wrote: I've attached a new patch which addresses several of your concerns, and adds the documentation. The description is much longer than the descriptions of other nearby options, which mostly just give a simple statement of what they do rather than a description of why that is useful. I don't know if that means I'm starting a good trend, or a bad one, or I'm just putting the exposition in the wrong place. In addition to showing the benefits of coding things on the server side when that is applicable, it also allows hackers to stress parts of the server code that are not easy to stress otherwise. As you mentioned in your original email over a year ago, most of this could be done as a custom script. It's nice to have another workload to test, but then again, there's an infinite number of workloads that might be interesting. You can achieve the same with this custom script: \set loops 512 do $$ DECLARE sum integer default 0; amount integer; account_id integer; BEGIN FOR i IN 1..:loops LOOP account_id=1 + floor(random() * :scale); SELECT abalance into strict amount FROM pgbench_accounts WHERE aid = account_id; sum := sum + amount; END LOOP; END; $$; It's a bit awkward because it has to be all on one line, and you don't get the auto-detection of scale. But those are really the only differences AFAICS. I think we would be better off improving pgbench to support those things in custom scripts. It would be nice to be able to write initialization steps that only run once in the beginning of the test. You could then put the SELECT COUNT(*) FROM pgbench_branches there, to do the scale auto-detection. I'm sure Jeff submitted this because of the need for a standard test, rather than the wish to actually modify pgbench itself. Can I suggest that we include a list of standard scripts with pgbench for this purpose? These can then be copied alongside the binary when we do an install. That way this patch can become a standard script, plus an entry in the docs to list this. We could even include scripts for the usual cases, to allow them to be more easily viewed/copied/modified. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 10:32, Simon Riggs wrote: On 20 June 2012 14:40, Heikki Linnakangas And I'm worried it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. So, if the origin id is not sufficient for some conflict resolution mechanisms, what extra information do you need for those, and where do you put it? Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. Additional information required by logical information will be handled by a new wal_level. The discussion here is about adding origin_node_id *only*, which needs to be added on each WAL record. If that's all we can discuss here is, and all other options are off the table, then I'll have to just outright object to this patch. Let's implement what we can without the origin id, and revisit this later. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Too frequent message of pgbench -i?
Currently pgbench -i prints following message every 10k tuples created. fprintf(stderr, %d tuples done.\n, j); I think it's long time ago when the frequency of message seemed to be appropriate because computer is getting so fast these days and every 10k message seems to be too often for me. Can we change the frequency from 10k to 100k? Or should we make the frequency be adjustable from a command line? Or even better, the frequency should be changed automatically according to scale factor? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 15:45, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 20.06.2012 10:32, Simon Riggs wrote: On 20 June 2012 14:40, Heikki Linnakangas And I'm worried it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. So, if the origin id is not sufficient for some conflict resolution mechanisms, what extra information do you need for those, and where do you put it? As explained elsewhere, wal_level = logical (or similar) would be used to provide any additional logical information required. Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. In the case of reflexive updates you raised, a typical response in other DBMS would be to represent the query UPDATE SET counter = counter + 1 by sending just the +1 part, not the current value of counter, as would be the case with the non-reflexive update UPDATE SET counter = 1 Handling such things in Postgres would require some subtlety, which would not be resolved in first release but is pretty certain not to require any changes to the WAL record header as a way of resolving it. Having already thought about it, I'd estimate that is a very long discussion and not relevant to the OT, but if you wish to have it here, I won't stop you. Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. Additional information required by logical information will be handled by a new wal_level. The discussion here is about adding origin_node_id *only*, which needs to be added on each WAL record. If that's all we can discuss here is, and all other options are off the table, then I'll have to just outright object to this patch. Let's implement what we can without the origin id, and revisit this later. As explained, we can do nothing without the origin id. It is not optional or avoidable in the way you've described. We have the choice to add the required information as a static or as a variable length addition to the WAL record header. Since there is no additional requirement for expansion of the header at this point and no measurable harm in doing so, I suggest we avoid the complexity and lack of robustness that a variable length header would cause. Of course, we can go there later if needed, but there is no current need. If a variable length header had been suggested, it is certain somebody would say that was overcooked and just do it as static. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 11:17, Simon Riggs wrote: On 20 June 2012 15:45, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 20.06.2012 10:32, Simon Riggs wrote: On 20 June 2012 14:40, Heikki Linnakangas And I'm worried it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. So, if the origin id is not sufficient for some conflict resolution mechanisms, what extra information do you need for those, and where do you put it? As explained elsewhere, wal_level = logical (or similar) would be used to provide any additional logical information required. Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. In the case of reflexive updates you raised, a typical response in other DBMS would be to represent the query UPDATE SET counter = counter + 1 by sending just the +1 part, not the current value of counter, as would be the case with the non-reflexive update UPDATE SET counter = 1 Handling such things in Postgres would require some subtlety, which would not be resolved in first release but is pretty certain not to require any changes to the WAL record header as a way of resolving it. Having already thought about it, I'd estimate that is a very long discussion and not relevant to the OT, but if you wish to have it here, I won't stop you. Yeah, I'd like to hear briefly how you would handle that without any further changes to the WAL record header. Additional information required by logical information will be handled by a new wal_level. The discussion here is about adding origin_node_id *only*, which needs to be added on each WAL record. If that's all we can discuss here is, and all other options are off the table, then I'll have to just outright object to this patch. Let's implement what we can without the origin id, and revisit this later. As explained, we can do nothing without the origin id. It is not optional or avoidable in the way you've described. It's only needed for multi-master replication, where the same table can be updated from multiple nodes. Just leave that out for now. There's plenty of functionality and issues left even without that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Release versioning inconsistency
Hi list, The recent 9.2 beta releases have used a slightly different numbering scheme than all previous releases. It used to be that tarballs for version $VER were always available at: http://ftp.postgresql.org/pub/source/v$VER/postgresql-$VER.tar.bz2 However, the new releases now use v9.2.0beta2 for the directory name, but 9.2beta2 in the tarball file. No big deal for most people, but it will confuse people who have scripts to download PostgreSQL tarballs automatically (e.g. packagers). For example: http://ftp.postgresql.org/pub/source/v9.2.0beta2/postgresql-9.2beta2.tar.bz2 Is there any reason behind this change? Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 16:23, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It's only needed for multi-master replication, where the same table can be updated from multiple nodes. Just leave that out for now. There's plenty of functionality and issues left even without that. Huh? Multi-master replication is what is being built here and many people want that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 11:34, Simon Riggs wrote: On 20 June 2012 16:23, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It's only needed for multi-master replication, where the same table can be updated from multiple nodes. Just leave that out for now. There's plenty of functionality and issues left even without that. Huh? Multi-master replication is what is being built here and many people want that. Sure, but presumably you're going to implement master-slave first, and build multi-master on top of that. What I'm saying is that we can leave out the origin-id for now, since we don't have agreement on it, and revisit it after master-slave replication is working. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Too frequent message of pgbench -i?
On 20.06.2012 11:04, Tatsuo Ishii wrote: Currently pgbench -i prints following message every 10k tuples created. fprintf(stderr, %d tuples done.\n, j); I think it's long time ago when the frequency of message seemed to be appropriate because computer is getting so fast these days and every 10k message seems to be too often for me. Can we change the frequency from 10k to 100k? Or should we make the frequency be adjustable from a command line? Or even better, the frequency should be changed automatically according to scale factor? Or print the message every n seconds, instead of every n lines. I think a new command-line option would be overkill. I would be fine with just bumping it to every 100k lines. So +1 on doing something about it, not sure what. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote: On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne cbbro...@gmail.com wrote: On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote: Btw, what do you mean with conflating the stream? I don't really see that being proposed. It seems to me that you are intent on using the WAL stream as the logical change stream. I think that's a bad design. Instead, you should extract changes from WAL and then ship them around in a format that is specific to logical replication. Yeah, that seems worth elaborating on. What has been said several times is that it's pretty necessary to capture the logical changes into WAL. That seems pretty needful, in order that the replication data gets fsync()ed avidly, and so that we don't add in the race condition of needing to fsync() something *else* almost exactly as avidly as is the case for WAL today.. Check. But it's undesirable to pull *all* the bulk of contents of WAL around if it's only part of the data that is going to get applied. On a physical streaming replica, any logical data that gets captured will be useless. And on a logical replica, they physical bits of WAL will be useless. What I *want* you to mean is that there would be: a) WAL readers that pull the physical bits, and b) WAL readers that just pull logical bits. I expect it would be fine to have a tool that pulls LCRs out of WAL to prepare that to be sent to remote locations. Is that what you have in mind? Yes. I think it should be possible to generate LCRs from WAL, but I think that the on-the-wire format for LCRs should be different from the WAL format. Trying to use the same format for both things seems like an unpleasant straightjacket. This discussion illustrates why: we're talking about consuming scarce bit-space in WAL records for a feature that only a tiny minority of users will use, and it's still not really enough bit space. That stinks. If LCR transmission is a separate protocol, this problem can be engineered away at a higher level. As I said before, I definitely agree that we want to have a separate transport format once we have decoding nailed down. We still need to ship wal around if the decoding happens in a different instance, but *after* that it can be shipped in something more convenient/appropriate. Suppose we have three servers, A, B, and C, that are doing multi-master replication in a loop. A sends LCRs to B, B sends them to C, and C sends them back to A. Obviously, we need to make sure that each server applies each set of changes just once, but it suffices to have enough information in WAL to distinguish between replication transactions and non-replication transactions - that is, one bit. So suppose a change is made on server A. A generates LCRs from WAL, and tags each LCR with node_id = A. It then sends those LCRs to B. B applies them, flagging the apply transaction in WAL as a replication transaction, AND ALSO sends the LCRs to C. The LCR generator on B sees the WAL from apply, but because it's flagged as a replication transaction, it does not generate LCRs. So C receives LCRs from B just once, without any need for the node_id to to be known in WAL. C can now also apply those LCRs (again flagging the apply transaction as replication) and it can also skip sending them to A, because it seems that they originated at A. One bit is fine if you have only very simple replication topologies. Once you think about globally distributed databases its a bit different. You describe some of that below, but just to reiterate: Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF in europe). You may only want to have full intercontinental interconnect between two of those (say A and D). If you only have one bit to represent the origin thats not going to work because you won't be able discern the changes from BC on A from the changes from those originating on DEF. Another topology which is interesting is circular replications (i.e. changes get shipped A-B, B-C, C-A) which is a sensible topology if you only have a low change rate and a relatively high number of nodes because you don't need the full combinatorial amount of connections. You still have the origin_id's be meaningful in the local context though. As described before, in the communication between the different nodes you can simply replace 16bit node id with some fancy UUID or such. And do the reverse when replaying LCRs. Now suppose we have a more complex topology. Suppose we have a cluster of four servers A .. D which, for improved tolerance against network outages, are all connected pairwise. Normally all the links are up, so each server sends all the LCRs it generates directly to all other servers. But how do we prevent cycles? A generates a change and sends it to B, C, and D. B then sees that the change
Re: [HACKERS] Release versioning inconsistency
On Wed, Jun 20, 2012 at 10:28 AM, Marti Raudsepp ma...@juffo.org wrote: Hi list, The recent 9.2 beta releases have used a slightly different numbering scheme than all previous releases. It used to be that tarballs for version $VER were always available at: http://ftp.postgresql.org/pub/source/v$VER/postgresql-$VER.tar.bz2 However, the new releases now use v9.2.0beta2 for the directory name, but 9.2beta2 in the tarball file. No big deal for most people, but it will confuse people who have scripts to download PostgreSQL tarballs automatically (e.g. packagers). For example: http://ftp.postgresql.org/pub/source/v9.2.0beta2/postgresql-9.2beta2.tar.bz2 Is there any reason behind this change? Not behind the first one. I believe that's just me and a bad memory - I got it wrong. (I do believe that using the v9.2.0beta marker is *better*, because then it sorts properly. But likely not enough much better to be inconsistent with previous versions) For beta2, the only reason was to keep it consistent with beta1. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Release versioning inconsistency
On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net wrote: (I do believe that using the v9.2.0beta marker is *better*, because then it sorts properly. But likely not enough much better to be inconsistent with previous versions) Good point. Maybe that's a reason to change the versioning scheme and stick with 9.2.0betaX everywhere. Including calling the final release 9.2.0 instead of simply 9.2? Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 16:44, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 20.06.2012 11:34, Simon Riggs wrote: On 20 June 2012 16:23, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It's only needed for multi-master replication, where the same table can be updated from multiple nodes. Just leave that out for now. There's plenty of functionality and issues left even without that. Huh? Multi-master replication is what is being built here and many people want that. Sure, but presumably you're going to implement master-slave first, and build multi-master on top of that. What I'm saying is that we can leave out the origin-id for now, since we don't have agreement on it, and revisit it after master-slave replication is working. I am comfortable with the idea of deferring applying the patch, but I don't see any need to defer agreeing the patch is OK, so it can be applied easily later. It does beg the question of when exactly we would defer it to though. When would that be? If you have a reason for disagreement, please raise it now, having seen explanations/comments on various concerns. Of course, people have made initial objections, which is fine but its not reasonable to assume that such complaints continue to exist after. Perhaps there are other thoughts? The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. Deferring this could easily end up with a huge patch in last CF, and then it will be rejected/deferred. Patch submission here is following the requested process - as early as possible, production ready, small meaningful patches that build towards a final goal. This is especially true for format changes, which is why this patch is here now. Doing it differently just makes patch wrangling and review more difficult, which reduces overall quality and slows down development. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On sön, 2012-06-17 at 23:58 +0100, Peter Geoghegan wrote: So if you take the word Aßlar here - that is equivalent to Asslar, and so strcoll(Aßlar, Asslar) will return 0 if you have the right LC_COLLATE This is not actually correct. glibc will sort Asslar before Aßlar, and that is correct in my mind. When a Wikipedia page on some particular language's alphabet says something like $letterA and $letterB are equivalent, what it really means is that they are sorted the same compared to other letters, but are distinct when ties are broken. (if you tried this out for yourself and found that I was actually lying through my teeth, pretend I said Hungarian instead of German and some really obscure character rather than ß). Yeah, there are obviously exceptions, which led to the original change being made, but they are not as wide-spread as they appear to be. The real issue in this area, I suspect, will be dealing with Unicode combining sequences versus equivalent precombined characters. But support for that is generally crappy, so it's not urgent to deal with it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture
Hi Robert, Hi All! On Wednesday, June 20, 2012 03:08:48 AM Robert Haas wrote: On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund and...@2ndquadrant.com wrote: Well, the words are fuzzy, but I would define logical replication to be something which is independent of the binary format in which stuff gets stored on disk. If it's not independent of the disk format, then you can't do heterogenous replication (between versions, or between products). That precise limitation is the main thing that drives people to use anything other than SR in the first place, IME. Not in mine. The main limitation I see is that you cannot write anything on the standby. Which sucks majorly for many things. Its pretty much impossible to fix that for SR outside of very limited cases. While many scenarios don't need multimaster *many* need to write outside of the standby's replication set. Well, that's certainly a common problem, even if it's not IME the most common, but I don't think we need to argue about which one is more common, because I'm not arguing against it. The point, though, is that if the logical format is independent of the on-disk format, the things we can do are a strict superset of the things we can do if it isn't. I don't want to insist that catalogs be the same (or else you get garbage when you decode tuples). I want to tolerate the fact that they may very well be different. That will in no way preclude writing outside the standby's replication set, nor will it prevent multi-master replication. It will, however, enable heterogenous replication, which is a very important use case. It will also mean that innocent mistakes (like somehow ending up with a column that is text on one server and numeric on another server) produce comprehensible error messages, rather than garbage. I agree with most of that. I think that some parts of the above need to be optional because you do loose too much for other scenarious. I *definitely* want to build the *infrastructure* which make it easy to implement all of the above but I find it a bit much to require that from the get-go. Its important that everything is reusable for that, yes. Does a patchset that wants to implement tightly coupled multimaster need to implement everything for that? No. If we raise the barrier for anything around this topic so high we will *NEVER* get anywhere. Its a huge topic with loads of people wanting loads of different things. And that will hurt people wanting some feature which matches 90% of the proposed goals *far* more. Its not only the logging side which is a limitation in todays replication scenarios. The apply side scales even worse because its *very* hard to distribute it between multiple backends. I don't think that making LCR format = on-disk format is going to solve that problem. To solve that problem, we need to track dependencies between transactions, so that if tuple A is modified by T1 and T2, in that order, we apply T1 before T2. But if T3 - which committed after both T1 and T2 - touches none of the same data as T1 or T2 - then we can apply it in parallel, so long as we don't commit until T1 and T2 have committed (because allowing T3 to commit early would produce a serialization anomaly from the point of view of a concurrent reader). Well, doing apply on such low level, without reencoding the data increased throughput nearly threefold even for trivial types. So it pushes of the point where we need to do the above quite a bit. Because the routines that decode tuples don't include enough sanity checks to prevent running off the end of the block, or even the end of memory completely. Consider a corrupt TOAST pointer that indicates that there is a gigabyte of data stored in an 8kB block. One of the common symptoms of corruption IME is TOAST requests for -3 bytes of memory. Yes, but we need to put safeguards against that sort of thing anyway. So sure, we can have bugs but this is not a fundamental limitation. There's a reason we haven't done that already, though: it's probably going to stink for performance. If it turns out that it doesn't stink for performance, great. But if it causes a 5% slowdown on common use cases, I suspect we're not gonna do it, and I bet I can construct a case where it's worse than that (think: 400 column table with lots of varlenas, sorting by column 400 to return column 399). I think it's treading on dangerous ground to assume we're going to be able to just go fix this. I am talking about ensuring that the catalog is the same on the decoding site not about making all decoding totally safe in the face of corrupted information. Postgis uses one information table in a few more complex functions but not in anything low-level. Evidenced by the fact that it was totally normal for that to go out of sync before 2.0. But even if such a thing would be needed, it wouldn't be problematic to make extension
Re: [HACKERS] performance regression in 9.2 when loading lots of small tables
On Tue, Jun 19, 2012 at 8:06 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes jeff.ja...@gmail.com wrote: But in the 9.2 branch, the slow phenotype was re-introduced in 1575fbcb795fc331f4, although perhaps the details of who is locking what differs. I haven't yet sorted that out. It very much does. That commit prevents people from creating a relation in - or renaming a relation into - a schema that is being concurrently dropped, which in previous releases would have resulted in inconsistent catalog contents. I admit that it harms your test case, but how likely is it that someone is going to put every single table into its own schema? Yep, I see that even having 10 tables per scheme (which I think was the case for the person who started this line of inquiry) rather than just 1 greatly reduces this regression. It basically goes from about 2x slower to about 1.1x slower. So I think that pretty much settles the issue for 9.2. And have shared_buffers low enough for this to be the dominant cost? That one is not so unlikely. After all, lowering shared_buffers is a defensive move to get around the FlushRelationBuffers problem, so hopefully people faced with the task of restoring such a dump would take advantage of it, if they have a maintenance window in which to do so. Also, the FlushRelationBuffers issue is linear while the locks are quadratic, so eventually the locking would regain dominance even if shared_buffers is large. I think in real-world scenarios this isn't going to be a problem - although, of course, making the lock manager faster would be nifty if we can do it, and this might be a good test case. The resource owner reassign lock patch does effectively solve the lock manager problem of restoring dumps of such databases, as well as the similar problem in creating such dumps. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables
Hi KaiGai-san, Thank you for the review. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai Sent: Wednesday, June 20, 2012 1:26 AM To: Etsuro Fujita Cc: Robert Haas; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables Hi Fujita-san, Could you rebase this patch towards the latest tree? It was unavailable to apply the patch cleanly. Sorry, I updated the patch. Please find attached an updated version of the patch. I looked over the patch, then noticed a few points. At ProcessCopyOptions(), defel-arg can be NIL, isn't it? If so, cstate-convert_binary is not a suitable flag to check redundant option. It seems to me cstate-convert_selectively is more suitable flag to check it. + else if (strcmp(defel-defname, convert_binary) == 0) + { + if (cstate-convert_binary) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg(conflicting or redundant options))); + cstate-convert_selectively = true; + if (defel-arg == NULL || IsA(defel-arg, List)) + cstate-convert_binary = (List *) defel-arg; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg(argument to option \%s\ must be a list of column names, + defel-defname))); + } Yes, defel-arg can be NIL. defel-arg is a List structure listing all the columns needed to be converted to binary representation, which is NIL for the case where no columns are needed to be converted. For example, defel-arg is NIL for SELECT COUNT(*). In this case, while cstate-convert_selectively is set to true, no columns are converted at NextCopyFrom(). Most efficient case! In short, cstate-convert_selectively represents whether to do selective binary conversion at NextCopyFrom(), and cstate-convert_binary represents all the columns to be converted at NextCopyFrom(), that can be NIL. At NextCopyFrom(), this routine computes default values if configured. In case when these values are not referenced, it might be possible to skip unnecessary calculations. Is it unavailable to add logic to avoid to construct cstate-defmap on unreferenced columns at BeginCopyFrom()? I think that we don't need to add the above logic because file_fdw does BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom() doesn't construct cstate-defmap at all. I fixed a bug plus some minor optimization in check_binary_conversion() that is renamed to check_selective_binary_conversion() in the updated version, and now file_fdw gives up selective binary conversion for the following cases: a) BINARY format b) CSV/TEXT format and whole row reference c) CSV/TEXT format and all the user attributes needed Best regards, Etsuro Fujita Thanks, 2012/5/11 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: -Original Message- From: Robert Haas [mailto:robertmh...@gmail.com] Sent: Friday, May 11, 2012 1:36 AM To: Etsuro Fujita Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables On Tue, May 8, 2012 at 7:26 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I would like to propose to improve parsing efficiency of contrib/file_fdw by selective parsing proposed by Alagiannis et al.[1], which means that for a CSV/TEXT file foreign table, file_fdw performs binary conversion only for the columns needed for query processing. Attached is a WIP patch implementing the feature. Can you add this to the next CommitFest? Looks interesting. Done. Best regards, Etsuro Fujita -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers begin 666 file_fdw_sel_bin_conv_v2.patch M9EF9B M+6=I=!A+V-O;G1R:6(O9FEL95]F9'O9FEL95]F9'N8R!B+V-O M;G1R:6(O9FEL95]F9'O9FEL95]F9'N8PII;F1E!E,V(Y,C(S+BYF9#DT M,3)F(#$P,#8T- HM+2T@82]C;VYTFEB+V9I;5?9F1W+V9I;5?9F1W+F,* M*RLK((O8V]N=')I8B]F:6QE7V9D=R]F:6QE7V9D=RYCD! (TQ-BPV(LQ M-BPW($! B C:6YC;'5D92 \=6YIW1D+F@^B *(-I;F-L=61E()A8V-E MW,OF5L;W!T:6]NRYH(@HK(VEN8VQU94@(F%C8V5SR]S7-A='1R+F@B MB C:6YC;'5D92 B8V%T86QO9R]P9U]F;W)E:6=N7W1A8FQE+F@BB C:6YC M;'5D92 B8V]M;6%N9',O8V]P2YH(@H@(VEN8VQU94@(F-O;6UA;F1S+V1E M9G)E;2YH(@I 0 M,CDL-B K,S L-R! 0 H@(VEN8VQU94@(F]P=EM:7IE MB]P871H;F]D92YH(@H@(VEN8VQU94@(F]P=EM:7IEB]P;%N;6%I;BYH
Re: [HACKERS] sortsupport for text
On 20 June 2012 11:00, Peter Eisentraut pete...@gmx.net wrote: On sön, 2012-06-17 at 23:58 +0100, Peter Geoghegan wrote: So if you take the word Aßlar here - that is equivalent to Asslar, and so strcoll(Aßlar, Asslar) will return 0 if you have the right LC_COLLATE This is not actually correct. glibc will sort Asslar before Aßlar, and that is correct in my mind. Uh, what happened here was that I assumed that it was correct, and then went to verify it and found that it wasn't before sending the mail, and couldn't immediately find any hard data about what characters this did apply to, I decided to turn it into a joke. I say this, and yet you've included that bit of the e-mail inline in your reply, so maybe it just wasn't a very good joke. When a Wikipedia page on some particular language's alphabet says something like $letterA and $letterB are equivalent, what it really means is that they are sorted the same compared to other letters, but are distinct when ties are broken. I know. (if you tried this out for yourself and found that I was actually lying through my teeth, pretend I said Hungarian instead of German and some really obscure character rather than ß). Yeah, there are obviously exceptions, which led to the original change being made, but they are not as wide-spread as they appear to be. True. The real issue in this area, I suspect, will be dealing with Unicode combining sequences versus equivalent precombined characters. But support for that is generally crappy, so it's not urgent to deal with it. I agree that it isn't urgent. However, I have an ulterior motive, which is that in allowing for this, we remove the need to strcmp() after each strcoll(), and consequently it becomes possible to use strxfrm() instead. Now, we could also use a hack that's going to make the strxfrm() blobs even bulkier still (basically, concatenate the original text to the blob before strcmp()), but I don't want to go there if it can possibly be avoided. I should also point out that we pride ourselves on following the letter of the standard when that makes sense, and we are currently not doing that in respect of the Unicode standard. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
On Jun19, 2012, at 17:36 , Robert Haas wrote: On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout klep...@svana.org wrote: On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote: The fly in the ointment with any of these ideas is that the configure list is not a list of exact cipher names, as per Magnus' comment that the current default includes tests like !aNULL. I am not sure that we know how to evaluate such conditions if we are applying an after-the-fact check on the selected cipher. Does OpenSSL expose any API for evaluating whether a selected cipher meets such a test? I'm not sure whether there's an API for it, but you can certainly check manually with openssl ciphers -v, for example: $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP' NULL-SHASSLv3 Kx=RSA Au=RSA Enc=None Mac=SHA1 NULL-MD5SSLv3 Kx=RSA Au=RSA Enc=None Mac=MD5 ...etc... So unless the openssl includes the code twice there must be a way to extract the list from the library. There doubtless is, but I'd being willing to wager that you won't be able to figure out the exact method without reading the source code for 'opennssl ciphers' to see how it was done there, and most likely you'll find that at least one of the functions they use has no man page. Documentation isn't their strong point. Yes, unfortunately. I wonder though if shouldn't restrict the allowed ciphers list to being a simple list of supported ciphers. If our goal is to support multiple SSL libraries transparently then surely having openssl-specific syntax in the config file isn't exactly great anyway... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Wednesday, June 20, 2012 05:01:16 AM Robert Haas wrote: On Tue, Jun 19, 2012 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote: 1. dllist.h has double the element overhead by having an inline value pointer (which is not needed when embedding) and a pointer to the list (which I have a hard time seing as being useful) 2. only double linked list, mine provided single and double linked ones 3. missing macros to use when embedded in a larger struct (containerof() wrappers and for(...) support basically) 4. most things are external function calls... 5. way much more branches/complexity in most of the functions. My implementation doesn't use any branches for the typical easy modifications (push, pop, remove element somewhere) and only one for the typical tests (empty, has-next, ...) The performance and memory aspects were crucial for the aforementioned toy project (slab allocator for postgres). Its not that crucial for the applycache where the lists currently are mostly used although its also relatively performance sensitive and obviously does a lot of list manipulation/iteration. If I had to decide I would add the missing api in dllist.h to my implementation and then remove it. Its barely used - and only in an embedded fashion - as far as I can see. I can understand though if that argument is met with doubt by others ;). If thats the way it has to go I would add some more convenience support for embedding data to dllist.h and settle for that. I think it might be simpler to leave the name as Dllist and just overhaul the implementation along the lines you suggest, rather than replacing it with something completely different. Mostly, I don't want to add a third thing if we can avoid it, given that Dllist as it exists today is used only lightly. Well, if its the name, I have no problem with changing it, but I don't see how you can keep the api as it currently is and address my points. If there is some buyin I can try to go either way (keeping the existing name, changing the api, adjusting the callers or just adjust the callers, throw away the old implementation) I just don't want to get into that just to see somebody isn't agreeing with the fundamental idea. My guess is that it wouldn't be too hard to remove some of the extra pointers. Anyone who is using Dllist as a non-inline list could be converted to List * instead. There are only three users of the whole dllist.h. Catcache, autovacuum and postmaster. The latter two just keep a list of databases around. So any change will only be moderatively intrusive. Also, the performance-critical things could be reimplemented as macros. I question, though, whether we really need both single and doubly linked lists. That seems like it's almost certainly micro-optimization that we are better off not doing. It was certainly worthwile for the memory manager (lower per allocation overhead). You might be right that its not worth it for many other possible usecases in pg. Its not much code though. *looks around* A quick grep found: single linked list like code: guc_private.h, aset.c, elog.h, regguts.h (ok, irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c I stopped at that point because while surely not of all of the above usecases could be replaced by a common implementation several could from a quick look. Also, several pg_list.h users could benefit from a conversion. So I think adding a single linked list implementation is worthwile. The most contentious point is probably relying on USE_INLINE being available anywhere. Which I believe to be the point now that we have gotten rid of some platforms. I would be hesitant to chuck that even though I realize it's unlikely that we really need !USE_INLINE. But see sortsupport for an example of how we've handled this in the recent past. I agree its possible to resolve this. But hell, do we really need to add all that ugliness in 2012? I don't think its worth the effort of support ancient compilers that don't support inline anymore. If we could stop trying to support catering for probably non-existing compilers we could remove some *very* ugly long macros for example (e.g. in htup.h). If support for !USE_INLINE is required I would prefer to have an header define the functions like #ifdef USE_INLINE #define OPTIONALLY_INLINE static inline #define USE_LINKED_LIST_IMPL #endif #ifdef USE_LINKED_LIST_IMPL OPTIONALLY_INLINE void myFuncCall(){ ... } #endif which then gets included with #define USE_LINKED_LIST_IMPL by some c file defining OPTIONALLY_INLINE to something empty if !USE_INLINE. Its too much code to duplicate imo. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list
Re: [HACKERS] Ability to listen on two unix sockets
On 06/15/2012 05:40 PM, Honza Horak wrote: I realized the patch has some difficulties -- namely the socket path in the data dir lock file, which currently uses one port for socket and the same for interface. So to allow users to use arbitrary port for all unix sockets, we'd need to add another line only for unix socket, which doesn't apply for other platforms. Or we could just say that the first socket will allways use the default port (PostPortNumber), which is a solution I prefer currently, but will be glad for any other opinion. This is also why there is still un-necesary string splitting in pg_ctl.c, which will be removed after the issue above is solved. This is an enhanced patch, which forbids using a port number in the first socket directory entry. Honza diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml index cfdb33a..679c40a 100644 --- a/doc/src/sgml/client-auth.sgml +++ b/doc/src/sgml/client-auth.sgml @@ -838,7 +838,7 @@ omicron bryanh guest1 varnameunix_socket_permissions/varname (and possibly varnameunix_socket_group/varname) configuration parameters as described in xref linkend=runtime-config-connection. Or you -could set the varnameunix_socket_directory/varname +could set the varnameunix_socket_directories/varname configuration parameter to place the socket file in a suitably restricted directory. /para diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 074afee..d1727f8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -453,17 +453,26 @@ SET ENABLE_SEQSCAN TO OFF; /listitem /varlistentry - varlistentry id=guc-unix-socket-directory xreflabel=unix_socket_directory - termvarnameunix_socket_directory/varname (typestring/type)/term + varlistentry id=guc-unix-socket-directories xreflabel=unix_socket_directories + termvarnameunix_socket_directories/varname (typestring/type)/term indexterm - primaryvarnameunix_socket_directory/ configuration parameter/primary + primaryvarnameunix_socket_directories/ configuration parameter/primary /indexterm listitem para -Specifies the directory of the Unix-domain socket on which the +Specifies the directories of the Unix-domain sockets on which the server is to listen for connections from client applications. The default is normally filename/tmp/filename, but can be changed at build time. +Directories are separated by ',' and additional replaceableport/ +number can be set, separated from directory by ':'. Port number will +only be used as a part of the socket file name. For example, +literal'/var/run, /tmp:5431'/literal would create socket files +literal/var/run/.s.PGSQL.5432/literal and +literal/tmp/.s.PGSQL.5431/literal. +It is not possible to define replaceableport/ number in the first +entry. If set, it will be ignored and default varnameport/ +will be used. This parameter can only be set at server start. /para @@ -472,7 +481,7 @@ SET ENABLE_SEQSCAN TO OFF; literal.s.PGSQL.replaceable//literal where replaceable/ is the server's port number, an ordinary file named literal.s.PGSQL.replaceable/.lock/literal will be -created in the varnameunix_socket_directory/ directory. Neither +created in the varnameunix_socket_directories/ directories. Neither file should ever be removed manually. /para @@ -6593,7 +6602,7 @@ LOG: CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1) /row row entryoption-k replaceablex/replaceable/option/entry -entryliteralunix_socket_directory = replaceablex/replaceable//entry +entryliteralunix_socket_directories = replaceablex/replaceable//entry /row row entryoption-l/option/entry diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 7ba18f0..6c74844 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1784,7 +1784,7 @@ pg_dumpall -p 5432 | psql -d postgres -p 5433 para The simplest way to prevent spoofing for literallocal/ connections is to use a Unix domain socket directory (xref - linkend=guc-unix-socket-directory) that has write permission only + linkend=guc-unix-socket-directories) that has write permission only for a trusted local user. This prevents a malicious user from creating their own socket file in that directory. If you are concerned that some applications might still reference filename/tmp/ for the diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index e3ae92d..50d4167 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -31,6 +31,7 @@ #include postmaster/bgwriter.h #include
Re: [HACKERS] [ADMIN] pg_basebackup blocking all queries with horrible performance
On Mon, Jun 11, 2012 at 6:06 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jun 12, 2012 at 12:47 AM, Magnus Hagander mag...@hagander.net wrote: On Mon, Jun 11, 2012 at 5:37 PM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Jun 11, 2012 at 3:24 AM, Magnus Hagander mag...@hagander.net wrote: On Sun, Jun 10, 2012 at 6:08 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Jun 10, 2012 at 11:45 PM, Magnus Hagander mag...@hagander.net wrote: On Sun, Jun 10, 2012 at 4:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Jun 10, 2012 at 11:10 PM, Magnus Hagander mag...@hagander.net wrote: On Sun, Jun 10, 2012 at 4:08 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Jun 10, 2012 at 10:34 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Jun 10, 2012 at 9:25 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Jun 10, 2012 at 7:43 PM, Magnus Hagander mag...@hagander.net wrote: On Sat, Jun 9, 2012 at 2:51 PM, Tom Lane t...@sss.pgh.pa.us wrote: Fujii Masao masao.fu...@gmail.com writes: This seems a bug. I think we should prevent pg_basebackup from becoming synchronous standby. Thought? Absolutely. If we have replication clients that are not actually capable of being standbys, there *must* be a way for the master to know that. I thought we fixed this already by sending InvalidXlogRecPtr as flush location? And that this only applied in 9.2? Are you saying we picked pg_basebackup *in backup mode* (not log streaming) as synchronous standby? Yes. If so then yes, that is *definitely* a bug that should be fixed. We should never select a connection that's not even streaming log as standby! Agreed. Attached patch prevents pg_basebackup from becoming sync standby. Also this patch fixes another problem: currently only walsender which reaches STREAMING state can become sync walsender. OTOH, sync walsender thinks that walsender with higher priority will be sync one whether its state is STREAMING, and switches to potential sync walsender. So when the standby with higher priority connects to the master, we might have no sync standby until it reaches the STREAMING state. To fix this problem, the patch switches walsender's state from sync to potential *after* walsender with higher priority has reached the STREAMING state. We also should not select (1) background stream process forked from pg_basebackup and (2) pg_receivexlog as sync standby because they don't send back replication progress. To address this, I'm thinking to introduce new option NOSYNC in START_REPLICATION command as follows, and to change (1) and (2) so that they specify NOSYNC. START_REPLICATION XXX/XXX [NOSYNC] If the standby specifies NOSYNC option, it's never assigned as sync standby even if its name is in synchronous_standby_names. Thought? The standby which always sends InvalidXLogRecPtr back should not become sync one. So instead of NOSYNC option, by checking whether InvalidXLogRecPtr is sent, we can avoid problematic sync standby. We should not do this because Magnus is proposing the patch (http://archives.postgresql.org/pgsql-hackers/2012-06/msg00348.php) which breaks the above assumption at all. So we should introduce something like NOSYNC option. Wouldn't the better choice there in that case be to give a switch to pg_receivexlog if you *want* it to be able to become a sync replica, and by default disallow it? And then keep the backend just treating InvalidXlogRecPtr as don't-become-sync-replica. I don't object to making pg_receivexlog as sync standby at all. So at least for me, that switch is not necessary. What I'm worried about is the background stream process forked from pg_basebackup. I think that it should not run as sync standby but sending back its replication progress seems helpful because a user can see the progress from pg_stat_replication. So I'm thinking that something like NOSYNC option is required. On principle, no. By default, yes. How about: pg_basebackup background: *never* sends flush location, and therefor won't become sync replica pg_receivexlog *optionally* sends flush location. by defualt own't become sync replica, but can be made so with a switch Wouldn't a user who sees NULL in flush_location from pg_stat_replication misunderstand that pg_receivexlog (in default mode) and pg_basebackup background don't flush WAL files at all? That sounds like a documentable issue. But maybe you're right, and we need the never become sync as a flag. You agreed to add something like NOSYNC option into START_REPLICATION command? I'm on the fence. I was hoping somebody else would chime in with an opinion as well. +1 Nobody else with any opinion on this? :( -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] WAL format changes
On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Well, that was easier than I thought. Attached is a patch to make XLogRecPtr a uint64, on top of my other WAL format patches. I think we should go ahead with this. +1. The LSNs on pages are still stored in the old format, to avoid changing the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control file and WAL are changed, however, so an initdb (or at least pg_resetxlog) is required. Seems fine. Should we keep the old representation in the replication protocol messages? That would make it simpler to write a client that works with different server versions (like pg_receivexlog). Or, while we're at it, perhaps we should mandate network-byte order for all the integer and XLogRecPtr fields in the replication protocol. That would make it easier to write a client that works across different architectures, in = 9.3. The contents of the WAL would of course be architecture-dependent, but it would be nice if pg_receivexlog and similar tools could nevertheless be architecture-independent. I share Andres' question about how we're doing this already. I think if we're going to break this, I'd rather do it in 9.3 than 5 years from now. At this point it's just a minor annoyance, but it'll probably get worse as people write more tools that understand WAL. If we are looking at breaking it, and we are especially concerned about something like pg_receivexlog... Is it something we could/should change in the protocl *now* for 9.2, to make it non-broken in any released version? As in, can we extract just the protocol change and backpatch that to 9.2beta? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Release versioning inconsistency
On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp ma...@juffo.org wrote: On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net wrote: (I do believe that using the v9.2.0beta marker is *better*, because then it sorts properly. But likely not enough much better to be inconsistent with previous versions) Good point. Maybe that's a reason to change the versioning scheme and stick with 9.2.0betaX everywhere. Including calling the final release 9.2.0 instead of simply 9.2? That might actually be a good idea. We can't really change the way we named the betas, but it's not too late to consider naming the actual release as 9.2.0... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Release versioning inconsistency
2012/6/20 Magnus Hagander mag...@hagander.net: On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp ma...@juffo.org wrote: On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net wrote: (I do believe that using the v9.2.0beta marker is *better*, because then it sorts properly. But likely not enough much better to be inconsistent with previous versions) Good point. Maybe that's a reason to change the versioning scheme and stick with 9.2.0betaX everywhere. Including calling the final release 9.2.0 instead of simply 9.2? That might actually be a good idea. We can't really change the way we named the betas, but it's not too late to consider naming the actual release as 9.2.0... May be a symlink could be created just do fit the same pattern that other versions do and keeps the actual links (for beta) working. I'm using the same pattern in `pgvm` [1] and it is failing to fetch beta versions :( [1] https://github.com/guedes/pgvm/blob/master/include/sites regards -- Dickson S. Guedes mail/xmpp: gue...@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_basebackup blocking all queries with horrible performance
On 11 June 2012 23:47, Magnus Hagander mag...@hagander.net wrote You agreed to add something like NOSYNC option into START_REPLICATION command? I'm on the fence. I was hoping somebody else would chime in with an opinion as well. Why would you add it to synchronous_standby_names and then explicitly ignore it? I don't see why you'd want this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Release versioning inconsistency
On ons, 2012-06-20 at 13:26 +0200, Magnus Hagander wrote: On Wed, Jun 20, 2012 at 11:23 AM, Marti Raudsepp ma...@juffo.org wrote: On Wed, Jun 20, 2012 at 12:18 PM, Magnus Hagander mag...@hagander.net wrote: (I do believe that using the v9.2.0beta marker is *better*, because then it sorts properly. But likely not enough much better to be inconsistent with previous versions) Good point. Maybe that's a reason to change the versioning scheme and stick with 9.2.0betaX everywhere. Including calling the final release 9.2.0 instead of simply 9.2? That might actually be a good idea. We can't really change the way we named the betas, but it's not too late to consider naming the actual release as 9.2.0... The final release was always going to be called 9.2.0, but naming the beta 9.2.0betaX is wrong. There was a previous discussion about that particular point. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] reviving AC_PROG_INSTALL
As I had written in [0], using /usr/bin/install instead of install-sh can significantly speed up the time to run make install, and hence make check. Using /usr/bin/install is standard in all autotools-using projects. PostgreSQL has removed the use of /usr/bin/install because of the incident described in [1], which was basically configure picking up some random install program in the path, which didn't do what we wanted. The Autoconf test has been much improved in the meantime, it checks whether calling the program actually installs a file now, so it's unlikely that another false positive would be picked up. Therefore I propose to put back the AC_PROG_INSTALL configure check and associated bits that were taken out back then. See attached patch. [0]: http://petereisentraut.blogspot.fi/2012/03/postgresql-make-install-times.html [1]: http://archives.postgresql.org/pgsql-hackers/2001-03/msg00312.php diff --git a/configure b/configure index 6f8ebd8..fd29770 100755 --- a/configure +++ b/configure @@ -693,6 +693,9 @@ MKDIR_P AWK LN_S TAR +INSTALL_DATA +INSTALL_SCRIPT +INSTALL_PROGRAM WINDRES DLLWRAP DLLTOOL @@ -6855,6 +6858,106 @@ fi fi +# Find a good install program. We prefer a C program (faster), +# so one script is as good as another. But avoid the broken or +# incompatible versions: +# SysV /etc/install, /usr/sbin/install +# SunOS /usr/etc/install +# IRIX /sbin/install +# AIX /bin/install +# AmigaOS /C/install, which installs bootblocks on floppy discs +# AIX 4 /usr/bin/installbsd, which doesn't work without a -g flag +# AFS /usr/afsws/bin/install, which mishandles nonexistent args +# SVR4 /usr/ucb/install, which tries to use the nonexistent group staff +# OS/2's system install, which has a completely different semantic +# ./install, which can be erroneously created by make from ./install.sh. +# Reject install programs that cannot install multiple files. +{ $as_echo $as_me:$LINENO: checking for a BSD-compatible install 5 +$as_echo_n checking for a BSD-compatible install... 6; } +if test -z $INSTALL; then +if test ${ac_cv_path_install+set} = set; then + $as_echo_n (cached) 6 +else + as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH +do + IFS=$as_save_IFS + test -z $as_dir as_dir=. + # Account for people who put trailing slashes in PATH elements. +case $as_dir/ in + ./ | .// | /cC/* | \ + /etc/* | /usr/sbin/* | /usr/etc/* | /sbin/* | /usr/afsws/bin/* | \ + ?:\\/os2\\/install\\/* | ?:\\/OS2\\/INSTALL\\/* | \ + /usr/ucb/* ) ;; + *) +# OSF1 and SCO ODT 3.0 have their own names for install. +# Don't use installbsd from OSF since it installs stuff as root +# by default. +for ac_prog in ginstall scoinst install; do + for ac_exec_ext in '' $ac_executable_extensions; do + if { test -f $as_dir/$ac_prog$ac_exec_ext $as_test_x $as_dir/$ac_prog$ac_exec_ext; }; then + if test $ac_prog = install + grep dspmsg $as_dir/$ac_prog$ac_exec_ext /dev/null 21; then + # AIX install. It has an incompatible calling convention. + : + elif test $ac_prog = install + grep pwplus $as_dir/$ac_prog$ac_exec_ext /dev/null 21; then + # program-specific install script used by HP pwplus--don't use. + : + else + rm -rf conftest.one conftest.two conftest.dir + echo one conftest.one + echo two conftest.two + mkdir conftest.dir + if $as_dir/$ac_prog$ac_exec_ext -c conftest.one conftest.two `pwd`/conftest.dir + test -s conftest.one test -s conftest.two + test -s conftest.dir/conftest.one + test -s conftest.dir/conftest.two + then + ac_cv_path_install=$as_dir/$ac_prog$ac_exec_ext -c + break 3 + fi + fi + fi + done +done +;; +esac + +done +IFS=$as_save_IFS + +rm -rf conftest.one conftest.two conftest.dir + +fi + if test ${ac_cv_path_install+set} = set; then +INSTALL=$ac_cv_path_install + else +# As a last resort, use the slow shell script. Don't cache a +# value for INSTALL within a source directory, because that will +# break other packages using the cache if that directory is +# removed, or if the value is a relative name. +INSTALL=$ac_install_sh + fi +fi +{ $as_echo $as_me:$LINENO: result: $INSTALL 5 +$as_echo $INSTALL 6; } + +# Use test -z because SunOS4 sh mishandles braces in ${var-val}. +# It thinks the first close brace ends the variable substitution. +test -z $INSTALL_PROGRAM INSTALL_PROGRAM='${INSTALL}' + +test -z $INSTALL_SCRIPT INSTALL_SCRIPT='${INSTALL}' + +test -z $INSTALL_DATA INSTALL_DATA='${INSTALL} -m 644' + +# When Autoconf chooses install-sh as install program it tries to generate +# a relative path to it in each makefile where it subsitutes it. This clashes +# with our Makefile.global concept. This workaround helps. +case $INSTALL in + *install-sh*) INSTALL='';; +esac + # Extract the first word of tar, so it can be a program name with args. set dummy tar; ac_word=$2 { $as_echo $as_me:$LINENO: checking for $ac_word 5 @@
Re: [HACKERS] sortsupport for text
On 19 June 2012 19:44, Peter Geoghegan pe...@2ndquadrant.com wrote: PostgreSQL supported Unicode before 2005, when the tie-breaker was introduced. I know at least one Swede who used Postgres95. I just took a look at the REL6_4 branch, and it looks much the same in 1999 as it did in 2005, in that there is no tie-breaker after the strcoll(). Now, that being the case, and Hungarian in particular having a whole bunch of these equivalencies, I have to wonder if the original complainant's problem really was diagnosed correctly. It could of had something to do with the fact that texteq() was confused about whether it reported equality or equivalency - it may have taken that long for the (len1 != len2) fastpath thing (only holds for equality, not equivalence, despite the fact that the 2005-era strcoll() call checks equivalence within texteq() ) to trip someone out, because texteq() would have thereby given inconsistent answers in a very subtle way, that were not correct either according to the Hungarian locale, nor according to simple bitwise equality. It seems likely that this is more-or-less correct. The two equivalent strings had a variable number of characters, so the fastpath made texteq not accord with varstr_cmp(), even though texteq() itself also only had a single strcoll() call. So there was some tuples with the hungarian string potty in the 2005 bug report. They were not visible for any of the queries seen in test cases, even the good ones. There were a few tuples with equivalent strings like potyty that were visible. The index scans didn't fail to return the expected tuples because the indexes were internally inconsistent or otherwise corrupt. Rather, the _bt_checkkeys() function returned early because the faulty half equivalence, half equality texteq() comparator reported that the tuple returned didn't satisfy the qual, and on that basis the index scan stopped. This usually wouldn't happen with Swedish, because their equivalencies tend to be one character long, and are less common. So far, so good, but how did this not blow-up sooner? Did Hungarians only start using Postgres in late 2005, immediately after the 8.1 release? Hardly. Commit c1d62bfd00f4d1ea0647e12947ca1de9fea39b33, made in late 2003, Add operator strategy and comparison-value datatype fields to ScanKey, may be part of the problem here. Consider this test within _bt_checkkeys(), that was changed by that commit: - if (key-sk_flags SK_COMMUTE) - test = FunctionCall2(key-sk_func, -key-sk_argument, datum); - else - test = FunctionCall2(key-sk_func, -datum, key-sk_argument); + test = FunctionCall2(key-sk_func, datum, key-sk_argument); - if (DatumGetBool(test) == !!(key-sk_flags SK_NEGATE)) + if (!DatumGetBool(test)) I think that this change may have made the difference between the Hungarians getting away with it and not getting away with it. Might it have been that for text, they were using some operator that wasn't '=' (perhaps one which has no fastpath, and thus correctly made a representation about equivalency) rather than texteq prior to this commit? I didn't eyeball the pg_amop entries of the era myself, but it seems quite possible. In any case, I find it hard to believe that it took at least ten years for this problem to manifest itself just because it took that long for a Hungarian with a strcoll() implementation that correctly represented equivalency to use Postgres. A year is a more plausible window. If we do introduce an idea of equivalency to make all this work, that means there'll have to be equivalency verification when equality verification returned false in a number of places, including the above. For Gist, there is an equivalent test will still vary based on the strategy number used dubbed the consistent function, which seems analogous to the above. So, you're going to have an extra strcoll()/strxfrm() + strcmp() here, as part of a not-equal-but-maybe-equivalent test, which is bad. However, if that means that we can cache a text constant as a strxfrm() blob, and compare in a strxfrm()-wise fashion, that will more than pay for itself, even for btree traversal alone. For a naive strxfrm() + strcoll() implementation, that will save just under half of the work, and everyone knows that the cost of the comparison is what dominates here, particularly for certain collations. We'd probably formalise it to the point where there'd be a btree strategy number and fully-fledged equivalency operator that the user could conceivably use themselves. There seems to be scope-creep here. I'm not sure that I should continue with this as part of this review. Maybe this should be something that I work on for the next commitfest. It would be nice to hear what others thought of these ideas before I actually start writing a patch that both fixes these problems (our behaviour is incorrect for some locales
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Heikki Linnakangas wrote: I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. Right. And I'm worried it might not even be enough in more complicated scenarios. Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. Not only would that handle absolute versus relative updates and origin id, but application frameworks could take advantage of such a system for passing transaction metadata. I've held back on one concern so far that I'll bring up now because this suggestion would address it nicely. Our current trigger-driven logical replication includes a summary which includes transaction run time, commit time, the transaction type identifier, the source code line from which that transaction was invoked, the user ID with which the user connected to the application (which isn't the same as the database login), etc. Being able to decorate a database transaction with arbitrary (from the DBMS POV) metadata would be very valuable. In fact, our shop can't maintain the current level of capabilities without *some* way to associate such information with a transaction. I think that using up the only unused space in the fixed header to capture one piece of the transaction metadata needed for logical replication, and that only in some configurations, is short-sighted. If we solve the general problem of transaction metadata, this one specific case will fall out of that. I think removing origin ID from this patch and submitting a separate patch for a generalized transaction metadata system is the sensible way to go. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund and...@2ndquadrant.com wrote: My guess is that it wouldn't be too hard to remove some of the extra pointers. Anyone who is using Dllist as a non-inline list could be converted to List * instead. There are only three users of the whole dllist.h. Catcache, autovacuum and postmaster. The latter two just keep a list of databases around. So any change will only be moderatively intrusive. Yeah. Also, the performance-critical things could be reimplemented as macros. I question, though, whether we really need both single and doubly linked lists. That seems like it's almost certainly micro-optimization that we are better off not doing. It was certainly worthwile for the memory manager (lower per allocation overhead). You might be right that its not worth it for many other possible usecases in pg. Its not much code though. *looks around* A quick grep found: single linked list like code: guc_private.h, aset.c, elog.h, regguts.h (ok, irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c I stopped at that point because while surely not of all of the above usecases could be replaced by a common implementation several could from a quick look. Also, several pg_list.h users could benefit from a conversion. So I think adding a single linked list implementation is worthwile. I can believe that, although I fear it may be a distraction in the grand scheme of getting logical replication implemented. There should be very few places where this is actually performance-critical, and Tom will complain about large amounts of code churn that don't improve performance. If we're going to do that, how about transforming dllist.h into the doubly-linked list and adding sllist.h for the singly-linked list? The most contentious point is probably relying on USE_INLINE being available anywhere. Which I believe to be the point now that we have gotten rid of some platforms. I would be hesitant to chuck that even though I realize it's unlikely that we really need !USE_INLINE. But see sortsupport for an example of how we've handled this in the recent past. I agree its possible to resolve this. But hell, do we really need to add all that ugliness in 2012? I don't think its worth the effort of support ancient compilers that don't support inline anymore. If we could stop trying to support catering for probably non-existing compilers we could remove some *very* ugly long macros for example (e.g. in htup.h). I don't feel qualified to make a decision on this one, so will defer to the opinions of others. If support for !USE_INLINE is required I would prefer to have an header define the functions like #ifdef USE_INLINE #define OPTIONALLY_INLINE static inline #define USE_LINKED_LIST_IMPL #endif #ifdef USE_LINKED_LIST_IMPL OPTIONALLY_INLINE void myFuncCall(){ ... } #endif which then gets included with #define USE_LINKED_LIST_IMPL by some c file defining OPTIONALLY_INLINE to something empty if !USE_INLINE. Its too much code to duplicate imo. Neat trick. Maybe we should revise the sortsupport stuff to do it that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund and...@2ndquadrant.com wrote: As I said before, I definitely agree that we want to have a separate transport format once we have decoding nailed down. We still need to ship wal around if the decoding happens in a different instance, but *after* that it can be shipped in something more convenient/appropriate. Right, OK, we agree on this then. One bit is fine if you have only very simple replication topologies. Once you think about globally distributed databases its a bit different. You describe some of that below, but just to reiterate: Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF in europe). You may only want to have full intercontinental interconnect between two of those (say A and D). If you only have one bit to represent the origin thats not going to work because you won't be able discern the changes from BC on A from the changes from those originating on DEF. I don't see the problem. A certainly knows via which link the LCRs arrived. So: change happens on A. A sends the change to B, C, and D. B and C apply the change. One bit is enough to keep them from regenerating new LCRs that get sent back to A. So they're fine. D also receives the changes (from A) and applies them, but it also does not need to regenerate LCRs. Instead, it can take the LCRs that it has already got (from A) and send those to E and F. Or: change happens on B. B sends the changes to A. Since A knows the network topology, it sends the changes to C and D. D sends them to E and F. Nobody except B needs to *generate* LCRs. All any other node needs to do is suppress *redundant* LCR generation. Another topology which is interesting is circular replications (i.e. changes get shipped A-B, B-C, C-A) which is a sensible topology if you only have a low change rate and a relatively high number of nodes because you don't need the full combinatorial amount of connections. I think this one is OK too. You just generate LCRs on the origin node and then pass them around the ring at every step. When the next hop would be the origin node then you're done. I think you may be imagining that A generates LCRs and sends them to B. B applies them, and then from the WAL just generated, it produces new LCRs which then get sent to C. If you do that, then, yes, everything that you need to disentangle various network topologies must be present in WAL. But what I'm saying is: don't do it like that. Generate the LCRs just ONCE, at the origin node, and then pass them around the network, applying them at every node. Then, the information that is needed in WAL is confined to one bit: the knowledge of whether or not a particular transaction is local (and thus LCRs should be generated) or non-local (and thus they shouldn't, because the origin already generated them and thus we're just handing them around to apply everywhere). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Wednesday, June 20, 2012 02:51:30 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 6:59 AM, Andres Freund and...@2ndquadrant.com wrote: Also, the performance-critical things could be reimplemented as macros. I question, though, whether we really need both single and doubly linked lists. That seems like it's almost certainly micro-optimization that we are better off not doing. It was certainly worthwile for the memory manager (lower per allocation overhead). You might be right that its not worth it for many other possible usecases in pg. Its not much code though. *looks around* A quick grep found: single linked list like code: guc_private.h, aset.c, elog.h, regguts.h (ok, irrelevant), dynahash.c, resowner.c, extension.c, pgstat.c, xlog.c Double linked like code: shmqueue.c, lwlock.c, dynahash.c, xact.c I stopped at that point because while surely not of all of the above usecases could be replaced by a common implementation several could from a quick look. Also, several pg_list.h users could benefit from a conversion. So I think adding a single linked list implementation is worthwile. I can believe that, although I fear it may be a distraction in the grand scheme of getting logical replication implemented. There should be very few places where this is actually performance-critical, and Tom will complain about large amounts of code churn that don't improve performance. Uh. I don't want to just go around and replace anything randomly. Actually I don't want to change anything for now except whats necessary to get the patch in. The point I tried to make was just that the relatively widespread usage of similar structure make it likely that it can be used in more places in future. If we're going to do that, how about transforming dllist.h into the doubly-linked list and adding sllist.h for the singly-linked list? I would be fine with that. I will go and try to cook up a patch, assuming for now that we rely on inline, the ugliness can be added back afterwards. The most contentious point is probably relying on USE_INLINE being available anywhere. Which I believe to be the point now that we have gotten rid of some platforms. I would be hesitant to chuck that even though I realize it's unlikely that we really need !USE_INLINE. But see sortsupport for an example of how we've handled this in the recent past. I agree its possible to resolve this. But hell, do we really need to add all that ugliness in 2012? I don't think its worth the effort of support ancient compilers that don't support inline anymore. If we could stop trying to support catering for probably non-existing compilers we could remove some *very* ugly long macros for example (e.g. in htup.h). I don't feel qualified to make a decision on this one, so will defer to the opinions of others. Ok. If support for !USE_INLINE is required I would prefer to have an header define the functions like #ifdef USE_INLINE #define OPTIONALLY_INLINE static inline #define USE_LINKED_LIST_IMPL #endif #ifdef USE_LINKED_LIST_IMPL OPTIONALLY_INLINE void myFuncCall(){ ... } #endif which then gets included with #define USE_LINKED_LIST_IMPL by some c file defining OPTIONALLY_INLINE to something empty if !USE_INLINE. Its too much code to duplicate imo. Neat trick. Maybe we should revise the sortsupport stuff to do it that way. Either that or at least add a comment to both that its duplicated... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. If you're saying that single-master logical replication isn't useful, I disagree. Of course, having both single-master and multi-master replication together is even more useful. But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. Thinking that we're going to get MMR in one release is not realistic. The only way to make it realistic is to put MMR ahead of every other goal that people have for logical replication, including robustness and stability. It's entirely premature to be designing features for MMR when we don't even have the design for SMR nailed down yet. And that's even assuming we EVER want MMR in core, which has not even really been argued, let alone agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:19:55 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. If you're saying that single-master logical replication isn't useful, I disagree. Of course, having both single-master and multi-master replication together is even more useful. But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. Thinking that we're going to get MMR in one release is not realistic. The only way to make it realistic is to put MMR ahead of every other goal that people have for logical replication, including robustness and stability. It's entirely premature to be designing features for MMR when we don't even have the design for SMR nailed down yet. And that's even assuming we EVER want MMR in core, which has not even really been argued, let alone agreed. I agree it has not been agreed uppon, but I certainly would consider submitting a prototype implementing it an argument for doing it ;) Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
I'm almost inclined to suggest that we not get next-LSN from WAL, but by scanning all the pages in the main data store and computing the max observed LSN. This is clearly not very attractive from a performance standpoint, but it would avoid the obvious failure mode where you lost some recent WAL segments along with pg_control. According to my analysis, this will have some problem. I will explain the problem by taking example scenario. Example Scenario - Let us assume that database crashes and it can be recovered by doing crash recovery. Now assume we have Data files and WAL files intact and only control file is lost. Now user uses pg_resetxlog to generate pg_control file and we uses new algorithm to generate next-LSN. Summary of events before database crash- 1. Checkpoint was in progress and it has already noted next-LSN location (LSN-107) and mark the dirty pages as BM_CHECKPOINT_NEEDED. 2. At this point a new transaction dirties 2 pages, first it dirties a fresh page (for this change LSN-108) and then it dirties one which is already marked as BM_CHECKPOINT_NEEDED (for this change LSN-109). 3. CheckPoint starts flushing pages. 4. It will now flush the page with LSN-109 but not the page 108. 4. Checkpoint finishes. 5. Database crashes. Normal Crash Recovery - it will start the replay from 107 and after recovery the database will be in consistent state. Pg_resetxlog - It will generate the next-LSN point as 109 which when used for recovery will generate inconsistent database. However if we would have relied on WAL, it would have got next-LSN as 107. This is just an Example case to show that there can be some problems using the algorithm for generating next-LSN from pages. However it doesn't prove that generating from WAL will be correct. Please correct my understanding if I am wrong. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Add some more documentation for array indexes/operators
I had trouble finding what operators arrays supported or which ones had index support or even determining that arrays could be indexed from the documentation from the array data type. So, patch. -Ryan Kelly diff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml index 3508ba3..51d996d 100644 --- a/doc/src/sgml/array.sgml +++ b/doc/src/sgml/array.sgml @@ -573,6 +573,33 @@ SELECT * FROM This function is described in xref linkend=functions-srf-subscripts. /para + para + You can also search with arrays using the literal@gt;/, + literallt;@/, and literalamp;amp;/ operators which check + whether the left operand contains, is contained by, or overlaps with + the right operand, respectively. For instance, + +programlisting +SELECT * FROM sal_emp WHERE pay_by_quarter ARRAY[1]; +/programlisting + + Will find all rows where the structfieldpay_by_quarter/structfield + overlaps with an array containing literal1/, that is, any + employee with a quarterly salary of literal1/ will be matched. + /para + + para + See xref linkend=arrays for more details about array operator + behavior. + /para + + para + Additionally, a number of operators support indexed operations. This means + that the example above could utilize an index to efficiently locate those + employees with the desired quarterly salary. See xref linkend=indexes-types + for more details about which operators support indexed operations. + /para + tip para Arrays are not sets; searching for specific array elements diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index cd374ac..2cef8e4 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -10282,7 +10282,8 @@ SELECT NULLIF(value, '(none)') ... para See xref linkend=arrays for more details about array operator - behavior. + behavior. See xref linkend=indexes-types for more details about + which operators support indexed operations. /para para -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund and...@2ndquadrant.com wrote: Uh. I don't want to just go around and replace anything randomly. Actually I don't want to change anything for now except whats necessary to get the patch in. The point I tried to make was just that the relatively widespread usage of similar structure make it likely that it can be used in more places in future. Well, the question is for anywhere you might be thinking of using this: why not just use List? We do that in a lot of other places, and there's not much reason to event something new unless there is a problem with what we already have. I assume this is related to logical replication somehow, but it's not clear to me exactly what problem you hit doing this in the obvious way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote: On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. If you're saying that single-master logical replication isn't useful, I disagree. Of course, having both single-master and multi-master replication together is even more useful. But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. OK, so your estimate is 1.5 people to do that. And if we have more people, should they sit around doing nothing? Thinking that we're going to get MMR in one release is not realistic. If you block it, then the above becomes true, whether or not it starts true. You may not want MMR, but others do. I see no reason to prevent people from having it, which is what you suggest. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)
On Wednesday, June 20, 2012 03:24:58 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 9:12 AM, Andres Freund and...@2ndquadrant.com wrote: Uh. I don't want to just go around and replace anything randomly. Actually I don't want to change anything for now except whats necessary to get the patch in. The point I tried to make was just that the relatively widespread usage of similar structure make it likely that it can be used in more places in future. Well, the question is for anywhere you might be thinking of using this: why not just use List? We do that in a lot of other places, and there's not much reason to event something new unless there is a problem with what we already have. I assume this is related to logical replication somehow, but it's not clear to me exactly what problem you hit doing this in the obvious way. It incurs a rather high performance overhead due to added memory allocations and added pointer indirections. Thats fine for most of the current users of the List interface, but certainly not for all. In other places you cannot even have memory allocations because the list lives in shared memory. E.g. in the ApplyCache, where I use the submitted ilist.h stuff, when reconstructing transactions you add to a potentially really long linked list of individual changes for every interesting wal record. Before I prevented memory allocations in that path it took about 12-14% of the time when applying changes in the same backend. Afterwards it wasn't visible in the profile anymore. Several of the pieces of code I pointed out in a previous email use open-coded list implementation exactly to prevent those problems. If you look at the parsing, planning execution of trivial statements you will also notice the overhead of memory allocations. A good bit of those is caused by list manipulation. Check Stephen Frost's Pre-alloc ListCell's optimization for workarounds.. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
On Wed, Jun 20, 2012 at 9:21 AM, Amit Kapila amit.kap...@huawei.com wrote: Example Scenario - Now assume we have Data files and WAL files intact and only control file is lost. Just so I understand correctly, the aim of this is to fix the situation where out of the thousands of files and 100s of GB of data in my pg directory, the *only* corruption is that a single file pg_control file is missing? a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote: On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. If you're saying that single-master logical replication isn't useful, I disagree. Of course, having both single-master and multi-master replication together is even more useful. But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. OK, so your estimate is 1.5 people to do that. And if we have more people, should they sit around doing nothing? Oh, give me a break. You're willfully missing my point. And to quote Fred Brooks, nine women can't make a baby in one month. Thinking that we're going to get MMR in one release is not realistic. If you block it, then the above becomes true, whether or not it starts true. If I had no rational basis for my objections, that would be true. You've got four people objecting to this patch now, all of whom happen to be committers. Whether or not MMR goes into core, who knows, but it doesn't seem that this patch is going to fly. My main point in bringing this up is that if you pick a project that is too large, you will fail. As I would rather see this project succeed, I recommend that you don't do that. Both you and Andres seem to believe that MMR is a reasonable first target to shoot at, but I don't think anyone else - including the Slony developers who have commented on this issue - endorses that position. At PGCon, you were talking about getting a new set of features into PG over the next 3-5 years. Now, it seems like you want to compress that timeline to a year. I don't think that's going to work. You also requested that people tell you sooner when large patches were in danger of not making the release. Now I'm doing that, VERY early, and you're apparently angry about it. If the only satisfactory outcome of this conversation is that everyone agrees with the design pattern you've already decided on, then you haven't left yourself very much room to walk away satisfied. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:02:28 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund and...@2ndquadrant.com wrote: One bit is fine if you have only very simple replication topologies. Once you think about globally distributed databases its a bit different. You describe some of that below, but just to reiterate: Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF in europe). You may only want to have full intercontinental interconnect between two of those (say A and D). If you only have one bit to represent the origin thats not going to work because you won't be able discern the changes from BC on A from the changes from those originating on DEF. I don't see the problem. A certainly knows via which link the LCRs arrived. So: change happens on A. A sends the change to B, C, and D. B and C apply the change. One bit is enough to keep them from regenerating new LCRs that get sent back to A. So they're fine. D also receives the changes (from A) and applies them, but it also does not need to regenerate LCRs. Instead, it can take the LCRs that it has already got (from A) and send those to E and F. Or: change happens on B. B sends the changes to A. Since A knows the network topology, it sends the changes to C and D. D sends them to E and F. Nobody except B needs to *generate* LCRs. All any other node needs to do is suppress *redundant* LCR generation. Another topology which is interesting is circular replications (i.e. changes get shipped A-B, B-C, C-A) which is a sensible topology if you only have a low change rate and a relatively high number of nodes because you don't need the full combinatorial amount of connections. I think this one is OK too. You just generate LCRs on the origin node and then pass them around the ring at every step. When the next hop would be the origin node then you're done. I think you may be imagining that A generates LCRs and sends them to B. B applies them, and then from the WAL just generated, it produces new LCRs which then get sent to C. Yes, thats what I am proposing. If you do that, then, yes, everything that you need to disentangle various network topologies must be present in WAL. But what I'm saying is: don't do it like that. Generate the LCRs just ONCE, at the origin node, and then pass them around the network, applying them at every node. Then, the information that is needed in WAL is confined to one bit: the knowledge of whether or not a particular transaction is local (and thus LCRs should be generated) or non-local (and thus they shouldn't, because the origin already generated them and thus we're just handing them around to apply everywhere). Sure, you can do it that way, but I don't think its a good idea. If you do it my way you *guarantee* that when replaying changes from node B on node C you have replayed changes from A at least as far as B has. Thats a really nice property for MM. You *can* get same with your solution but it starts to get complicated rather fast. While my/our proposed solution is trivial to implement. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 20:37, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Heikki Linnakangas wrote: I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. Right. Wrong, as explained. And I'm worried it might not even be enough in more complicated scenarios. Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. Not only would that handle absolute versus relative updates and origin id, but application frameworks could take advantage of such a system for passing transaction metadata. I've held back on one concern so far that I'll bring up now because this suggestion would address it nicely. Our current trigger-driven logical replication includes a summary which includes transaction run time, commit time, the transaction type identifier, the source code line from which that transaction was invoked, the user ID with which the user connected to the application (which isn't the same as the database login), etc. Being able to decorate a database transaction with arbitrary (from the DBMS POV) metadata would be very valuable. In fact, our shop can't maintain the current level of capabilities without *some* way to associate such information with a transaction. I think that using up the only unused space in the fixed header to capture one piece of the transaction metadata needed for logical replication, and that only in some configurations, is short-sighted. If we solve the general problem of transaction metadata, this one specific case will fall out of that. The proposal now includes flag bits that would allow the addition of a variable length header, should that ever become necessary. So the unused space in the fixed header is not being used up as you say. In any case, the fixed header still has 4 wasted bytes on 64bit systems even after the patch is applied. So this claim of short sightedness is just plain wrong. It isn't true that this is needed only for some configurations of multi-master, per discussion. This is not transaction metadata, it is WAL record metadata required for multi-master replication, see later point. We need to add information to every WAL record that is used as the source for generating LCRs. It is also possible to add this to HEAP and HEAP2 records, but doing that *will* bloat the WAL stream, whereas using the *currently wasted* bytes on a WAL record header does *not* bloat the WAL stream. I think removing origin ID from this patch and submitting a separate patch for a generalized transaction metadata system is the sensible way to go. We already have a very flexible WAL system for recording data of interest to various resource managers. If you wish to annotate a transaction, you can either generate a new kind of WAL record or you can enhance a commit record. There are already unused flag bits on commit records for just such a purpose. XLOG_NOOP records can already be generated by your application if you wish to inject additional metadata to the WAL stream. So no changes are required for you to implement the generalised transaction metadata scheme you say you require. Not sure how or why that relates to requirements for multi-master. Please note that I've suggested review changes to Andres' work myself. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/19/2012 01:47 AM, Christopher Browne wrote: That numbering scheme gets pretty anti-intuitive fairly quickly, from whence we took the approach of having a couple digits indicating data centre followed by a digit indicating which node in that data centre. If that all sounds incoherent, well, the more nodes you have around, the more difficult it becomes to make sure you *do* have a coherent picture of your cluster. I recall the Slony-II project having a notion of attaching a permanent UUID-based node ID to each node. As long as there is somewhere decent to find a symbolically significant node name, I like the idea of the ID *not* being in a tiny range, and being UUID/OID-like... Just as a sidenote, MySQL's new global transaction ids use a UUID for the serverid bit of it. [1] - - Hannu Valtonen [1] http://d2-systems.blogspot.fi/2012/04/global-transaction-identifiers-are-in.html -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJP4B2ZAAoJENx2aQJr96ohm/4QAOKqFkyBlbwn1k69z6jmqbyq vn1pCtAKgpdL3wdOEIipotylJnyl9Aqd8IRQT/j3u2dDOWbFqf6Xs3t8x6NbE06T 4kBANs7VjZwSgUlnmU6zsGhX5RB+mHEmnjKn+GG3uA4m5Qo5WvvzfwxKXDdOfSuw ChLgTTse28OV5xOeQnU4FSPPrsc0AxL8suoUDAi3Qp3EkefgN7zzkfLkw1B+E119 mfxSus9nu4rl4givP+z8VMtfIhU4EqQvQvcI5w6E4aW88iYxzkH/BICyBYGg73e4 SnJfLpaXg/C/Ll0NjKRr9Gsuxl1yStb7vPzc0AQahyE2IyspN+Ga5Y1eyvEfcv5s cOpKFLMPE5sQpvTdcSfZ3/nGotos1PDijVpAy7qVY3m0ow5ECyGv/8sdXSQXut+x f9QVNe3rOznsy3J38Z8OOEaftq30UZt7+cXl4fMvI/eVkva+MyHjdR+aIqzMVh1d S6uCpkK/UUC11PCENdRmVIka6EHAsvs+m9B4kmsHpR4+T/qKVaLdBX27L7k2prse OmJ1qFtOmMqZJHFZ8oCVYoVoK5UEaJfZBLbSyf9vU0iSDh2XlKgVNe6TTE0f+OVE GcqIT2qbc9XG6hFO/C0zy5qdNXem96feBszjWLhf+F9ULq1d0HNEg2WghJJPpan1 adx5JGZK5k0xtSewTj7O =lK2f -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com wrote: If you do that, then, yes, everything that you need to disentangle various network topologies must be present in WAL. But what I'm saying is: don't do it like that. Generate the LCRs just ONCE, at the origin node, and then pass them around the network, applying them at every node. Then, the information that is needed in WAL is confined to one bit: the knowledge of whether or not a particular transaction is local (and thus LCRs should be generated) or non-local (and thus they shouldn't, because the origin already generated them and thus we're just handing them around to apply everywhere). Sure, you can do it that way, but I don't think its a good idea. If you do it my way you *guarantee* that when replaying changes from node B on node C you have replayed changes from A at least as far as B has. Thats a really nice property for MM. You *can* get same with your solution but it starts to get complicated rather fast. While my/our proposed solution is trivial to implement. That's an interesting point. I agree that's a useful property, and might be a reason not to just use a single-bit flag, but I still think you'd be better off handling that requirement via some other method, like logging the node ID once per transaction or something. That lets you have as much metadata as you end up needing, which is a lot more flexible than a 16-bit field, as Kevin, Heikki, and Tom have also said. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:42:39 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote: On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. If you're saying that single-master logical replication isn't useful, I disagree. Of course, having both single-master and multi-master replication together is even more useful. But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. Thinking that we're going to get MMR in one release is not realistic. If you block it, then the above becomes true, whether or not it starts true. My main point in bringing this up is that if you pick a project that is too large, you will fail. Were not the only ones here that are performing scope creep though... I think about all people who have posted in the whole thread except maybe Tom and Marko are guilty of doing so. I still think its rather sensible to focus on exactly duplicated schemas in a very first version just because that leaves out some of the complexity while paving the road for other nice things. You've got four people objecting to this patch now, all of whom happen to be committers. Whether or not MMR goes into core, who knows, but it doesn't seem that this patch is going to fly. I find that a bit too early to say. Sure it won't fly exactly as proposed, but hell, who cares? What I want to get in is a solution to the specific problem the patch targets. At least you have, not sure about others, accepted that the problem needs a solution. We do not agree yet how that solution looks should like but thats not exactly surprising as we started discussing the problem only a good day ago. If people agree that your proposed way of just one flag bit is the way to go we will have to live with that. But thats different from saying the whole thing is dead. As I would rather see this project succeed, I recommend that you don't do that. Both you and Andres seem to believe that MMR is a reasonable first target to shoot at, but I don't think anyone else - including the Slony developers who have commented on this issue - endorses that position. I don't think we get full MMR into 9.3. What I am proposing is that we build in the few pieces that are required to implement MMR *ontop* of whats hopefully in 9.3. And I think thats a realistic goal. At PGCon, you were talking about getting a new set of features into PG over the next 3-5 years. Now, it seems like you want to compress that timeline to a year. Well, I obviously would like to be everything be done in a release, but I also would like to go hiking for a year, have a restored sailing boat and some more. That doesn't make it reality... To make it absolutely clear: I definitely don't think its realistic to have everything in 9.3. And I don't think Simon does so either. What I want is to have the basic building blocks in 9.3. The difference probably just is whats determined as a building block. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:54:43 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com wrote: If you do that, then, yes, everything that you need to disentangle various network topologies must be present in WAL. But what I'm saying is: don't do it like that. Generate the LCRs just ONCE, at the origin node, and then pass them around the network, applying them at every node. Then, the information that is needed in WAL is confined to one bit: the knowledge of whether or not a particular transaction is local (and thus LCRs should be generated) or non-local (and thus they shouldn't, because the origin already generated them and thus we're just handing them around to apply everywhere). Sure, you can do it that way, but I don't think its a good idea. If you do it my way you *guarantee* that when replaying changes from node B on node C you have replayed changes from A at least as far as B has. Thats a really nice property for MM. You *can* get same with your solution but it starts to get complicated rather fast. While my/our proposed solution is trivial to implement. That's an interesting point. I agree that's a useful property, and might be a reason not to just use a single-bit flag, but I still think you'd be better off handling that requirement via some other method, like logging the node ID once per transaction or something. That lets you have as much metadata as you end up needing, which is a lot more flexible than a 16-bit field, as Kevin, Heikki, and Tom have also said. If it comes down to that I can definitely live with that. I still think making the filtering trivial so it can be done without any logic on a low level is a very desirable property but if not, so be it. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 21:42, Robert Haas robertmh...@gmail.com wrote: On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: On 20 June 2012 21:19, Robert Haas robertmh...@gmail.com wrote: On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. If you're saying that single-master logical replication isn't useful, I disagree. Of course, having both single-master and multi-master replication together is even more useful. But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. OK, so your estimate is 1.5 people to do that. And if we have more people, should they sit around doing nothing? Oh, give me a break. You're willfully missing my point. And to quote Fred Brooks, nine women can't make a baby in one month. No, I'm not. The question is not how quickly can N people achieve a single thing, but how long will it take a few skilled people working on carefully selected tasks that have few dependencies between them to achieve something. We have significantly more preparation, development time and resources than any other project previously performed for PostgreSQL, that I am aware of. Stating that it is impossible to perform a task in a certain period of time without even considering those points is clearly rubbish. I've arrived at my thinking based upon detailed project planning of what was possible in the time. How exactly did you arrive at your conclusion? Why is yours right and mine wrong? Thinking that we're going to get MMR in one release is not realistic. If you block it, then the above becomes true, whether or not it starts true. If I had no rational basis for my objections, that would be true. You've got four people objecting to this patch now, all of whom happen to be committers. Whether or not MMR goes into core, who knows, but it doesn't seem that this patch is going to fly. No, I have four people who had initial objections and who have not commented on the fact that the points made are regrettably incorrect. I don't expect everybody commenting on the design to have perfect knowledge of the whole design, so I expect people to make errors in their comments. I also expect people to take note of what has been said before making further objections or drawing conclusions. Since at least 3 of the people making such comments did not attend the full briefing meeting in Ottawa, I am not particularly surprised. However, I do expect people that didn't come to the meeting to recognise that they are likely to be missing information and to listen closely, as I listen to them. When the facts change, I change my mind. What do you do, sir? My main point in bringing this up is that if you pick a project that is too large, you will fail. As I would rather see this project succeed, I recommend that you don't do that. Both you and Andres seem to believe that MMR is a reasonable first target to shoot at, but I don't think anyone else - including the Slony developers who have commented on this issue - endorses that position. At PGCon, you were talking about getting a new set of features into PG over the next 3-5 years. Now, it seems like you want to compress that timeline to a year. I don't think that's going to work. You also requested that people tell you sooner when large patches were in danger of not making the release. Now I'm doing that, VERY early, and you're apparently angry about it. If the only satisfactory outcome of this conversation is that everyone agrees with the design pattern you've already decided on, then you haven't left yourself very much room to walk away satisfied. Note that I have already myself given review feedback to Andres and that change has visibly occurred during this thread via public debate. Claiming that I only stick to what has already been decided is patently false, with me at least. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On Sun, Jun 17, 2012 at 9:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: The trick for hashing such datatypes is to be able to guarantee that equal values hash to the same hash code, which is typically possible as long as you know the equality rules well enough. We could possibly do that for text with pure-strcoll equality if we knew all the details of what strcoll would consider equal, but we do not. It occurs to me that strxfrm would answer this question. If we made the hash function hash the result of strxfrm then we could make equality use strcoll and not fall back to strcmp. I'm suspect in a green field that's what we would do though the cpu cost might be enough to think hard about it. I'm not sure it's worth considering switching though. The cases where it matters to users incidentally is when you have a multi-column sort order and have values that are supposed to sort equal in the first column but print differently. Given that there seems to be some controversy in the locale definitions -- most locals seem to use insignificant factors like accents or ligatures as tie-breakers and avoid claiming different sequences are equal even when the language usually treats them as equivalent -- it doesn't seem super important to maintain the property for the few locales that fall the other way. Unless my impression is wrong and there's a good principled reason why some locales treat nearly equivalent strings one way and some treat them the other. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On 20 June 2012 15:10, Greg Stark st...@mit.edu wrote: On Sun, Jun 17, 2012 at 9:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: The trick for hashing such datatypes is to be able to guarantee that equal values hash to the same hash code, which is typically possible as long as you know the equality rules well enough. We could possibly do that for text with pure-strcoll equality if we knew all the details of what strcoll would consider equal, but we do not. It occurs to me that strxfrm would answer this question. If we made the hash function hash the result of strxfrm then we could make equality use strcoll and not fall back to strcmp. What about per-column collations? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] foreign key locks
On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera alvhe...@alvh.no-ip.org wrote: Hi, This is v12 of the foreign key locks patch. Hi Álvaro, Just noticed that this patch needs a rebase because of the refactoring Tom did in ri_triggers.c -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 16:23, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 20.06.2012 11:17, Simon Riggs wrote: On 20 June 2012 15:45, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 20.06.2012 10:32, Simon Riggs wrote: On 20 June 2012 14:40, Heikki Linnakangas And I'm worried it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. So, if the origin id is not sufficient for some conflict resolution mechanisms, what extra information do you need for those, and where do you put it? As explained elsewhere, wal_level = logical (or similar) would be used to provide any additional logical information required. Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. In the case of reflexive updates you raised, a typical response in other DBMS would be to represent the query UPDATE SET counter = counter + 1 by sending just the +1 part, not the current value of counter, as would be the case with the non-reflexive update UPDATE SET counter = 1 Handling such things in Postgres would require some subtlety, which would not be resolved in first release but is pretty certain not to require any changes to the WAL record header as a way of resolving it. Having already thought about it, I'd estimate that is a very long discussion and not relevant to the OT, but if you wish to have it here, I won't stop you. Yeah, I'd like to hear briefly how you would handle that without any further changes to the WAL record header. I already did: Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. The case you mentioned relates to UPDATEs only, so I would suggest that we add that information to a new form of update record only. That has nothing to do with the WAL record header. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On Wed, Jun 20, 2012 at 3:19 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: It occurs to me that strxfrm would answer this question. If we made the hash function hash the result of strxfrm then we could make equality use strcoll and not fall back to strcmp. What about per-column collations? Well collations aren't really per-column, they're specific to the comparison in the expression at hand. The per-column collation is just the default for comparisons against that column. So for a hash join for example you would use build the hash table using strxfrm for the collation being used for the join expression. For hash indexes the index would only be valid for a specific collation, just like a btree index is. But this all seems like a lot of work for a case that most locales seem to think isn't worth worrying about. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] PANIC while doing failover (streaming replication)
Hi, Any news on this issue? My slave server is not trusted, all the warnings below are related to indexes of the main tables: 2012-06-14 11:46:23 BRT [18654]: [34603-1] user=,db= LOG: recovery restart point at 435/4E899CE8 2012-06-14 11:46:23 BRT [18654]: [34604-1] user=,db= DETAIL: last completed transaction was at log time 2012-06-14 11:46:22.770348-03 2012-06-14 11:48:53 BRT [18654]: [34605-1] user=,db= LOG: restartpoint starting: time 2012-06-14 11:51:09 BRT [18653]: [18-1] user=,db= LOG: received promote request 2012-06-14 11:51:09 BRT [11879]: [2-1] user=,db= FATAL: terminating walreceiver process due to administrator command 2012-06-14 11:51:09 BRT [18653]: [19-1] user=,db= LOG: invalid record length at 435/5079CAE8 2012-06-14 11:51:09 BRT [18653]: [20-1] user=,db= LOG: redo done at 435/5079CAA8 2012-06-14 11:51:09 BRT [18653]: [21-1] user=,db= LOG: last completed transaction was at log time 2012-06-14 11:51:08.969771-03 2012-06-14 11:51:09 BRT [18653]: [22-1] user=,db= LOG: selected new timeline ID: 2 2012-06-14 11:51:09 BRT [18653]: [23-1] user=,db= LOG: archive recovery complete 2012-06-14 11:51:09 BRT [18653]: [24-1] user=,db= WARNING: page 193569 of relation base/16407/80416524 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [25-1] user=,db= WARNING: page 129229 of relation base/16407/38334886 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [26-1] user=,db= WARNING: page 134146 of relation base/16407/38334880 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [27-1] user=,db= WARNING: page 318013 of relation base/16407/38334887 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [28-1] user=,db= WARNING: page 134143 of relation base/16407/38334880 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [29-1] user=,db= WARNING: page 156203 of relation base/16407/38334883 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [30-1] user=,db= WARNING: page 318009 of relation base/16407/38334887 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [31-1] user=,db= WARNING: page 370150 of relation base/16407/38334888 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [32-1] user=,db= WARNING: page 133811 of relation base/16407/38334879 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [33-1] user=,db= WARNING: page 317963 of relation base/16407/38334884 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [34-1] user=,db= WARNING: page 133808 of relation base/16407/38334879 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [35-1] user=,db= WARNING: page 133809 of relation base/16407/38334879 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [36-1] user=,db= WARNING: page 133810 of relation base/16407/38334879 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [37-1] user=,db= WARNING: page 129231 of relation base/16407/38334886 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [38-1] user=,db= WARNING: page 14329 of relation base/16407/80430266 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [39-1] user=,db= WARNING: page 134145 of relation base/16407/38334880 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [40-1] user=,db= WARNING: page 134144 of relation base/16407/38334880 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [41-1] user=,db= WARNING: page 317957 of relation base/16407/38334884 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [42-1] user=,db= WARNING: page 129230 of relation base/16407/38334886 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [43-1] user=,db= WARNING: page 156201 of relation base/16407/38334883 was uninitialized 2012-06-14 11:51:09 BRT [18653]: [44-1] user=,db= PANIC: WAL contains references to invalid pages 2012-06-14 11:51:09 BRT [18651]: [4-1] user=,db= LOG: startup process (PID 18653) was terminated by signal 6: Aborted 2012-06-14 11:51:09 BRT [18651]: [5-1] user=,db= LOG: terminating any other active server processes PostgreSQL version 9.1.3. Thanks in advance for any suggestions. -Matheus 2011/7/1 Fujii Masao masao.fu...@gmail.com On Fri, Jul 1, 2011 at 2:18 PM, Mikko Partio mpar...@gmail.com wrote: Hello list I have two a machine cluster with PostgreSQL 9.0.4 and streaming replication. In a normal situation I did a failover -- touched the trigger file in standby to promote it to production mode. I have done this previously without any complications but now the to-be-production-database had a PANIC and shut itself down. From the logs: postgres[10751]: [2-1] 2011-06-30 17:25:24 EEST [10751]: [1-1] user=,db= LOG: streaming replication successfully connected to primary postgres[10736]: [10-1] 2011-07-01 07:50:29 EEST [10736]: [10-1] user=,db= LOG: trigger file found: /postgresql/data/finish_replication.trigger postgres[10751]: [3-1] 2011-07-01 07:50:29 EEST [10751]: [2-1] user=,db= FATAL: terminating walreceiver process due to administrator command postgres[10736]: [11-1] 2011-07-01 07:50:30 EEST [10736]: [11-1] user=,db= LOG: redo done at AE/8B1855C8 postgres[10736]:
Re: [HACKERS] sortsupport for text
Peter Geoghegan pe...@2ndquadrant.com writes: I think that this change may have made the difference between the Hungarians getting away with it and not getting away with it. Might it have been that for text, they were using some operator that wasn't '=' (perhaps one which has no fastpath, and thus correctly made a representation about equivalency) rather than texteq prior to this commit? Uh, no. There aren't any magic variants of equality now, and there were not back then either. I'm inclined to think that the Hungarian problem did exist long before we fixed it. So, you're going to have an extra strcoll()/strxfrm() + strcmp() here, as part of a not-equal-but-maybe-equivalent test, which is bad. However, if that means that we can cache a text constant as a strxfrm() blob, and compare in a strxfrm()-wise fashion, that will more than pay for itself, even for btree traversal alone. Um ... are you proposing to replace text btree index entries with strxfrm values? Because if you aren't, I don't see that this is likely to win anything. And if you are, it loses on the grounds of (a) index bloat and (b) loss of ability to do index-only scans. It would be nice to hear what others thought of these ideas before I actually start writing a patch that both fixes these problems (our behaviour is incorrect for some locales according to the Unicode standard), facilitates a strxfrm() optimisation, and actually adds a strxfrm() optimisation. Personally I think this is not a direction we want to go in. I think that it's going to end up a significant performance loss in many cases, break backwards compatibility in numerous ways, and provide a useful behavioral improvement to only a small minority of users. If you check the archives, we get far more complaints from people who think their locale definition is wrong than from people who are unhappy because we don't hew exactly to the corner cases of their locale. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
Folks, A co-worker filed a bug against file_fdw where the columns in a FOREIGN TABLE were scrambled on SELECT. It turned out that this comes from the (yes, it's documented, but since it's documented in a place not obviously linked to the bug, it's pretty useless) feature of COPY CSV HEADER whereby the header line is totally ignored in COPY OUT. Rather than being totally ignored in the COPY OUT (CSV HEADER) case, the header line in should be parsed to establish which columns are where and rearranging the output if needed. I'm proposing to make the code change here: http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436 and a suitable doc change that talks about reading the header only for the purpose of matching column names to columns, and throwing away the output as before. What say? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
Excerpts from Florian Pflug's message of mié jun 20 06:35:29 -0400 2012: On Jun19, 2012, at 17:36 , Robert Haas wrote: On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout klep...@svana.org wrote: On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote: The fly in the ointment with any of these ideas is that the configure list is not a list of exact cipher names, as per Magnus' comment that the current default includes tests like !aNULL. I am not sure that we know how to evaluate such conditions if we are applying an after-the-fact check on the selected cipher. Does OpenSSL expose any API for evaluating whether a selected cipher meets such a test? I'm not sure whether there's an API for it, but you can certainly check manually with openssl ciphers -v, for example: $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP' NULL-SHASSLv3 Kx=RSA Au=RSA Enc=None Mac=SHA1 NULL-MD5SSLv3 Kx=RSA Au=RSA Enc=None Mac=MD5 ...etc... So unless the openssl includes the code twice there must be a way to extract the list from the library. There doubtless is, but I'd being willing to wager that you won't be able to figure out the exact method without reading the source code for 'opennssl ciphers' to see how it was done there, and most likely you'll find that at least one of the functions they use has no man page. Documentation isn't their strong point. Yes, unfortunately. I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain the list of ciphers starting from the stringified configuration parameter and iterate on them. The problem is figuring out whether any given cipher meets some criteria; all the stuff that the command prints after the cipher name comes from a get cipher description API call and it doesn't look like there's any simple way of getting the individual bits in some better form (assuming we don't want to parse the description string). Now if the cipher name is enough for whatever it is that we want, then that looks easy. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
Amit Kapila amit.kap...@huawei.com writes: I'm almost inclined to suggest that we not get next-LSN from WAL, but by scanning all the pages in the main data store and computing the max observed LSN. This is clearly not very attractive from a performance standpoint, but it would avoid the obvious failure mode where you lost some recent WAL segments along with pg_control. According to my analysis, this will have some problem. I think you're missing the point. There is no possible way to guarantee database consistency after applying pg_resetxlog, unless the database had been cleanly shut down beforehand. The reset will lose the xlog information that was needed to restore consistency. So arguing from examples that demonstrate this is rather pointless. Rather, the value of pg_resetxlog is to be able to start the database at all so that info can be extracted from it. What we are looking for is not perfection, because that's impossible, but just to not make a bad situation worse. The reason I'm concerned about selecting a next-LSN that's certainly beyond every LSN in the database is that not doing so could result in introducing further corruption, which would be entirely avoidable with more care in choosing the next-LSN. Pg_resetxlog - It will generate the next-LSN point as 109 which when used for recovery will generate inconsistent database. However if we would have relied on WAL, it would have got next-LSN as 107. Umm ... the entire point of pg_resetxlog is to throw away WAL. Not to rely on it. It's conceivable that there would be some use in a tool that searches the available WAL files for the latest checkpoint record and recreates a pg_control file pointing at that checkpoint, without zapping the WAL files. This would be much different in purpose and usage from pg_resetxlog, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Release versioning inconsistency
Peter Eisentraut pete...@gmx.net writes: On ons, 2012-06-20 at 13:26 +0200, Magnus Hagander wrote: That might actually be a good idea. We can't really change the way we named the betas, but it's not too late to consider naming the actual release as 9.2.0... The final release was always going to be called 9.2.0, but naming the beta 9.2.0betaX is wrong. There was a previous discussion about that particular point. Yes. There is no reason to change the naming scheme we have been using for years now (at least since version_stamp.pl was invented for 7.4). The only problem is that somebody got the name of the directory wrong on the FTP server. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg default's verbosity?
On tis, 2012-06-19 at 02:15 -0400, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: There might be something to the idea of demoting a few of the things we've traditionally had as NOTICEs, though. IME, the following two messages account for a huge percentage of the chatter: NOTICE: CREATE TABLE will create implicit sequence foo_a_seq for serial column foo.a NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index foo_pkey for table foo Personally, I'd have no problem with flat-out dropping (not demoting) both of those two specific messages. I seem to recall that Bruce has lobbied for them heavily in the past, though. I don't like these messages any more than the next guy, but why drop only those, and not any of the other NOTICE-level messages? The meaning of NOTICE is pretty much, if this is the first time you're using PostgreSQL, let me tell you a little bit about how we're doing things here. If you've run your SQL script more than 3 times, you won't need them anymore. So set your client_min_messages to WARNING then. That should be pretty much standard for running SQL scripts, in addition to all the other stuff listed here: http://petereisentraut.blogspot.fi/2010/03/running-sql-scripts-with-psql.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
Florian Pflug f...@phlo.org writes: I wonder though if shouldn't restrict the allowed ciphers list to being a simple list of supported ciphers. If our goal is to support multiple SSL libraries transparently then surely having openssl-specific syntax in the config file isn't exactly great anyway... No, we don't want to go there, because then we'd have to worry about keeping the default list in sync with what's supported by the particular version of the particular library we chance to be using. That's about as far from transparent as you can get. A notation like DEFAULT is really quite ideal for our purposes in that respect. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Simon Riggs si...@2ndquadrant.com wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: Heikki Linnakangas wrote: I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. Right. Wrong, as explained. The point is not wrong; you are simply not responding to what is being said. You have not explained why an origin ID is required when there is no replication, or if there is master/slave logical replication, or there are multiple masters with non-overlapping primary keys replicating to a single table in a consolidated database, or each master replicates to all other masters directly, or any of various other scenarios raised on this thread. You've only explained why it's necessary for certain configurations of multi-master replication where all rows in a table can be updated on any of the masters. I understand that this is the configuration you find most interesting, at least for initial implementation. That does not mean that the other situations don't exist as use cases or should be not be considered in the overall design. I don't think there is anyone here who would not love to see this effort succeed, all the way to multi-master replication in the configuration you are emphasizing. What is happening is that people are expressing concerns about parts of the design which they feel are problematic, and brainstorming about possible alternatives. As I'm sure you know, fixing a design problem at this stage in development is a lot less expensive than letting the problem slide and trying to deal with it later. It isn't true that this is needed only for some configurations of multi-master, per discussion. I didn't get that out of the discussion; I saw a lot of cases mentioned as not needing it to which you simply did not respond. This is not transaction metadata, it is WAL record metadata required for multi-master replication, see later point. We need to add information to every WAL record that is used as the source for generating LCRs. If the origin ID of a transaction doesn't count as transaction metadata (i.e., data about the transaction), what does? It may be a metadata element about which you have special concerns, but it is transaction metadata. You don't plan on supporting individual WAL records within a transaction containing different values for origin ID, do you? If not, why is it something to store in every WAL record rather than once per transaction? That's not intended to be a rhetorical question. I think it's because you're still thinking of the WAL stream as *the medium* for logical replication data rather than *the source* of logical replication data. As long as the WAL stream is the medium, options are very constrained. You can code a very fast engine to handle a single type of configuration that way, and perhaps that should be a supported feature, but it's not a configuration I've needed yet. (Well, on reflection, if it had been available and easy to use, I can think of *one* time I *might* have used it for a pair of nodes.) It seems to me that you are so focused on this one use case that you are not considering how design choices which facilitate fast development of that use case paint us into a corner in terms of expanding to other use cases. I think removing origin ID from this patch and submitting a separate patch for a generalized transaction metadata system is the sensible way to go. We already have a very flexible WAL system for recording data of interest to various resource managers. If you wish to annotate a transaction, you can either generate a new kind of WAL record or you can enhance a commit record. Right. Like many of us are suggesting should be done for origin ID. XLOG_NOOP records can already be generated by your application if you wish to inject additional metadata to the WAL stream. So no changes are required for you to implement the generalised transaction metadata scheme you say you require. I'm glad it's that easy. Are there SQL functions to for that yet? Not sure how or why that relates to requirements for multi-master. That depends on whether you want to leave the door open to other logical replication than the one use case on which you are currently focused. I even consider some of those other cases multi-master, especially when multiple databases are replicating to a single table on another server. I'm not clear on your definition -- it seems to be rather more narrow. Maybe we need to define some terms somewhere to facilitate discussion. Is there a Wiki page where that would make sense? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund and...@2ndquadrant.com wrote: Were not the only ones here that are performing scope creep though... I think about all people who have posted in the whole thread except maybe Tom and Marko are guilty of doing so. I still think its rather sensible to focus on exactly duplicated schemas in a very first version just because that leaves out some of the complexity while paving the road for other nice things. Well, I guess what I want to know is: what does focusing on exactly duplicated schemas mean? If it means we'll disable DDL for tables when we turn on replication, that's basically the Slony approach: when you want to make a DDL change, you have to quiesce replication, do it, and then resume replication. I would possibly be OK with that approach. If it means that we'll hope that the schemas are duplicated and start spewing garbage data when they're not, then I'm not definitely not OK with that approach. If it means using event triggers to keep the catalogs synchronized, then I don't think I don't think that's adequately robust. The user could add more event triggers that run before or after the ones the replication system adds, and then you are back to garbage decoding (or crashes). They could also modify the catalogs directly, although it's possible we don't care quite as much about that case (but on the other hand people do sometimes need to do it to solve real problems). Although I am 100% OK with pairing back the initial feature set - indeed, I strongly recommend it - I think that robustness is not a feature which can be left out in v1 and added in later. All the robustness has to be designed in at the start, or we will never have it. On the whole, I think we're spending far too much time talking about code and far too little time talking about what the overall design should look like. We are having a discussion about whether or not MMR should be supported by sticking a 16-bit node ID into every WAL record without having first decided whether we should support MMR, whether that requires node IDs, whether they should be integers, whether those integers should be 16 bits in size, whether they should be present in WAL, and whether or not the record header is the right place to put them. There's a right order in which to resolve those questions, and this isn't it. More generally, I think there is a ton of complexity that we're probably overlooking here in focusing in on specific coding details. I think the most interesting comment made to date is Steve Singer's observation that very little of Slony is concerned with changeset extraction or apply. Now, on the flip side, all of these patches seem to be concerned with changeset extraction and apply. That suggests that we're missing some pretty significant pieces somewhere in this design. I think those pieces are things like error recovery, fault tolerance, user interface design, and control logic. Slony has spent years trying to get those things right. Whether or not they actually have gotten them right is of course an arguable point, but we're unlikely to do better by ignoring all of those issues and implementing whatever is most technically expedient. You've got four people objecting to this patch now, all of whom happen to be committers. Whether or not MMR goes into core, who knows, but it doesn't seem that this patch is going to fly. I find that a bit too early to say. Sure it won't fly exactly as proposed, but hell, who cares? What I want to get in is a solution to the specific problem the patch targets. At least you have, not sure about others, accepted that the problem needs a solution. We do not agree yet how that solution looks should like but thats not exactly surprising as we started discussing the problem only a good day ago. Oh, no argument with any of that. I strongly object to the idea of shoving this patch through as-is, but I don't object to solving the problem in some other, more appropriate way. I think that won't look much like this patch, though; it will be some new patch. If people agree that your proposed way of just one flag bit is the way to go we will have to live with that. But thats different from saying the whole thing is dead. I think you've convinced me that a single flag-bit is not enough, but I don't think you've convinced anyone that it belongs in the record header. As I would rather see this project succeed, I recommend that you don't do that. Both you and Andres seem to believe that MMR is a reasonable first target to shoot at, but I don't think anyone else - including the Slony developers who have commented on this issue - endorses that position. I don't think we get full MMR into 9.3. What I am proposing is that we build in the few pieces that are required to implement MMR *ontop* of whats hopefully in 9.3. And I think thats a realistic goal. I can't quite follow that sentence, but my general sense is that, while you're
Re: [HACKERS] libpq compression
On Jun20, 2012, at 17:34 , Tom Lane wrote: Florian Pflug f...@phlo.org writes: I wonder though if shouldn't restrict the allowed ciphers list to being a simple list of supported ciphers. If our goal is to support multiple SSL libraries transparently then surely having openssl-specific syntax in the config file isn't exactly great anyway... No, we don't want to go there, because then we'd have to worry about keeping the default list in sync with what's supported by the particular version of the particular library we chance to be using. That's about as far from transparent as you can get. A notation like DEFAULT is really quite ideal for our purposes in that respect. No argument with that, but does that mean we have to allow the full syntax supported by OpenSSL (i.e., those +,-,! prefixes)? Maybe we could map an empty list to DEFAULT and otherwise interpret it as a list of ciphers? It'd make the whole NULL-cipher business easy, because once we know that the cipher specified doesn't contain !NULL (which removes NULL *permanently*), we can simply append NULL to allow all these ciphers plus NULL. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
David Fetter da...@fetter.org writes: Rather than being totally ignored in the COPY OUT (CSV HEADER) case, the header line in should be parsed to establish which columns are where and rearranging the output if needed. This is not fix a POLA violation. This is a non-backwards-compatible new feature, which would have to have a switch to turn it on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
On 06/20/2012 11:02 AM, David Fetter wrote: Folks, A co-worker filed a bug against file_fdw where the columns in a FOREIGN TABLE were scrambled on SELECT. It turned out that this comes from the (yes, it's documented, but since it's documented in a place not obviously linked to the bug, it's pretty useless) feature of COPY CSV HEADER whereby the header line is totally ignored in COPY OUT. Rather than being totally ignored in the COPY OUT (CSV HEADER) case, the header line in should be parsed to establish which columns are where and rearranging the output if needed. I'm proposing to make the code change here: http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436 and a suitable doc change that talks about reading the header only for the purpose of matching column names to columns, and throwing away the output as before. What say? First you are talking about COPY IN, not COPY OUT, surely. This is not a bug, it is documented in exactly the place that all other COPY options are documented. The file_fdw page refers the reader to the COPY docs for details. Unless you want us to duplicate the entire COPY docs in the file_fdw page this seems entirely reasonable. The current behaviour was discussed at some length back when we implemented the HEADER feature, IIRC, and is quite intentional. I don't think we should alter the current behaviour, as plenty of people rely on it, some to my certain knowledge. I do see a reasonable case for adding a new behaviour which takes notice of the header line, although it's likely to have plenty of wrinkles. Reordering columns like you suggest might well have a significant impact on COPY performance, BTW. Also note that I created the file_text_array FDW precisely for people who want to be able to cherry pick and reorder columns. See https://github.com/adunstan/file_text_array_fdw cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
Alvaro Herrera alvhe...@commandprompt.com writes: I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain the list of ciphers starting from the stringified configuration parameter and iterate on them. Do you mean that it will produce an expansion of the set of ciphers meeting criteria like !aNULL? If so, I think we are set; we can easily check to see if the active cipher is in that list, no? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote: Simon Riggs si...@2ndquadrant.com wrote: This is not transaction metadata, it is WAL record metadata required for multi-master replication, see later point. We need to add information to every WAL record that is used as the source for generating LCRs. If the origin ID of a transaction doesn't count as transaction metadata (i.e., data about the transaction), what does? It may be a metadata element about which you have special concerns, but it is transaction metadata. You don't plan on supporting individual WAL records within a transaction containing different values for origin ID, do you? If not, why is it something to store in every WAL record rather than once per transaction? That's not intended to be a rhetorical question. Its definitely possible to store it per transaction (see the discussion around http://archives.postgresql.org/message- id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via the originating node a considerably more complex thing. With our proposal you can do it without any complexity involved, on a low level. Storing it per transaction means you can only stream out the data to other nodes *after* fully reassembling the transaction. Thats a pitty, especially if we go for a design where the decoding happens in a proxy instance. Other metadata will not be needed on such a low level. I also have to admit that I am very hesitant to start developing some generic transaction metadata framework atm. That seems to be a good way to spend a good part of time in discussion and disagreeing. Imo thats something for later. I think it's because you're still thinking of the WAL stream as *the medium* for logical replication data rather than *the source* of logical replication data. I don't think thats true. See the above referenced subthread for reasons why I think the origin id is important. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs si...@2ndquadrant.com wrote: But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. OK, so your estimate is 1.5 people to do that. And if we have more people, should they sit around doing nothing? Oh, give me a break. You're willfully missing my point. And to quote Fred Brooks, nine women can't make a baby in one month. No, I'm not. The question is not how quickly can N people achieve a single thing, but how long will it take a few skilled people working on carefully selected tasks that have few dependencies between them to achieve something. The bottleneck is getting the design right, not writing the code. Selecting tasks for people to work on without an agreement on the design will not advance the process, unless we just accept whatever code you choose to right based on whatever design you happen to pick. How exactly did you arrive at your conclusion? Why is yours right and mine wrong? I estimated the amount of work that would be required to do this right and compared it to other large projects that have been successfully done in the past. I think you are looking at something on the order of magnitude of the Windows port, which took about four releases to become stable, or the SE-Linux project, which still isn't feature-complete. Even if it's only a HS-sized project, that took two releases, as did streaming replication. SSI got committed within one release cycle, but there were several years of design and advocacy work before any code was written, so that, too, was really a multi-year project. I'll confine my comments on the second part of the question to the observation that it is a bit early to know who is right and who is wrong, but the question could just as easily be turned on its head. No, I have four people who had initial objections and who have not commented on the fact that the points made are regrettably incorrect. I think Kevin addressed this point better than I can. Asserting something doesn't make it true, and you haven't offered any rational argument against the points that have been made, probably because there isn't one. We *cannot* justify steeling 100% of the available bit space for a feature that many people won't use and may not be enough to address the real requirement anyway. Since at least 3 of the people making such comments did not attend the full briefing meeting in Ottawa, I am not particularly surprised. However, I do expect people that didn't come to the meeting to recognise that they are likely to be missing information and to listen closely, as I listen to them. Participation in the community development process is not contingent on having flown to Ottawa in May, or on having decided to spend that evening at your briefing meeting. Attributing to ignorance what is adequately explained by honest disagreement is impolite. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On 20 June 2012 15:55, Tom Lane t...@sss.pgh.pa.us wrote: Peter Geoghegan pe...@2ndquadrant.com writes: I think that this change may have made the difference between the Hungarians getting away with it and not getting away with it. Might it have been that for text, they were using some operator that wasn't '=' (perhaps one which has no fastpath, and thus correctly made a representation about equivalency) rather than texteq prior to this commit? Uh, no. There aren't any magic variants of equality now, and there were not back then either. I'm inclined to think that the Hungarian problem did exist long before we fixed it. I was suggesting that an operator other than equality may have been used for some reason. It probably isn't a significant issue though. Um ... are you proposing to replace text btree index entries with strxfrm values? Because if you aren't, I don't see that this is likely to win anything. And if you are, it loses on the grounds of (a) index bloat and (b) loss of ability to do index-only scans. No, I'm suggesting it would probably be at least a bit of a win here to cache the constant, and only have to do a strxfrm() + strcmp() per comparison. Not enough to justify all the added complexity of course, but I wouldn't seek to justify this on the basis of improvements to the speed of btree traversal. It would obviously be much more valuable for tuple sorting. Personally I think this is not a direction we want to go in. I think that it's going to end up a significant performance loss in many cases, break backwards compatibility in numerous ways, and provide a useful behavioral improvement to only a small minority of users. I may have over-emphasized the improvement in correctness that this would bring, which I personally consider to be very much a secondary benefit. The fact is that this is likely to be a fairly significant performance win, because strxfrm() is quite simply the way you're supposed to do collation-aware sorting, and is documented as such. For that reason, C standard library implementations should not be expected to emphasize its performance - they assume that you're using strxfrm() + their highly optimised strcmp() (as I've said, the glibc implementation is written in ASM, and makes use of hand-written SSSE3 instructions on x86_64 for example). The only performance downside that I can think of is the added check for equivalence for each tuple within _bt_checkkeys() - perhaps you were thinking of something else that hasn't occurred to me though. Were you? The added overhead is only going to be paid only once per index scan, and not once per tuple returned by an index scan. Since equality implies equivalence for us, there is no need to check equivalence until something is unequal, and when that happens, and equivalence doesn't hold, we're done. Meanwhile, that entire index scan just got measurably faster from caching the constant's strxfrm() blob. Now, maybe it is a bit funky that this equivalence test has to happen even though the vast majority of locales don't care. If the only alternative is to bloat up the strxfrm() blobs with the original string, I'd judge that the funkiness is well worth it. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
On Wed, Jun 20, 2012 at 12:40 PM, Amit Kapila amit.kap...@huawei.com wrote: I believe if WAL files are proper as mentioned in Alvaro's mail, the purposed logic should generate correct values. Do you see any problem in logic purposed in my original mail. Can I resume my work on this feature? Maybe I'm missing your point, but... why don't you just use PITR to recover from the corruption of pg_control? AFAIK PITR can be used in a scenario where there is a base back-up and we have archived the WAL files after that, now it can use WAL files to apply on base-backup. Yes. If you want to recover the database from the media crash like the corruption of pg_control file, you basically should take a base backup and set up continuous archiving. In this scenario we don't know a point from where to start the next replay. So I believe it will be difficult to use PITR in this scenario. You can find out the point from the complete pg_control file which was restored from the backup. If pg_control is corrupted, we can easily imagine that other database files would also be corrupted. I wonder how many cases where only pg_control file gets corrupted are. In that case, pg_resetxlog is unhelpful at all. You need to use PITR, intead. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote: David Fetter da...@fetter.org writes: Rather than being totally ignored in the COPY OUT (CSV HEADER) case, the header line in should be parsed to establish which columns are where and rearranging the output if needed. This is not fix a POLA violation. This is a non-backwards-compatible new feature, which would have to have a switch to turn it on. OK, new proposal: COPY FROM (Thanks, Andrew! Must not post while asleep...) should have an option which requires that HEADER be enabled and mandates that the column names in the header match the columns coming in. Has someone got a better name for this option than KEEP_HEADER_COLUMN_NAMES? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
David Fetter da...@fetter.org writes: OK, new proposal: COPY FROM (Thanks, Andrew! Must not post while asleep...) should have an option which requires that HEADER be enabled and mandates that the column names in the header match the columns coming in. Has someone got a better name for this option than KEEP_HEADER_COLUMN_NAMES? Well, if it's just checking that the list matches, then maybe CHECK_HEADER would do. In your original proposal, I was rather wondering what should happen if the incoming file didn't have the same set of columns called out in the COPY command's column list. (That is, while *rearranging* the columns might be thought non-astonishing, I'm less convinced about a copy operation that inserts or defaults columns differently from what the command said should happen.) If we're just checking for a match, that question goes away. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule
--- Problem I'm trying to solve: For partitioned tables, make it possible to use RETURNING clause on INSERT INTO together with DO INSTEAD rule [ Note - wherever I say INSERT I also mean UPDATE and DELETE ] --- Current behaviour : An INSERT which has a RETURNING clause and which is to be rewritten based on a rule will be accepted if the rule is an unconditional DO INSTEAD. In general I believe unconditional means no WHERE clause, but in practice if the rule is of the form CREATE RULE insert_part_history as ON INSERT to history \ DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id this is treated as conditional and the query is rejected. Testcase: A table T is partitioned and has two or more columns, one of which is an id column declared as id bigint DEFAULT nextval('history_id_seq'::regclass) NOT NULL and the application issues INSERT into history (column-list which excludes id) values () RETURNING id I can get the re-direction of the INSERT *without* RETURNING to work using either trigger or rule, in which the trigger/rule invokes a procedure, but whichever way I do it, I could not get this RETURNING clause to work. For a trigger, the INSERT ... RETURNING was accepted but returned no rows, (as I would expect), and for the RULE, the INSERT ... RETURNING was rejected with : ERROR: cannot perform INSERT RETURNING on relation history HINT: You need an unconditional ON INSERT DO INSTEAD rule with a RETURNING clause. but this hint was not much help, since : For a rule, CREATE RULE insert_part_history as ON INSERT to history \ DO INSTEAD SELECT history_insert_partitioned(NEW) returning NEW.id ERROR: syntax error at or near returning LINE 1: ...DO INSTEAD SELECT history_insert_partitioned(NEW) returning ... Here the function history_insert_partitioned is something like CREATE FUNCTION history_insert_partitioned( NEW public.history) RETURNS BIGINT AS $$ DECLARE ... BEGIN ... acccess NEW fields e.g. timestamp construct partitioned table name EXECUTE 'INSERT INTO ' partitioned table ... RETURN history_id; END; $$ LANGUAGE plpgsql --- Some references to discussion of this requirement : . http://wiki.postgresql.org/wiki/Todo item Make it possible to use RETURNING together with conditional DO INSTEAD rules, such as for partitioning setups . http://archives.postgresql.org/pgsql-general/2012-06/msg00377.php . http://archives.postgresql.org/pgsql-general/2010-12/msg00542.php . http://acodapella.blogspot.it/2011/06/hibernate-postgresql-table-partitioning.html --- Proposal: . I propose to extend the rule system to recognize cases where the INSERT query specifies RETURNING and the rule promises to return a row, and to then permit this query to run and return the expected row. In effect, to widen the definition of unconditional to handle cases such as my testcase. . One comment is that all the infrastructure for returning one row from the re-written query is already present in the code, and the non-trivial question is how to ensure the new design is safe in preventing any rewrite that actually would not return a row. . In this patch, I have chosen to make use of the LIMIT clause - I add a side-effect implication to a LIMIT clause when it occurs in a rewrite of an INSERT to mean this rule will return a row. So, with my patch, same testcase, same function history_insert_partitioned and new rule CREATE RULE insert_part_history as ON INSERT to history \ DO INSTEAD SELECT history_insert_partitioned(NEW) LIMIT 1 the INSERT is accepted and returns the id. This use of LIMIT clause is probably contentious but I wished to avoid introducing new SQL syntax, and the LIMIT clause does have a connotation of returning rows. --- I attach patch based on clone of postgresql.git as of yesterday (120619-145751 EST) I have tested the patch with INSERT and UPDATE (not tested with DELETE but should work). The patch is not expected to be final but just to show how I did it. John --- src/backend/optimizer/plan/planner.c.orig 2012-06-19 14:59:21.264574275 -0400 +++ src/backend/optimizer/plan/planner.c 2012-06-19 15:10:54.776590758 -0400 @@ -226,7 +226,8 @@ standard_planner(Query *parse, int curso result-commandType = parse-commandType; result-queryId = parse-queryId; - result-hasReturning = (parse-returningList !=
Re: [HACKERS] libpq compression
Excerpts from Tom Lane's message of mié jun 20 11:49:51 -0400 2012: Alvaro Herrera alvhe...@commandprompt.com writes: I looked at the code (apps/ciphers.c) and it looks pretty easy to obtain the list of ciphers starting from the stringified configuration parameter and iterate on them. Do you mean that it will produce an expansion of the set of ciphers meeting criteria like !aNULL? Attached is a simple program that does that. You pass 'ALL:!aNULL' as its first arg and it produces such a list. If so, I think we are set; we can easily check to see if the active cipher is in that list, no? Great. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support #include stdio.h #include stdlib.h #include string.h #include openssl/err.h #include openssl/ssl.h int main(int argc, char *argv[]) { const SSL_METHOD *method = TLSv1_client_method(); SSL_CTX*ctx; SSL*ssl = NULL; char *ciphers; int i; if (argc 2) { fprintf(stderr, ciphers not specified\n); exit(1); } ciphers = argv[1]; SSL_library_init(); ctx = SSL_CTX_new(method); if (!ctx) { fprintf(stderr, something went wrong\n); exit(1); } if (!SSL_CTX_set_cipher_list(ctx, ciphers)) { fprintf(stderr, unable to set cipher list\n); exit(1); } ssl = SSL_new(ctx); if (!ssl) { fprintf(stderr, unable to create the SSL object\n); exit(1); } for (i = 0;; i++) { const char *cipher; cipher = SSL_get_cipher_list(ssl, i); if (cipher == NULL) { fprintf(stderr, end of cipher list?\n); break; } printf(cipher: %s\n, cipher); } SSL_CTX_free(ctx); SSL_free(ssl); return 0; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
Peter Geoghegan pe...@2ndquadrant.com writes: No, I'm suggesting it would probably be at least a bit of a win here to cache the constant, and only have to do a strxfrm() + strcmp() per comparison. Um, have you got any hard evidence to support that notion? The traditional advice is that strcoll is faster than using strxfrm unless the same strings are to be compared repeatedly. I'm not convinced that saving the strxfrm on just one side will swing the balance. The fact is that this is likely to be a fairly significant performance win, because strxfrm() is quite simply the way you're supposed to do collation-aware sorting, and is documented as such. For that reason, C standard library implementations should not be expected to emphasize its performance - they assume that you're using strxfrm() + their highly optimised strcmp() Have you got any evidence in support of this claim, or is it just wishful thinking about what's likely to be inside libc? I'd also note that any comparisons you may have seen about this are certainly not accounting for the effects of data bloat from strxfrm (ie, possible spill to disk, more merge passes, etc). In any case, if you have to redefine the meaning of equality in order to justify a performance patch, I'm prepared to walk away at the start. The range of likely performance costs/benefits across different locales and different implementations is so wide that if you can't show it to be a win even with the strcmp tiebreaker, it's not likely to be a reliable win without that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
On Wed, Jun 20, 2012 at 12:35 PM, Florian Pflug f...@phlo.org wrote: On Jun19, 2012, at 17:36 , Robert Haas wrote: On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout klep...@svana.org wrote: On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote: The fly in the ointment with any of these ideas is that the configure list is not a list of exact cipher names, as per Magnus' comment that the current default includes tests like !aNULL. I am not sure that we know how to evaluate such conditions if we are applying an after-the-fact check on the selected cipher. Does OpenSSL expose any API for evaluating whether a selected cipher meets such a test? I'm not sure whether there's an API for it, but you can certainly check manually with openssl ciphers -v, for example: $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP' NULL-SHA SSLv3 Kx=RSA Au=RSA Enc=None Mac=SHA1 NULL-MD5 SSLv3 Kx=RSA Au=RSA Enc=None Mac=MD5 ...etc... So unless the openssl includes the code twice there must be a way to extract the list from the library. There doubtless is, but I'd being willing to wager that you won't be able to figure out the exact method without reading the source code for 'opennssl ciphers' to see how it was done there, and most likely you'll find that at least one of the functions they use has no man page. Documentation isn't their strong point. Yes, unfortunately. I wonder though if shouldn't restrict the allowed ciphers list to being a simple list of supported ciphers. If our goal is to support multiple SSL libraries transparently then surely having openssl-specific syntax in the config file isn't exactly great anyway... That is a very good point. Before we design *another* feature that relies on it, we should verify if the syntax is compatible in the other libraries that would be interesting (gnutls and NSS primarily), and if it's not that at least the *functionality* exists ina compatible way. So we don't put ourselves in a position where we can't proceed. And yes, that's vapourware so far. But I know at least Claes (added to CC) has said he wants to work on it during this summer, and I've promised to help him out and review as well, if/when he gets that far. But even without that, we should try to keep the door to these other library implementations as open as possible. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote: David Fetter da...@fetter.org writes: OK, new proposal: COPY FROM (Thanks, Andrew! Must not post while asleep...) should have an option which requires that HEADER be enabled and mandates that the column names in the header match the columns coming in. Has someone got a better name for this option than KEEP_HEADER_COLUMN_NAMES? Well, if it's just checking that the list matches, then maybe CHECK_HEADER would do. OK In your original proposal, I was rather wondering what should happen if the incoming file didn't have the same set of columns called out in the COPY command's column list. (That is, while *rearranging* the columns might be thought non-astonishing, I'm less convinced about a copy operation that inserts or defaults columns differently from what the command said should happen.) If we're just checking for a match, that question goes away. Let's imagine we have a table foo(id serial, t text, n numeric) and a CSV file foo.csv with headers (n, t). Just to emphasize, the column ordering is different in the places where they match. Would COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER) now just work (possibly with some performance penalty) and COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER) only work, i.e. import gobbledygook, in the case where every t entry in foo.csv happened to be castable to NUMERIC? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012: On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote: In your original proposal, I was rather wondering what should happen if the incoming file didn't have the same set of columns called out in the COPY command's column list. (That is, while *rearranging* the columns might be thought non-astonishing, I'm less convinced about a copy operation that inserts or defaults columns differently from what the command said should happen.) If we're just checking for a match, that question goes away. Let's imagine we have a table foo(id serial, t text, n numeric) and a CSV file foo.csv with headers (n, t). Just to emphasize, the column ordering is different in the places where they match. For the record, IIRC we had one person trying to do this in the spanish list not long ago. Another related case: you have a file with headers and columns (n, t, x, y, z) but your table only has n and t. How would you tell COPY to discard the junk columns? Currently it just complains that they are there. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi, On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund and...@2ndquadrant.com wrote: Were not the only ones here that are performing scope creep though... I think about all people who have posted in the whole thread except maybe Tom and Marko are guilty of doing so. I still think its rather sensible to focus on exactly duplicated schemas in a very first version just because that leaves out some of the complexity while paving the road for other nice things. Well, I guess what I want to know is: what does focusing on exactly duplicated schemas mean? If it means we'll disable DDL for tables when we turn on replication, that's basically the Slony approach: when you want to make a DDL change, you have to quiesce replication, do it, and then resume replication. I would possibly be OK with that approach. If it means that we'll hope that the schemas are duplicated and start spewing garbage data when they're not, then I'm not definitely not OK with that approach. If it means using event triggers to keep the catalogs synchronized, then I don't think I don't think that's adequately robust. The user could add more event triggers that run before or after the ones the replication system adds, and then you are back to garbage decoding (or crashes). I would prefer the event trigger way because that seems to be the most extensible/reusable. It would allow a fully replicated databases and catalog only instances. I think we need to design event triggers in a way you cannot simply circumvent them. We already have the case that if users try to screw around system triggers we give back wrong answers with the planner relying on foreign keys btw. If the problem is having user trigger after system triggers: Lets make that impossible. Forbidding DDL on the other instances once we have that isn't that hard. Perhaps all that will get simpler if we can make reading the catalog via custom built snapshots work as you proposed otherwhere in this thread. That would make checking errors way much easier even if you just want to apply to a database with exactly the same schema. Thats the next thing I plan to work on. They could also modify the catalogs directly, although it's possible we don't care quite as much about that case (but on the other hand people do sometimes need to do it to solve real problems). With that you already can crash the database perfectly fine today. I think trying to care for that is a waste of time. Although I am 100% OK with pairing back the initial feature set - indeed, I strongly recommend it - I think that robustness is not a feature which can be left out in v1 and added in later. All the robustness has to be designed in at the start, or we will never have it. I definitely don't intend to cut down on robustness. On the whole, I think we're spending far too much time talking about code and far too little time talking about what the overall design should look like. Agreed. We are having a discussion about whether or not MMR should be supported by sticking a 16-bit node ID into every WAL record without having first decided whether we should support MMR, whether that requires node IDs, whether they should be integers, whether those integers should be 16 bits in size, whether they should be present in WAL, and whether or not the record header is the right place to put them. There's a right order in which to resolve those questions, and this isn't it. More generally, I think there is a ton of complexity that we're probably overlooking here in focusing in on specific coding details. I think the most interesting comment made to date is Steve Singer's observation that very little of Slony is concerned with changeset extraction or apply. Now, on the flip side, all of these patches seem to be concerned with changeset extraction and apply. That suggests that we're missing some pretty significant pieces somewhere in this design. I think those pieces are things like error recovery, fault tolerance, user interface design, and control logic. Slony has spent years trying to get those things right. Whether or not they actually have gotten them right is of course an arguable point, but we're unlikely to do better by ignoring all of those issues and implementing whatever is most technically expedient. I agree that the focus isn't 100% optimal and that there are *loads* of issues we haven't event started to look at. But you need a point to start and extraction apply seems to be a good one because you can actually test it without the other issues solved which is not really the case the other way round. Also its possible to plug in the newly built changeset extraction into existing solutions to make them more efficient while retaining most of their respective framework. So I disagree that thats the wrong part to start with. You've got four people objecting to this patch now,
Re: [HACKERS] foreign key locks
Jaime Casanova ja...@2ndquadrant.com writes: On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera alvhe...@alvh.no-ip.org wrote: This is v12 of the foreign key locks patch. Just noticed that this patch needs a rebase because of the refactoring Tom did in ri_triggers.c Hold on a bit before you work on that code --- I've got one more bit of hacking I want to try before walking away from it. I did some oprofile work on Dean's example from caezatcwm8m00ra814o4dc2cd_aj44gqlb0tddxmha312qg7...@mail.gmail.com and noticed that it looks like ri_FetchConstraintInfo is eating a noticeable fraction of the runtime, which is happening because it is called to deconstruct the relevant pg_constraint row for each tuple we consider firing the trigger for (and then again, if we do fire the trigger). I'm thinking it'd be worth maintaining a backend-local cache for the deconstructed data, and am going to go try that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
On 06/20/2012 12:18 PM, Tom Lane wrote: David Fetterda...@fetter.org writes: OK, new proposal: COPY FROM (Thanks, Andrew! Must not post while asleep...) should have an option which requires that HEADER be enabled and mandates that the column names in the header match the columns coming in. Has someone got a better name for this option than KEEP_HEADER_COLUMN_NAMES? Well, if it's just checking that the list matches, then maybe CHECK_HEADER would do. In your original proposal, I was rather wondering what should happen if the incoming file didn't have the same set of columns called out in the COPY command's column list. (That is, while *rearranging* the columns might be thought non-astonishing, I'm less convinced about a copy operation that inserts or defaults columns differently from what the command said should happen.) If we're just checking for a match, that question goes away. In the past people have asked me to have copy use the CSV header line in place of supplying a column list in the COPY command. I can certainly see some utility in that, and I think it might achieve what David wants. Using that scenario it would be an error to supply an explicit column list and also use the header line. But then I don't think CHECK_HEADER would be the right name for the option. In any case, specifying a name before we settle on an exact behaviour seems wrong :-) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
On 06/20/2012 12:50 PM, Alvaro Herrera wrote: Another related case: you have a file with headers and columns (n, t, x, y, z) but your table only has n and t. How would you tell COPY to discard the junk columns? Currently it just complains that they are there. That's one of the main use cases for file_text_array_fdw. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 23:34, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Simon Riggs si...@2ndquadrant.com wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: Heikki Linnakangas wrote: I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. Right. Wrong, as explained. The point is not wrong; you are simply not responding to what is being said. Heikki said that the origin ID was not required for all MMR configs/scenarios. IMHO that is wrong, with explanation given. By agreeing with him, I assumed you were sharing that assertion, rather than saying something else. You have not explained why an origin ID is required when there is no replication, or if there is master/slave logical replication, or ... You're right; I never claimed it was needed. Origin Id is only needed for multi-master replication and that is the only context I've discussed it. This is not transaction metadata, it is WAL record metadata required for multi-master replication, see later point. We need to add information to every WAL record that is used as the source for generating LCRs. If the origin ID of a transaction doesn't count as transaction metadata (i.e., data about the transaction), what does? It may be a metadata element about which you have special concerns, but it is transaction metadata. You don't plan on supporting individual WAL records within a transaction containing different values for origin ID, do you? If not, why is it something to store in every WAL record rather than once per transaction? That's not intended to be a rhetorical question. I think it's because you're still thinking of the WAL stream as *the medium* for logical replication data rather than *the source* of logical replication data. As long as the WAL stream is the medium, options are very constrained. You can code a very fast engine to handle a single type of configuration that way, and perhaps that should be a supported feature, but it's not a configuration I've needed yet. (Well, on reflection, if it had been available and easy to use, I can think of *one* time I *might* have used it for a pair of nodes.) It seems to me that you are so focused on this one use case that you are not considering how design choices which facilitate fast development of that use case paint us into a corner in terms of expanding to other use cases. I think removing origin ID from this patch and submitting a separate patch for a generalized transaction metadata system is the sensible way to go. We already have a very flexible WAL system for recording data of interest to various resource managers. If you wish to annotate a transaction, you can either generate a new kind of WAL record or you can enhance a commit record. Right. Like many of us are suggesting should be done for origin ID. XLOG_NOOP records can already be generated by your application if you wish to inject additional metadata to the WAL stream. So no changes are required for you to implement the generalised transaction metadata scheme you say you require. I'm glad it's that easy. Are there SQL functions to for that yet? Yes, another possible design is to generate a new kind of WAL record for the origin id. Doing it that way will slow down multi-master by a measurable amount, and slightly bloat the WAL stream. The proposed way uses space that is currently wasted and likely to remain so. Only 2 bytes of 6 bytes available are proposed for use, with a flag design that allows future extension if required. When MMR is not in use, the WAL records would look completely identical to the way they look now, in size, settings and speed of writing them. Putting the origin id onto each WAL record allows very fast and simple stateless filtering. I suggest using it because those bytes have been sitting there unused for close to 10 years now and no better use springs to mind. The proposed design is the fastest way of implementing MMR, without any loss for non-users. As I noted before, slowing down MMR by a small amount causes geometric losses in performance across the whole cluster. Not sure how or why that relates to requirements for multi-master. That depends on whether you want to leave the door open to other logical replication than the one use case on which you are currently focused. I even consider some of those other cases multi-master, especially when multiple databases are replicating to a single table on another server. I'm not clear on your definition -- it seems to be rather more narrow. Maybe we need to define some terms somewhere to facilitate discussion. Is there a Wiki page where that would make sense? The project is called BiDirectional Replication to ensure that people understood this is not just multi-master. But that doesn't mean that multi-master can't have its own specific requirements. Adding originid is also useful for the use case
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
In the past people have asked me to have copy use the CSV header line in place of supplying a column list in the COPY command. I can certainly see some utility in that, and I think it might achieve what David wants. Using that scenario it would be an error to supply an explicit column list and also use the header line. But then I don't think CHECK_HEADER would be the right name for the option. In any case, specifying a name before we settle on an exact behaviour seems wrong :-) Actually, I can see three valid and valuable-to-users behaviors here: 1) current behavior, header line is ignored. 2) CHECK HEADER: a column list is supplied, but a check header flag is set. If the column list and header list don't match *exactly*, you get an error. 3) USE HEADER: no column list is supplied, but a use header flag is set. A column list is created to match the columns from the CSV header. Of necessity, this will consist of all or some of the columns in the table. If columns are supplied which are not in the table, then you get an error (as well as if columns are missing which are NOT NULL, as you get at present). (2) is more valuable to people who want to check data integrity rigorously, and test for unexpected API changes. (3) is more valuable for regular users who want CSV import to just work. (1) is valuable for backwards compatibility, and for cases where the CSV header was generated by another program (Excel?) so the column names don't match. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund and...@2ndquadrant.com wrote: On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote: Simon Riggs si...@2ndquadrant.com wrote: This is not transaction metadata, it is WAL record metadata required for multi-master replication, see later point. We need to add information to every WAL record that is used as the source for generating LCRs. If the origin ID of a transaction doesn't count as transaction metadata (i.e., data about the transaction), what does? It may be a metadata element about which you have special concerns, but it is transaction metadata. You don't plan on supporting individual WAL records within a transaction containing different values for origin ID, do you? If not, why is it something to store in every WAL record rather than once per transaction? That's not intended to be a rhetorical question. Its definitely possible to store it per transaction (see the discussion around http://archives.postgresql.org/message- id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via the originating node a considerably more complex thing. With our proposal you can do it without any complexity involved, on a low level. Storing it per transaction means you can only stream out the data to other nodes *after* fully reassembling the transaction. Thats a pitty, especially if we go for a design where the decoding happens in a proxy instance. I guess I'm not seeing the purpose to having the origin node id in the WAL stream either. We have it in the Slony sl_log_* stream, however there is a crucial difference, in that sl_log_* is expressly a shared structure. In contrast, WAL isn't directly sharable; you don't mix together multiple WAL streams. It seems as though the point in time at which you need to know the origin ID is the moment at which you're deciding to read data from the WAL files, and knowing which stream you are reading from is an assertion that might be satisfied by looking at configuration that doesn't need to be in the WAL stream itself. It might be *nice* for the WAL stream to be self-identifying, but that doesn't seem to be forcibly necessary. The case where it *would* be needful is if you are in the process of assembling together updates coming in from multiple masters, and need to know: - This INSERT was replicated from node #1, so should be ignored downstream - That INSERT was replicated from node #2, so should be ignored downstream - This UPDATE came from the local node, so needs to be passed to downstream users Or perhaps something else is behind the node id being deeply embedded into the stream that I'm not seeing altogether. Other metadata will not be needed on such a low level. I also have to admit that I am very hesitant to start developing some generic transaction metadata framework atm. That seems to be a good way to spend a good part of time in discussion and disagreeing. Imo thats something for later. Well, I see there being a use in there being at least 3 sorts of LCR records: a) Capturing literal SQL that is to replayed downstream. This parallels two use cases existing in existing replication systems: i) In pre-2.2 versions of Slony, statements are replayed literally. So there's a stream of INSERT/UPDATE/DELETE statements. ii) DDL capture and replay. In existing replication systems, DDL isn't captured implicitly, the way Dimitri's Event Triggers are to do, but rather is captured explicitly. There should be a function to allow injecting such SQL explicitly; that is sure to be a useful sort of thing to be able to do. b) Capturing tuple updates in a binary form that can be turned readily into heap updates on a replica. Unfortunately, this form is likely not to play well when replicating across platforms or Postgres versions, so I suspect that this performance optimization should be implemented as a *last* resort, rather than first. Michael Jackson had some rules of optimization that said don't do it, and, for the expert, don't do it YET... c) Capturing tuple data in some reasonably portable and readily re-writable form. Slony 2.2 changes from SQL fragments (of a) i) above) to storing updates as an array of text values indicating: - relation name - attribute names - attribute values, serialized into strings I don't know that this provably represents the *BEST* representation, but it definitely will be portable where b) would not be, and lends itself to being able to reuse query plans, where a) requires extraordinary amounts of parsing work, today. So I'm pretty sure it's better than a) and b) for a sizable set of cases. -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund and...@2ndquadrant.com wrote: I would prefer the event trigger way because that seems to be the most extensible/reusable. It would allow a fully replicated databases and catalog only instances. I think we need to design event triggers in a way you cannot simply circumvent them. We already have the case that if users try to screw around system triggers we give back wrong answers with the planner relying on foreign keys btw. If the problem is having user trigger after system triggers: Lets make that impossible. Forbidding DDL on the other instances once we have that isn't that hard. So, this is interesting. I think something like this could solve the problem, but then why not just make it built-in code that runs from the same place as the event trigger rather than using the trigger mechanism per se? Presumably the trigger code's purpose is going to be to inject additional data into the WAL stream (am I wrong?) which is not something you're going to be able to do from PL/pgsql anyway, so you don't really need a trigger, just a call to some C function - which not only has the advantage of being not bypassable, but is also faster. Perhaps all that will get simpler if we can make reading the catalog via custom built snapshots work as you proposed otherwhere in this thread. That would make checking errors way much easier even if you just want to apply to a database with exactly the same schema. Thats the next thing I plan to work on. I realized a problem with that idea this morning: it might work for reading things, but if anyone attempts to write data you've got big problems. Maybe we could get away with forbidding that, not sure. Would be nice to get some input from other hackers on this. They could also modify the catalogs directly, although it's possible we don't care quite as much about that case (but on the other hand people do sometimes need to do it to solve real problems). With that you already can crash the database perfectly fine today. I think trying to care for that is a waste of time. You're probably right. I agree that the focus isn't 100% optimal and that there are *loads* of issues we haven't event started to look at. But you need a point to start and extraction apply seems to be a good one because you can actually test it without the other issues solved which is not really the case the other way round. Also its possible to plug in the newly built changeset extraction into existing solutions to make them more efficient while retaining most of their respective framework. So I disagree that thats the wrong part to start with. I think extraction is a very sensible place to start; actually, I think it's the best possible place to start. But this particular thread is about adding origin_ids to WAL, which I think is definitely not the best place to start. I definitely do want to provide code that generates a textual representation of the changes. As you say, even if its not used for anything its needed for debugging. Not sure if it should be sql or maybe the new slony representation. If thats provided and reusable it should make sure that ontop of that other solutions can be built. Oh, yeah. If we can get that, I will throw a party. I find your supposition that I/we just want to get MMR without regard for anything else a bit offensive. I wrote at least three times in this thread that I do think its likely that we will not get more than the minimal basis for implementing MMR into 9.3. I wrote multiple times that I want to provide the basis for multiple solutions. The prototype - while obviously being incomplete - tried hard to be modular. You cannot blame us that we want the work we do to be *also* usable for what one of our major aims? What can I do to convince you/others that I am not planning to do something evil but that I try to reach as many goals at once as possible? Sorry. I don't think you're planning to do something evil, but before I thought you said you did NOT want to write the code to extract changes as text or something similar. I think that would be a really bad thing to skip for all kinds of reasons. I think we need that as a foundational technology before we do much else. Now, once we have that, if we can safely detect cases where it's OK to bypass decoding to text and skip it in just those cases, I think that's great (although possibly difficult to implement correctly). I basically feel that without decode-to-text, this can't possibly be a basis for multiple solutions; it will be a basis only for itself, and extremely difficult to debug, too. No other replication solution can even theoretically have any use for the raw on-disk tuple, at least not without horrible kludgery. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your
Re: [HACKERS] sortsupport for text
On Wed, Jun 20, 2012 at 12:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: The fact is that this is likely to be a fairly significant performance win, because strxfrm() is quite simply the way you're supposed to do collation-aware sorting, and is documented as such. For that reason, C standard library implementations should not be expected to emphasize its performance - they assume that you're using strxfrm() + their highly optimised strcmp() Have you got any evidence in support of this claim, or is it just wishful thinking about what's likely to be inside libc? I'd also note that any comparisons you may have seen about this are certainly not accounting for the effects of data bloat from strxfrm (ie, possible spill to disk, more merge passes, etc). In any case, if you have to redefine the meaning of equality in order to justify a performance patch, I'm prepared to walk away at the start. The range of likely performance costs/benefits across different locales and different implementations is so wide that if you can't show it to be a win even with the strcmp tiebreaker, it's not likely to be a reliable win without that. On the testing I've done, the strcmp() tie-breaking rarely gets run anyway. Unless you're sorting data with only a few distinct values, most comparisons are between values that are distinct under any choice of collation, and therefore strcoll() returns 0 very rarely, and therefore the additional runtime it consumes does not matter very much. Also, it's quite a bit faster than strcoll() anyway, so even when it does run it doesn't add much to the total time. I think the elephant in the room here is that we're relying on the OS to do everything for us, and the OS API we use (strcoll) requires an extra memcpy and is also dreadfully slow. If we could solve that problem, it would save us a lot more than worrying about the extra strcmp(). Of course, solving that problem is hard: we either have to get the glibc and FreeBSD libc folks to provide a better API (that takes lengths for each input instead of relying on trailing nul bytes), or reimplement locales within PG, or store trailing NUL bytes that we don't really need in the index so we can apply strcoll directly, none of which are very appealing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
-Ursprüngliche Nachricht- Von: pgsql-hackers-ow...@postgresql.org im Auftrag von Josh Berkus Gesendet: Mi 6/20/2012 7:06 An: pgsql-hackers@postgresql.org Betreff: Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER ... (1) is valuable for backwards compatibility, and for cases where the CSV header was generated by another program (Excel?) so the column names don't match. 4) MAP_HEADER ('column 1'- 'col_1', 'Date' - 'input_date' ...) to cover the case when column names do not match. my 2 pences, Marc Mamin
Re: [HACKERS] sortsupport for text
On 20 June 2012 17:41, Tom Lane t...@sss.pgh.pa.us wrote: Peter Geoghegan pe...@2ndquadrant.com writes: No, I'm suggesting it would probably be at least a bit of a win here to cache the constant, and only have to do a strxfrm() + strcmp() per comparison. Um, have you got any hard evidence to support that notion? The traditional advice is that strcoll is faster than using strxfrm unless the same strings are to be compared repeatedly. I'm not convinced that saving the strxfrm on just one side will swing the balance. Fair enough, but I'd suggest that that traditional advice assumes that strcoll()'s space-efficiency and general avoidance of dynamic allocation is an important factor, which, for us, it clearly isn't, since we can re-use a single buffer in the manner of Robert's text sortsupport patch for each and every per-tuple strxfrm() blob (when traversing an index). This is a direct quote from glibc's strcoll_l.c : Perform the first pass over the string and while doing this find and store the weights for each character. Since we want this to be as fast as possible we are using `alloca' to store the temporary values. But since there is no limit on the length of the string we have to use `malloc' if the string is too long. We should be very conservative here. Here, alloca is used to allocate space in a stack frame. I believe that this is an entirely inappropriate trade-off for Postgres to be making. strxfrm(), in constrast, leaves buffer sizing and management up to the caller. That has to be a big part of the problem here. The fact is that this is likely to be a fairly significant performance win, because strxfrm() is quite simply the way you're supposed to do collation-aware sorting, and is documented as such. For that reason, C standard library implementations should not be expected to emphasize its performance - they assume that you're using strxfrm() + their highly optimised strcmp() Have you got any evidence in support of this claim, or is it just wishful thinking about what's likely to be inside libc? According to the single-unix specification's strcoll() documentation, The strxfrm() and strcmp() functions should be used for sorting large lists. If that isn't convincing enough for you, there is the fact that glibc's strcmp() is clearly highly optimised for each and every architecture, and that we are currently throwing away an extra strcmp() in the event of strcoll() equality. I'd also note that any comparisons you may have seen about this are certainly not accounting for the effects of data bloat from strxfrm (ie, possible spill to disk, more merge passes, etc). What about the fact that strcoll() may be repeatedly allocating and freeing memory per comparison? The blobs really aren't that much larger than the strings to be sorted, which are typically quite short. In any case, if you have to redefine the meaning of equality in order to justify a performance patch, I'm prepared to walk away at the start. The advantage of my proposed implementation is precisely that I won't have to redefine the meaning of equality, and that only the text datatype will have to care about equivalency, so you can just skip over an explanation of equivalency for most audiences. If you feel that strongly about it, and I have no possible hope of getting this accepted, I'm glad that I know now rather than after completing a significant amount of work on this. I would like to hear other people's opinions before I drop it though. The range of likely performance costs/benefits across different locales and different implementations is so wide that if you can't show it to be a win even with the strcmp tiebreaker, it's not likely to be a reliable win without that. glibc is the implementation that really matters. My test-case used en_US.UTF-8 as its locale, which has to be one of the least stressful to strcoll() - I probably could have shown a larger improvement just by selecting a locale that was known to have to make more passes, like, say, hu_HU.UTF-8. I expect to have access to a 16 core server next week, which Bull have made available to me. Maybe I'll get some interesting performance numbers from it. The reason that I don't want to use the blob with original string hack is because it's ugly, space-inefficient, unnecessary and objectively incorrect, since it forces us to violate the conformance requirement C9 of Unicode 3.0, marginal though that may be. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER
4) MAP_HEADER ('column 1'- 'col_1', 'Date' - 'input_date' ...) to cover the case when column names do not match. Personally, I think that's going way beyond what we want to do with COPY. At that point, check out the CSV-array FDW. Of course, if someone writes a WIP patch which implements the above, we can evaluate it then. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote: On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund and...@2ndquadrant.com wrote: I would prefer the event trigger way because that seems to be the most extensible/reusable. It would allow a fully replicated databases and catalog only instances. I think we need to design event triggers in a way you cannot simply circumvent them. We already have the case that if users try to screw around system triggers we give back wrong answers with the planner relying on foreign keys btw. If the problem is having user trigger after system triggers: Lets make that impossible. Forbidding DDL on the other instances once we have that isn't that hard. So, this is interesting. I think something like this could solve the problem, but then why not just make it built-in code that runs from the same place as the event trigger rather than using the trigger mechanism per se? Presumably the trigger code's purpose is going to be to inject additional data into the WAL stream (am I wrong?) which is not something you're going to be able to do from PL/pgsql anyway, so you don't really need a trigger, just a call to some C function - which not only has the advantage of being not bypassable, but is also faster. I would be totally fine with that. As long as event triggers provide the infrastructure that shouldn't be a big problem. Perhaps all that will get simpler if we can make reading the catalog via custom built snapshots work as you proposed otherwhere in this thread. That would make checking errors way much easier even if you just want to apply to a database with exactly the same schema. Thats the next thing I plan to work on. I realized a problem with that idea this morning: it might work for reading things, but if anyone attempts to write data you've got big problems. Maybe we could get away with forbidding that, not sure. Hm, why is writing a problem? You mean io conversion routines writing data? Yes, that will be a problem. I am fine with simply forbidding that, we should be able to catch that and provide a sensible error message, since SSI we have the support for that. Would be nice to get some input from other hackers on this. Oh, yes! I agree that the focus isn't 100% optimal and that there are *loads* of issues we haven't event started to look at. But you need a point to start and extraction apply seems to be a good one because you can actually test it without the other issues solved which is not really the case the other way round. Also its possible to plug in the newly built changeset extraction into existing solutions to make them more efficient while retaining most of their respective framework. So I disagree that thats the wrong part to start with. I think extraction is a very sensible place to start; actually, I think it's the best possible place to start. But this particular thread is about adding origin_ids to WAL, which I think is definitely not the best place to start. Yep. I think the reason everyone started at it is that the patch was actually really simple ;). Note that the wal enrichement decoding patches were before the origin_id patch in the patchseries ;) I definitely do want to provide code that generates a textual representation of the changes. As you say, even if its not used for anything its needed for debugging. Not sure if it should be sql or maybe the new slony representation. If thats provided and reusable it should make sure that ontop of that other solutions can be built. Oh, yeah. If we can get that, I will throw a party. Good ;) I find your supposition that I/we just want to get MMR without regard for anything else a bit offensive. I wrote at least three times in this thread that I do think its likely that we will not get more than the minimal basis for implementing MMR into 9.3. I wrote multiple times that I want to provide the basis for multiple solutions. The prototype - while obviously being incomplete - tried hard to be modular. You cannot blame us that we want the work we do to be *also* usable for what one of our major aims? What can I do to convince you/others that I am not planning to do something evil but that I try to reach as many goals at once as possible? Sorry. I don't think you're planning to do something evil, but before I thought you said you did NOT want to write the code to extract changes as text or something similar. Hm. I might have been a bit ambiguous when saying that I do not want to provide everything for that use-case. Once we have a callpoint that has a correct catalog snapshot for exactly the tuple in question text conversion is damn near trivial. The point where you get passed all that information (action, tuple, table, snapshot) is the one I think the patch should mainly provide. I think that would be a really bad thing to skip for all kinds of reasons. I think we need that as a
Re: [HACKERS] pgbench--new transaction type
On Wed, Jun 20, 2012 at 3:48 AM, Simon Riggs si...@2ndquadrant.com wrote: I'm sure Jeff submitted this because of the need for a standard test, rather than the wish to actually modify pgbench itself. Can I suggest that we include a list of standard scripts with pgbench for this purpose? These can then be copied alongside the binary when we do an install. I was thinking along similar lines myself. At the least, I think we can't continue to add a short option for every new test type. Instead, maybe we could have --test-type=WHATEVER, and perhaps that then reads whatever.sql from some compiled-in directory. That would allow us to sanely support a moderately large number of tests. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander mag...@hagander.net wrote: On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Well, that was easier than I thought. Attached is a patch to make XLogRecPtr a uint64, on top of my other WAL format patches. I think we should go ahead with this. +1. The LSNs on pages are still stored in the old format, to avoid changing the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control file and WAL are changed, however, so an initdb (or at least pg_resetxlog) is required. Seems fine. Should we keep the old representation in the replication protocol messages? That would make it simpler to write a client that works with different server versions (like pg_receivexlog). Or, while we're at it, perhaps we should mandate network-byte order for all the integer and XLogRecPtr fields in the replication protocol. That would make it easier to write a client that works across different architectures, in = 9.3. The contents of the WAL would of course be architecture-dependent, but it would be nice if pg_receivexlog and similar tools could nevertheless be architecture-independent. I share Andres' question about how we're doing this already. I think if we're going to break this, I'd rather do it in 9.3 than 5 years from now. At this point it's just a minor annoyance, but it'll probably get worse as people write more tools that understand WAL. If we are looking at breaking it, and we are especially concerned about something like pg_receivexlog... Is it something we could/should change in the protocl *now* for 9.2, to make it non-broken in any released version? As in, can we extract just the protocol change and backpatch that to 9.2beta? pg_receivexlog in 9.2 cannot handle correctly the WAL location FF (which was skipped in 9.2 or before). For example, pg_receivexlog calls XLByteAdvance() which always skips FF. So even if we change the protocol, ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which might send FF. No? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers