Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote: If a slave falls behind, how does it catch up? I guess you're saying that it can't fall behind, because the master will block before that happens. Also in asynchronous replication? And what about when the slave is first set up, and needs to catch up with the master? The mechanism for the slave to catch up with the master should be provided on the outside of postgres. I think that postgres should provide only WAL streaming, i.e. the master always sends *current* WAL data to the slave. Of course, the master has to send also the current WAL *file* in the initial sending just after the slave starts and connects with it. Because, at the time, current WAL position might be in the middle of WAL file. Even if the master sends only current WAL data, the slave which don't have the corresponding WAL file can not handle it. regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keeping creation time of objects
On Tue, 9 Sep 2008, David Fetter [EMAIL PROTECTED] writes: AFAICS, PostgreSQL is not keeping info about when a table, database, sequence, etc was created. We cannot get that info even from OS, since CLUSTER or VACUUM FULL may change the metadata of corresponding relfilenode. When people aren't keeping track of their DDL, that is very strictly a process problem on their end. When people are shooting themselves in the foot, it's a great disservice to market Kevlar shoes to them. Word. In the company I'm currently working at we store database schema in a VCS repository with minor and major version taggings. And there is a current_foo_soft_version() function that returns the revision of the related database schema. If there is no control over the database schema changes in a company working scheme, the most logging-feature-rich PostgreSQL release will provide an insignificant benefit compared the mess needs to get fixed. Regards. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote: On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote: If a slave falls behind, how does it catch up? I guess you're saying that it can't fall behind, because the master will block before that happens. Also in asynchronous replication? And what about when the slave is first set up, and needs to catch up with the master? The mechanism for the slave to catch up with the master should be provided on the outside of postgres. So you mean that we still need to do initial setup (copy backup files and ship and replay WAL segments generated during copy) by external WAL-shipping tools, like walmgr.py, and then at some point switch over to internal WAL-shipping, when we are sure that we are within same WAL file on both master and slave ? I think that postgres should provide only WAL streaming, i.e. the master always sends *current* WAL data to the slave. Of course, the master has to send also the current WAL *file* in the initial sending just after the slave starts and connects with it. I think that it needs to send all WAL files which slave does not yet have, as else the slave will have gaps. On busy system you will generate several new WAL files in the time it takes to make master copy, transfer it to slave and apply WAL files generated during initial setup. Because, at the time, current WAL position might be in the middle of WAL file. Even if the master sends only current WAL data, the slave which don't have the corresponding WAL file can not handle it. I agree, that making initial copy may be outside the scope of Synchronous Log Shipping Replication, but slave catching up by requesting all missing WAL files and applying these up to a point when it can switch to Sync mode should be in. Else we gain very little from this patch. --- Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, Sep 10, 2008 at 12:05 PM, Hannu Krosing [EMAIL PROTECTED] wrote: Because, at the time, current WAL position might be in the middle of WAL file. Even if the master sends only current WAL data, the slave which don't have the corresponding WAL file can not handle it. I agree, that making initial copy may be outside the scope of Synchronous Log Shipping Replication, but slave catching up by requesting all missing WAL files and applying these up to a point when it can switch to Sync mode should be in. Else we gain very little from this patch. I agree. We should leave the initial backup acquisition out of the scope atleast for the first phase, but provide mechanism to do initial catch up, as it may get messy to do it completely outside of the core. The slave will need to able to buffer the *current* WAL until it gets the missing WAL files and then continue. Also we may not want the master to be stuck while slave is doing the catchup. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote: On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: There's one thing I haven't figured out in this discussion. Does the write to the disk happen before or after the write to the slave? Can you guarantee that if a transaction is committed in the master, it's also committed in the slave, or vice versa? The write happens concurrently and independently on both. Yes, you wait for the write *and* send pointer to be flushed before you allow a synch commit with synch replication. (Definition of flushed is changeable by parameters). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keeping creation time of objects
Hi, Devrim GÜNDÜZ wrote: On Tue, 2008-09-09 at 15:36 -0400, Tom Lane wrote: why would creation time (as opposed to any other time, eg last schema modification, last data modification, yadda yadda) be especially significant? Hmm, those would be cool, too. maybe except last data modification. But for audit reasons its really helpful so see if someone has had hands on objects since they have been created. So if it would not cost us arm and leg I'm all for having created/changed timestamps for all objects. Regards Tino smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 12:24 +0530, Pavan Deolasee wrote: Also we may not want the master to be stuck while slave is doing the catchup. No, since it may take hours, not seconds. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
Hi, Simon Riggs wrote: 1. Standby contacts primary and says it would like to catch up, but is currently at point X (which is a point at, or after the first consistent stopping point in WAL after standby has performed its own crash recovery, if any was required). 2. primary initiates data transfer of old data to standby, starting at point X 3. standby tells primary where it has got to periodically 4. at some point primary decides primary and standby are close enough that it can now begin streaming current WAL (which is always the WAL up to wal_buffers behind the the current WAL insertion point). Hm.. wouldn't it be simpler, to start streaming right away and cache that on the standby until it can be applied? I.e. a protocol like: 1. - same as above - 2. primary starts streaming from live or hot data from it's current position Y in the WAL stream, which is certainly after (or probably equal to) X. 3. standby receives the hot stream from point Y on. It now knows it misses 'cold' portions of the WAL from X to Y and requests that. 4. primary serves remaining 'cold' WAL chunks from its xlog / archive from between X and Y. 5. standby applies 'cold' WAL, until done. Then proceeds with the cached WAL segments from 'hot' streaming. Bear in mind that unless wal_buffers 16MB the final catchup will *always* be less than one WAL file, so external file based mechanisms alone could never be enough. Agreed. This also probably means that receipt of WAL data on the standby cannot be achieved by placing it in wal_buffers. So we probably need to write it directly to the WAL files, then rely on the filesystem cache on the standby to buffer the data for use by ReadRecord. Makes sense, especially in case of cached WAL as outlined above. Is this a problem in any way? Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
Simon Riggs wrote: On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote: On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: There's one thing I haven't figured out in this discussion. Does the write to the disk happen before or after the write to the slave? Can you guarantee that if a transaction is committed in the master, it's also committed in the slave, or vice versa? The write happens concurrently and independently on both. Yes, you wait for the write *and* send pointer to be flushed before you allow a synch commit with synch replication. (Definition of flushed is changeable by parameters). The thing that bothers me is the behavior when the synchronous slave doesn't respond. A timeout has been discussed, after which the master just gives up on sending, and starts acting as if there's no slave. How's that different from asynchronous mode where WAL is sent to the server concurrently when it's flushed to disk, but we don't wait for the send to finish? ISTM that in both cases the only guarantee we can give is that when a transaction is acknowledged as committed, it's committed in the master but not necessarily in the slave. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, Sep 10, 2008 at 1:40 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: The thing that bothers me is the behavior when the synchronous slave doesn't respond. A timeout has been discussed, after which the master just gives up on sending, and starts acting as if there's no slave. How's that different from asynchronous mode where WAL is sent to the server concurrently when it's flushed to disk, but we don't wait for the send to finish? ISTM that in both cases the only guarantee we can give is that when a transaction is acknowledged as committed, it's committed in the master but not necessarily in the slave. I think there is one difference. Assuming that the timeouts happen infrequently, most of the time the slave is in sync with the master and that can be reported to the user. Whereas in async mode, the slave will *always* be out of sync. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 11:10 +0300, Heikki Linnakangas wrote: Simon Riggs wrote: On Wed, 2008-09-10 at 13:28 +0900, Fujii Masao wrote: On Tue, Sep 9, 2008 at 8:38 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: There's one thing I haven't figured out in this discussion. Does the write to the disk happen before or after the write to the slave? Can you guarantee that if a transaction is committed in the master, it's also committed in the slave, or vice versa? The write happens concurrently and independently on both. Yes, you wait for the write *and* send pointer to be flushed before you allow a synch commit with synch replication. (Definition of flushed is changeable by parameters). The thing that bothers me is the behavior when the synchronous slave doesn't respond. A timeout has been discussed, after which the master just gives up on sending, and starts acting as if there's no slave. How's that different from asynchronous mode where WAL is sent to the server concurrently when it's flushed to disk, but we don't wait for the send to finish? ISTM that in both cases the only guarantee we can give is that when a transaction is acknowledged as committed, it's committed in the master but not necessarily in the slave. We should differentiate between what the WALsender does and what the user does in response to a network timeout. Saying I want to wait for a synchronous commit and I am willing to wait for ever to ensure it leads to long hangs in some cases. I was suggesting that some users may wish to wait up to time X before responding to the commit. The WALSender may keep retrying long after that point, but that doesn't mean all current users need to do that also. The user would need to say whether the response to the timeout was an error, or just accept and get on with it. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Radek Strnad wrote: Progress so far: - created catalogs pg_collation a pg_charset which are filled with three standard collations - initdb changes rows called DEFAULT in both catalogs during the bki bootstrap phase with current system LC_COLLATE and LC_CTYPE or those set by command line. - new collations can be defined with command CREATE COLLATION collation name FOR character set specification FROM existing collation name [STRCOLFN fn name] [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] [ LCCTYPE lc_ctype ] - because of pg_collation and pg_charset are catalogs individual for each database, if you want to create a database with collation other than specified, create it in template1 and then create database I have to wonder, is all that really necessary? The feature you're trying to implement is to support database-level collation at first, and perhaps column-level collation later. We don't need support for user-defined collations and charsets for that. If leave all that out of the patch for now, we'll have a much slimmer, and just as useful patch, implementing database-level collation. We can add those catalogs later if we need them, but I don't think there's much point in adding all that infrastructure if they just reflect the locales installed in the operating system. - when connecting to database, it retrieves locales from pg_database and sets them This is the real gist of this patch. Design functionality changes left: - move retrieveing collation from pg_database to pg_type I don't understand this item. What will you move? - get case sensitivity and pad characteristic working I feel we should leave this to the collation implementation. - when creating database with different collation than database cluster, the database has to be reindexed. Any idea how to do it? Function ReindexDatabase works only when database is opened. That's a tricky one. One idea is to prohibit choosing a different collation than the one in the template database, unless we know it's safe to do so without reindexing. The problem is that we don't know whether it's safe. A simple but limiting solution would be to require that the template database has the same collation as the database that's being created, except that template0 can always be used as template. template0 is safe, because there's no indexes on text columns there. Note that we already have the same problem with encodings. If you create a database with LATIN1 encoding, load it with data, and then use that as a template for a database with UTF-8 encoding, the text data will be incorrectly encoded. We should probably fix that too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
Simon Riggs wrote: Saying I want to wait for a synchronous commit and I am willing to wait for ever to ensure it leads to long hangs in some cases. Sure. That's the fundamental problem with synchronous replication. That's why many people choose asynchronous replication instead. Clearly at some point you'll want to give up and continue without the slave, or kill the master and fail over to the slave. I'm wondering how that's different than the lag between master and server in asynchronous replication from the client's point of view. I was suggesting that some users may wish to wait up to time X before responding to the commit. The WALSender may keep retrying long after that point, but that doesn't mean all current users need to do that also. The user would need to say whether the response to the timeout was an error, or just accept and get on with it. I'm not sure I understand that paragraph. Who's the user? Do we need to expose some new information to the client so that it can do something? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Tue, 2008-09-09 at 20:59 +0200, Zeugswetter Andreas OSB sIT wrote: All in all a useful streamer seems like a lot of work. I mentioned some time ago an alternative idea of having the slave connect through a normal SQL connection and call a function which streams the WAL file from the point requested by the slave... wouldn't that be feasible ? All the connection part would be already there, only the streaming function should be implemented. It even could use SSL connections if needed. Then you would have one normal backend per slave, and they should access either the files directly or possibly some shared area where the WAL is buffered for this purpose... the streaming function could also take care of signaling the up-to-dateness of the slaves in case of synchronous replication. There could also be some system table infrastructure to track the slaves. There could also be some functions to stream the files of the DB through normal backends, so a slave could be bootstrapped all the way from copying the files through a simple postgres backend connection... that would make for the easiest possible setup of a slave: configure a connection to the master, and hit run... and last but not least the same interface could be used by a PITR repository client for archiving the WAL stream and occasional file system snapshots. Cheers, Csaba. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote: Radek Strnad wrote: - because of pg_collation and pg_charset are catalogs individual for each database, if you want to create a database with collation other than specified, create it in template1 and then create database I have to wonder, is all that really necessary? The feature you're trying to implement is to support database-level collation at first, and perhaps column-level collation later. We don't need support for user-defined collations and charsets for that. Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Design functionality changes left: - move retrieveing collation from pg_database to pg_type I don't understand this item. What will you move? Long term, the collation is a property of the type, but I agree, I'm not sure why this patch needs it. That's a tricky one. One idea is to prohibit choosing a different collation than the one in the template database, unless we know it's safe to do so without reindexing. But that put us back where we started: every database having the same collation. We're trying to move away from that. Just reindex everything and be done with it. Note that we already have the same problem with encodings. If you create a database with LATIN1 encoding, load it with data, and then use that as a template for a database with UTF-8 encoding, the text data will be incorrectly encoded. We should probably fix that too. I'd say forbid more than one encoding in a cluster, but that's just my opinion :) Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] Keeping creation time of objects
On Wed, 2008-09-10 at 09:27 +0300, Volkan YAZICI wrote: On Tue, 9 Sep 2008, David Fetter [EMAIL PROTECTED] writes: AFAICS, PostgreSQL is not keeping info about when a table, database, sequence, etc was created. We cannot get that info even from OS, since CLUSTER or VACUUM FULL may change the metadata of corresponding relfilenode. When people aren't keeping track of their DDL, that is very strictly a process problem on their end. When people are shooting themselves in the foot, it's a great disservice to market Kevlar shoes to them. Word. In the company I'm currently working at we store database schema in a VCS repository with minor and major version taggings. And there is a current_foo_soft_version() function that returns the revision of the related database schema. If there is no control over the database schema changes in a company working scheme, the most logging-feature-rich PostgreSQL release will provide an insignificant benefit compared the mess needs to get fixed. Timestamps should rather be considered a forensic tool. You may have the best VCS system, but if somebody bypasses it, you may still need to find out, when it was done. Until we have some enforcable audit facilities for DDL in place _inside_ the database, having at least timestamps often helps. - Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 10:06 +0200, Markus Wanner wrote: Hi, Simon Riggs wrote: 1. Standby contacts primary and says it would like to catch up, but is currently at point X (which is a point at, or after the first consistent stopping point in WAL after standby has performed its own crash recovery, if any was required). 2. primary initiates data transfer of old data to standby, starting at point X 3. standby tells primary where it has got to periodically 4. at some point primary decides primary and standby are close enough that it can now begin streaming current WAL (which is always the WAL up to wal_buffers behind the the current WAL insertion point). Hm.. wouldn't it be simpler, to start streaming right away and cache that on the standby until it can be applied? I.e. a protocol like: Good idea! This makes everything simpler, as user has to do only 4 things 1. start slave in receive WAL, dont apply mode 2. start walshipping on master 3. copy files from master to slave. 4. restart slave in receive WAL mode all else will happen automatically. --- Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 08:15 +0100, Simon Riggs wrote: Any working solution needs to work for all required phases. If you did it this way, you'd never catch up at all. When you first make the copy, it will be made at time X. The point of consistency will be sometime later and requires WAL data to make it consistent. So you would need to do a PITR to get it to the point of consistency. While you've been doing that, the primary server has moved on and now there is a gap between primary and standby. You *must* provide a facility to allow the standby to catch up with the primary. Only sending *current* WAL is not a solution, and not acceptable. So there must be mechanisms for sending past *and* current WAL data to the standby, and an exact and careful mechanism for switching between the two modes when the time is right. Replication is only synchronous *after* the change in mode. So the protocol needs to be something like: 1. Standby contacts primary and says it would like to catch up, but is currently at point X (which is a point at, or after the first consistent stopping point in WAL after standby has performed its own crash recovery, if any was required). 2. primary initiates data transfer of old data to standby, starting at point X 3. standby tells primary where it has got to periodically 4. at some point primary decides primary and standby are close enough that it can now begin streaming current WAL (which is always the WAL up to wal_buffers behind the the current WAL insertion point). Bear in mind that unless wal_buffers 16MB the final catchup will *always* be less than one WAL file, so external file based mechanisms alone could never be enough. So you would need wal_buffers = 2000 to make an external catch up facility even work at all. This also probably means that receipt of WAL data on the standby cannot be achieved by placing it in wal_buffers. So we probably need to write it directly to the WAL files, then rely on the filesystem cache on the standby to buffer the data for use by ReadRecord. And this catchup may be needed to be done repeatedly, in case of network failure. I don't think that slave automatically becoming a master if it detects network failure (as suggested elsewhere in this thread) is acceptable solution, as it will more often than not result in two masters. A better solution would be: 1. Slave just keeps waiting for new WAL records and confirming receipt storing to disk and application. 2. Master is in one of at least two states 2.1 - Catchup - Async mode where it is sending old logs and wal records to slave 2.2 - Sync Replication - Sync mode, where COMMIT does not return before confirmation from WALSender. Initial mode is Catchup which is promoted to Sync Replication when delay of WAL application is reasonably small. When Master detects network outage (== delay bigger than acceptable) it will either just Send a NOTICE to all clients and fall back to Catchup, or raise an ERROR (and still fall back to cathup) This is the point where external HA / Heartbeat etc. solutions would intervene and decide, what to do. - Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, Sep 10, 2008 at 4:15 PM, Simon Riggs [EMAIL PROTECTED] wrote: On Wed, 2008-09-10 at 09:35 +0300, Hannu Krosing wrote: On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote: On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote: If a slave falls behind, how does it catch up? I guess you're saying that it can't fall behind, because the master will block before that happens. Also in asynchronous replication? And what about when the slave is first set up, and needs to catch up with the master? The mechanism for the slave to catch up with the master should be provided on the outside of postgres. So you mean that we still need to do initial setup (copy backup files and ship and replay WAL segments generated during copy) by external WAL-shipping tools, like walmgr.py, and then at some point switch over to internal WAL-shipping, when we are sure that we are within same WAL file on both master and slave ? I think that postgres should provide only WAL streaming, i.e. the master always sends *current* WAL data to the slave. Of course, the master has to send also the current WAL *file* in the initial sending just after the slave starts and connects with it. I think that it needs to send all WAL files which slave does not yet have, as else the slave will have gaps. On busy system you will generate several new WAL files in the time it takes to make master copy, transfer it to slave and apply WAL files generated during initial setup. Because, at the time, current WAL position might be in the middle of WAL file. Even if the master sends only current WAL data, the slave which don't have the corresponding WAL file can not handle it. I agree, that making initial copy may be outside the scope of Synchronous Log Shipping Replication, but slave catching up by requesting all missing WAL files and applying these up to a point when it can switch to Sync mode should be in. Else we gain very little from this patch. I agree with Hannu. Any working solution needs to work for all required phases. If you did it this way, you'd never catch up at all. When you first make the copy, it will be made at time X. The point of consistency will be sometime later and requires WAL data to make it consistent. So you would need to do a PITR to get it to the point of consistency. While you've been doing that, the primary server has moved on and now there is a gap between primary and standby. You *must* provide a facility to allow the standby to catch up with the primary. Only sending *current* WAL is not a solution, and not acceptable. So there must be mechanisms for sending past *and* current WAL data to the standby, and an exact and careful mechanism for switching between the two modes when the time is right. Replication is only synchronous *after* the change in mode. So the protocol needs to be something like: 1. Standby contacts primary and says it would like to catch up, but is currently at point X (which is a point at, or after the first consistent stopping point in WAL after standby has performed its own crash recovery, if any was required). 2. primary initiates data transfer of old data to standby, starting at point X 3. standby tells primary where it has got to periodically 4. at some point primary decides primary and standby are close enough that it can now begin streaming current WAL (which is always the WAL up to wal_buffers behind the the current WAL insertion point). Bear in mind that unless wal_buffers 16MB the final catchup will *always* be less than one WAL file, so external file based mechanisms alone could never be enough. So you would need wal_buffers = 2000 to make an external catch up facility even work at all. This also probably means that receipt of WAL data on the standby cannot be achieved by placing it in wal_buffers. So we probably need to write it directly to the WAL files, then rely on the filesystem cache on the standby to buffer the data for use by ReadRecord. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support Umm.. I disagree with you ;) Here is my initial setup sequence. 1) Start WAL receiver. The current WAL file and subsequent ones will be transmitted by WAL sender and WAL receiver. This transmission will not block the following operation for initial setup, and vice versa. That is, the slave can catch up with the master without blocking the master. I cannot accept that WAL sender is blocked for initial setup. 2) Copy the missing history files from the master to the slave. 3) Prepare recovery.conf on the slave. You have to configure pg_standby and set recovery_target_timeline to 'latest' or the current TLI of the master. 4) Start postgres. The startup process and pg_standby start archive recovery. If there are missing WAL files, pg_standby waits for it and WAL replay is
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 10:06 +0200, Markus Wanner wrote: Hi, Simon Riggs wrote: 1. Standby contacts primary and says it would like to catch up, but is currently at point X (which is a point at, or after the first consistent stopping point in WAL after standby has performed its own crash recovery, if any was required). 2. primary initiates data transfer of old data to standby, starting at point X 3. standby tells primary where it has got to periodically 4. at some point primary decides primary and standby are close enough that it can now begin streaming current WAL (which is always the WAL up to wal_buffers behind the the current WAL insertion point). Hm.. wouldn't it be simpler, to start streaming right away and cache The standby server won't come up until you have: * copied the base backup * sent it to standby server * bring up standby, have it realise it is a replication partner and begin requesting WAL from primary (in some way) There will be a gap (probably) between the initial WAL files and the current tail of wal_buffers by the time all of the above has happened. We will then need to copy more WAL across until we get to a point where the most recent WAL record available on standby is ahead of the tail of wal_buffers on primary so that streaming can start. If we start caching WAL right away we would need to have two receivers. One to receive the missing WAL data and one to receive the current WAL data. We can't apply the WAL until we have the earlier missing WAL data, so cacheing it seems difficult. On a large server this might be GBs of data. Seems easier to not cache current WAL and to have just a single WALReceiver process that performs a mode change once it has caught up. (And I should say if it catches up, since it is possible that it never actually will catch up, in practical terms, since this depends upon the relative power of the servers involved.). So there's no need to store more WAL on standby than is required to restart recovery from last restartpoint. i.e. we stream WAL at all times, not just in normal running mode. Seems easiest to have: * Startup process only reads next WAL record when the ReceivedLogPtr ReadRecPtr, so it knows nothing of how WAL is received. Startup process reads directly from WAL files in *all* cases. ReceivedLogPtr is in shared memory and accessed via spinlock. Startup process only ever reads this pointer. (Notice that Startup process is modeless). * WALReceiver reads data from primary and writes it to WAL files, fsyncing (if ever requested to do so). WALReceiver updates ReceivedLogPtr. That is much simpler and more modular. Buffering of the WAL files is handled by filesystem buffering. If standby crashes, all data is safely written to WAL files and we restart from correct place. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote: I cannot accept that WAL sender is blocked for initial setup. Yes, very important point. We definitely agree on that. The primary must be able to continue working while all this setup is happening. No tables are locked, all commits are allowed etc.. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
Hi, Le mercredi 10 septembre 2008, Heikki Linnakangas a écrit : Sure. That's the fundamental problem with synchronous replication. That's why many people choose asynchronous replication instead. Clearly at some point you'll want to give up and continue without the slave, or kill the master and fail over to the slave. I'm wondering how that's different than the lag between master and server in asynchronous replication from the client's point of view. As a future user of this new facilities, the difference from client's POV is simple : in normal mode of operation, we want a strong guarantee that any COMMIT has made it to both the master and the slave at commit time. No lag whatsoever. You're considering lag as an option in case of failure, but I don't see this as acceptable when you need sync commit. In case of network timeout, cluster is down. So you want to either continue servicing in degraged mode or get the service down while you repair the cluster, but neither of those choice can be transparent to the admins, I'd argue. Of course, main use case is high availability, which tends to say you do not have the option to stop service, and seems to dictate continue servicing in degraded mode: slave can't keep up (whatever the error domain), master is alone, advertise to monitoring solutions and continue servicing. And provide some way for the slave to rejoin, maybe, too. I'm not sure I understand that paragraph. Who's the user? Do we need to expose some new information to the client so that it can do something? Maybe with some GUCs where to set the acceptable timeout for WAL sync process, and if reaching timeout is a warning or an error. With a userset GUC we could event have replication-error-level transaction concurrent to non critical ones... Now what to do exactly in case of error remains to be decided... HTH, Regards, -- dim signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote: Umm.. I disagree with you ;) That's no problem and I respect your knowledge. If we disagree, it is very likely because we have misunderstood each other. Much has been written, so I will wait for it to all be read and understood by you and others, and for me to read other posts and replies also. I feel sure that after some thought a clear consensus will emerge, and I feel hopeful that the feature can be done in the time available with simple code changes. So I will stop replying for a few hours to give everyone time (incl me). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
Hi, Simon Riggs wrote: The standby server won't come up until you have: * copied the base backup * sent it to standby server * bring up standby, have it realise it is a replication partner and begin requesting WAL from primary (in some way) Right. That was your assumption as well. Required before step 1 in both cases. If we start caching WAL right away we would need to have two receivers. One to receive the missing WAL data and one to receive the current WAL data. We can't apply the WAL until we have the earlier missing WAL data, so cacheing it seems difficult. You could use the same receiver process and just handle different packets differently. I see no need for two separate receiver processes here. On a large server this might be GBs of data. ..if served from a log archive, correct. Without archiving, we are limited to xlog anyway. Seems easier to not cache current WAL and to have just a single WALReceiver process that performs a mode change once it has caught up. (And I should say if it catches up, since it is possible that it never actually will catch up, in practical terms, since this depends upon the relative power of the servers involved.). So there's no need to store more WAL on standby than is required to restart recovery from last restartpoint. i.e. we stream WAL at all times, not just in normal running mode. Switching between streaming from files and 'live' streaming on the active node seems difficult to me, because you need to make sure there's no gap. That problem could be circumvented by handling this on the standby. If you think switching on the active is simple enough, that's fine. Seems easiest to have: * Startup process only reads next WAL record when the ReceivedLogPtr ReadRecPtr, so it knows nothing of how WAL is received. Startup process reads directly from WAL files in *all* cases. ReceivedLogPtr is in shared memory and accessed via spinlock. Startup process only ever reads this pointer. (Notice that Startup process is modeless). Well, that's certainly easier for the standby, but requires mode switching on the active. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout wrote: On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote: Radek Strnad wrote: - because of pg_collation and pg_charset are catalogs individual for each database, if you want to create a database with collation other than specified, create it in template1 and then create database I have to wonder, is all that really necessary? The feature you're trying to implement is to support database-level collation at first, and perhaps column-level collation later. We don't need support for user-defined collations and charsets for that. Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Let's just use the name of the OS locale, like we do now. Having a pg_collation catalog just moves the problem elsewhere: we'd still need something in pg_collation to tie the collation to the OS locale. Design functionality changes left: - move retrieveing collation from pg_database to pg_type I don't understand this item. What will you move? Long term, the collation is a property of the type, ... You might want to provide a default collation for a type as well, but the very finest grade is that you can specify collation for every (text) comparison operator in your query. Of course you don't want to do that for every query, which is why we should provide defaults at different levels: columns, tables, database. And perhaps types as well, but that's not the most interesting case. I'm not sure what the SQL spec says about that, but I believe it provides syntax and rules for all that. That's a tricky one. One idea is to prohibit choosing a different collation than the one in the template database, unless we know it's safe to do so without reindexing. But that put us back where we started: every database having the same collation. We're trying to move away from that. Just reindex everything and be done with it. That's easier said than done, unfortunately. Note that we already have the same problem with encodings. If you create a database with LATIN1 encoding, load it with data, and then use that as a template for a database with UTF-8 encoding, the text data will be incorrectly encoded. We should probably fix that too. I'd say forbid more than one encoding in a cluster, but that's just my opinion :) Yeah, that's pretty useless, at least without support for different locales on different databases. But might as well keep it unless there's a pressing reason to drop it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas napsal(a): Design functionality changes left: - move retrieveing collation from pg_database to pg_type The problem there is that pg_collation is local catalog, but pg_database is global catalog. IIRC, It was discussed during last commitfest. I think it is bad idea to make this kind of dependency. It seems to me better to implement something what we will use later to avoid useless and confusing dependency. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 11:07 +0200, Dimitri Fontaine wrote: Hi, Le mercredi 10 septembre 2008, Heikki Linnakangas a écrit : Sure. That's the fundamental problem with synchronous replication. That's why many people choose asynchronous replication instead. Clearly at some point you'll want to give up and continue without the slave, or kill the master and fail over to the slave. I'm wondering how that's different than the lag between master and server in asynchronous replication from the client's point of view. As a future user of this new facilities, the difference from client's POV is simple : in normal mode of operation, we want a strong guarantee that any COMMIT has made it to both the master and the slave at commit time. No lag whatsoever. Agreed. You're considering lag as an option in case of failure, but I don't see this as acceptable when you need sync commit. In case of network timeout, cluster is down. So you want to either continue servicing in degraged mode or get the service down while you repair the cluster, but neither of those choice can be transparent to the admins, I'd argue. Of course, main use case is high availability, which tends to say you do not have the option to stop service, We have a number of choices, at the point of failure: * Does the whole primary server stay up (probably)? * Do we continue to allow new transactions in degraded mode? (which increases the risk of transaction loss if we continue at that time). (The answer sounds like it will be of course, stupid but this cluster may be part of an even higher level HA mechanism, so the answer isn't always clear). * For each transaction that is trying to commit: do we want to wait forever? If not, how long? If we stop waiting, do we throw ERROR, or do we say, lets get on with another transaction. If the server is up, yet all connections in a session pool are stuck waiting for their last commits to complete then most sysadmins would agree that the server is actually down. Since no useful work is happening, or can be initiated - even read only. We don't need to address that issue in the same way for all transactions, is all I'm saying. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 09:35 +0300, Hannu Krosing wrote: On Wed, 2008-09-10 at 15:15 +0900, Fujii Masao wrote: On Wed, Sep 10, 2008 at 12:26 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote: If a slave falls behind, how does it catch up? I guess you're saying that it can't fall behind, because the master will block before that happens. Also in asynchronous replication? And what about when the slave is first set up, and needs to catch up with the master? The mechanism for the slave to catch up with the master should be provided on the outside of postgres. So you mean that we still need to do initial setup (copy backup files and ship and replay WAL segments generated during copy) by external WAL-shipping tools, like walmgr.py, and then at some point switch over to internal WAL-shipping, when we are sure that we are within same WAL file on both master and slave ? I think that postgres should provide only WAL streaming, i.e. the master always sends *current* WAL data to the slave. Of course, the master has to send also the current WAL *file* in the initial sending just after the slave starts and connects with it. I think that it needs to send all WAL files which slave does not yet have, as else the slave will have gaps. On busy system you will generate several new WAL files in the time it takes to make master copy, transfer it to slave and apply WAL files generated during initial setup. Because, at the time, current WAL position might be in the middle of WAL file. Even if the master sends only current WAL data, the slave which don't have the corresponding WAL file can not handle it. I agree, that making initial copy may be outside the scope of Synchronous Log Shipping Replication, but slave catching up by requesting all missing WAL files and applying these up to a point when it can switch to Sync mode should be in. Else we gain very little from this patch. I agree with Hannu. Any working solution needs to work for all required phases. If you did it this way, you'd never catch up at all. When you first make the copy, it will be made at time X. The point of consistency will be sometime later and requires WAL data to make it consistent. So you would need to do a PITR to get it to the point of consistency. While you've been doing that, the primary server has moved on and now there is a gap between primary and standby. You *must* provide a facility to allow the standby to catch up with the primary. Only sending *current* WAL is not a solution, and not acceptable. So there must be mechanisms for sending past *and* current WAL data to the standby, and an exact and careful mechanism for switching between the two modes when the time is right. Replication is only synchronous *after* the change in mode. So the protocol needs to be something like: 1. Standby contacts primary and says it would like to catch up, but is currently at point X (which is a point at, or after the first consistent stopping point in WAL after standby has performed its own crash recovery, if any was required). 2. primary initiates data transfer of old data to standby, starting at point X 3. standby tells primary where it has got to periodically 4. at some point primary decides primary and standby are close enough that it can now begin streaming current WAL (which is always the WAL up to wal_buffers behind the the current WAL insertion point). Bear in mind that unless wal_buffers 16MB the final catchup will *always* be less than one WAL file, so external file based mechanisms alone could never be enough. So you would need wal_buffers = 2000 to make an external catch up facility even work at all. This also probably means that receipt of WAL data on the standby cannot be achieved by placing it in wal_buffers. So we probably need to write it directly to the WAL files, then rely on the filesystem cache on the standby to buffer the data for use by ReadRecord. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote: Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Let's just use the name of the OS locale, like we do now. Having a pg_collation catalog just moves the problem elsewhere: we'd still need something in pg_collation to tie the collation to the OS locale. There's not a one-to-one mapping between collation and locale name. A locale name includes information about the charset and a collation may have paramters like case-sensetivity and pad-attribute which are not present in the locale name. You need a mapping anyway, which is what this table is for. The difference in collation between CHAR() and VARCHAR() is the usual example here. Long term, the collation is a property of the type, ... I'm not sure what the SQL spec says about that, but I believe it provides syntax and rules for all that. The spec is quite detailed about and I posted code to do it years ago. The point is that we don't need to go that far with this patch. But that put us back where we started: every database having the same collation. We're trying to move away from that. Just reindex everything and be done with it. That's easier said than done, unfortunately. I don't see an alternative. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
[HACKERS] using hash index when BETWEEN is specified
I has played with new hash index implementation and I tried following command: postgres=# select * from test where id between 1 and 5; Time: 9651,033 ms postgres=# explain select * from test where id between 1 and 5; QUERY PLAN - Seq Scan on test (cost=0.00..141681.00 rows=1 width=4) Filter: ((id = 1) AND (id = 5)) (2 rows) Hash index is created on id column. However when I use postgres=# explain select * from test where id in (1,2,3,4,5); QUERY PLAN - Bitmap Heap Scan on test (cost=22.24..332.53 rows=83 width=4) Recheck Cond: (id = ANY ('{1,2,3,4,5}'::integer[])) - Bitmap Index Scan on test_idx (cost=0.00..22.22 rows=83 width=0) Index Cond: (id = ANY ('{1,2,3,4,5}'::integer[])) (4 rows) Time: 1,352 ms I'm not planner guru but it seems to me that BETWEEN clause could be rewritten as a IN clause for integer data types and small interval. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using hash index when BETWEEN is specified
On Wed, Sep 10, 2008 at 1:39 PM, Zdenek Kotala [EMAIL PROTECTED]wrote: I has played with new hash index implementation and I tried following command: postgres=# select * from test where id between 1 and 5; Time: 9651,033 ms postgres=# explain select * from test where id between 1 and 5; QUERY PLAN - Seq Scan on test (cost=0.00..141681.00 rows=1 width=4) Filter: ((id = 1) AND (id = 5)) (2 rows) Hash index is created on id column. However when I use postgres=# explain select * from test where id in (1,2,3,4,5); QUERY PLAN - Bitmap Heap Scan on test (cost=22.24..332.53 rows=83 width=4) Recheck Cond: (id = ANY ('{1,2,3,4,5}'::integer[])) - Bitmap Index Scan on test_idx (cost=0.00..22.22 rows=83 width=0) Index Cond: (id = ANY ('{1,2,3,4,5}'::integer[])) (4 rows) Time: 1,352 ms I'm not planner guru but it seems to me that BETWEEN clause could be rewritten as a IN clause for integer data types and small interval. Where should the line be drawn. Define small :) Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using hash index when BETWEEN is specified
I'm not planner guru but it seems to me that BETWEEN clause could be rewritten as a IN clause for integer data types and small interval. Where should the line be drawn. Define small :) When the estimated cost is lower? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Base64 decode/encode performance
Hi, I have been experimenting with some base64 encoding/decoding implementation. I find out that the one at http://code.google.com/p/stringencoders is the best obtaining a 1.3 speedup vs the postgres one. Do you think is worth to submit a patch that replaces the postgres base64 implementation with this one? Regards Gaetano Mendola -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using hash index when BETWEEN is specified
On Wed, 2008-09-10 at 07:13 -0400, Robert Haas wrote: I'm not planner guru but it seems to me that BETWEEN clause could be rewritten as a IN clause for integer data types and small interval. Where should the line be drawn. Define small :) When the estimated cost is lower? You still need to draw a line for when to even try estimating the cost . Will this be interval of 10 ? or 100 ? or 1 ? Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
* Simon Riggs [EMAIL PROTECTED] [080910 06:18]: We have a number of choices, at the point of failure: * Does the whole primary server stay up (probably)? The only sane choice is the one the admin makes. Any predetermined choice PG makes can (and will) be wrong in some situations. * Do we continue to allow new transactions in degraded mode? (which increases the risk of transaction loss if we continue at that time). (The answer sounds like it will be of course, stupid but this cluster may be part of an even higher level HA mechanism, so the answer isn't always clear). The only sane choice is the one the admin makes. Any predetermined choice PG makes can (and will) be wrong in some situations. * For each transaction that is trying to commit: do we want to wait forever? If not, how long? If we stop waiting, do we throw ERROR, or do we say, lets get on with another transaction. The only sane choice is the one the admin makes. Any predetermined choice PG makes can (and will) be wrong in some situations. If the server is up, yet all connections in a session pool are stuck waiting for their last commits to complete then most sysadmins would agree that the server is actually down. Since no useful work is happening, or can be initiated - even read only. We don't need to address that issue in the same way for all transactions, is all I'm saying. Sorry to sound like a broken record here, but the whole point is to guarantee data safety. You can only start trading ACID for HA if you have the ACID guarantees in the first place (and for replication, this means across the cluster, including slaves) So in that light, I think it's pretty obvious that if a slave is considered part of an active synchronous replication cluster, in the face of network lag, or even network failure, the master *must* pretty much halt all new commits in their tracks until that slave acknowledges the commit. Yes that's going to cause a backup. That's the cost of a synchronous replication. But that means the admin has to be able to control whether a slave is part of an active synchronous replication cluster or not. I hope that control eventually is a lot more than a GUC that says when a slave is X seconds behind, abandon him). I'ld dream of a replication interface where I could add new slaves on the fly (and a nice tool that pg_start_backup()/sync/apply WAL to sync then subscribe), get slave status (maybe syncing/active/abandoned), and some average latency (i.e. something like svctm of iostat on your WAL disk) and some way to control the slave degradation from active to abandoned (like the above GUC, or maybe a callout/hook/script that runs when latency X, etc, or both). And for async replication, you just have a proxy slave which does nothing but subscribe to your master, always acknowledge WAL right away so the master doesn't wait, and keep a local backlog of WAL it's sending out to many clients. This proxy slave doesn't slow down the master, but can feed clients accross slow WAN links (that may not have the burst bandwidth to keep up with bursty master writes, but have agregate bandwidth to keep pretty close to the master), or networks that drop out for a period, etc. -- Aidan Van Dyk Create like a god, [EMAIL PROTECTED] command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] Keeping creation time of objects
Hannu Krosing wrote: Timestamps should rather be considered a forensic tool. You may have the best VCS system, but if somebody bypasses it, you may still need to find out, when it was done. So you're arguing for modification time, which is not was Devrim is proposing -- he's proposing creation time. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 09:36 -0400, Aidan Van Dyk wrote: * Simon Riggs [EMAIL PROTECTED] [080910 06:18]: We have a number of choices, at the point of failure: * Does the whole primary server stay up (probably)? The only sane choice is the one the admin makes. Any predetermined choice PG makes can (and will) be wrong in some situations. We are in agreement then. Those questions were listed as arguments in favour of a parameter to let the sysadmin choose. More than that, I was saying this can be selected for individual transactions, not just for the whole server as a whole (as other vendors do). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keeping creation time of objects
Hi, On Tue, 2008-09-09 at 23:14 -0400, Andrew Dunstan wrote: Nothing is cost free. Every feature adds to code complexity, and has to be maintained. With full respect to you: I'm only talking about creation time. How much overhead and complexity are you expecting? Cheers, -- Devrim GÜNDÜZ, RHCE devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Synchronous Log Shipping Replication
On Tue, Sep 9, 2008 at 10:55 PM, Markus Wanner [EMAIL PROTECTED] wrote: Hi, ITAGAKI Takahiro wrote: Signals and locking, borrewed from Postgres-R, are now studied for the purpose in the log shipping, Cool. Let me know if you have any questions WRT this imessages stuff. If you're sure it's all right, I have a trivial question. Which signal should we use for the notification to the backend from WAL sender? The notable signals are already used. Or, since a backend don't need to wait on select() unlike WAL sender, ISTM that it's not so inconvenient to use a semaphore for that notification. Your thought? regards -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
Hi, Fujii Masao wrote: On Tue, Sep 9, 2008 at 10:55 PM, Markus Wanner [EMAIL PROTECTED] wrote: Hi, ITAGAKI Takahiro wrote: Signals and locking, borrewed from Postgres-R, are now studied for the purpose in the log shipping, Cool. Let me know if you have any questions WRT this imessages stuff. If you're sure it's all right, I have a trivial question. Well, I know it works for me and I think it could work for you, too. That's all I'm saying. Which signal should we use for the notification to the backend from WAL sender? The notable signals are already used. I'm using SIGUSR1, see src/backend/storage/ipc/imsg.c from Postgres-R, line 232. That isn't is use for backends or the postmaster, AFAIK. Or, since a backend don't need to wait on select() unlike WAL sender, ISTM that it's not so inconvenient to use a semaphore for that notification. They probably could, but not the WAL sender. What's the benefit of semaphores? It seems pretty ugly to set up a semaphore, lock that on the WAL sender, then claim it on the backend to wait for it, and then release it on the WAL sender to notify the backend. If all you want to do is to signal the backend, why not use signals ;-) But maybe I'm missing something? Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using hash index when BETWEEN is specified
Hannu Krosing napsal(a): On Wed, 2008-09-10 at 07:13 -0400, Robert Haas wrote: I'm not planner guru but it seems to me that BETWEEN clause could be rewritten as a IN clause for integer data types and small interval. Where should the line be drawn. Define small :) When the estimated cost is lower? You still need to draw a line for when to even try estimating the cost . Will this be interval of 10 ? or 100 ? or 1 ? I think it depends of ration of unique integer number in a table and numbers of requested interval, number distribution and total number of rows. For example if you have 10 distinct number and each has 100 occurrence then full scan is better (for between 1 and 5). But if each number occurs 10x. Then using hash index should be effective. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Base64 decode/encode performance
On 9/10/08, Gaetano Mendola [EMAIL PROTECTED] wrote: I have been experimenting with some base64 encoding/decoding implementation. I find out that the one at http://code.google.com/p/stringencoders is the best obtaining a 1.3 speedup vs the postgres one. Do you think is worth to submit a patch that replaces the postgres base64 implementation with this one? (Note: the b64encode there reads 3 chars at a time, b64decode int32 at a time.) There are 2 killer problems: - decode does not seem to handle architectures that segfault on unaligned int32 accesses. - decode does not allow whitespace in input string. If those are fixed it's question of if the 1.3x speed if worth more complex code with big lookup tables. If you want to optimize, it seems more worthwhile to add additional loop to current code that reads 3 or 4 chars at a time, before the current single-char loop. The decode loop may thus even optimize to int32 fetching on x86/64 with reasonable compiler. Handling whitespace with such code is doable, but will the code be clear enough? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Cleanup of GUC units code
Robert Haas wrote: bits...bytes...blocks...m...M I can't imagine that taking away the B is somehow going to be more clear. If clarity is the goal, I'd want the following: a) Verbosely spelling out the units in the default config file temp_buffers = 16 megabytes or temp_buffers = 16 milliblocks :-) Naive users who favor cutpaste will use the verbose words that should leave little room for confusion. Power-users who know the short forms from the docs will presumably have read the descriptions. b) having show show verbosely spelled out units. db=# show temp_buffers; temp_buffers -- 1600 bytes (1 row) c) having set show a NOTICE with the verbose word for the units db=# set temp_buffers = '16mb'; NOTICE: setting temp_buffers to 1600 bytes -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Base64 decode/encode performance
Marko Kreen wrote: (Note: the b64encode there reads 3 chars at a time, b64decode int32 at a time.) There are 2 killer problems: - decode does not seem to handle architectures that segfault on unaligned int32 accesses. Out of curiosity - does this problem exist on any platform for which PostgreSQL is currently ported and supported? Cheers, mark -- Mark Mielke [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New FSM patch
Heikki Linnakangas napsal(a): Here's an updated FSM patch. Changes since last patch: Yesterday, I started to reviewing your patch. At the beginning I have general questions: 1) If I understand correctly the main goal is to improve FSM to cover all pages in file which is useful for huge database. 2) Did you perform any benchmark? Is there any performance improvement or penalty? 3) How it works when database has many active parallel connections? Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Base64 decode/encode performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Sep 10, 2008 at 10:44:00AM -0400, Mark Mielke wrote: Marko Kreen wrote: [...] - decode does not seem to handle architectures that segfault on unaligned int32 accesses. Out of curiosity - does this problem exist on any platform for which PostgreSQL is currently ported and supported? HP PA is one of them. Besides, some others hide that behind a painful (more than a factor of 1.3 -- we are talking software emulation here) performance hit. More modeern architectures (Alpha, Itanium, IA-64) take this route. Regards - -- tomás -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIx+ggBcgs9XrR2kYRAha9AJ0Xy6Zg/m76H2H4Uzta3pSXJh/D2gCfS8PF vpDQMU8gg2BahURgSI97GSk= =KnIm -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Base64 decode/encode performance
On Wed, Sep 10, 2008 at 10:44:00AM -0400, Mark Mielke wrote: There are 2 killer problems: - decode does not seem to handle architectures that segfault on unaligned int32 accesses. Out of curiosity - does this problem exist on any platform for which PostgreSQL is currently ported and supported? It exists on most CPUs actually like Alpha/Sparc/MIPS, just not on Intel chips, which is why you don't see them very often. Unaligned accesses do take twice as long to execute though, even on Intel chips. On some OSes the unaligned access is trapped and emulated by the OS, which doesn't do much for performance. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] New FSM patch
Zdenek Kotala wrote: Yesterday, I started to reviewing your patch. Thanks! 1) If I understand correctly the main goal is to improve FSM to cover all pages in file which is useful for huge database. That's not a goal per se, though it's true that the new FSM does cover all pages. The goals are to: - eliminate max_fsm_pages and max_fsm_relations GUC variables, so that there's one thing less to configure - make the FSM immediately available and useful after recovery (eg. warm standby) - make it possible to retail update the FSM, which will be needed for partial vacuum 2) Did you perform any benchmark? Is there any performance improvement or penalty? Working on it.. I've benchmarked some bulk-insertion scenarios, and the new FSM is now comparable to the current implementation on those tests. See the o I've also been working on a low level benchmark using a C user-defined function that exercises just the FSM, showing the very raw CPU performance vs. current implementation. More on that later, but ATM it looks like the new implementation can be faster or slower than the current one, depending on the table size. The biggest potential performance issue, however, is the fact that the new FSM implementation is WAL-logged. That shows up dramatically in the raw test where there's no other activity than FSM lookups and updates, but will be much less interesting in real life where FSM lookups are always related to some other updates which are WAL-logged anyway. I also ran some DBT-2 tests without think times, with a small number of warehouses. But the results of that had such a high variability from test to test, that any difference in FSM speed would've been lost in the noise. Do you still have the iGen setup available? Want to give it a shot? 3) How it works when database has many active parallel connections? The new FSM should in principle scale better than the old one. However, Simon raised a worry about the WAL-logging: WALInserLock can already become the bottleneck in OLTP-scenarios with very high load and many CPUs. The FSM isn't any worse than other actions that generate WAL, but naturally if you're bottlenecked by the WAL lock or bandwidth, any increase in WAL traffic will show up as an overall performance loss. I'm not too worried about that, myself, because in typical scenarios the extra WAL traffic generated by the FSM should be insignificant in volume compared to all the other WAL traffic. But Simon will probably demand some hard evidence of that ;-). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Cleanup of GUC units code
a) Verbosely spelling out the units in the default config file temp_buffers = 16 megabytes or temp_buffers = 16 milliblocks :-) Naive users who favor cutpaste will use the verbose words that should leave little room for confusion. Power-users who know the short forms from the docs will presumably have read the descriptions. I think it would make a lot of sense to encourage adding the word buffers or blocks when that is the unit in question. This is all religion at this point, but I find it difficult to believe that there is any real need to spell out megabytes. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Interesting glitch in autovacuum
I observed a curious bug in autovac just now. Since plain vacuum avoids calling GetTransactionSnapshot, an autovac worker that happens not to analyze any tables will never call GetTransactionSnapshot at all. This means it will arrive at vac_update_datfrozenxid with RecentGlobalXmin never having been changed from its boot value of FirstNormalTransactionId, which means that it will fail to update the database's datfrozenxid ... or, if the current value of datfrozenxid is past 2 billion, that it will improperly advance datfrozenxid to sometime in the future. Once you get into this state in a reasonably idle database such as template1, autovac is completely dead in the water: if it thinks template1 needs to be vacuumed for wraparound, then every subsequent worker will be launched at template1, every one will fail to advance its datfrozenxid, rinse and repeat. Even before that happens, the DB's datfrozenxid will prevent clog truncation, which might explain some of the recent complaints. I've only directly tested this in HEAD, but I suspect the problem goes back a ways. On reflection I'm not even sure that this is strictly an autovacuum bug. It can be cast more generically as RecentGlobalXmin getting used without ever having been set, and it sure looks to me like the HOT patch may have introduced a few risks of that sort. I'm thinking that maybe an appropriate fix is to insert a GetTransactionSnapshot call at the beginning of InitPostgres' transaction, thus ensuring that every backend has some vaguely sane value for RecentGlobalXmin before it tries to do any database access. Another thought is that even with that, an autovac worker is likely to reach vac_update_datfrozenxid with a RecentGlobalXmin value that was computed at the start of its run, and is thus rather old. I wonder why vac_update_datfrozenxid is using the variable at all rather than doing GetOldestXmin? It's not like that function is so performance-critical that it needs to avoid calling GetOldestXmin. Lastly, now that we have the PROC_IN_VACUUM test in GetSnapshotData, is it actually necessary for lazy vacuum to avoid setting a snapshot? It seems like it might be a good idea for it to do so in order to keep its RecentGlobalXmin reasonably current. I've only looked at this in HEAD, but I am thinking that we have a real problem here in both HEAD and 8.3. I'm less sure how bad things are in the older branches. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keeping creation time of objects
On Tue, Sep 09, 2008 at 10:50:57PM -0400, Robert Haas wrote: When people aren't keeping track of their DDL, that is very strictly a process problem on their end. When people are shooting themselves in the foot, it's a great disservice to market Kevlar shoes to them. I can't believe anyone is going to stop tracking their DDL because, ooh goody, now we have pg_class.creation_time. They will look at and say either oh, this is nice or oh, this is useless and go on about their business. I can easily believe that a feature like this might help them make the decision not to start out of a false sense of security. I try pretty hard not to shoot myself in the foot. But if someone comes up to me and offers me some shoes that are have the same cost, Not the same. This is extra code, so it will provide both new places for bugs and extra maintenance costs. appearance, Clearly not the same. comfort-level, False comfort is bad. Putting an anesthetic instead of support in a shoe billed as orthopedic may make customers comfortable, but when they continue to damage their foot with it, it's not a feature. and durability as regular shoes but are slightly more bullet resistant, should I refuse them on principle? See above. Why? See above. Cheers, David. -- David Fetter [EMAIL PROTECTED] http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: [EMAIL PROTECTED] Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using hash index when BETWEEN is specified
Zdenek Kotala [EMAIL PROTECTED] writes: I think it depends of ration of unique integer number in a table and numbers of requested interval, number distribution and total number of rows. For example if you have 10 distinct number and each has 100 occurrence then full scan is better (for between 1 and 5). But if each number occurs 10x. Then using hash index should be effective. I think this discussion is a complete waste of time. Hash indexes don't win against btrees for single indexscans currently. Even if that ever gets fixed, it's highly unlikely that they'd win for N separate indexscans versus 1 indexscan, which is what a query rewrite of this sort would produce. Remember that the btree will have the desired range of keys stored adjacently, whereas in a hash they are almost certainly in distinct buckets, and likely not even close-together buckets if the hash function is doing its job well. So you really are talking about a factor of N both in indexscan setup overhead and in I/O costs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Common Table Expressions (WITH RECURSIVE) patch
On Tue, 2008-09-09 at 09:47 -0400, Robert Haas wrote: 3. I think this is a must fix because of the point about volatile functions --- changing it later will result in user-visible semantics changes, so we have to get it right the first time. I don't entirely agree with #3. It is user-visible, but only in the sense that someone is depending on undocumented multiple-evaluation behavior. What makes you think it's going to be undocumented? Single versus multiple evaluation is a keep aspect of this feature and certainly needs to be documented one way or the other. I can't understand why we would introduce a standard syntax with non-standard behavior, but if we do, it certainly had better be mentioned in the documentation. I meant that -- hypothetically if we did accept the feature as-is -- the number of evaluations would be documented to be undefined, not N. That would avoid the backwards-compatibility problem. This one point is probably not worth discussing now, because argument #1 and #2 stand on their own. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Tom Lane wrote: I observed a curious bug in autovac just now. Since plain vacuum avoids calling GetTransactionSnapshot, an autovac worker that happens not to analyze any tables will never call GetTransactionSnapshot at all. This means it will arrive at vac_update_datfrozenxid with RecentGlobalXmin never having been changed from its boot value of FirstNormalTransactionId, which means that it will fail to update the database's datfrozenxid ... or, if the current value of datfrozenxid is past 2 billion, that it will improperly advance datfrozenxid to sometime in the future. Ouch :-( I've only directly tested this in HEAD, but I suspect the problem goes back a ways. Well, this logic was introduced in 8.2; I'm not sure if there's a problem in 8.1, but I don't think so. On reflection I'm not even sure that this is strictly an autovacuum bug. It can be cast more generically as RecentGlobalXmin getting used without ever having been set, and it sure looks to me like the HOT patch may have introduced a few risks of that sort. Agreed. Maybe we should boot RecentGlobalXmin with InvalidOid, and ensure where it's going to be used that it's not that. I'm thinking that maybe an appropriate fix is to insert a GetTransactionSnapshot call at the beginning of InitPostgres' transaction, thus ensuring that every backend has some vaguely sane value for RecentGlobalXmin before it tries to do any database access. AFAIR there's an initial transaction in InitPostgres or something like that. Since it goes away quickly, it'd be a good place to ensure the snapshot does not last much longer. Another thought is that even with that, an autovac worker is likely to reach vac_update_datfrozenxid with a RecentGlobalXmin value that was computed at the start of its run, and is thus rather old. I wonder why vac_update_datfrozenxid is using the variable at all rather than doing GetOldestXmin? It's not like that function is so performance-critical that it needs to avoid calling GetOldestXmin. The function is called only once per autovacuum iteration, and once in manually-invoked vacuum, so certainly it's not performance-critical. Lastly, now that we have the PROC_IN_VACUUM test in GetSnapshotData, is it actually necessary for lazy vacuum to avoid setting a snapshot? It seems like it might be a good idea for it to do so in order to keep its RecentGlobalXmin reasonably current. Hmm, I think I'd rather be inclined to get a snapshot just when it's going to finish. That way, RecentGlobalXmin will be up to date even if the I've only looked at this in HEAD, but I am thinking that we have a real problem here in both HEAD and 8.3. I'm less sure how bad things are in the older branches. 8.2 does contain the vac_update_datfrozenxid problem at the very least. Older versions do not have that logic, so they are probably safe. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: I observed a curious bug in autovac just now. ... Maybe we should boot RecentGlobalXmin with InvalidOid, and ensure where it's going to be used that it's not that. Good idea --- an Assert at the use sites should be sufficient. Lastly, now that we have the PROC_IN_VACUUM test in GetSnapshotData, is it actually necessary for lazy vacuum to avoid setting a snapshot? It seems like it might be a good idea for it to do so in order to keep its RecentGlobalXmin reasonably current. Hmm, I think I'd rather be inclined to get a snapshot just when it's going to finish. I'm worried about keeping RecentGlobalXmin up to date during the vacuums, not only at the end, because it will be used for HOT page pruning during the vacuums. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: Lastly, now that we have the PROC_IN_VACUUM test in GetSnapshotData, is it actually necessary for lazy vacuum to avoid setting a snapshot? It seems like it might be a good idea for it to do so in order to keep its RecentGlobalXmin reasonably current. Hmm, I think I'd rather be inclined to get a snapshot just when it's going to finish. I'm worried about keeping RecentGlobalXmin up to date during the vacuums, not only at the end, because it will be used for HOT page pruning during the vacuums. Oh, I see. I didn't know we were doing HOT pruning during vacuum; does it make sense? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: I'm worried about keeping RecentGlobalXmin up to date during the vacuums, not only at the end, because it will be used for HOT page pruning during the vacuums. Oh, I see. I didn't know we were doing HOT pruning during vacuum; does it make sense? Sorry, I got a bit confused there. The vacuum's intentional pruning will use its own OldestXmin variable, which is known current at the start of the vacuum (and I think there were even proposals to advance it more frequently than that). However, a vacuum could require some incidental system catalog fetches, which I think could result in prune operations based on RecentGlobalXmin on the catalog pages (cf index_getnext). So it's probably not too terribly important ... as long as an autovac worker doesn't live long enough that its RecentGlobalXmin threatens to wrap around ... but I'm still interested in it as a code cleanup measure. Skipping the transaction snapshot fetch was a performance kluge, and if we don't need it any more I'd like to get rid of that distinction between full and lazy vacuum behavior. Anyway I think we are on the same page about the rest of the issues. Did you want to work on fixing them, or shall I? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Alvaro Herrera wrote: I didn't know we were doing HOT pruning during vacuum; does it make sense? Removing HOT-updated, dead tuples is a bit complicated. It needs to be done one HOT-chain at a time, and the root line pointer needs to be updated to the next live tuple in the chain. lazy_scan_heap() calls the regular pruning function heap_page_prune() to deal with those before doing the normal scan of line pointers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Log Shipping Replication
On Wed, 2008-09-10 at 17:57 +0900, Fujii Masao wrote: My sequence covers several cases : * There is no missing WAL file. * There is a lot of missing WAL file. This is the likely case for any medium+ sized database. * There are missing history files. Failover always generates the gap of history file because TLI is incremented when archive recovery is completed. Yes, but failover doesn't happen while we are configuring replication, it can only happen after we have configured replication. It would be theoretically possible to take a copy from one server and then try to synchronise with a 3rd copy of the same server, but that seems perverse and bug prone. So I advise that we only allow replication when the timeline of the standby matches the timeline of the master, having it as an explicit check. In your design, does not initial setup block the master? Does your design cover above-mentioned case? The way I described it does not block the master. It does defer the point at which we can start using synchronous replication, so perhaps that is your objection. I think it is acceptable: good food takes time to cook. I have thought about the approach you've outlined, though it seems to me now like a performance optimisation rather than something we must have. IMHO it will be confusing to be transferring both old and new data at the same time from master to slave. We will have two different processes sending and two different processes receiving. You'll need to work through about four times as many failure modes, all of which will need testing. Diagnosing problems in it via the log hurts my head just thinking about it. ISTM that will severely impact the initial robustness of the software for this feature. Perhaps in time it is the right way. Anyway, feels like we're getting close to some good designs. There isn't much difference between what we're discussing here. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner question
Tom Lane wrote: Tom Raney [EMAIL PROTECTED] writes: Why does the planner consider both input variations of each symmetric merge join? The README says there is not a lot of difference between the two options. When are there any differences? The righthand side needs to support mark/restore, the left doesn't; so depending on plan types one way might need a helper Materialize node that the other way doesn't. Also, duplicated values are a bit cheaper to process on the left than the right. regards, tom lane Thank you for the explanation. On a somewhat related issue, I am a bit stumped on the way path keys function. In the following query and debug data, why would an index scan on a single relation contain a path key from a different relation? optimizer/README says, The PathKeys data structure represents what is known about the sort order of the tuples generated by a particular Path. A path's pathkeys field is a list of PathKey nodes, where the n'th item represents the n'th sort key of the result. Why does the index scan for tenk1 include a path key from onek.unique2? Is it implying an equivalence there? -Tom Raney bench=# explain select * from tenk1 JOIN onek ON tenk1.unique2=onek.unique2; RELOPTINFO (tenk1): rows=1 width=244 path list: SeqScan(tenk1) rows=1 cost=0.00..434.00 IdxScan(tenk1) rows=1 cost=0.00..583.25 pathkeys: ((tenk1.unique2, onek.unique2)) --- cheapest startup path: SeqScan(tenk1) rows=1 cost=0.00..434.00 cheapest total path: SeqScan(tenk1) rows=1 cost=0.00..434.00 RELOPTINFO (onek): rows=1000 width=244 path list: SeqScan(onek) rows=1000 cost=0.00..44.00 IdxScan(onek) rows=1000 cost=0.00..72.25 pathkeys: ((tenk1.unique2, onek.unique2)) cheapest startup path: SeqScan(onek) rows=1000 cost=0.00..44.00 cheapest total path: SeqScan(onek) rows=1000 cost=0.00..44.00 RELOPTINFO (tenk1/onek): rows=1000 width=488 path list: MergeJoin(tenk1/onek) rows=1000 cost=0.52..144.24 clauses: tenk1.unique2 = onek.unique2 IdxScan(tenk1) rows=1 cost=0.00..583.25 pathkeys: ((tenk1.unique2, onek.unique2)) IdxScan(onek) rows=1000 cost=0.00..72.25 pathkeys: ((tenk1.unique2, onek.unique2)) NestLoop(tenk1/onek) rows=1000 cost=0.00..1756.96 clauses: tenk1.unique2 = onek.unique2 SeqScan(onek) rows=1000 cost=0.00..44.00 IdxScan(tenk1) rows=1 cost=0.00..1.70 cheapest startup path: NestLoop(tenk1/onek) rows=1000 cost=0.00..1756.96 clauses: tenk1.unique2 = onek.unique2 SeqScan(onek) rows=1000 cost=0.00..44.00 IdxScan(tenk1) rows=1 cost=0.00..1.70 cheapest total path: MergeJoin(tenk1/onek) rows=1000 cost=0.52..144.24 clauses: tenk1.unique2 = onek.unique2 IdxScan(tenk1) rows=1 cost=0.00..583.25 pathkeys: ((tenk1.unique2, onek.unique2)) IdxScan(onek) rows=1000 cost=0.00..72.25 pathkeys: ((tenk1.unique2, onek.unique2)) QUERY PLAN - Merge Join (cost=0.52..144.24 rows=1000 width=488) Merge Cond: (tenk1.unique2 = onek.unique2) - Index Scan using tenk1_unique2 on tenk1 (cost=0.00..583.25 rows=1 width=244) - Index Scan using onek_unique2 on onek (cost=0.00..72.25 rows=1000 width=244) (4 rows) bench=# -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Common Table Expressions (WITH RECURSIVE) patch
I meant that -- hypothetically if we did accept the feature as-is -- the number of evaluations would be documented to be undefined, not N. That would avoid the backwards-compatibility problem. This one point is probably not worth discussing now, because argument #1 and #2 stand on their own. Agreed. Plus, both Tom and Pavel seem to think this is a relatively solvable problem. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner question
Tom Raney [EMAIL PROTECTED] writes: Why does the index scan for tenk1 include a path key from onek.unique2? Is it implying an equivalence there? bench=# explain select * from tenk1 JOIN onek ON tenk1.unique2=onek.unique2; Yes, for an example like that the planner knows that tenk1.unique2 and onek.unique2 will have equal values in any valid join row, so it's okay to suppose that a sort by one is the same as a sort by the other. So the pathkey items actually reference sets of variables {tenk1.unique2, onek.unique2} not just individual variables. RELOPTINFO (tenk1): rows=1 width=244 path list: SeqScan(tenk1) rows=1 cost=0.00..434.00 IdxScan(tenk1) rows=1 cost=0.00..583.25 pathkeys: ((tenk1.unique2, onek.unique2)) --- cheapest startup path: SeqScan(tenk1) rows=1 cost=0.00..434.00 cheapest total path: SeqScan(tenk1) rows=1 cost=0.00..434.00 Hm, I don't recognize this output format, is it coming from some custom code? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New FSM patch
Heikki Linnakangas napsal(a): Zdenek Kotala wrote: Do you still have the iGen setup available? Want to give it a shot? Not sure if I have it configured, need to check. But I'll try it or I'll ask Jignesh or Paul if they have a free time. They are real benchmark gurus. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
On Wed, 2008-09-10 at 11:52 -0400, Tom Lane wrote: I'm thinking that maybe an appropriate fix is to insert a GetTransactionSnapshot call at the beginning of InitPostgres' transaction, thus ensuring that every backend has some vaguely sane value for RecentGlobalXmin before it tries to do any database access. Can't we just set RecentGlobalXmin without getting a Snapshot? There's no need for a snapshot at that point is there? Just lock ProcArrayLock, read GlobalXmin, drop lock. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Simon Riggs [EMAIL PROTECTED] writes: On Wed, 2008-09-10 at 11:52 -0400, Tom Lane wrote: I'm thinking that maybe an appropriate fix is to insert a GetTransactionSnapshot call at the beginning of InitPostgres' transaction, thus ensuring that every backend has some vaguely sane value for RecentGlobalXmin before it tries to do any database access. Can't we just set RecentGlobalXmin without getting a Snapshot? Well, certainly building an MVCC snapshot is more than we (appear to) need, but I'm of the opinion that what we have got here is an unexpected way in which these specialized transactions are unlike all others. I think the safest fix is to make them more like normal transactions, rather than invent still other ways to do things. If there were a serious performance argument against that, then yeah, but I don't see one. Backend startup isn't cheap in any case, and neither is a VACUUM, so the incremental cost involved here seems negligible. There's no need for a snapshot at that point is there? Just lock ProcArrayLock, read GlobalXmin, drop lock. There's no GlobalXmin variable. We'd have to use GetOldestXmin(); which is cheaper than GetSnapshotData, but not hugely so. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner question
Tom Lane wrote: Tom Raney [EMAIL PROTECTED] writes: Why does the index scan for tenk1 include a path key from onek.unique2? Is it implying an equivalence there? bench=# explain select * from tenk1 JOIN onek ON tenk1.unique2=onek.unique2; Yes, for an example like that the planner knows that tenk1.unique2 and onek.unique2 will have equal values in any valid join row, so it's okay to suppose that a sort by one is the same as a sort by the other. So the pathkey items actually reference sets of variables {tenk1.unique2, onek.unique2} not just individual variables. Thanks. RELOPTINFO (tenk1): rows=1 width=244 path list: SeqScan(tenk1) rows=1 cost=0.00..434.00 IdxScan(tenk1) rows=1 cost=0.00..583.25 pathkeys: ((tenk1.unique2, onek.unique2)) --- cheapest startup path: SeqScan(tenk1) rows=1 cost=0.00..434.00 cheapest total path: SeqScan(tenk1) rows=1 cost=0.00..434.00 Hm, I don't recognize this output format, is it coming from some custom code? Yes, it is. I thought it was easier to read the OPTIMIZER_DEBUG output with the relation names instead of the relation indexes. I will post a patch against CVS HEAD if you think it will help others. -Tom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Tom Lane wrote: Sorry, I got a bit confused there. The vacuum's intentional pruning will use its own OldestXmin variable, which is known current at the start of the vacuum (and I think there were even proposals to advance it more frequently than that). However, a vacuum could require some incidental system catalog fetches, which I think could result in prune operations based on RecentGlobalXmin on the catalog pages (cf index_getnext). Hmm, right, and what Heikki said too. Anyway I think we are on the same page about the rest of the issues. Did you want to work on fixing them, or shall I? Is this more or less what you had in mind? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Index: src/backend/access/heap/heapam.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.262 diff -c -p -r1.262 heapam.c *** src/backend/access/heap/heapam.c 11 Aug 2008 11:05:10 - 1.262 --- src/backend/access/heap/heapam.c 10 Sep 2008 19:50:36 - *** heapgetpage(HeapScanDesc scan, BlockNumb *** 219,224 --- 219,225 /* * Prune and repair fragmentation for the whole page, if possible. */ + Assert(TransactionIdIsValid(RecentGlobalXmin)); heap_page_prune_opt(scan-rs_rd, buffer, RecentGlobalXmin); /* *** heap_hot_search_buffer(ItemPointer tid, *** 1469,1474 --- 1470,1477 if (all_dead) *all_dead = true; + Assert(TransactionIdIsValid(RecentGlobalXmin)); + Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer)); offnum = ItemPointerGetOffsetNumber(tid); at_chain_start = true; Index: src/backend/access/index/indexam.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/access/index/indexam.c,v retrieving revision 1.109 diff -c -p -r1.109 indexam.c *** src/backend/access/index/indexam.c 19 Jun 2008 00:46:03 - 1.109 --- src/backend/access/index/indexam.c 10 Sep 2008 19:51:38 - *** index_getnext(IndexScanDesc scan, ScanDi *** 419,424 --- 419,426 SCAN_CHECKS; GET_SCAN_PROCEDURE(amgettuple); + Assert(TransactionIdIsValid(RecentGlobalXmin)); + /* * We always reset xs_hot_dead; if we are here then either we are just * starting the scan, or we previously returned a visible tuple, and in Index: src/backend/commands/vacuum.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/commands/vacuum.c,v retrieving revision 1.376 diff -c -p -r1.376 vacuum.c *** src/backend/commands/vacuum.c 13 Aug 2008 00:07:50 - 1.376 --- src/backend/commands/vacuum.c 10 Sep 2008 20:17:32 - *** vac_update_datfrozenxid(void) *** 790,803 bool dirty = false; /* ! * Initialize the min calculation with RecentGlobalXmin. Any ! * not-yet-committed pg_class entries for new tables must have ! * relfrozenxid at least this high, because any other open xact must have ! * RecentXmin = its PGPROC.xmin = our RecentGlobalXmin; see ! * AddNewRelationTuple(). So we cannot produce a wrong minimum by ! * starting with this. */ ! newFrozenXid = RecentGlobalXmin; /* * We must seqscan pg_class to find the minimum Xid, because there is no --- 790,801 bool dirty = false; /* ! * Initialize the min calculation with GetOldestXmin, which is a ! * reasonable approximation to the minimum relfrozenxid for not-yet- ! * committed pg_class entries for new tables; see AddNewRelationTuple(). ! * Se we cannot produce a wrong minimum by starting with this. */ ! newFrozenXid = GetOldestXmin(true, true); /* * We must seqscan pg_class to find the minimum Xid, because there is no *** vacuum_rel(Oid relid, VacuumStmt *vacstm *** 990,1007 /* Begin a transaction for vacuuming this relation */ StartTransactionCommand(); ! if (vacstmt-full) ! { ! /* functions in indexes may want a snapshot set */ ! PushActiveSnapshot(GetTransactionSnapshot()); ! } ! else { /* ! * During a lazy VACUUM we do not run any user-supplied functions, and ! * so it should be safe to not create a transaction snapshot. ! * ! * We can furthermore set the PROC_IN_VACUUM flag, which lets other * concurrent VACUUMs know that they can ignore this one while * determining their OldestXmin. (The reason we don't set it during a * full VACUUM is exactly that we may have to run user- defined --- 988,1003 /* Begin a transaction for vacuuming this relation */ StartTransactionCommand(); ! /* ! * Functions in indexes may want a snapshot set. Also, setting ! * a snapshot ensures that RecentGlobalXmin is kept truly recent. ! */ !
Re: [HACKERS] Potential Join Performance Issue
On Tue, 2008-09-09 at 11:21 -0700, Lawrence, Ramon wrote: Our research group has been using the PostgreSQL code base to test new join algorithms. Sounds cool. I'm sure you'll come up with some good things. You might be interested in this also http://archives.postgresql.org/pgsql-hackers/2007-01/msg01600.php after which Greg Stark and I were investigating using alternate/compressed data structures to avoid the need to switch to multi-batch hash joins. If we knew we were dealing with nearly contiguous ranges of discrete values, we could store the missing values rather than the present values using an HRL encoded bitmap. Other ideas are possible also, I'm sure. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
On Wed, 2008-09-10 at 16:11 -0400, Tom Lane wrote: If there were a serious performance argument against that, then yeah, but I don't see one. Maybe! Just finishing other patch then I'll be starting Hot Standby discussions, so we'll see. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: Anyway I think we are on the same page about the rest of the issues. Did you want to work on fixing them, or shall I? Is this more or less what you had in mind? Yeah, this looks like exactly what I had in mind for HEAD. I'm not too sure about the back branches though. I think we could apply all of it to 8.3, but further back is going to require a separate investigation for each branch. Will you take that on? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: Anyway I think we are on the same page about the rest of the issues. Did you want to work on fixing them, or shall I? Is this more or less what you had in mind? Yeah, this looks like exactly what I had in mind for HEAD. I'm not too sure about the back branches though. I think we could apply all of it to 8.3, but further back is going to require a separate investigation for each branch. Will you take that on? Sure. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
I wrote: Yeah, this looks like exactly what I had in mind for HEAD. I'm not too sure about the back branches though. I think we could apply all of it to 8.3, but further back is going to require a separate investigation for each branch. Will you take that on? BTW, I did a quick look at all the uses of RecentGlobalXmin in the back branches, and I think we might be all right before 8.2. The older branches do in fact init RecentGlobalXmin to InvalidTransactionId, and the only thing they use it for is is this tuple dead to everyone tests. Since InvalidTransactionId compares older than anything else, the only consequence of not having set it is overly-conservative decisions not to mark tuples killed. So unless you see a problem I missed, I think we only have an issue-worth-fixing in 8.2 and 8.3. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed patch: make SQL interval-literal syntax work per spec
Tom Lane [EMAIL PROTECTED] wrote: The patch includes regression test changes that illustrate what it does. I am not sure about some of the corner cases --- anyone want to see if their understanding of the spec for interval string is different? The patch seems to support extensions to the standard. (1) In the spec, an interval value or literal must be either year-month or day-time. (I guess they didn't want to try to deal with the sticky issues of what it means to have an interval of, for example, seven months and three days -- since an interval has no sense of which seven months.) (2) The interval qualifier is not optional in the spec. (3) It seems to me that they were requiring that there be a one-to-one match between the fields specified in the interval qualifier and the fields present in the interval string. (4) I'm not 100% sure on this one, but it seemed to me that they were requiring year to be four digits and other components (except for fractional seconds) to be two digits. So long as they are documented, there's nothing wrong with extensions. Nothing I saw suggests that legal interval literals would be misinterpreted. There is still some unfinished business if anyone wants to make it really exactly 100% spec compliant. In particular the spec seems to allow a minus sign *outside* the string literal, I agree. They go so far as to point out how it should be interpreted if the minus is present in both allowed locations (both inside and outside the quotes): 5) The sign in a signed numeric literal or an interval literal is a monadic arithmetic operator. The monadic arithmetic operators + and * specify monadic plus and monadic minus, respectively. If neither monadic plus nor monadic minus are specified in a signed numeric literal or an interval literal, or if monadic plus is specified, then the literal is positive. If monadic minus is specified in a signed numeric literal or interval literal, then the literal is negative. If sign is specified in both possible locations in an interval literal, then the sign of the literal is determined by normal mathematical interpretation of multiple sign operators. and if I'm reading it right, a precision spec in combination with field restrictions ought to look like INTERVAL '...' DAY TO SECOND(3) not INTERVAL(3) '...' DAY TO SECOND. Agreed. However, for these you'll get a syntax error instead of silently wrong answers if you try to use the other syntax, so it's not quite as pernicious as the matters addressed here. Agreed. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Transaction Snapshots and Hot Standby
Transaction snapshots is probably the most difficult problem for Hot Standby to resolve. We *must* have a transaction snapshot to access table data in a consistent way, or we must accept some inconsistency, though that is not simple either. We can derive transaction snapshots * remotely from primary node * locally on the standby node If we derive a snapshot locally, then we will end up with a situation where the xmin of the local snapshot precedes the xmin of the primary node. When this occurs it will then be possible for WAL records to arrive on the standby that request removal of rows that a transaction might wish to see. Preventing that situation can be done by either deferring WAL apply or by cancelling queries. We can defer WAL apply for particular tables only, but this could significantly complicate the apply process and is not a suggested option for the first release of this feature. We might control query cancellation by tracking which tables have had rows removed that would have been visible to particular queries. Again, possible but suffers from the problem that tables on which HOT is frequently active would be almost unusable. So not a general solution. Simply ignoring WAL removal has been discussed and rejected. http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php Explicitly defining the tables a transaction wishes to see has also been discussed and rejected. http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php So the most generally applicable and simplest initial solution for generating snapshots is to take them from the remote primary node. The proposal for this follows: Transaction snapshots will be managed by a new process, Snapproc. Snapproc will start only in recovery mode and will exit when normal processing mode begins. Snapshot process will connect to the remote primary node and export snapshot data and copy this to shared memory on the standby node. When all standby backends have run UnregisterSnapshot() then the snapshot will then be unregistered on the remote primary node also. The standby must not think a transaction is visible until all changes made by it are have been applied. So snapshots from the primary cannot be used on the standby until the LSN at which they were taken has been reached by WAL apply on the standby. Snapshots don't normally have an LSN, so we must derive that information as well when we take a Snapshot. With asynchronous commits a transaction can be committed and yet not yet written to disk, so we cannot rely on the current WAL write pointer. Taking the WAL insert pointer is the safest thing to do, though most heavily contended. We don't want to hold ProcArrayLock while requesting WALInsertLock, so we will get the LSN of the WAL insert pointer *after* the Snapshot has been derived (it doesn't matter that much, as long as its not before the snapshot). So snapshots from the primary take time before they can be used. The delay is exactly the current processing delay from primary to standby, so another reason why we do not wish to fall behind. Taking snapshots from primary has a few disadvantages * snapshots take time before they are usable * requesting new snapshots is via remote request * snapshots on primary prevent row removal (but this was also an advantage of this technique!) If primary and standby node are connected by private ethernet then the snapshot request time will be ~10ms, though that is probably 100 times slower than current snapshot access. If primary and standby are opposite sides of an ocean then times could be as high as 500ms. Taking snapshots directly could be annoyingly slow for small read-only statements, so we need to buffer this process in various ways. For larger queries, this may not be a problem at all, but we do not wish to limit or assume the uses of the standby node. First of all, each backend will have the opportunity to reuse previous snapshots both within transactions and across them. A userset parameter snapshot_reuse_window = 0..6ms will define the time window in which any request for a new snapshot will simply result in being fed the last snapshot again. When the window on a snapshot has expired a newer snapshot will be presented instead. This idea is similar to serializable transactions, which continually reuse the same snapshot. This is a useful parameter for normal processing as well, since it will reduce contention on the ProcArrayLock for many applications. Snapshots can be reused across transactions in recovery mode, since they are held in shared memory. Amount of shared memory dedicated to storing snapshots will be max_connections * max size of snapshots. Since there is a delay between obtaining a new snapshot and it becoming usable the Snapshot process will buffer them until they become mature, like a good Whiskey. Snapshot process will take regular snapshots and pre-age them so that when a backend requests a snapshot it will be
Re: [HACKERS] Proposed patch: make SQL interval-literal syntaxwork per spec
Kevin Grittner [EMAIL PROTECTED] wrote: (4) I'm not 100% sure on this one, but it seemed to me that they were requiring year to be four digits and other components (except for fractional seconds) to be two digits. That can't be right. Maybe I saw that in datetime literal specs. Apologies. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposed patch: make SQL interval-literal syntax work per spec
Kevin Grittner [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] wrote: I am not sure about some of the corner cases --- anyone want to see if their understanding of the spec for interval string is different? The patch seems to support extensions to the standard. Right. All of these were extensions that already existed in PG. (3) It seems to me that they were requiring that there be a one-to-one match between the fields specified in the interval qualifier and the fields present in the interval string. Yeah. I couldn't actually find any such statement in SQL92, but the SQL:2008 draft has this at 5.3 rule 30: 30)Let N be the number of primary datetime fields in the precision of the interval literal, as specified by interval qualifier. The interval literal being defined shall contain N datetime components. and at rule 34: 34)Within the definition of an interval literal that contains a year-month literal, the interval qualifier shall not specify DAY, HOUR, MINUTE, or SECOND. Within the definition of an interval literal that contains a day-time literal, the interval qualifier shall not specify YEAR or MONTH. This seems to be requiring that not only do you give the exact number of components, but the formatting must match the expectation. So anything we accept beyond that is gravy. I think that most of the extension cases were already being accepted in some form, and I'd be hesitant to take them out for fear of breaking existing applications. There is still some unfinished business if anyone wants to make it really exactly 100% spec compliant ... I agree. I committed the patch as presented, and I think I might go take a quick look at the other two issues. What I suspect I'll find is that the minus sign issue isn't fixable without turning INTERVAL into a fully reserved word, which is probably a cure worse than the disease. However it might be fairly easy to get the grammar to allow the precision in either place. (We'd want to keep the old way working for backward compatibility.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
Tom Lane wrote: BTW, I did a quick look at all the uses of RecentGlobalXmin in the back branches, and I think we might be all right before 8.2. The older branches do in fact init RecentGlobalXmin to InvalidTransactionId, and the only thing they use it for is is this tuple dead to everyone tests. Since InvalidTransactionId compares older than anything else, the only consequence of not having set it is overly-conservative decisions not to mark tuples killed. So unless you see a problem I missed, I think we only have an issue-worth-fixing in 8.2 and 8.3. Actually, 8.2 initializes it to InvalidTransactionId too, so it doesn't seem like there's a problem there either. Since the problem stems from using it as initializer for the Min() calculation, there's no actual bug on 8.2 either. The bug only appeared on 8.3 when the initializer was changed. And given that there's no HOT in 8.2, then there's no danger of misusing it in page pruning either. Still, I produced a patch to 8.2, but given that I'm not sure if it's worth applying. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. Index: src/backend/commands/vacuum.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/commands/vacuum.c,v retrieving revision 1.342.2.5 diff -c -p -r1.342.2.5 vacuum.c *** src/backend/commands/vacuum.c 11 Feb 2008 19:14:38 - 1.342.2.5 --- src/backend/commands/vacuum.c 11 Sep 2008 01:08:10 - *** vac_update_datfrozenxid(void) *** 764,777 bool dirty = false; /* ! * Initialize the min calculation with RecentGlobalXmin. Any ! * not-yet-committed pg_class entries for new tables must have ! * relfrozenxid at least this high, because any other open xact must have ! * RecentXmin = its PGPROC.xmin = our RecentGlobalXmin; see ! * AddNewRelationTuple(). So we cannot produce a wrong minimum by ! * starting with this. */ ! newFrozenXid = RecentGlobalXmin; /* * We must seqscan pg_class to find the minimum Xid, because there is no --- 764,775 bool dirty = false; /* ! * Initialize the min calculation with GetOldestXmin, which is a ! * reasonable approximation to the minimum relfrozenxid for not-yet- ! * committed pg_class entries for new tables; see AddNewRelationTuple(). ! * Se we cannot produce a wrong minimum by starting with this. */ ! newFrozenXid = GetOldestXmin(true, true); /* * We must seqscan pg_class to find the minimum Xid, because there is no *** vacuum_rel(Oid relid, VacuumStmt *vacstm *** 965,982 /* Begin a transaction for vacuuming this relation */ StartTransactionCommand(); ! if (vacstmt-full) ! { ! /* functions in indexes may want a snapshot set */ ! ActiveSnapshot = CopySnapshot(GetTransactionSnapshot()); ! } ! else { /* ! * During a lazy VACUUM we do not run any user-supplied functions, and ! * so it should be safe to not create a transaction snapshot. ! * ! * We can furthermore set the inVacuum flag, which lets other * concurrent VACUUMs know that they can ignore this one while * determining their OldestXmin. (The reason we don't set inVacuum * during a full VACUUM is exactly that we may have to run user- --- 963,978 /* Begin a transaction for vacuuming this relation */ StartTransactionCommand(); ! /* ! * Functions in indexes may want a snapshot set. Also, setting ! * a snapshot ensures that RecentGlobalXmin is kept truly recent. ! */ ! ActiveSnapshot = CopySnapshot(GetTransactionSnapshot()); ! ! if (!vacstmt-full) { /* ! * During a lazy VACUUM we can set the inVacuum flag, which lets other * concurrent VACUUMs know that they can ignore this one while * determining their OldestXmin. (The reason we don't set inVacuum * during a full VACUUM is exactly that we may have to run user- Index: src/backend/utils/init/postinit.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/utils/init/postinit.c,v retrieving revision 1.172 diff -c -p -r1.172 postinit.c *** src/backend/utils/init/postinit.c 5 Nov 2006 22:42:09 - 1.172 --- src/backend/utils/init/postinit.c 10 Sep 2008 23:43:45 - *** *** 42,47 --- 42,48 #include utils/guc.h #include utils/portal.h #include utils/relcache.h + #include utils/tqual.h #include utils/syscache.h #include pgstat.h *** InitPostgres(const char *dbname, const c *** 380,389 on_shmem_exit(ShutdownPostgres, 0); /* ! * Start a new transaction here before first access to db */ if (!bootstrap) StartTransactionCommand(); /* * Now that we have a transaction, we can take locks. Take a writer's --- 381,395 on_shmem_exit(ShutdownPostgres, 0); /* ! * Start a new transaction here
Re: [HACKERS] Proposed patch: make SQL interval-literal syntax work per spec
(1) In the spec, an interval value or literal must be either year-month or day-time. (I guess they didn't want to try to deal with the sticky issues of what it means to have an interval of, for example, seven months and three days -- since an interval has no sense of which seven months.) Note that, for usability reasons, Karel some time ago try-partitioned our intervals: year-month|day-week|hour-min-sec. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Interesting glitch in autovacuum
On Wed, Sep 10, 2008 at 9:22 PM, Tom Lane [EMAIL PROTECTED] wrote: On reflection I'm not even sure that this is strictly an autovacuum bug. It can be cast more generically as RecentGlobalXmin getting used without ever having been set, and it sure looks to me like the HOT patch may have introduced a few risks of that sort. ISTM that HOT may be safe here because even if RecentGlobalXmin is not set and has the boot time value of FirstNormalTransactionId, the heap_page_prune_opt() would just return without doing any real work. This is OK except in VACUUM where we anyways use OldestXmin. So I don't see a real problem here. But its good to fix the larger problem as per the suggestion. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Commitfest patches mostly assigned ... status
Hackers, At this point, almost all patches have been assigned to reviewers. If you submitted a patch and don't get feedback by Saturday, take a look at who's in the reviewers column and send them a query. Since I have no way to track when patches are assigned to reviewers, I have no idea if some of them are sitting on their hands. Some patches have not been assigned to reviewers for various reasons. The following weren't assigned because they are complex and really need a high-end hacker or committer to take them on: libpq events rmgr hooks and contrib/rmgr_hook CLUSTER using sort instead of index scan The following were not assigned because there has already been discussion on this list debating them being a good idea. These need consensus on this list before they get assigned to a reviewer: remove --inputdir and --outputdir from pg_regress GUC: Case-insensitive units Allow has_table_privilege(...,'usage') on sequences I've assigned some reviewers to WIP patches with instructions to report back on their general experience with building, functionality and spec. Note that patches need not have only one reviewer! If you have time, please take on testing some of the more complex patches, especially: Column-level Permissions Common Table Expressions SE-PostgreSQL patches Also, note that the following patches need performance testing on a variety of platforms. Everyone should help with this. GSoC Improved Hash Indexing posix fadvises operator restrictivity function for text search CLUSTER using sort instead of index scan Thanks for you input, and let's close this commitfest in a week! --CommitFest Mom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers