Re: [HACKERS] [BUGS] BUG #5487: dblink failed with 63 bytes connection names
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Hmm, seems that dblink should call truncate_identifier() for the truncation, to be consistent with truncation of table names etc. Hmmm, we need the same routine with truncate_identifier(), but we hard to use the function because it modifies the input buffer directly. Since all of the name strings in dblink is const char *, I added a bit modified version of the function as truncate_identifier_copy() in the attached v2 patch. I also spotted this in dblink.c: /* first gather the server connstr options */ if (strlen(servername) NAMEDATALEN) foreign_server = GetForeignServerByName(servername, true); I think that's wrong. We normally consistently truncate identifiers at creation and at use, so that if you create an object with a very long name and it's truncated, you can still refer to it with the untruncated name because all such references are truncated too. Absolutely. I re-use the added function for the fix. Regards, --- Takahiro Itagaki NTT Open Source Software Center dblink_63bytes-2010602.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronization levels in SR
Heikki Linnakangas wrote: The possibilities are endless... Your proposal above covers a pretty good set of scenarios, but it's by no means complete. If we try to solve everything the configuration will need to be written in a Turing-complete Replication Description Language. We'll have to pick a useful, easy-to-understand subset that covers the common scenarios. To handle the more exotic scenarios, you can write a proxy that sits in front of the master, and implements whatever rules you wish, with the rules written in C. I was thinking about this a bit recently. As I see it, there are three fundamental parts of this: 1) We have a transaction that is being committed. The rest of the computations here are all relative to it. 2) There is an (internal?) table that lists the state of each replication target relative to that transaction. It would include the node name, perhaps some metadata ('location' seems the one that's most likely to help with the remote data center issue), and a state code. The codes from http://wiki.postgresql.org/wiki/Streaming_Replication work fine for the last part (which is the only dynamic one--everything else is static data being joined against): async=hasn't received yet recv=been received but just in RAM fsync=received and synced to disk apply=applied to the database These would need to be enums so they can be ordered from lesser to greater consistency. So in a 3 node case, the internal state table might look like this after a bit of data had been committed: node | location | state -- a | local| fsync b | remote | recv c | remote | async This means that the local node has a fully persistent copy, but the best either remote one has done is received the data, it's not on disk at all yet at the remote data center. Still working its way through. 3) The decision about whether the data has been committed to enough places to be considered safe by the master is computed by a function that is passed this internal table as something like a SRF, and it returns a boolean. Once that returns true, saying it's satisfied, the transaction closes on the master and continues to percolate out from there. If it's false, we wait for another state change to come in and return to (2). I would propose that most behaviors someone has expressed as being their desired implementation is possible to implement using this scheme. -Semi-sync commit: proceed as soon somebody else has a copy and hope the copies all become consistent: EXISTS WHERE state=recv -Don't proceed until there's a fsync'd commit on at least one of the remote nodes: EXISTS WHERE location='remote' AND state=fsync -Look for a quorum of n commits of fsync quality: CASE WHEN (SELECT COUNT(*) WHERE state=fsync)n THEN true else FALSE end; Syntax is obviously rough but I think you can get the drift of what I'm suggesting. While exposing the local state and running this computation isn't free, in situations where there truly are remote nodes in here being communicated with the network overhead is going to dwarf that. If there were a fast path for the simplest cases and this complicated one for the rest, I think you could get the fully programmable behavior some people want using simple SQL, rather than having to write a new Replication Description Language or something so ambitious. This data about what's been replicated to where looks an awful lot like a set of rows you can operate on using features already in the database to me. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] obsolete comments in xlog.c
Hi, I found some obsolete comments in xlog.c, which are related to recently-deleted variable fromArchive. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c886571..65675b9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2609,8 +2609,8 @@ XLogFileOpen(uint32 log, uint32 seg) /* * Open a logfile segment for reading (during recovery). * - * If fromArchive is true, the segment is retrieved from archive, otherwise - * it's read from pg_xlog. + * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive. + * If source = XLOG_FROM_PG_XLOG, it's read from pg_xlog. */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, @@ -2673,8 +2673,6 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, * Open a logfile segment for reading (during recovery). * * This version searches for the segment with any TLI listed in expectedTLIs. - * If not in StandbyMode and fromArchive is true, the segment is also - * searched in pg_xlog if not found in archive. */ static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronization levels in SR
On 02/06/10 10:22, Greg Smith wrote: Heikki Linnakangas wrote: The possibilities are endless... Your proposal above covers a pretty good set of scenarios, but it's by no means complete. If we try to solve everything the configuration will need to be written in a Turing-complete Replication Description Language. We'll have to pick a useful, easy-to-understand subset that covers the common scenarios. To handle the more exotic scenarios, you can write a proxy that sits in front of the master, and implements whatever rules you wish, with the rules written in C. I was thinking about this a bit recently. As I see it, there are three fundamental parts of this: 1) We have a transaction that is being committed. The rest of the computations here are all relative to it. Agreed. So in a 3 node case, the internal state table might look like this after a bit of data had been committed: node | location | state -- a | local | fsync b | remote | recv c | remote | async This means that the local node has a fully persistent copy, but the best either remote one has done is received the data, it's not on disk at all yet at the remote data center. Still working its way through. 3) The decision about whether the data has been committed to enough places to be considered safe by the master is computed by a function that is passed this internal table as something like a SRF, and it returns a boolean. Once that returns true, saying it's satisfied, the transaction closes on the master and continues to percolate out from there. If it's false, we wait for another state change to come in and return to (2). You can't implement wait for X to ack the commit, but if that doesn't happen in Y seconds, time out and return true anyway with that. While exposing the local state and running this computation isn't free, in situations where there truly are remote nodes in here being communicated with the network overhead is going to dwarf that. If there were a fast path for the simplest cases and this complicated one for the rest, I think you could get the fully programmable behavior some people want using simple SQL, rather than having to write a new Replication Description Language or something so ambitious. This data about what's been replicated to where looks an awful lot like a set of rows you can operate on using features already in the database to me. Yeah, if we want to provide full control over when a commit is acknowledged to the client, there's certainly no reason we can't expose that using a hook or something. It's pretty scary to call a user-defined function at that point in transaction. Even if we document that you must refrain from doing nasty stuff like modifying tables in that function, it's still scary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] obsolete comments in xlog.c
On 02/06/10 10:39, Fujii Masao wrote: I found some obsolete comments in xlog.c, which are related to recently-deleted variable fromArchive. Thanks, committed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronization levels in SR
On Wed, 2010-06-02 at 03:22 -0400, Greg Smith wrote: Heikki Linnakangas wrote: The possibilities are endless... Your proposal above covers a pretty good set of scenarios, but it's by no means complete. If we try to solve everything the configuration will need to be written in a Turing-complete Replication Description Language. We'll have to pick a useful, easy-to-understand subset that covers the common scenarios. To handle the more exotic scenarios, you can write a proxy that sits in front of the master, and implements whatever rules you wish, with the rules written in C. I was thinking about this a bit recently. As I see it, there are three fundamental parts of this: 1) We have a transaction that is being committed. The rest of the computations here are all relative to it. 2) There is an (internal?) table that lists the state of each replication target relative to that transaction. It would include the node name, perhaps some metadata ('location' seems the one that's most likely to help with the remote data center issue), and a state code. The codes from http://wiki.postgresql.org/wiki/Streaming_Replication work fine for the last part (which is the only dynamic one--everything else is static data being joined against): async=hasn't received yet recv=been received but just in RAM fsync=received and synced to disk apply=applied to the database These would need to be enums so they can be ordered from lesser to greater consistency. So in a 3 node case, the internal state table might look like this after a bit of data had been committed: node | location | state -- a | local| fsync b | remote | recv c | remote | async This means that the local node has a fully persistent copy, but the best either remote one has done is received the data, it's not on disk at all yet at the remote data center. Still working its way through. 3) The decision about whether the data has been committed to enough places to be considered safe by the master is computed by a function that is passed this internal table as something like a SRF, and it returns a boolean. Once that returns true, saying it's satisfied, the transaction closes on the master and continues to percolate out from there. If it's false, we wait for another state change to come in and return to (2). I would propose that most behaviors someone has expressed as being their desired implementation is possible to implement using this scheme. -Semi-sync commit: proceed as soon somebody else has a copy and hope the copies all become consistent: EXISTS WHERE state=recv -Don't proceed until there's a fsync'd commit on at least one of the remote nodes: EXISTS WHERE location='remote' AND state=fsync -Look for a quorum of n commits of fsync quality: CASE WHEN (SELECT COUNT(*) WHERE state=fsync)n THEN true else FALSE end; Syntax is obviously rough but I think you can get the drift of what I'm suggesting. While exposing the local state and running this computation isn't free, in situations where there truly are remote nodes in here being communicated with the network overhead is going to dwarf that. If there were a fast path for the simplest cases and this complicated one for the rest, I think you could get the fully programmable behavior some people want using simple SQL, rather than having to write a new Replication Description Language or something so ambitious. This data about what's been replicated to where looks an awful lot like a set of rows you can operate on using features already in the database to me. I think we're all agreed on the 4 levels: async, recv, fsync, apply. I also like the concept of a synchronisation/wakeup rule as an abstract concept. Certainly makes things easier to discuss. The inputs to the wakeup rule can be defined in different ways. Holding per-node state at local level looks too complex to me. I'm not suggesting that we need both per-node AND per-transaction options interacting at the same time. (That would be a clear argument against per-transaction options, if that was a requirement - its not, for me). There seems to be a simpler way: a service oriented model. The transaction requests a minimum level of synchronisation, the standbys together service that request. A simple, clear process: 1. If transaction requests recv, fsync or apply, backend sleeps in the appropriate queue 2. An agent on behalf of the remote standby provides feedback according to the levels of service defined for that standby. 3. The agent calls a wakeup-rule to see if the backend can be woken yet The most basic rule is first-standby-wakes meaning that the first standby to provide feedback that the required synchronisation level has been met by at least one standby will cause the rule to fire. The next most basic thing is that some standbys can be marked as not taking part in the quorum
Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages
On 28/05/10 04:00, Josh Berkus wrote: Consider a table that is regularly written but append-only. Every time autovacuum kicks in, we'll go and remove any dead tuples and then mark the pages PD_ALL_VISIBLE and set the visibility map bits, which will cause subsequent vacuums to ignore the all-visible portions of the table... until anti-wraparound kicks in, at which point we'll vacuum the entire table and freeze everything. If, however, we decree that you can't write a new tuple into a PD_ALL_VISIBLE page without freezing the existing tuples, then you'll still have the small, incremental vacuums but those are pretty cheap, That only works if those pages were going to be autovacuumed anyway. In the case outlined above (which I've seen at 3 different production sites this year), they wouldn't be; a table with less than 2% updates and deletes does not get vacuumed until max_freeze_age for any reason. For that matter, pages which are getting autovacuumed are not a problem, period; they're being read and written and freezing them is not an issue. I'm not seeing a way of fixing this common issue short of overhauling CLOG, or of creating a freeze_map. Darn. Don't you not get a positive enough effect by adjusting the table's autovacuum_min_freeze_age and autovacuum_max_freeze_age. If you set those numbers small, it appears to me that you would get very quickly to a state where the vacuum would example only the most recent part of the table rather than the whole thing. Does that give you enough of a win that it stops the scanning and writing of the whole table which reduces the performance problem being experienced. It's not a complete solution, but does it go someway? Regards Russell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #5487: dblink failed with 63 bytes connection names
On 02/06/10 09:46, Takahiro Itagaki wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com wrote: Hmm, seems that dblink should call truncate_identifier() for the truncation, to be consistent with truncation of table names etc. Hmmm, we need the same routine with truncate_identifier(), but we hard to use the function because it modifies the input buffer directly. Since all of the name strings in dblink is const char *, I added a bit modified version of the function as truncate_identifier_copy() in the attached v2 patch. Well, looking at the callers, most if not all of them seem to actually pass a palloc'd copy, allocated by text_to_cstring(). Should we throw a NOTICE like vanilla truncate_identifier() does? I feel it would be better to just call truncate_identifier() than roll your own. Assuming we want the same semantics in dblink, we'll otherwise have to remember to update truncate_identifier_copy() with any changes to truncate_identifier(). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby
On 02/06/10 06:23, Fujii Masao wrote: On Mon, May 31, 2010 at 7:17 PM, Fujii Masaomasao.fu...@gmail.com wrote: 4) Change it so that checkpoint_segments takes effect in standby mode, but not during recovery otherwise I revised the patch to achieve 4). This will enable checkpoint_segments to trigger a restartpoint like checkpoint_timeout already does, in standby mode (i.e., streaming replication or file-based log shipping). Hmm, XLogCtl-Insert.RedoRecPtr is not updated during recovery, so this doesn't work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] proposal - custom format for date, timestamp
Hello I thinking about request on custom datetime format. My first idea is: a) add new datestyle format custom b) add new GUC variables - custom_date_format, custom_time_format, custom_timestamp_format There are some questions: a) what is good behave when datestyle = custom, but some necessary custom_*_format is empty? b) custom format is used for output, have to be used for input too? c) when input isn't correct for custom format, should be parsed by usual way? I did some conceptual prototype - see attachment postgres=# set datestyle to custom; SET Time: 0,450 ms postgres=# select current_date; ERROR: custom_datestyle_format is empty postgres=# set custom_datestyle_format to 'Dy Mon '; SET Time: 0,409 ms postgres=# select current_date; date -- Wed Jun 2010 (1 row) Time: 0,485 ms Regards Pavel Stehule custom_format.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby
On Wed, Jun 2, 2010 at 8:40 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 02/06/10 06:23, Fujii Masao wrote: On Mon, May 31, 2010 at 7:17 PM, Fujii Masaomasao.fu...@gmail.com wrote: 4) Change it so that checkpoint_segments takes effect in standby mode, but not during recovery otherwise I revised the patch to achieve 4). This will enable checkpoint_segments to trigger a restartpoint like checkpoint_timeout already does, in standby mode (i.e., streaming replication or file-based log shipping). Hmm, XLogCtl-Insert.RedoRecPtr is not updated during recovery, so this doesn't work. Oops! I revised the patch, which changes CreateRestartPoint() so that it updates XLogCtl-Insert.RedoRecPtr. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center checkpoint_segments_during_recovery_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal - custom format for date, timestamp
Pavel Stehule pavel.steh...@gmail.com writes: I thinking about request on custom datetime format. My first idea is: a) add new datestyle format custom b) add new GUC variables - custom_date_format, custom_time_format, custom_timestamp_format Ick. Why not just put them in one variable. datestyle = 'custom mm dd' Interrelated GUCs are really painful. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronization levels in SR
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: It's pretty scary to call a user-defined function at that point in transaction. Not so much pretty scary as zero chance of being accepted. And I do mean zero. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal - custom format for date, timestamp
2010/6/2 Tom Lane t...@sss.pgh.pa.us: Pavel Stehule pavel.steh...@gmail.com writes: I thinking about request on custom datetime format. My first idea is: a) add new datestyle format custom b) add new GUC variables - custom_date_format, custom_time_format, custom_timestamp_format Ick. Why not just put them in one variable. datestyle = 'custom mm dd' Interrelated GUCs are really painful. I understand, but personally I dislike mix two different values - name and format together. And what more - we need separate formats for date, time and timestamp. It is differnet than Oracle because Oracle uses only one datatype for all. Regards Pavel regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote: Uh, we have three days before we package 9.0beta2. It would be good if we could decide on the max_standby_delay issue soon. I've heard something from Heikki, not from anyone else. Those comments amount to lets replace max_standby_delay with max_apply_delay. Got a beta idea? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs si...@2ndquadrant.com wrote: On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote: Uh, we have three days before we package 9.0beta2. It would be good if we could decide on the max_standby_delay issue soon. I've heard something from Heikki, not from anyone else. Those comments amount to lets replace max_standby_delay with max_apply_delay. Got a beta idea? Given the incessant ticking of the clock, I have a hard time believing we have any real options besides max_standby_delay or a boolean which corresponds to the -1 and 0 settings of max_standby_delay. I think it's pretty clear that there's a use case for the positive values, although there are bound to be some who try it and are surprised by behavior at transition from idle to active. The whole debate seems to boil down to how important a middle ground is versus how damaging the surprise factor is. (I don't really buy the argument that we won't be able to remove it later if we replace it with something better.) I know there were initially some technical problems, too; have those been resolved? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Simon Riggs si...@2ndquadrant.com writes: OK, here's v4. I've been trying to stay out of this thread, but with beta2 approaching and no resolution in sight, I'm afraid I have to get involved. This patch seems to me to be going in fundamentally the wrong direction. It's adding complexity and overhead (far more than is needed), and it's failing utterly to resolve the objections that I raised to start with. In particular, Simon seems to be basically refusing to do anything about the complaint that the code fails unless master and standby clocks are in close sync. I do not believe that this is acceptable, and since he won't fix it, I guess I'll have to. The other basic problem that I had with the current code, which this does nothing about, is that it believes that timestamps pulled from WAL archive should be treated on the same footing as timestamps of live actions. That might work in certain limited scenarios but it's going to be a disaster in general. I believe that the motivation for treating archived timestamps as live is, essentially, to force rapid catchup if a slave falls behind so far that it's reading from archive instead of SR. There are certainly use-cases where that's appropriate (though, again, not all); but even when you do want it it's a pretty inflexible implementation. For realistic values of max_standby_delay the behavior is going to pretty much be instant kill whenever we're reading from archive. I have backed off my original suggestion that we should reduce max_standby_delay to a boolean: that was based on the idea that delays would only occur in response to DDL on the master, but since vacuum and btree page splits can also trigger delays, it seems clear that a kill queries immediately policy isn't really very useful in practice. So we need to make max_standby_delay work rather than just get rid of it. What I think might be a realistic compromise is this: 1. Separate max_standby_delay into two GUCs, say max_streaming_delay and max_archive_delay. 2. When applying WAL that came across SR, use max_streaming_delay and let the time measurement be current time minus time of receipt of the current WAL send chunk. 3. When applying WAL that came from archive, use max_archive_delay and let the time measurement be current time minus time of acquisition of the current WAL segment from the archive. The current code's behavior in the latter case could effectively be modeled by setting max_archive_delay to zero, but that isn't the only plausible setting. More likely DBAs would set max_archive_delay to something smaller than max_streaming_delay, but still positive so as to not kill conflicting queries instantly. An important property of this design is that all relevant timestamps are taken on the slave, so clock skew isn't an issue. I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. I don't however see the value of transmitting keepalive records when we'd otherwise not have anything to send. The value of adding timestamps to the SR protocol is really to let the slave determine the current amount of clock skew between it and the master; which is a number that should not change so rapidly that it has to be updated every 100ms even in an idle system. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
* Tom Lane (t...@sss.pgh.pa.us) wrote: An important property of this design is that all relevant timestamps are taken on the slave, so clock skew isn't an issue. I agree that this is important, and I do run NTP on all my servers and even monitor it using Nagios. It's still not a cure-all for time skew issues. Comments? I'm not really a huge fan of adding another GUC, to be honest. I'm more inclined to say we treat 'max_archive_delay' as '0', and turn max_streaming_delay into what you've described. If we fall back so far that we have to go back to reading WALs, then we need to hurry up and catch-up and damn the torpedos. I'd also prefer that we only wait the delay time once until we're fully caught up again (and have gotten back around to waiting for new data). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Synchronization levels in SR
Tom Lane wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: It's pretty scary to call a user-defined function at that point in transaction. Not so much pretty scary as zero chance of being accepted. And I do mean zero. I swear, you guys are such buzzkills some days. I was suggesting a model for building easy prototypes, and advocating a more formal way to explain, in what could be code form, what someone means when they suggest a particular quorum model or the like. Maybe all that will ever be exposed into a production server are the best of the hand-written implementations, and the scary try your prototype here hook only shows up in debug builds, or never gets written at all. I did comment that I expected faster built-in implementations to be the primary way these would be handled. From what Heikki said, it sounds like the main thing I was didn't remember is to include some timestamp information to allow rules based on that information too. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Tom Lane wrote: I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. +1. From a radically different perspective, I had to do something similar in the buildfarm years ago to protect us from machines reporting with grossly inaccurate timestamps. This was part of the solution. The client adds its current timestamp setting just before transmitting the data to the server. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: Comments? I'm not really a huge fan of adding another GUC, to be honest. I'm more inclined to say we treat 'max_archive_delay' as '0', and turn max_streaming_delay into what you've described. If we fall back so far that we have to go back to reading WALs, then we need to hurry up and catch-up and damn the torpedos. If I thought that 0 were a generally acceptable value, I'd still be pushing the simplify it to a boolean agenda ;-). The problem is that that will sometimes kill standby queries even when they are quite short and doing nothing objectionable. I'd also prefer that we only wait the delay time once until we're fully caught up again (and have gotten back around to waiting for new data). The delays will be measured from a receipt instant to current time, which means that the longer it takes to apply a WAL segment or WAL send chunk, the less grace period there will be. (Which is the same as what CVS HEAD does --- I'm just arguing about where we get the start time from.) I believe this does what you suggest and more. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages
Excerpts from Russell Smith's message of mié jun 02 06:38:35 -0400 2010: Don't you not get a positive enough effect by adjusting the table's autovacuum_min_freeze_age and autovacuum_max_freeze_age. If you set those numbers small, it appears to me that you would get very quickly to a state where the vacuum would example only the most recent part of the table rather than the whole thing. The problem is that vacuum doesn't know that a certain part of the table is already frozen. It needs to scan it completely anyways. If we had a frozen map, we could mark pages that are completely frozen and thus do not need any vacuuming; but we don't (I don't recall the reasons for this. Maybe it's just that no one has gotten around to it, or maybe there's something else). -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote: Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: Comments? I'm not really a huge fan of adding another GUC, to be honest. I'm more inclined to say we treat 'max_archive_delay' as '0', and turn max_streaming_delay into what you've described. If we fall back so far that we have to go back to reading WALs, then we need to hurry up and catch-up and damn the torpedos. If I thought that 0 were a generally acceptable value, I'd still be pushing the simplify it to a boolean agenda ;-). The problem is that that will sometimes kill standby queries even when they are quite short and doing nothing objectionable. OK, now I understand. I was just thinking the same as Stephen, but now I agree we need a second parameter. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages
Alvaro Herrera alvhe...@commandprompt.com writes: The problem is that vacuum doesn't know that a certain part of the table is already frozen. It needs to scan it completely anyways. If we had a frozen map, we could mark pages that are completely frozen and thus do not need any vacuuming; but we don't (I don't recall the reasons for this. Maybe it's just that no one has gotten around to it, or maybe there's something else). Offhand I think the reason is that you'd have to trust the frozen bit to be 100% correct (or at least never set to 1 in error). Currently, both the FSM and visibility forks are just hints, and we won't suffer data corruption if they're wrong; so we don't get too tense about WAL logging or fsync'ing updates. I believe Heikki is looking into what it'd take to make the visibility map 100% reliable, in connection with the desire for index-only scans. If we get that and the overhead isn't too terrible maybe we could build a frozen-status map the same way. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 2:03 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote: Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: Comments? I'm not really a huge fan of adding another GUC, to be honest. I'm more inclined to say we treat 'max_archive_delay' as '0', and turn max_streaming_delay into what you've described. If we fall back so far that we have to go back to reading WALs, then we need to hurry up and catch-up and damn the torpedos. If I thought that 0 were a generally acceptable value, I'd still be pushing the simplify it to a boolean agenda ;-). The problem is that that will sometimes kill standby queries even when they are quite short and doing nothing objectionable. OK, now I understand. I was just thinking the same as Stephen, but now I agree we need a second parameter. I too was thinking the same as Stephen, but now I also agree we need a second parameter. I think the whole design Tom proposed seems good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages
On Wed, Jun 2, 2010 at 2:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@commandprompt.com writes: The problem is that vacuum doesn't know that a certain part of the table is already frozen. It needs to scan it completely anyways. If we had a frozen map, we could mark pages that are completely frozen and thus do not need any vacuuming; but we don't (I don't recall the reasons for this. Maybe it's just that no one has gotten around to it, or maybe there's something else). Offhand I think the reason is that you'd have to trust the frozen bit to be 100% correct (or at least never set to 1 in error). Currently, both the FSM and visibility forks are just hints, and we won't suffer data corruption if they're wrong; so we don't get too tense about WAL logging or fsync'ing updates. I believe Heikki is looking into what it'd take to make the visibility map 100% reliable, in connection with the desire for index-only scans. If we get that and the overhead isn't too terrible maybe we could build a frozen-status map the same way. We could, but I think we'd be better off just freezing at the time we mark the page PD_ALL_VISIBLE and then using the visibility map for both purposes. Keeping around the old xmin values after every tuple on the page is visible to every running transaction is useful only for forensics, and building a whole new freeze map just to retain that information longer (and eventually force a massive anti-wraparound vacuum) seems like overkill. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: I believe that the motivation for treating archived timestamps as live is, essentially, to force rapid catchup if a slave falls behind so far that it's reading from archive instead of SR. Huh, I think this is the first mention of this that I've seen. I always assumed the motivation was just that you wanted to control how much data loss could occur on failover and how long recovery would take. I think separating the two delays is an interesting idea but I don't see how it counts as a simplification. This also still allows a slave to become arbitrarily far behind the master. The master might take a long time to fill up a log file, and if the slave is blocked for a few minutes, then requests the next bit of log file and blocks for a few more minutes, then requests the next log file and blocks for a few more minutes, etc. As long as the slave is reading more slowly than the master is filling those segments it will fall further and further behind but never be more than a few minutes behind receiving the logs. I propose an alternate way out of the problem of syncing two clocks. Instead of comparing timestamps compare time intervals. So as it reads xlog records it only ever compares the master timestamps with previous master timestamps to determine how much time has elapsed on the master. It compares that time elapsed with the time elapsed on the slave to determine if it's falling behind. I'm assuming it periodically catches up and can throw out its accumulated time elapsed and start fresh. If not then the accumulated error might be a problem, but hopefully one we can find a solution to. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote: This patch seems to me to be going in fundamentally the wrong direction. It's adding complexity and overhead (far more than is needed), and it's failing utterly to resolve the objections that I raised to start with. Having read your proposal, it seems changing from time-on-sender to time-on-receiver is a one line change to the patch. What else are you thinking of removing, if anything? Adding an extra parameter adds more obviously and is something I now agree with. In particular, Simon seems to be basically refusing to do anything about the complaint that the code fails unless master and standby clocks are in close sync. I do not believe that this is acceptable, and since he won't fix it, I guess I'll have to. Syncing two servers in replication is common practice, as has been explained here; I'm still surprised people think otherwise. Measuring the time between two servers is the very purpose of the patch, so the synchronisation is not a design flaw, it is its raison d'etre. There's been a few spleens emptied on that topic, not all of them mine, and certainly no consensus on that. So I'm not refusing to do anything that's been agreed... -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Exposing the Xact commit order to the user
d...@csail.mit.edu (Dan Ports) writes: I'm not clear on why the total rowcount is useful, but perhaps I'm missing something obvious. It would make it easy to conclude: This next transaction did 8328194 updates. Maybe we should do some kind of checkpoint (e.g. - commit transaction or such) before working on it. versus This transaction we're thinking of working on had 7 updates. No big deal... -- (reverse (concatenate 'string ofni.secnanifxunil @ enworbbc)) http://linuxfinances.info/info/finances.html Rules of the Evil Overlord #189. I will never tell the hero Yes I was the one who did it, but you'll never be able to prove it to that incompetent old fool. Chances are, that incompetent old fool is standing behind the curtain. http://www.eviloverlord.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 2:27 PM, Simon Riggs si...@2ndquadrant.com wrote: Syncing two servers in replication is common practice, as has been explained here; I'm still surprised people think otherwise. Measuring the time between two servers is the very purpose of the patch, so the synchronisation is not a design flaw, it is its raison d'etre. I think the purpose of the patch should not be to measure the time difference between servers, but rather the replication delay. While I don't necessarily agree with Tom's statement that this is must-fix, I do agree that it would be nicer if we could avoid depending on time sync. Yeah, I keep my servers time synced, too. But, shit happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] caught_up status in walsender
I wrote: I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. On further contemplation, it seems like the protocol needs another field besides that: each record should also carry a boolean indicating whether walsender.c thinks it is currently caught up, ie the record carries all WAL data up to the current end of WAL. If the sender is not caught up, then the receiver should apply max_archive_delay not max_streaming_delay. In this way, WAL chunks that are a little bit behind current time will be treated the same way whether they come across the SR link or via the archive. In conjunction with that, I think there's a logic bug in XLogSend; it ought to be changed like so: /* if we went beyond SendRqstPtr, back off */ if (XLByteLT(SendRqstPtr, endptr)) + { endptr = SendRqstPtr; + *caughtup = true; + } In the current coding, the effect of not setting *caughtup here is just that we uselessly call XLogSend an extra time for each transmission (because the main loop won't ever delay immediately after a transmission). But without this, we'd never send caughtup = true to the slave. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages
Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010: We could, but I think we'd be better off just freezing at the time we mark the page PD_ALL_VISIBLE and then using the visibility map for both purposes. Keeping around the old xmin values after every tuple on the page is visible to every running transaction is useful only for forensics, and building a whole new freeze map just to retain that information longer (and eventually force a massive anti-wraparound vacuum) seems like overkill. Reducing the xid wraparound horizon a bit is reasonable, but moving it all the way forward to OldestXmin is a bit much, methinks. Besides, there's another argument for not freezing tuples immediately: they may be updated shortly thereafter, causing extra churn for no gain. I'd prefer a setting that would tell the system to freeze all tuples that fall within a safety range whenever any tuple in the page is frozen -- weren't you working on a patch to do this? (was it Jeff Davis?) (BTW maybe instead of separate visibility and freeze maps we could have two bits in the visibility map?) -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Greg Stark gsst...@mit.edu writes: On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: I believe that the motivation for treating archived timestamps as live is, essentially, to force rapid catchup if a slave falls behind so far that it's reading from archive instead of SR. Huh, I think this is the first mention of this that I've seen. I always assumed the motivation was just that you wanted to control how much data loss could occur on failover and how long recovery would take. I think separating the two delays is an interesting idea but I don't see how it counts as a simplification. Well, it isn't a simplification: it's bringing it up to the minimum complication level where it'll actually work sanely. The current implementation doesn't work sanely because it confuses stale timestamps read from WAL with real live time. This also still allows a slave to become arbitrarily far behind the master. Indeed, but nothing we do can prevent that, if the slave is just plain slower than the master. You have to assume that the slave is capable of keeping up in the absence of query-caused delays, or you're hosed. The real reason this is at issue is the fact that the max_standby_delay kill mechanism applies to certain buffer-level locking operations. On the master we just wait, and it's not a problem, because in practice the conflicting queries almost always release these locks pretty quick. On the slave, though, instant kill as a result of a buffer-level lock conflict would result in a very serious degradation in standby query reliability (while also doing practically nothing for the speed of WAL application, most of the time). This morning on the phone Bruce and I were seriously discussing the idea of ripping the max_standby_delay mechanism out of the buffer-level locking paths, and just let them work like they do on the master, ie, wait forever. If we did that then simplifying max_standby_delay to a boolean would be reasonable again (because it really would only come into play for DDL on the master). The sticky point is that once in a blue moon you do have a conflicting query sitting on a buffer lock for a long time, or even more likely a series of queries keeping the WAL replay process from obtaining buffer cleanup lock. So it seems that we have to have max_standby_delay-like logic for those locks, and also that zero grace period before kill isn't a very practical setting. However, there isn't a lot of point in obsessing over exactly how long the grace period ought to be, as long as it's more than a few milliseconds. It *isn't* going to have any real effect on whether the slave can stay caught up. You could make a fairly decent case for just measuring the grace period from when the replay process starts to wait, as I think I proposed awhile back. The value of measuring delay from a receipt time is that if you do happen to have a bunch of delays within a short interval you'll get more willing to kill queries --- but I really believe that that is a corner case and will have nothing to do with ordinary performance. I propose an alternate way out of the problem of syncing two clocks. Instead of comparing timestamps compare time intervals. So as it reads xlog records it only ever compares the master timestamps with previous master timestamps to determine how much time has elapsed on the master. It compares that time elapsed with the time elapsed on the slave to determine if it's falling behind. I think this would just add complexity and uncertainty, to deal with something that won't be much of a problem in practice. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] caught_up status in walsender
On 02/06/10 21:44, Tom Lane wrote: In conjunction with that, I think there's a logic bug in XLogSend; it ought to be changed like so: /* if we went beyond SendRqstPtr, back off */ if (XLByteLT(SendRqstPtr, endptr)) + { endptr = SendRqstPtr; + *caughtup = true; + } In the current coding, the effect of not setting *caughtup here is just that we uselessly call XLogSend an extra time for each transmission (because the main loop won't ever delay immediately after a transmission). But without this, we'd never send caughtup = true to the slave. That's intentional. It could take some time for the WAL to be sent, if the network is busy, so by the time XLogSend returns you might well not be caught up anymore. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] caught_up status in walsender
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 02/06/10 21:44, Tom Lane wrote: In the current coding, the effect of not setting *caughtup here is just that we uselessly call XLogSend an extra time for each transmission (because the main loop won't ever delay immediately after a transmission). But without this, we'd never send caughtup = true to the slave. That's intentional. It could take some time for the WAL to be sent, if the network is busy, so by the time XLogSend returns you might well not be caught up anymore. It may have been intentional, but it's still wrong. If you were able to pull all of WAL into the record-to-be-sent, you should sleep afterwards, not send an extra record containing a few more bytes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Allow wal_keep_segments to keep all segments
Bruce Momjian wrote: Robert Haas wrote: On Sat, May 8, 2010 at 10:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: Uh, did we decide that 'wal_keep_segments' was the best name for this GUC setting? ?I know we shipped beta1 using that name. I thought min_wal_segments was a reasonable proposal, but it wasn't clear if there was consensus or not. I think most people thought it was another reasonable choice, but I think the consensus position is probably something like it's about the same rather than it's definitely better. We had one or two people with stronger opinions than that on either side, I believe. Agreed the current name seems OK. However, was there agreement that wal_keep_segments = -1 should keep all WAL segements? I can see that as useful for cases where you are doing a dump to be transfered to the slave, and not using archive_command. This avoids the need for the set a huge value solution. The attached patch allows wal_keep_segments = -1 to keep all segements; this is particularly useful for taking a base backup, where you need all the WAL files during startup of the standby. I have documented this usage in the patch as well. I am thinking of applying this after 9.0 beta2 if there is no objection. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + Index: doc/src/sgml/config.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.280 diff -c -c -r1.280 config.sgml *** doc/src/sgml/config.sgml 31 May 2010 15:50:48 - 1.280 --- doc/src/sgml/config.sgml 2 Jun 2010 19:19:18 - *** *** 1887,1893 Specifies the number of past log file segments kept in the filenamepg_xlog/ directory, in case a standby server needs to fetch them for streaming ! replication. Each segment is normally 16 megabytes. If a standby server connected to the primary falls behind by more than varnamewal_keep_segments/ segments, the primary might remove a WAL segment still needed by the standby, in which case the --- 1887,1893 Specifies the number of past log file segments kept in the filenamepg_xlog/ directory, in case a standby server needs to fetch them for streaming ! replication. Each segment is normally 16 megabytes. If a standby server connected to the primary falls behind by more than varnamewal_keep_segments/ segments, the primary might remove a WAL segment still needed by the standby, in which case the *** *** 1901,1908 is zero (the default), the system doesn't keep any extra segments for standby purposes, and the number of old WAL segments available for standbys is determined based only on the location of the previous ! checkpoint and status of WAL archiving. ! This parameter can only be set in the filenamepostgresql.conf/ file or on the server command line. /para /listitem --- 1901,1909 is zero (the default), the system doesn't keep any extra segments for standby purposes, and the number of old WAL segments available for standbys is determined based only on the location of the previous ! checkpoint and status of WAL archiving. If literal-1/ is ! specified, log file segments are kept indefinitely. This ! parameter can only be set in the filenamepostgresql.conf/ file or on the server command line. /para /listitem Index: doc/src/sgml/high-availability.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/high-availability.sgml,v retrieving revision 1.70 diff -c -c -r1.70 high-availability.sgml *** doc/src/sgml/high-availability.sgml 29 May 2010 09:01:10 - 1.70 --- doc/src/sgml/high-availability.sgml 2 Jun 2010 19:19:19 - *** *** 750,756 If you use streaming replication without file-based continuous archiving, you have to set varnamewal_keep_segments/ in the master to a value high enough to ensure that old WAL segments are not recycled ! too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always --- 750,760 If you use streaming replication without file-based continuous archiving, you have to set varnamewal_keep_segments/ in the master to a value high enough to ensure that old WAL segments are not recycled ! too early, while the standby
Re: [HACKERS] Exposing the Xact commit order to the user
heikki.linnakan...@enterprisedb.com (Heikki Linnakangas) writes: On 24/05/10 19:51, Kevin Grittner wrote: The only thing I'm confused about is what benefit anyone expects to get from looking at data between commits in some way other than our current snapshot mechanism. Can someone explain a use case where what Jan is proposing is better than snapshot isolation? It doesn't provide any additional integrity guarantees that I can see. Right, it doesn't. What it provides is a way to reconstruct a snapshot at any point in time, after the fact. For example, after transactions A, C, D and B have committed in that order, it allows you to reconstruct a snapshot just like you would've gotten immediately after the commit of A, C, D and B respectively. That's useful replication tools like Slony that needs to commit the changes of those transactions in the slave in the same order as they were committed in the master. I don't know enough of Slony et al. to understand why that'd be better than the current heartbeat mechanism they use, taking a snapshot every few seconds, batching commits. I see two advantages: a) Identifying things on a transaction-by-transaction basis means that the snapshots (syncs) don't need to be captured, which is presently an area of fragility. If the slon daemon falls over on Friday evening, and nobody notices until Monday, the snapshot reverts to being all updates between Friday and whenever SYNCs start to be collected again. Exposing commit orders eliminates that fragility. SYNCs don't need to be captured anymore, so they can't be missed (which is today's problem). b) The sequence currently used to control log application ordering is a bottleneck, as it is a single sequence shared across all connections. It could be eliminated in favor of (perhaps) an in-memory variable defined on a per-connection basis. It's not a bottleneck that we hear a lot of complaints about, but the sequence certainly is a bottleneck. -- select 'cbbrowne' || '@' || 'cbbrowne.com'; http://cbbrowne.com/info/internet.html MS apparently now has a team dedicated to tracking problems with Linux and publicizing them. I guess eventually they'll figure out this back fires... ;) -- William Burrow aa...@delete.fan.nb.ca -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow wal_keep_segments to keep all segments
On Wed, Jun 2, 2010 at 3:20 PM, Bruce Momjian br...@momjian.us wrote: Bruce Momjian wrote: Robert Haas wrote: On Sat, May 8, 2010 at 10:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: Uh, did we decide that 'wal_keep_segments' was the best name for this GUC setting? ?I know we shipped beta1 using that name. I thought min_wal_segments was a reasonable proposal, but it wasn't clear if there was consensus or not. I think most people thought it was another reasonable choice, but I think the consensus position is probably something like it's about the same rather than it's definitely better. We had one or two people with stronger opinions than that on either side, I believe. Agreed the current name seems OK. However, was there agreement that wal_keep_segments = -1 should keep all WAL segements? I can see that as useful for cases where you are doing a dump to be transfered to the slave, and not using archive_command. This avoids the need for the set a huge value solution. The attached patch allows wal_keep_segments = -1 to keep all segements; this is particularly useful for taking a base backup, where you need all the WAL files during startup of the standby. I have documented this usage in the patch as well. I am thinking of applying this after 9.0 beta2 if there is no objection. +1 for the patch, but why wait until after beta2? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow wal_keep_segments to keep all segments
On Wed, 2010-06-02 at 15:20 -0400, Bruce Momjian wrote: The attached patch allows wal_keep_segments = -1 to keep all segements; this is particularly useful for taking a base backup, where you need all the WAL files during startup of the standby. I have documented this usage in the patch as well. I am thinking of applying this after 9.0 beta2 if there is no objection. It's not clear to me why keep all files until server breaks is a good setting. Surely you would set this parameter to the size of your disk. Why allow it to go higher? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] caught_up status in walsender
Tom Lane t...@sss.pgh.pa.us writes: I wrote: I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. On further contemplation, it seems like the protocol needs another field besides that: each record should also carry a boolean indicating whether walsender.c thinks it is currently caught up, ie the record carries all WAL data up to the current end of WAL. If the sender is not caught up, then the receiver should apply max_archive_delay not max_streaming_delay. In this way, WAL chunks that are a little bit behind current time will be treated the same way whether they come across the SR link or via the archive. Then we'd have max_catchup_delay and max_streaming_delay, wouldn't we? I'm still trying to understand the implications of the proposal, which sounds good but… given just enough load on the slave, wouldn't it be playing catchup all the time? Wouldn't enough load mean one read query per write commit on the master? Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On 02/06/10 20:14, Tom Lane wrote: For realistic values of max_standby_delay ... Hang on right there. What do you consider a realistic value for max_standby_delay? Because I'm not sure I have a grip on that myself. 5 seconds? 5 minutes? 5 hours? I can see use cases for all of those... What I think might be a realistic compromise is this: 1. Separate max_standby_delay into two GUCs, say max_streaming_delay and max_archive_delay. 2. When applying WAL that came across SR, use max_streaming_delay and let the time measurement be current time minus time of receipt of the current WAL send chunk. 3. When applying WAL that came from archive, use max_archive_delay and let the time measurement be current time minus time of acquisition of the current WAL segment from the archive. The current code's behavior in the latter case could effectively be modeled by setting max_archive_delay to zero, but that isn't the only plausible setting. More likely DBAs would set max_archive_delay to something smaller than max_streaming_delay, but still positive so as to not kill conflicting queries instantly. The problem with defining max_archive_delay that way is again that you can fall behind indefinitely. If you set it to 5 minutes, it means that you'll wait a maximum of 5 minutes *per WAL segment*, even if WAL is being generated faster. I don't understand why you want to use a different delay when you're restoring from archive vs. when you're streaming (what about existing WAL files found in pg_xlog, BTW?). The source of WAL shouldn't make a difference. If it's because you assume that restoring from archive is a sign that you've fallen behind a lot, surely you've exceeded max_standby_delay then and I still don't see a need for a separate GUC. I stand by my suggestion from yesterday: Let's define max_standby_delay as the difference between a piece of WAL becoming available in the standby, and applying it. To approximate piece of WAL becoming available for SR, we can use the mechanism with send/applyChunks from Simon's latest patch, or go with the simpler scheme of just resetting a last caughtup timestamp to current time whenever we have to wait for new WAL to arrive. When restoring from archive, likewise reset last caughtup timestamp whenever restore_command returns non-0, i.e we have to wait for the next WAL file to arrive. That works the same for both SR and file-based log shipping, there's only one knob to set, is simple to implement and doesn't require synchronized clocks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to pass around collation information
On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote: The feature I'm thinking of is what people might call per-column locale, and the SQL standard defines that. It would look like this: CREATE TABLE test ( a varchar COLLATE de, b varchar COLLATE fr ); SELECT * FROM test WHERE a 'baz' ORDER BY b; Perhaps it's also worth pointing out there could be use cases other than supporting multiple natural languages. For example, it is frequently requested to be able to sort in ways that doesn't ignore special characters, binary sort, or perhaps special file name sort that treats '/' special in some way. So it could be quite useful to be able to say CREATE TABLE something ( description text COLLATE en, code char(6) COLLATE binary, file text COLLATE filename_sort ); or even something like CREATE DOMAIN filename AS text COLLATE filename_sort; These are examples where having the collation attached to the column would appear to make more sense then having it attached only to operations. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] caught_up status in walsender
Dimitri Fontaine dfonta...@hi-media.com writes: I'm still trying to understand the implications of the proposal, which sounds good but⦠given just enough load on the slave, wouldn't it be playing catchup all the time? Uh, if the slave is overloaded, *any* implementation will be playing catchup all the time. Not sure about your point here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: The problem with defining max_archive_delay that way is again that you can fall behind indefinitely. See my response to Greg Stark --- I don't think this is really an issue. It's certainly far less of an issue than the current situation that query grace periods go to zero under not-at-all-extreme conditions. I don't understand why you want to use a different delay when you're restoring from archive vs. when you're streaming (what about existing WAL files found in pg_xlog, BTW?). You're missing the point. I want the DBA to be able to control what happens in those two cases. In the current implementation he has no control over what happens while restoring from archive: it's going to effectively act like max_archive_delay = 0 all the time. If you're of the opinion that's good, you can set the parameter that way and be happy. I'm of the opinion you'll soon find out it isn't so good, because it'll kill standby queries too easily. I stand by my suggestion from yesterday: Let's define max_standby_delay as the difference between a piece of WAL becoming available in the standby, and applying it. My proposal is essentially the same as yours plus allowing the DBA to choose different max delays for the caught-up and not-caught-up cases. Maybe everybody will end up setting the two delays the same, but I think we don't have enough experience to decide that for them now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] caught_up status in walsender
Tom Lane t...@sss.pgh.pa.us writes: Uh, if the slave is overloaded, *any* implementation will be playing catchup all the time. Not sure about your point here. Well, sorry, I missed the part where the catchup is measured on walsender side of things. Now that means that a non interrupted flow of queries quicker than the wal sender delay (100ms, right?) on the slave would certainly leave enough room for it to stay current. Well that depends also on the time it takes to replay the wals. I'm trying to decide if exposing this 100ms magic setup (or something else) would allow for a lot more control with respect to what means being overloaded. Say you set the 100ms delay to any value that you know (from testing and log_min_duration_statement, say) just a little higher than the slowest query you typically run on the slave. If that reduces the chances to ever playing cathing-up (as soon as there's no unexpected slow query there), that would be great. If you can manage to make sense of this, I'm interested into hearing how far it is from reality. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be
On Mon, May 17, 2010 at 4:33 AM, Fujii Masao masao.fu...@gmail.com wrote: On Sat, May 15, 2010 at 3:20 AM, Robert Haas robertmh...@gmail.com wrote: Hmm, OK, I think that makes sense. Would you care to propose a patch? Yep. Here is the patch. This patch distinguishes normal shutdown from unexpected exit, while the server is in recovery. That is, when smart or fast shutdown is requested during recovery, the bgwriter sets the ControlFile-state to new-introduced DB_SHUTDOWNED_IN_RECOVERY state. When recovery starts from the DB_SHUTDOWNED_IN_RECOVERY state, the startup process emits LOG: database system was shut down in recovery at 2010-05-12 20:35:24 EDT instead of LOG: database system was interrupted while in recovery at log time 2010-05-12 20:35:24 EDT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. Heikki and I discussed this over IM today and came away with two questions. First, is it appropriate to set the control file state to DB_SHUTDOWNED_IN_RECOVERY even when we're in crash recovery (as opposed to archive recovery/SR)? My vote is no, but Heikki thought it might be OK. Second, one of the places where this patch updates the control file immediately follows a call to UpdateMinRecoveryPoint(). That can lead to fsync-ing the control file twice in a row. Should we worry about this or just let it go? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] caught_up status in walsender
On Wed, Jun 2, 2010 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. On further contemplation, it seems like the protocol needs another field besides that: each record should also carry a boolean indicating whether walsender.c thinks it is currently caught up, ie the record carries all WAL data up to the current end of WAL. If the sender is not caught up, then the receiver should apply max_archive_delay not max_streaming_delay. In this way, WAL chunks that are a little bit behind current time will be treated the same way whether they come across the SR link or via the archive. I'm not sure that makes sense. I thought the point of separating the archive and streaming cases was that the same timeout wouldn't necessarily be correct for a 16MB WAL file as it would for a 16-byte WAL record you've just received. IOW, you might want max_archive_delay max_streaming_delay. The original proposal also seems somewhat easier to understand, to me at least. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to pass around collation information
On Wed, Jun 2, 2010 at 3:46 PM, Peter Eisentraut pete...@gmx.net wrote: On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote: The feature I'm thinking of is what people might call per-column locale, and the SQL standard defines that. It would look like this: CREATE TABLE test ( a varchar COLLATE de, b varchar COLLATE fr ); SELECT * FROM test WHERE a 'baz' ORDER BY b; Perhaps it's also worth pointing out there could be use cases other than supporting multiple natural languages. For example, it is frequently requested to be able to sort in ways that doesn't ignore special characters, binary sort, or perhaps special file name sort that treats '/' special in some way. So it could be quite useful to be able to say CREATE TABLE something ( description text COLLATE en, code char(6) COLLATE binary, file text COLLATE filename_sort ); or even something like CREATE DOMAIN filename AS text COLLATE filename_sort; These are examples where having the collation attached to the column would appear to make more sense then having it attached only to operations. But in the end the only purpose of setting it on a column is to set which one will be used for operations on that column. And the user might still override it for a particular query. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be
On 02/06/10 23:50, Robert Haas wrote: First, is it appropriate to set the control file state to DB_SHUTDOWNED_IN_RECOVERY even when we're in crash recovery (as opposed to archive recovery/SR)? My vote is no, but Heikki thought it might be OK. My logic on that is: If the database is known to be in good shape, i.e not corrupt, after shutdown during crash recovery, then we should not print the warning at restart saying This probably means that some data is corrupted. There's no reason to believe the database is corrupt if it's a controlled shutdown, so setting control file state to DB_SHUTDOWNED_IN_RECOVERY is OK. But if it's not OK for some reason, then we really shouldn't allow the shut down in the first place until we hit the end of WAL. So the option allow shutdown, but warn at restart that your data is probably corrupt does not make sense in any case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be
On Wed, Jun 2, 2010 at 5:39 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 02/06/10 23:50, Robert Haas wrote: First, is it appropriate to set the control file state to DB_SHUTDOWNED_IN_RECOVERY even when we're in crash recovery (as opposed to archive recovery/SR)? My vote is no, but Heikki thought it might be OK. My logic on that is: If the database is known to be in good shape, i.e not corrupt, after shutdown during crash recovery, then we should not print the warning at restart saying This probably means that some data is corrupted. There's no reason to believe the database is corrupt if it's a controlled shutdown, so setting control file state to DB_SHUTDOWNED_IN_RECOVERY is OK. But if it's not OK for some reason, then we really shouldn't allow the shut down in the first place until we hit the end of WAL. So the option allow shutdown, but warn at restart that your data is probably corrupt does not make sense in any case. Well, the point is, we emit that message every time we go to recover from a crash. Presumably the message is as valid after a restart of crash recovery as it was the first time around. thinks But maybe the message isn't right the first time either. After all the point of having a write-ahead log in the first place is that we should be able to prevent corruption in the event of an unexpected shutdown. Maybe the right thing to do is to forget about adding a new state and just remove or change the errhint from these messages: ereport(LOG, (errmsg(database system was interrupted while in recovery at %s, str_time(ControlFile-time)), errhint(This probably means that some data is corrupted and you will have to use the last backup for recovery.))); ereport(LOG, (errmsg(database system was interrupted while in recovery at log time %s, str_time(ControlFile-checkPointCopy.time)), errhint(If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.))); -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Indeed, but nothing we do can prevent that, if the slave is just plain slower than the master. You have to assume that the slave is capable of keeping up in the absence of query-caused delays, or you're hosed. I was assuming the walreceiver only requests more wal in relatively small chunks and only when replay has caught up and needs more data. I haven't actually read this code so if that assumption is wrong then I'm off-base. But if my assumption is right then it's not merely the master running faster than the slave that can cause you to fall arbitrarily far behind. The receive time is delayed by how long the slave waited on a previous lock, so if we wait for a few minutes, then proceed and find we need more wal data we'll read data from the master that it could have generated those few minutes ago. The sticky point is that once in a blue moon you do have a conflicting query sitting on a buffer lock for a long time, or even more likely a series of queries keeping the WAL replay process from obtaining buffer cleanup lock. So I think this isn't necessarily such a blue moon event. As I understand it, all it would take is a single long-running report and a vacuum or HOT cleanup occurring on the master. If I want to set max_standby_delay to 60min to allow reports to run for up to an hour then any random HOT cleanup on the master will propagate to the slave and cause a WAL stall until the transactions which have that xid in their snapshot end. Even the buffer locks are could potentially be blocked for a long time if you happen to run the right kind of query (say a nested loop with the buffer in question on the outside and a large table on the inside). That's a rarer event though; is that what you were thinking of? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
Greg Stark gsst...@mit.edu writes: I was assuming the walreceiver only requests more wal in relatively small chunks and only when replay has caught up and needs more data. I haven't actually read this code so if that assumption is wrong then I'm off-base. It is off-base. The receiver does not request data, the sender is what determines how much WAL is sent when. So I think this isn't necessarily such a blue moon event. As I understand it, all it would take is a single long-running report and a vacuum or HOT cleanup occurring on the master. I think this is mostly FUD too. How often do you see vacuum blocked for an hour now? It probably can happen, which is why we need to be able to kick queries off the locks with max_standby_delay, but it's far from common. What we're concerned about here is how long a buffer lock on a single page is held, not how long heavyweight locks are held. The normal hold times are measured in microseconds. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages
Alvaro Herrera wrote: Excerpts from Robert Haas's message of mi?? jun 02 14:16:33 -0400 2010: We could, but I think we'd be better off just freezing at the time we mark the page PD_ALL_VISIBLE and then using the visibility map for both purposes. Keeping around the old xmin values after every tuple on the page is visible to every running transaction is useful only for forensics, and building a whole new freeze map just to retain that information longer (and eventually force a massive anti-wraparound vacuum) seems like overkill. Reducing the xid wraparound horizon a bit is reasonable, but moving it all the way forward to OldestXmin is a bit much, methinks. Besides, there's another argument for not freezing tuples immediately: they may be updated shortly thereafter, causing extra churn for no gain. I'd prefer a setting that would tell the system to freeze all tuples that fall within a safety range whenever any tuple in the page is frozen -- weren't you working on a patch to do this? (was it Jeff Davis?) (BTW maybe instead of separate visibility and freeze maps we could have two bits in the visibility map?) Yeah, the two-bits idea was suggested during the conversation core had about the issue. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Exposing the Xact commit order to the user
On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne cbbro...@acm.org wrote: It would make it easy to conclude: This next transaction did 8328194 updates. Maybe we should do some kind of checkpoint (e.g. - commit transaction or such) before working on it. versus This transaction we're thinking of working on had 7 updates. No big deal... I'm puzzled how you would define this value. How do you add 7 inserts, 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7 inserts and 7 deletes worth twice as much as the 7 updates when they're basically the same thing? What if the inserts fired triggers which inserted 7 more rows, is that 14? What if the 7 updates modified 2 TB of TOAST data but the 8238194 updates were all to the same record and they were all HOT updates so all it did was change 8kB? In any case you'll have all the actual data from your triggers or hooks or whatever so what value does having the system keep track of this add? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow wal_keep_segments to keep all segments
Simon Riggs wrote: On Wed, 2010-06-02 at 15:20 -0400, Bruce Momjian wrote: The attached patch allows wal_keep_segments = -1 to keep all segements; this is particularly useful for taking a base backup, where you need all the WAL files during startup of the standby. I have documented this usage in the patch as well. I am thinking of applying this after 9.0 beta2 if there is no objection. It's not clear to me why keep all files until server breaks is a good setting. Surely you would set this parameter to the size of your disk. Why allow it to go higher? Well, the -1 allows them to set it temporarily without having to compute their free disk space. Frankly, because the disk space varies, it is impossible to know exactly how large the disk is at the time it would fill up. I think the normal computation would be: 1) How long is my file system backup and restore to standby going to take 2) How often do I generate a 16MB WAL file You would do some computation to figure that out, then maybe multiply it by 10x and set that for wal_keep_segments. I figured allowing a simple -1 would be easier. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow wal_keep_segments to keep all segments
Robert Haas wrote: The attached patch allows wal_keep_segments = -1 to keep all segements; this is particularly useful for taking a base backup, where you need all the WAL files during startup of the standby. ?I have documented this usage in the patch as well. I am thinking of applying this after 9.0 beta2 if there is no objection. +1 for the patch, but why wait until after beta2? I wanted to give people enough time to review/discuss it. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Keepalive for max_standby_delay
* Tom Lane (t...@sss.pgh.pa.us) wrote: Greg Stark gsst...@mit.edu writes: So I think this isn't necessarily such a blue moon event. As I understand it, all it would take is a single long-running report and a vacuum or HOT cleanup occurring on the master. I think this is mostly FUD too. How often do you see vacuum blocked for an hour now? It probably can happen, which is why we need to be able to kick queries off the locks with max_standby_delay, but it's far from common. What we're concerned about here is how long a buffer lock on a single page is held, not how long heavyweight locks are held. The normal hold times are measured in microseconds. Maybe I'm missing something, but I think Greg's point was that if you have a long-running query running against the standby/slave/whatever, which is holding locks on various relations to implement that report, and then a vacuum or HOT update happens on the master, the long-running report query will get killed off unless you bump max_streaming_delay up pretty high (eg: 60 mins). That being said, I'm not sure that there's really another solution. Yes, in this case, the slave can end up being an hour behind, but that's the trade-off you have to make if you want to run an hour-long query on the slave. The other answer is to make the master not update those tuples, etc, which might be possible by starting a transaction on the master which grabs things enough to prevent the vacuum/hot/etc update from happening. That may be possible manually, but it's not fun and it certainly isn't something we'll have built-in support for in 9.0. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] ALTER TABLE .... make constraint DEFERRABLE
Simon Riggs wrote: Deferrable unique constraints seem an interesting feature, though I have either some questions or some issues, not sure which. I don't seem to be able to find any way to do an ALTER TABLE that adds this new capability to an existing table. I was able to do it: test= create table test (x int unique DEFERRABLE INITIALLY DEFERRED); NOTICE: CREATE TABLE / UNIQUE will create implicit index test_x_key for table test CREATE TABLE test= alter table test add column y int; ALTER TABLE test= alter table test add unique (y) DEFERRABLE INITIALLY DEFERRED; NOTICE: ALTER TABLE / ADD UNIQUE will create implicit index test_y_key for table test ALTER TABLE Is that what you were asking? There is no way to add a constraint via a CREATE TABLE AS SELECT, so that means there is no way to use the feature at all in that case. Uh, CREATE TABLE AS SELECT seems to be very limited, but I have not heard any complaints about it before. Also, foreign keys can't be defined that refer to a deferrable primary key. That isn't mentioned at all in the manual with regard to the DEFERRABLE clause, though it is mentioned in the FK section. You get this error message ERROR: cannot use a deferrable unique constraint for referenced table The use case for this feature looks a little narrow at present. Can we do something about usability? Not sure why that was a limitation. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #5487: dblink failed with 63 bytes connection names
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Well, looking at the callers, most if not all of them seem to actually pass a palloc'd copy, allocated by text_to_cstring(). Should we throw a NOTICE like vanilla truncate_identifier() does? I feel it would be better to just call truncate_identifier() than roll your own. Assuming we want the same semantics in dblink, we'll otherwise have to remember to update truncate_identifier_copy() with any changes to truncate_identifier(). That makes sense. Now I use truncate_identifier(warn=true) for the fix. Regards, --- Takahiro Itagaki NTT Open Source Software Center dblink_63bytes-2010-603.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be
On Jun 3, 2010, at 0:58 , Robert Haas wrote: But maybe the message isn't right the first time either. After all the point of having a write-ahead log in the first place is that we should be able to prevent corruption in the event of an unexpected shutdown. Maybe the right thing to do is to forget about adding a new state and just remove or change the errhint from these messages: You've fallen prey to a (very common) miss-interpration of this message. It is not about corruption *caused* by a crash during recovery, it's about corruption *causing* the crash. I'm not in favor of getting rid of that message entirely, since produces a worthwhile hint if the crash was really caused by corrupt data. But it desperately needs a better wording that makes cause and effect perfectly clear. That even you miss-read it conclusively proves that. How about If this has happened repeatedly and without manual intervention, it was probably caused by corrupted data and you may need to restore from backup for the crash recovery case and If this has happened repeatedly and without manual intervention, it was probably caused by corrupted data and you may need to choose an earlier recovery target for the PITR case. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be
On Wed, Jun 2, 2010 at 9:07 PM, Florian Pflug f...@phlo.org wrote: On Jun 3, 2010, at 0:58 , Robert Haas wrote: But maybe the message isn't right the first time either. After all the point of having a write-ahead log in the first place is that we should be able to prevent corruption in the event of an unexpected shutdown. Maybe the right thing to do is to forget about adding a new state and just remove or change the errhint from these messages: You've fallen prey to a (very common) miss-interpration of this message. It is not about corruption *caused* by a crash during recovery, it's about corruption *causing* the crash. I'm not in favor of getting rid of that message entirely, since produces a worthwhile hint if the crash was really caused by corrupt data. But it desperately needs a better wording that makes cause and effect perfectly clear. That even you miss-read it conclusively proves that. How about If this has happened repeatedly and without manual intervention, it was probably caused by corrupted data and you may need to restore from backup for the crash recovery case and If this has happened repeatedly and without manual intervention, it was probably caused by corrupted data and you may need to choose an earlier recovery target for the PITR case. Oh. Well, if that's the case, then I guess I lean toward applying the patch as-is. Then there's no need for the caveat and without manual intervention. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] current value support
Hi all, Please see attached a patch that finishes the support for currval. All the structure was in place in GTM but there was still one missing call in sequence.c when calling the function. Now it is possible to get current values for sequences in the whole cluster. Regards, -- Michael Paquier NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center diff --git a/src/backend/access/transam/gtm.c b/src/backend/access/transam/gtm.c index c3cb72b..3be8d20 100644 --- a/src/backend/access/transam/gtm.c +++ b/src/backend/access/transam/gtm.c @@ -172,7 +172,7 @@ GetSnapshotGTM(GlobalTransactionId gxid, bool canbe_grouped) } -/** +/* * Create a sequence on the GTM. * * @@ -189,7 +189,30 @@ CreateSequenceGTM(char *seqname, GTM_Sequence increment, GTM_Sequence minval, return conn ? open_sequence(conn, seqkey, increment, minval, maxval, startval, cycle) : 0; } -/** +/* + * get the current sequence value + */ + +GTM_Sequence +GetCurrentValGTM(char *seqname) +{ + GTM_Sequence ret = -1; + GTM_SequenceKeyData seqkey; + CheckConnection(); + seqkey.gsk_keylen = strlen(seqname); + seqkey.gsk_key = seqname; + + if (conn) + ret = get_current(conn, seqkey); + if (ret 0) + { + CloseGTM(); + InitGTM(); + } + return ret; +} + +/* * Get the next sequence value */ GTM_Sequence @@ -211,7 +234,7 @@ GetNextValGTM(char *seqname) return ret; } -/** +/* * Drop the sequence */ int diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index ba9a932..5df50a3 100644 --- a/src/backend/commands/sequence.c +++ b/src/backend/commands/sequence.c @@ -533,8 +533,8 @@ nextval_internal(Oid relid) result = (int64) GetNextValGTM(RelationGetRelationName(seqrel)); if (result 0) ereport(ERROR, -(errcode(ERRCODE_CONNECTION_FAILURE), - errmsg(GTM error, could not obtain sequence value))); + (errcode(ERRCODE_CONNECTION_FAILURE), + errmsg(GTM error, could not obtain sequence value))); } else { #endif @@ -714,6 +714,19 @@ currval_oid(PG_FUNCTION_ARGS) /* open and AccessShareLock sequence */ init_sequence(relid, elm, seqrel); +#ifdef PGXC + if (IS_PGXC_COORDINATOR) + { + result = (int64) GetCurrentValGTM(RelationGetRelationName(seqrel)); + if (result 0) + ereport(ERROR, + (errcode(ERRCODE_CONNECTION_FAILURE), + errmsg(GTM error, could not obtain sequence value))); + } + else + { +#endif + if (pg_class_aclcheck(elm-relid, GetUserId(), ACL_SELECT) != ACLCHECK_OK pg_class_aclcheck(elm-relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK) ereport(ERROR, @@ -729,6 +742,10 @@ currval_oid(PG_FUNCTION_ARGS) result = elm-last; +#ifdef PGXC + } +#endif + relation_close(seqrel, NoLock); PG_RETURN_INT64(result); diff --git a/src/include/access/gtm.h b/src/include/access/gtm.h index 66ca3f1..88de4d8 100644 --- a/src/include/access/gtm.h +++ b/src/include/access/gtm.h @@ -25,6 +25,9 @@ extern GlobalTransactionId BeginTranAutovacuumGTM(void); extern int CommitTranGTM(GlobalTransactionId gxid); extern int RollbackTranGTM(GlobalTransactionId gxid); extern GTM_Snapshot GetSnapshotGTM(GlobalTransactionId gxid, bool canbe_grouped); + +/* Sequence interface APIs with GTM */ +extern GTM_Sequence GetCurrentValGTM(char *seqname); extern GTM_Sequence GetNextValGTM(char *seqname); extern int CreateSequenceGTM(char *seqname, GTM_Sequence increment, GTM_Sequence minval, GTM_Sequence maxval, GTM_Sequence startval, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Comments on Exclusion Constraints and related datatypes
David Fetter wrote: On Mon, Mar 22, 2010 at 04:04:16PM -0300, Alvaro Herrera wrote: David Fetter wrote: diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 9881ff4..9313112 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -7134,7 +7134,7 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple /row row entry literalamp;amp;/literal /entry -entryOverlaps?/entry +entryOverlaps? One point in common makes this true./entry entryliteralbox '((0,0),(1,1))' amp;amp; box '((0,0),(2,2))'/literal/entry /row row Hmm, how does this look in horizontal space? (The row makes me think it's a table.) Looks OK to me. The entry above, Closest point to first operand on second operand is actually wider. Patch applied. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages
On Wed, Jun 2, 2010 at 3:10 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010: We could, but I think we'd be better off just freezing at the time we mark the page PD_ALL_VISIBLE and then using the visibility map for both purposes. Keeping around the old xmin values after every tuple on the page is visible to every running transaction is useful only for forensics, and building a whole new freeze map just to retain that information longer (and eventually force a massive anti-wraparound vacuum) seems like overkill. Reducing the xid wraparound horizon a bit is reasonable, but moving it all the way forward to OldestXmin is a bit much, methinks. Why? If it's just for forensics, those are some pretty expensive forensics - it eventually costs you an additional complete rewrite of every page. Besides, there's another argument for not freezing tuples immediately: they may be updated shortly thereafter, causing extra churn for no gain. But if you were going to update PD_ALL_VISIBLE, then you were going to write the page anyway. You might as well freeze everything at the same time so you don't have to come back. Alternatively, you could do what I suggested upthread and just believe PD_ALL_VISIBLE over the individual tuple xmins. Then you don't have to freeze the page until it's next written, but you still get to keep your forensic info. I'd prefer a setting that would tell the system to freeze all tuples that fall within a safety range whenever any tuple in the page is frozen -- weren't you working on a patch to do this? (was it Jeff Davis?) Not me. I don't think that's going to help a whole lot, though. In many of the painful scenarios, every tuple on the page will have the same XID, and therefore they'll all be frozen at the same time anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Comments on Exclusion Constraints and related datatypes
Greg Stark wrote: On Mon, Mar 22, 2010 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote: * Circles, Boxes and other geometric datatypes defined overlaps to include touching shapes. So * inet datatypes don't have a commutative operator on which a unique index can be built. There is no overlaps equivalent, which again is a shame because that stops them being used with the new feature. I think our unusual data types are one of the strong points of Postgres but they're missing a lot of operators and opclasses to make them really useful. There's no reason we couldn't have separate overlaps and overlaps-internally operators just like we have =,= and ,. And it would be nice to flesh out the network data type more fully, perhaps merging in as much of ip4r as makes sense. Added to TODO: Add overlaps geometric operators that ignore point overlaps * http://archives.postgresql.org/pgsql-hackers/2010-03/msg00861.php -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Comments on Exclusion Constraints and related datatypes
Robert Haas wrote: * inet datatypes don't have a commutative operator on which a unique index can be built. There is no overlaps equivalent, which again is a shame because that stops them being used with the new feature. This would be a nice thing to fix, and I was thinking about doing it, but I just ran out of time. I think it can be left for 9.1. I have not infrequently wanted to build an IP allocation database, and this would be perfect for that. Added to TODO: Add INET overlaps operator, for use by exclusion constraints * http://archives.postgresql.org/pgsql-hackers/2010-03/msg00845.php -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be
On Jun 3, 2010, at 3:31 , Robert Haas wrote: On Wed, Jun 2, 2010 at 9:07 PM, Florian Pflug f...@phlo.org wrote: On Jun 3, 2010, at 0:58 , Robert Haas wrote: But maybe the message isn't right the first time either. After all the point of having a write-ahead log in the first place is that we should be able to prevent corruption in the event of an unexpected shutdown. Maybe the right thing to do is to forget about adding a new state and just remove or change the errhint from these messages: You've fallen prey to a (very common) miss-interpration of this message. It is not about corruption *caused* by a crash during recovery, it's about corruption *causing* the crash. I'm not in favor of getting rid of that message entirely, since produces a worthwhile hint if the crash was really caused by corrupt data. But it desperately needs a better wording that makes cause and effect perfectly clear. That even you miss-read it conclusively proves that. How about If this has happened repeatedly and without manual intervention, it was probably caused by corrupted data and you may need to restore from backup for the crash recovery case and If this has happened repeatedly and without manual intervention, it was probably caused by corrupted data and you may need to choose an earlier recovery target for the PITR case. Oh. Well, if that's the case, then I guess I lean toward applying the patch as-is. Then there's no need for the caveat and without manual intervention. That still leaves the messages awfully ambiguous concerning the cause (data corruption) and the effect (crash during recovery). How about If this has occurred more than once, it is probably caused by corrupt data and you have to use the latest backup for recovery for the crash recovery case and If this has occurred more than once, it is probably caused by corrupt data and you have to choose an earlier recovery target for the PITR case. I don't see why currently only the PITR-case includes the more than once clause. Its probably supposed to prevent unnecessarily alarming the user if the crash was in fact a stray SIGKILL or an out-of-memory condition, which seems equally likely in both cases. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] A tackle to the leaky VIEWs for RLS
(2010/06/02 12:02), KaiGai Kohei wrote: Here's another thought. If we're leaning toward explicit syntax to designate security views (and I do mean IF, since only one person has signed on to that, even if it is Tom Lane!), then maybe we should think about ripping out the logic that causes regular views to be evaluated using the credentials of the view owner rather than the person selecting from it. A security view would still use that logic, plus whatever additional stuff we come up with to prevent leakage. Perhaps this would be viewed as a nasty backward compatibility break, but the upside is that we'd then be being absolutely clear that a non-security view isn't and can never be trusted to be a security barrier. Right now we're shipping something that purports to act as a barrier but really doesn't. Sorry, should we make clear the purpose of explicit syntax for security views being issued now? In my understanding, if security views, the planner tries to checks privileges of the person selecting it to reference underlaying tables without any ereport. If violated, the planner prevents user given quals (perhaps FuncExpr only?) to come into inside of the join scan. Otherwise, if regular views, the planner works as is. Right? I'd like to check up the problem in details. We can sort out a view into three types, as follows: (1) A view which cannot be pulled up (2) A view which can be pulled up, but does not contain any joins (3) A view which can be pulled up, and contains joins. For the type (1), we don't need to handle anything special, because no external quals are unavailable to come into. For the type (2), we also don't need to handle anything special except for the evaluation order matter in ExecQual(), because it is impossible to distribute external quals into a part of relations. So, the type (3) is all we have to consider here. Right? Then, where should we distinguish a security view and others? At least, it should be decided at pull_up_subqueries() or earlier, because it merges subqueries into the upper query. In subquery_planner(), the pull_up_subqueries() is called just after inline_set_returning_functions() which makes RTE_FUNCTION entry flatten if available. It seems to me not a strange logic to check privileges on underlaying relations in pull_up_subqueries(), because inline_set_returning_functions() also checks ACL_EXECUTE here on the functions to be flatten. Then, once we can identify what is a security view and not, it shall be marked on somewhere. Maybe, FromExpr of the pulled-up subquery? In my current idea, when a qual-node that contains FuncExpr tries to reference a part of relations within a security view, its referencing relations will be expanded to whole of the security view at distribute_qual_to_rels(). Then, it will prevent a user defined function to come into inside of security views. At least, it seems to me reasonable to implement. Shall I launch to develop a proof of concept patch? Thanks, -- KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be
On Wed, Jun 2, 2010 at 10:34 PM, Florian Pflug f...@phlo.org wrote: Oh. Well, if that's the case, then I guess I lean toward applying the patch as-is. Then there's no need for the caveat and without manual intervention. That still leaves the messages awfully ambiguous concerning the cause (data corruption) and the effect (crash during recovery). How about If this has occurred more than once, it is probably caused by corrupt data and you have to use the latest backup for recovery for the crash recovery case and If this has occurred more than once, it is probably caused by corrupt data and you have to choose an earlier recovery target for the PITR case. I don't see why currently only the PITR-case includes the more than once clause. Its probably supposed to prevent unnecessarily alarming the user if the crash was in fact a stray SIGKILL or an out-of-memory condition, which seems equally likely in both cases. I've applied the patch for now - we can fix the wording of the other messages with a follow-on patch if we agree on what they should say. I don't like the use of the phrase you have to, particularly... I would tend to leave the archive recovery message alone and change the crash recovery message to be more like it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] current value support
Sorry about my previous email, I sent a patch to the wrong mailing list. Once again my apologies about that. Regards, Michael P.
Re: [HACKERS] Allow wal_keep_segments to keep all segments
On Wed, 2010-06-02 at 20:28 -0400, Bruce Momjian wrote: Simon Riggs wrote: On Wed, 2010-06-02 at 15:20 -0400, Bruce Momjian wrote: The attached patch allows wal_keep_segments = -1 to keep all segements; this is particularly useful for taking a base backup, where you need all the WAL files during startup of the standby. I have documented this usage in the patch as well. I am thinking of applying this after 9.0 beta2 if there is no objection. It's not clear to me why keep all files until server breaks is a good setting. Surely you would set this parameter to the size of your disk. Why allow it to go higher? Well, the -1 allows them to set it temporarily without having to compute their free disk space. Frankly, because the disk space varies, it is impossible to know exactly how large the disk is at the time it would fill up. I think the normal computation would be: 1) How long is my file system backup and restore to standby going to take 2) How often do I generate a 16MB WAL file You would do some computation to figure that out, then maybe multiply it by 10x and set that for wal_keep_segments. I figured allowing a simple -1 would be easier. I think its much easier to find out your free disk space than it is to calculate how much WAL might be generated during backup. Disk space doesn't vary significantly on a production database. If we encourage that laziness then we will get reports that replication doesn't work and Postgres crashes. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers