Re: [HACKERS] error: could not find pg_class tuple for index 2662
[adding back hackers so the thread shows the resolution] On Sun, Aug 14, 2011 at 07:02:55PM -0400, Tom Lane wrote: Sounds good. Based on my own testing so far, I think that patch will probably make things measurably better for you, though it won't resolve every corner case. The most recent catalog vacuums did vacuum full pg_class in 34 databases on that instance with no new failures. So it looks like the patch fixes it. This is not conclusive, but it looks very good so far. I'll send an update if I see any new errors during the week. Thanks for your help on this. It looks like it has sent you on a merry search through all the catcache related program activities. I'm assuming this patch or some improvement on it will show up in a point release. Meanwhile, if this works as is for couple more days we will resume upgrading the rest of the hosts to 9.0 using this patch. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland j...@mcknight.de wrote: In short, this is how it works: SELECT pg_export_snapshot(); pg_export_snapshot 03A1-1 (1 row) (and then in a different session) BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1'); I don't see the need to change the BEGIN command, which is SQL Standard. We don't normally do that. If we have pg_export_snapshot() why not pg_import_snapshot() as well? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On 15.08.2011 04:31, Joachim Wieland wrote: The one thing that it does not implement is leaving the transaction in an aborted state if the BEGIN TRANSACTION command failed for an invalid snapshot identifier. So what if the snapshot is invalid, the SNAPSHOT clause silently ignored? That sounds really bad. I can certainly see that this would be useful but I am not sure if it justifies introducing this inconsistency. We would have a BEGIN TRANSACTION command that left the session in a different state depending on why it failed... I don't understand what inconsistency you're talking about. What else can cause BEGIN TRANSACTION to fail? Is there currently any failure mode that doesn't leave the transaction in aborted state? I am wondering if pg_export_snapshot() is still the right name, since the snapshot is no longer exported to the user. It is exported to a file but that's an implementation detail. It's still exporting the snapshot to other sessions, that name still seems appropriate to me. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On 15.08.2011 10:40, Simon Riggs wrote: On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wielandj...@mcknight.de wrote: In short, this is how it works: SELECT pg_export_snapshot(); pg_export_snapshot 03A1-1 (1 row) (and then in a different session) BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1'); I don't see the need to change the BEGIN command, which is SQL Standard. We don't normally do that. If we have pg_export_snapshot() why not pg_import_snapshot() as well? It would be nice a symmetry, but you'd need a limitation that pg_import_snapshot() must be the first thing you do in the session. And it might be hard to enforce that, as once you get control into the function, you've already acquired another snapshot in the transaction to run the SELECT pg_import_snapshot() query with. Specifying the snapshot in the BEGIN command makes sense. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Monday, August 15, 2011 08:40:34 Simon Riggs wrote: On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland j...@mcknight.de wrote: In short, this is how it works: SELECT pg_export_snapshot(); pg_export_snapshot 03A1-1 (1 row) (and then in a different session) BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1'); I don't see the need to change the BEGIN command, which is SQL Standard. We don't normally do that. Uhm. There already are several extensions to begin transaction. Like the just added DEFERRABLE. If we have pg_export_snapshot() why not pg_import_snapshot() as well? Using BEGIN has the advantage of making it explicit that it cannot be used inside an existing transaction. Which I do find advantageous. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
* Not correspond yet * full_page_write = off - If the primary is full_page_write = off, archive recovery may not act normally. Therefore the standby may need to check whether full_page_write = off to WAL. Isn't having a standby make the full_page_write = on in all case (bypass configuration) ? what's the meaning? Jun Ishizuka NTT Software Corporation TEL:045-317-7018 E-Mail: ishizuka@po.ntts.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Aug 15, 2011, at 9:40 AM, Simon Riggs wrote: On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland j...@mcknight.de wrote: In short, this is how it works: SELECT pg_export_snapshot(); pg_export_snapshot 03A1-1 (1 row) (and then in a different session) BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1'); I don't see the need to change the BEGIN command, which is SQL Standard. We don't normally do that. If we have pg_export_snapshot() why not pg_import_snapshot() as well? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services i would definitely argue for a syntax like the one proposed by Joachim.. i could stay the same if this is turned into some sort of flashback implementation some day. regards, hans -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
* Simon Riggs: I don't see the need to change the BEGIN command, which is SQL Standard. We don't normally do that. Some language bindings treat BEGIN specially, so it might be difficult to use this feature. -- Florian Weimerfwei...@bfk.de BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_stat_replication vs StandbyReplyMessage
The pg_stat_replication view exposes all the fields in StandbyReplyMessage *except* for the timestamp when the message was generated. On an active system this is not all that interesting, but on a mostly idle system that allows the monitoring to react faster than the timeout that actually kicks the other end off - and could be useful in manual debugging scenarios. Any particular reason why this was not exposed as it's own column? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] walprotocol.h vs frontends
I'm trying to make my streaming log receiver work properly with 9.1, and have come across a couple of things. The first one that's causing trouble is that the definition of the protocol is currently in walprotocol.h, which is not include:able in a frontend application. AFAICT, this is because it includes utils/timestamp.h, which doesn't work. AFAICT, this means that anybody other than our own backend who wants to talk our replication protocol has to copy the specific struct defines they want in their own code. This seems like a really bad idea. (In my case, it's the StandbyReplyMessage that I need, so I can make my client not get killed by the default settings for timeout) The basic reason for this is that we're putting TimestampTz fields in the protocol. This also means that the protocol actually changes definition depending on if the server is compiled with integer or float timestamps. While the replication itself breaks if these are different, this seems like a bad thing to expose in the protocol. It also makes life a lot harder on third party tools. Any ideas on what to do with this, other than me copying the struct defs and changing them to int64 (which does work in my lab, but seems like a horrible kludge). -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 15.08.2011 04:31, Joachim Wieland wrote: The one thing that it does not implement is leaving the transaction in an aborted state if the BEGIN TRANSACTION command failed for an invalid snapshot identifier. So what if the snapshot is invalid, the SNAPSHOT clause silently ignored? That sounds really bad. No, the command would fail, but since it fails, it doesn't change the transaction state. What was proposed originally was to start a transaction but throw an error that leaves the transaction in the aborted state. But then the command had some effect because it started a transaction block, even though it failed. I can certainly see that this would be useful but I am not sure if it justifies introducing this inconsistency. We would have a BEGIN TRANSACTION command that left the session in a different state depending on why it failed... I don't understand what inconsistency you're talking about. What else can cause BEGIN TRANSACTION to fail? Is there currently any failure mode that doesn't leave the transaction in aborted state? Granted, it might only fail for parse errors so far, but that would include for example sending BEGIN DEFERRABLE to a pre-9.1 server. It wouldn't start a transaction and leave it in an aborted state, but it would just fail. I am wondering if pg_export_snapshot() is still the right name, since the snapshot is no longer exported to the user. It is exported to a file but that's an implementation detail. It's still exporting the snapshot to other sessions, that name still seems appropriate to me. ok. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_replication vs StandbyReplyMessage
On Mon, Aug 15, 2011 at 7:03 AM, Magnus Hagander mag...@hagander.net wrote: The pg_stat_replication view exposes all the fields in StandbyReplyMessage *except* for the timestamp when the message was generated. On an active system this is not all that interesting, but on a mostly idle system that allows the monitoring to react faster than the timeout that actually kicks the other end off - and could be useful in manual debugging scenarios. Any particular reason why this was not exposed as it's own column? I wondered the same thing. Sounds like a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
2011/8/15 Jun Ishiduka ishizuka@po.ntts.co.jp: * Not correspond yet * full_page_write = off - If the primary is full_page_write = off, archive recovery may not act normally. Therefore the standby may need to check whether full_page_write = off to WAL. Isn't having a standby make the full_page_write = on in all case (bypass configuration) ? what's the meaning? Yeah. full_page_writes is a WAL generation parameter. Standbys don't generate WAL. I think you just have to insist that the master has it on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 6:41 AM, Florian Weimer fwei...@bfk.de wrote: * Simon Riggs: I don't see the need to change the BEGIN command, which is SQL Standard. We don't normally do that. Some language bindings treat BEGIN specially, so it might be difficult to use this feature. It's true, the command might require explicit support from language bindings. However I used some perl test scripts, where you can also send a START TRANSACTION command in an $dbh-do(...). The intended use case of this feature is still pg_dump btw... Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI error messages
On lör, 2011-07-16 at 21:55 +0300, Heikki Linnakangas wrote: I think I would prefer something like this: ERROR: could not serialize access due to read/write dependencies among transactions DETAIL: Reason code: %s HINT: The transaction might succeed if retried. I've done it this way now, except that Reason code is not marked for translation, for the implementation reasons discussed downthread. I'm looking forward to see how many of these we'll see in practice and how users will react to them. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 3:51 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It would be nice a symmetry, but you'd need a limitation that pg_import_snapshot() must be the first thing you do in the session. And it might be hard to enforce that, as once you get control into the function, you've already acquired another snapshot in the transaction to run the SELECT pg_import_snapshot() query with. Specifying the snapshot in the BEGIN command makes sense. +1. Also, I am pretty sure that there are drivers out there, and connection poolers, that keep track of the transaction state by watching commands go by. Right now you can tell by the first word of the command whether it's something that might change the transaction state; I wouldn't like to make that harder. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus TOAST
On Sun, Aug 14, 2011 at 7:43 PM, Greg Stark st...@mit.edu wrote: On Sun, Aug 14, 2011 at 5:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: There would be some merit in your suggestion if we knew that all/most toasted columns would actually get fetched out of the catcache entry at some point. Then we'd only be moving the cost around, and might even save something on repeated accesses. But I don't think we know that. In the specific example at hand (pg_statistic entries) it's entirely plausible that the planner would only need the histogram, or only need the MCV list, depending on the sorts of queries it was coping with. Fwiw detoasting statistics entries sounds like a fine idea to me. I've often seen queries that are unexpectedly slow to plan and chalked it up to statistics entries getting toasted. If it's ok to read either the histogram or MVC list from disk every time we plan a query then why are we bothering with an in-memory cache of the statistics at all? The only thing that gives me pause is that it's possible these entries are *really* large. If you have a decent number of tables that are all a few megabytes of histograms then things could go poorly. But I don't think having to read in these entries from pg_toast every time you plan a query is going to go much better for you either. Yep; in fact, I've previously submitted test results showing that repeatedly decompressing TOAST entries can significantly slow down query planning. That having been said, Tom's fix seems safer to back-patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On 15 August 2011 12:22, Magnus Hagander mag...@hagander.net wrote: The basic reason for this is that we're putting TimestampTz fields in the protocol. This also means that the protocol actually changes definition depending on if the server is compiled with integer or float timestamps. Without commenting on what should be done in your specific case, I wonder whether it's time to fully retire the deprecated double representation of timestamps. Is anyone actually expected to rely on their availability when 9.2 is released? This also caused difficulties for Johann Oskarsson recently, during work on PL/Java. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
Magnus Hagander mag...@hagander.net writes: I'm trying to make my streaming log receiver work properly with 9.1, and have come across a couple of things. The first one that's causing trouble is that the definition of the protocol is currently in walprotocol.h, which is not include:able in a frontend application. AFAICT, this is because it includes utils/timestamp.h, which doesn't work. AFAICT, this means that anybody other than our own backend who wants to talk our replication protocol has to copy the specific struct defines they want in their own code. This seems like a really bad idea. (In my case, it's the StandbyReplyMessage that I need, so I can make my client not get killed by the default settings for timeout) The basic reason for this is that we're putting TimestampTz fields in the protocol. This also means that the protocol actually changes definition depending on if the server is compiled with integer or float timestamps. While the replication itself breaks if these are different, this seems like a bad thing to expose in the protocol. It also makes life a lot harder on third party tools. I don't really see why it matters, given that the data to be shipped is also dependent on the timestamp format? However, for a narrow fix, I could see moving the data type definition to someplace with fewer dependencies. Perhaps split it into a separate file timestamp_type.h, or something like that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On Mon, Aug 15, 2011 at 16:20, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: I'm trying to make my streaming log receiver work properly with 9.1, and have come across a couple of things. The first one that's causing trouble is that the definition of the protocol is currently in walprotocol.h, which is not include:able in a frontend application. AFAICT, this is because it includes utils/timestamp.h, which doesn't work. AFAICT, this means that anybody other than our own backend who wants to talk our replication protocol has to copy the specific struct defines they want in their own code. This seems like a really bad idea. (In my case, it's the StandbyReplyMessage that I need, so I can make my client not get killed by the default settings for timeout) The basic reason for this is that we're putting TimestampTz fields in the protocol. This also means that the protocol actually changes definition depending on if the server is compiled with integer or float timestamps. While the replication itself breaks if these are different, this seems like a bad thing to expose in the protocol. It also makes life a lot harder on third party tools. I don't really see why it matters, given that the data to be shipped is also dependent on the timestamp format? As an example, the stream receiver program needs to be able to construct the status message and send back - thus needs to be able to construct a TimestampTz. It never needs to look into the actual WAL data - it just writes it to a file considering it to be a stream of pure binary data. However, for a narrow fix, I could see moving the data type definition to someplace with fewer dependencies. Perhaps split it into a separate file timestamp_type.h, or something like that. Yes, that seems to fix the problem of timestamptz. See the attached patch - seems ok? I also ran into a similar problem with some WAL macro definitions that are in xlog_internal.h. I've moved them to xlogdefs.h in the attached xlog.diff file. Does that seem ok as well, or should I move them somewhere else? (I don't need all those macros, but I moved the complete block of macros wherever I needed one of them, because otherwise it would be completely impossible to track). This is just a simple move of the definitions, zero new logic added and zero removed. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h index 9414667..cdba21b 100644 --- a/src/include/replication/walprotocol.h +++ b/src/include/replication/walprotocol.h @@ -13,7 +13,7 @@ #define _WALPROTOCOL_H #include access/xlogdefs.h -#include utils/timestamp.h +#include utils/timestamp_defs.h /* diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h index 9e51b58..1a5ea0d 100644 --- a/src/include/utils/timestamp.h +++ b/src/include/utils/timestamp.h @@ -19,86 +19,7 @@ #include fmgr.h #include pgtime.h -#ifdef HAVE_INT64_TIMESTAMP -#include utils/int8.h -#endif - -/* - * Timestamp represents absolute time. - * - * Interval represents delta time. Keep track of months (and years), days, - * and hours/minutes/seconds separately since the elapsed time spanned is - * unknown until instantiated relative to an absolute time. - * - * Note that Postgres uses time interval to mean a bounded interval, - * consisting of a beginning and ending time, not a time span - thomas 97/03/20 - * - * We have two implementations, one that uses int64 values with units of - * microseconds, and one that uses double values with units of seconds. - * - * TimeOffset and fsec_t are convenience typedefs for temporary variables - * that are of different types in the two cases. Do not use fsec_t in values - * stored on-disk, since it is not the same size in both implementations. - * Also, fsec_t is only meant for *fractional* seconds; beware of overflow - * if the value you need to store could be many seconds. - */ - -#ifdef HAVE_INT64_TIMESTAMP - -typedef int64 Timestamp; -typedef int64 TimestampTz; -typedef int64 TimeOffset; -typedef int32 fsec_t; /* fractional seconds (in microseconds) */ -#else - -typedef double Timestamp; -typedef double TimestampTz; -typedef double TimeOffset; -typedef double fsec_t; /* fractional seconds (in seconds) */ -#endif - -typedef struct -{ - TimeOffset time; /* all time units other than days, months and - * years */ - int32 day; /* days, after time for alignment */ - int32 month; /* months and years, after time for alignment */ -} Interval; - - -#define MAX_TIMESTAMP_PRECISION 6 -#define MAX_INTERVAL_PRECISION 6 - -/* in both timestamp.h and ecpg/dt.h */ -#define DAYS_PER_YEAR 365.25 /* assumes leap year every four years */ -#define MONTHS_PER_YEAR 12 -/* - * DAYS_PER_MONTH is very imprecise. The more accurate value is - * 365.2425/12 = 30.436875, or '30 days 10:29:06'. Right now we only - * return an
Re: [HACKERS] walprotocol.h vs frontends
Magnus Hagander mag...@hagander.net writes: On Mon, Aug 15, 2011 at 16:20, Tom Lane t...@sss.pgh.pa.us wrote: However, for a narrow fix, I could see moving the data type definition to someplace with fewer dependencies. Perhaps split it into a separate file timestamp_type.h, or something like that. Yes, that seems to fix the problem of timestamptz. See the attached patch - seems ok? Don't think you should expose fsec_t, nor most of those macros. The foo_per_bar values are just namespace clutter. I also ran into a similar problem with some WAL macro definitions that are in xlog_internal.h. I've moved them to xlogdefs.h in the attached xlog.diff file. Does that seem ok as well, or should I move them somewhere else? I don't like the idea of exposing those to frontends, either. What do you actually *need* out of that, and why? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
Simon Riggs si...@2ndquadrant.com wrote: Joachim Wieland j...@mcknight.de wrote: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1'); I don't see the need to change the BEGIN command, which is SQL Standard. No, it's not standard. To quote from our docs at: http://www.postgresql.org/docs/9.0/interactive/sql-begin.html#AEN58214 | BEGIN is a PostgreSQL language extension. It is equivalent to the | SQL-standard command START TRANSACTION, whose reference page | contains additional compatibility information. | | Incidentally, the BEGIN key word is used for a different purpose | in embedded SQL. You are advised to be careful about the | transaction semantics when porting database applications. In checking the most recent standards draft I have available, it appears that besides embedded SQL, this keyword is also used in the standard trigger declaration syntax. Using BEGIN to start a transaction is a PostgreSQL extension to the standard. That said, if we support a feature on the nonstandard BEGIN statement, we typically add it as an extension to the standard START TRANSACTION and SET TRANSACTION statements. Through 9.0 that consisted of having a non-standard default for isolation level and the ability to omit commas required by the standard. In 9.1 we added another optional transaction property which defaults to standard behavior: DEFERRABLE. If we're talking about a property of a transaction, like the transaction snapshot, it seems to me to be best to support it using the same statements we use for other transaction properties. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On Mon, Aug 15, 2011 at 16:53, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Mon, Aug 15, 2011 at 16:20, Tom Lane t...@sss.pgh.pa.us wrote: However, for a narrow fix, I could see moving the data type definition to someplace with fewer dependencies. Perhaps split it into a separate file timestamp_type.h, or something like that. Yes, that seems to fix the problem of timestamptz. See the attached patch - seems ok? Don't think you should expose fsec_t, nor most of those macros. The foo_per_bar values are just namespace clutter. Hmm, ok. I just went for what seemed like a reasonable subset. I do also need the SECS_PER_DAY and those constants in order to be able to convert the timestamp value. That led me to include the other foo_per_bar so they are all in one place - seems wrong to have them split up. I also ran into a similar problem with some WAL macro definitions that are in xlog_internal.h. I've moved them to xlogdefs.h in the attached xlog.diff file. Does that seem ok as well, or should I move them somewhere else? I don't like the idea of exposing those to frontends, either. What do you actually *need* out of that, and why? PrevLogSeg - which also brings in XLogSegsPerFile. XLogFileName XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize PrevLogSeg should be self explaining, as should xlogfilename. XLogFileSize is required by XLByteAdvance() which is already in xlogdefs.h - so xlogdefs.h actually has a hidden dependency on xlog_internal.h here today. I can certainly separate those out, but it seemed more clean to move the whole block they were in. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On 11-08-15 10:00 AM, Peter Geoghegan wrote: Without commenting on what should be done in your specific case, I wonder whether it's time to fully retire the deprecated double representation of timestamps. Is anyone actually expected to rely on their availability when 9.2 is released? This also caused difficulties for Johann Oskarsson recently, during work on PL/Java. This would mean that anyone using the floating point timestamps today won't be able to use pg_upgrade to upgrade to whichever version we remove them from. 8.3 had float based timestamps as the default and I suspect many installations with the default 8.3 settings have been upgraded via pg_upgrade to 9.0 built the old timestamps representation. Steve -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On 15.08.2011 18:54, Magnus Hagander wrote: On Mon, Aug 15, 2011 at 16:53, Tom Lanet...@sss.pgh.pa.us wrote: Magnus Hagandermag...@hagander.net writes: I also ran into a similar problem with some WAL macro definitions that are in xlog_internal.h. I've moved them to xlogdefs.h in the attached xlog.diff file. Does that seem ok as well, or should I move them somewhere else? I don't like the idea of exposing those to frontends, either. What do you actually *need* out of that, and why? PrevLogSeg - which also brings in XLogSegsPerFile. XLogFileName XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize PrevLogSeg should be self explaining, as should xlogfilename. XLogFileSize is required by XLByteAdvance() which is already in xlogdefs.h - so xlogdefs.h actually has a hidden dependency on xlog_internal.h here today. I can certainly separate those out, but it seemed more clean to move the whole block they were in. Perhaps we should change the protocol so that it explicitly says which file the streamed piece of WAL belongs to. That way a client could write it to the correct file without knowing about all those macros. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Perhaps we should change the protocol so that it explicitly says which file the streamed piece of WAL belongs to. That way a client could write it to the correct file without knowing about all those macros. Yeah, maybe. Otherwise, all that logic is not only implicitly part of the protocol, but essentially impossible to change because it'll be hard-coded into clients. I wasn't too worried about this before because I supposed that only walsender and walreceiver really needed to know anything about it. If Magnus is busy coding something else that speaks that protocol, we have a problem. (Assuming we believe his use-case is sane, that is.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On Mon, Aug 15, 2011 at 18:12, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Perhaps we should change the protocol so that it explicitly says which file the streamed piece of WAL belongs to. That way a client could write it to the correct file without knowing about all those macros. Yeah, maybe. Otherwise, all that logic is not only implicitly part of the protocol, but essentially impossible to change because it'll be hard-coded into clients. I wasn't too worried about this before because I supposed that only walsender and walreceiver really needed to know Wouldn't doing that cause significant overhead? Every single packet would have to contain the filename, since the wal stream isn't depending onthe filenames in where it cuts off. Also, the way it is now a single packet on the replication stream can span multiple WAL files... (This is the bug in my previous version that I've been able to fix now) anything about it. If Magnus is busy coding something else that speaks that protocol, we have a problem. (Assuming we believe his use-case is sane, that is.) I am. And said program has existed for almost a year, IIRC. It has even been on a commitfest for 9.1, but had a bug and I didn't have time to fix it (it still worked in a lot of cases, but had cornercases). And a lot of different people have requested said functionality, so at least they believe it's a usecase (the usecase being able to stream your backups at a smaller interval than 16Mb). -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On 15 August 2011 16:56, Steve Singer ssin...@ca.afilias.info wrote: This would mean that anyone using the floating point timestamps today won't be able to use pg_upgrade to upgrade to whichever version we remove them from. 8.3 had float based timestamps as the default and I suspect many installations with the default 8.3 settings have been upgraded via pg_upgrade to 9.0 built the old timestamps representation. Really? I find that slightly surprising, considering that a quick look at master's timestamp.c suggests that the choice to use the in64 representation over the double representation is entirely a case of compile time either/or. There is no apparent fall-back to the double representation available to binaries built without --disable-integer-datetimes. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
Magnus Hagander mag...@hagander.net writes: On Mon, Aug 15, 2011 at 16:53, Tom Lane t...@sss.pgh.pa.us wrote: Don't think you should expose fsec_t, nor most of those macros. The foo_per_bar values are just namespace clutter. Hmm, ok. I just went for what seemed like a reasonable subset. I do also need the SECS_PER_DAY and those constants in order to be able to convert the timestamp value. Uh ... convert? That seems to require a whole lot more knowledge about timestamps than what is exposed by the proposed new header. I thought you were just hoping to compile a struct containing a TimestampTz field, not actually do anything with that field. I don't like the idea of exposing those to frontends, either. What do you actually *need* out of that, and why? PrevLogSeg - which also brings in XLogSegsPerFile. XLogFileName XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize PrevLogSeg should be self explaining, as should xlogfilename. XLogFileSize is required by XLByteAdvance() which is already in xlogdefs.h - so xlogdefs.h actually has a hidden dependency on xlog_internal.h here today. Yeah, and the question remains why is frontend code using any of that? I'm inclined to think that there is zero chance of this code being frontend in the sense of being at any real arm's length from backend implementation details, and so you should just forget these header maneuvers and do what pg_resetxlog.c does, ie /* * We have to use postgres.h not postgres_fe.h here, because there's so much * backend-only stuff in the XLOG include files we need. But we need a * frontend-ish environment otherwise. Hence this ugly hack. */ #define FRONTEND 1 #include postgres.h regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On Mon, Aug 15, 2011 at 18:40, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Mon, Aug 15, 2011 at 16:53, Tom Lane t...@sss.pgh.pa.us wrote: Don't think you should expose fsec_t, nor most of those macros. The foo_per_bar values are just namespace clutter. Hmm, ok. I just went for what seemed like a reasonable subset. I do also need the SECS_PER_DAY and those constants in order to be able to convert the timestamp value. Uh ... convert? That seems to require a whole lot more knowledge about timestamps than what is exposed by the proposed new header. I thought you were just hoping to compile a struct containing a TimestampTz field, not actually do anything with that field. Um, bad choice of words. I don't need to convert, really, but I do need to stick the current timestamp into a TimestampTz field. I don't like the idea of exposing those to frontends, either. What do you actually *need* out of that, and why? PrevLogSeg - which also brings in XLogSegsPerFile. XLogFileName XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize PrevLogSeg should be self explaining, as should xlogfilename. XLogFileSize is required by XLByteAdvance() which is already in xlogdefs.h - so xlogdefs.h actually has a hidden dependency on xlog_internal.h here today. Yeah, and the question remains why is frontend code using any of that? I'm inclined to think that there is zero chance of this code being frontend in the sense of being at any real arm's length from backend implementation details, and so you should just forget these header maneuvers and do what pg_resetxlog.c does, ie /* * We have to use postgres.h not postgres_fe.h here, because there's so much * backend-only stuff in the XLOG include files we need. But we need a * frontend-ish environment otherwise. Hence this ugly hack. */ #define FRONTEND 1 #include postgres.h That does seem to work, yes. I completely forgot about that hack. And I agree that's probably th eway to do it... Now I just need to figure out how to get to GetCurrentTimestamp() to be accessible - or just copy it over under a different name... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
Greg Smith g...@2ndquadrant.com wrote: Anyway, I think every idea thrown out here so far needs about an order of magnitude more types of benchmarking test cases before it can be evaluated at all. Right. I'm very excited about all the optimizations going in, and can't see where the ones committed so far are very likely to cause problems for workloads other than those tested, but here we're heading into territory where the performance farm is very desperately needed. That said, I'll throw out one more idea, as something that worked very well for me in a similar situation, and played well with external caching which knew lots about the storage media but nothing about the database side. It's very similar to the clock sweep algorithm, but uses a weighted average rather than a simple count. It didn't tend to leave as many buffers clustered at one end or the other, and the gradation did seem to be useful. I started out this post trying to describe it more generally, but it all came out sounding too vague and hand-wavy; so I'll give a description with more particulars which could, of course, be tweaked without tossing the concept. This particular description modifies the techniques I've used to try to better fit the particulars of PostgreSQL. It may fall down on contention issues, but since it benchmarked better for me than the alternatives, I thought it might at least help spin off other useful ideas. (1) Each buffer would have a one byte count, incremented on access, similar to the current count. Of course, 255 would not wrap on access, but be left unchanged. (2) We would have 256 int counters for how many buffers were available with each access count. These would be maintained when the access counts changed. (3) While sweeping, we wouldn't decrement the access counters for buffers we couldn't use; rather, there would be a boolean flag to say whether to divide the access counts for non-chosen buffers by two. (4) The reduce flag would be set when any access count goes to 255, and cleared after one complete sweep through the buffers. (So, obviously, we note the location when we set the flag.) (5) To find the next buffer to evict, we would find out what is the lowest count in use, and sweep forward until we found one. (6) We would have a background process doing this sweeping and count reduction which would try to keep a few evicted buffers in a queue for quick pickup. This queue would be the only source for getting a buffer. Getting a buffer would always be a very lightweight operation if there's something in the queue. The background task would try very hard to keep the queue non-empty, only coming to rest when the queue was full -- whatever that was chosen to mean (possibly based on the sort of adaptive calculations currently used by the background writer). There are a lot of interesting ways this could interact with the background writer. One of the more intriguing to me would be for the sweep process to estimate what current access count levels will need to be used on its next sweep and queue up any at those levels which are dirty for the background writer. This is vaguely conceptually similar to the wash marker used in some LRU lists. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On 11-08-15 12:33 PM, Peter Geoghegan wrote: On 15 August 2011 16:56, Steve Singerssin...@ca.afilias.info wrote: This would mean that anyone using the floating point timestamps today won't be able to use pg_upgrade to upgrade to whichever version we remove them from. 8.3 had float based timestamps as the default and I suspect many installations with the default 8.3 settings have been upgraded via pg_upgrade to 9.0 built the old timestamps representation. Really? I find that slightly surprising, considering that a quick look at master's timestamp.c suggests that the choice to use the in64 representation over the double representation is entirely a case of compile time either/or. There is no apparent fall-back to the double representation available to binaries built without --disable-integer-datetimes. I *think* the default on 8.3 was float based timestamps. If you want to upgrade a system running 8.3 (that uses float based timestamps) in using pg_upgrade you must compile 9.0 (or 8.4 or 9.1) with --disable-integer-datetimes. If at some point in the future you then want to upgrade to 9.2 with pg_upgrade you will again need to build 9.2 with --disable-integer-datetimes.If we remove the code for floating point representations of datetime then you won't be able to do that. Steve -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
Steve Singer ssin...@ca.afilias.info writes: On 11-08-15 12:33 PM, Peter Geoghegan wrote: On 15 August 2011 16:56, Steve Singerssin...@ca.afilias.info wrote: This would mean that anyone using the floating point timestamps today won't be able to use pg_upgrade to upgrade to whichever version we remove them from. 8.3 had float based timestamps as the default and I suspect many installations with the default 8.3 settings have been upgraded via pg_upgrade to 9.0 built the old timestamps representation. Really? I *think* the default on 8.3 was float based timestamps. Yeah, it was. Also, the fact that the source distribution switched defaults at 8.4 has not got a lot to do with what individual binary distributions did --- some switched earlier, some may still not have done so. And of course individual users building from source were and still are free to make their own choice. It's going to be quite a long time before we can contemplate removing float timestamp support altogether. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On 15 August 2011 18:09, Steve Singer ssin...@ca.afilias.info wrote: Really? I find that slightly surprising, considering that a quick look at master's timestamp.c suggests that the choice to use the in64 representation over the double representation is entirely a case of compile time either/or. There is no apparent fall-back to the double representation available to binaries built without --disable-integer-datetimes. I *think* the default on 8.3 was float based timestamps. Yes, it is. If you want to upgrade a system running 8.3 (that uses float based timestamps) in using pg_upgrade you must compile 9.0 (or 8.4 or 9.1) with --disable-integer-datetimes. If at some point in the future you then want to upgrade to 9.2 with pg_upgrade you will again need to build 9.2 with --disable-integer-datetimes. If we remove the code for floating point representations of datetime then you won't be able to do that. I'm pretty surprised that pg_upgrade pushes that onus onto its users - for many users, the need to build their own binaries is a greater barrier to upgrading than doing a logical restore. Maybe that's simply considered a matter for package managers to worry about, but that doesn't sit well with me. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On Mon, Aug 15, 2011 at 5:15 PM, Magnus Hagander mag...@hagander.net wrote: On Mon, Aug 15, 2011 at 18:12, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Perhaps we should change the protocol so that it explicitly says which file the streamed piece of WAL belongs to. That way a client could write it to the correct file without knowing about all those macros. Yeah, maybe. Otherwise, all that logic is not only implicitly part of the protocol, but essentially impossible to change because it'll be hard-coded into clients. I wasn't too worried about this before because I supposed that only walsender and walreceiver really needed to know Wouldn't doing that cause significant overhead? Every single packet would have to contain the filename, since the wal stream isn't depending onthe filenames in where it cuts off. Also, the way it is now a single packet on the replication stream can span multiple WAL files... (This is the bug in my previous version that I've been able to fix now) Why not have a specific protocol message to indicate a change of filename? I don't mean a WAL message, I mean a streaming protocol message. At present the WALSender only sends from one file at a time, so sending a message when we open a new file would be straightforward. Magnus needs some things to make this work for 9.0/9.1, but we don't need to change 9.2 to allow that, we just need to copy the definitions for those now-fixed releases. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Should we have an optional limit on the recursion depth of recursive CTEs?
Should we have an optional, disabled-by-default limit on the recursion/iteration depth of recursive CTEs to guard against stupid queries that loop ad infinitum? I've looked at other database systems that support WITH RECURSIVE queries, and this idea crops up there. For example, Firebird, the only other RDBMS that I cared to look at for reasons you can perhaps guess, has a hard limit of 1024 (though you could argue that that's a limitation of their implementation, and I'd agree). Maybe the proprietary databases like SQL server have similar, perhaps even optional/adjustable limits - I don't know because I didn't check. I'd suggest that an appropriate interface would be an int GUC with a GucContext of PGC_SUSET, so that DBAs can impose system-wide limits. A possible use of such a GUC is to zero in on the actual recursion depth of the rCTE with the greatest depth in a given query, by performing a git bisect style binary search, setting the GUC dynamically at each step. It's probably not worth having a proper interface to do that with, but I can imagine that being a useful trick in certain narrow situations. We could also add a similar GUC that can be separately set by unprivileged users, that independently limits the recursion depth per session. This could be used as a sort of assertion of the maximum recursion depth of a given query. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compressing the AFTER TRIGGER queue
On Aug 4, 2011, at 8:40 AM, Kevin Grittner wrote: Robert Haas robertmh...@gmail.com wrote: On Wed, Aug 3, 2011 at 6:05 PM, Jim Nasby j...@nasby.net wrote: Not sure how much this relates to this discussion, but I have often wished we had AFTER FOR EACH STATEMENT triggers that provided OLD and NEW recordsets you could make use of. Sometimes it's very valuably to be able to look at *all* the rows that changed in a transaction in one shot. Yeah, that would be awesome. I think some of our competitors provide exactly that feature... If I remember correctly, MS SQL Server and Sybase ASE provide INSERTED and DELETED relations in triggers instead of NEW and OLD records. In a FOR EACH ROW trigger the relation contains only one row. This is related to the thread on BEFORE triggers, in that these products require that you UPDATE the row in the base table to modify it (normally by joining to the INSERTED relation), making the latest values available to other trigger code, and providing a clear distinction between the values coming in to the trigger and the latest values in the database. Yeah, that sounds like how DB2 does it (though the relations were by default named NEW and OLD). If there is agreement that having a NEW recordset that contains all the new records (or new values on updates) and an OLD recordset that contains all the old records, are there any other API issues that need to be resolved? How would this actually be implemented? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we have an optional limit on the recursion depth of recursive CTEs?
Peter Geoghegan pe...@2ndquadrant.com writes: Should we have an optional, disabled-by-default limit on the recursion/iteration depth of recursive CTEs to guard against stupid queries that loop ad infinitum? I think not ... I'd suggest that an appropriate interface would be an int GUC with a GucContext of PGC_SUSET, so that DBAs can impose system-wide limits. ... and that would be a seriously bad API. There are not SUSET restrictions on other resources such as work_mem. Why do we need one for this? By and large, this sounds like a solution looking for a problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On Mon, Aug 15, 2011 at 21:12, Simon Riggs si...@2ndquadrant.com wrote: On Mon, Aug 15, 2011 at 5:15 PM, Magnus Hagander mag...@hagander.net wrote: On Mon, Aug 15, 2011 at 18:12, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Perhaps we should change the protocol so that it explicitly says which file the streamed piece of WAL belongs to. That way a client could write it to the correct file without knowing about all those macros. Yeah, maybe. Otherwise, all that logic is not only implicitly part of the protocol, but essentially impossible to change because it'll be hard-coded into clients. I wasn't too worried about this before because I supposed that only walsender and walreceiver really needed to know Wouldn't doing that cause significant overhead? Every single packet would have to contain the filename, since the wal stream isn't depending onthe filenames in where it cuts off. Also, the way it is now a single packet on the replication stream can span multiple WAL files... (This is the bug in my previous version that I've been able to fix now) Why not have a specific protocol message to indicate a change of filename? I don't mean a WAL message, I mean a streaming protocol message. That we could have, and it would work as long as it's always sent as the first packet in any stream. If we do that, we should probably make it a general metadata package - including things like segment size as well... At present the WALSender only sends from one file at a time, so sending a message when we open a new file would be straightforward. Are you sure? We can receive a single message spanning multiple files... Magnus needs some things to make this work for 9.0/9.1, but we don't need to change 9.2 to allow that, we just need to copy the definitions for those now-fixed releases. The hack from pg_resetxlog with a manual #define FRONTEND 1 will work for current releases, I think. And it will work for future ones as well - but you'll be in a position where using the log streaming tool built with different parameters (like xlog seg size) than the server will not work - I'm not sure that's a problem big enough to worry about, though... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?
On Mon, Aug 15, 2011 at 9:31 PM, Tom Lane t...@sss.pgh.pa.us wrote: ... and that would be a seriously bad API. There are not SUSET restrictions on other resources such as work_mem. Why do we need one for this? I think a better analogy would be imposing a maximum number of rows a query can output. That might be a sane thing to have for some circumstances but it's not useful in general. Consider for instance my favourite recursive query application, displaying the lock dependency graph for pg_locks. What arbitrary maximum number of locks would you like to impose at which the query should error out? There is a situation though that I think is motivating this though where it would be nice to detect a problem: when the query is such that it *can't* produce a record because there's an infinite loop before the first record. Ideally you want some way to detect that you've recursed and haven't changed anything that could lead to a change in the recursion condition. But that seems like a pretty hard thing to detect, probably impossible. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
Magnus Hagander mag...@hagander.net writes: Why not have a specific protocol message to indicate a change of filename? I don't mean a WAL message, I mean a streaming protocol message. That we could have, and it would work as long as it's always sent as the first packet in any stream. If we do that, we should probably make it a general metadata package - including things like segment size as well... I think this could be made an opt-in: the streaming client would have to ask for it when connecting to the walsender (or maybe whenever in the connection stream, I don't know). Then it can even be backported, and it won't add anything to core clients streaming data. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Aug 13, 2011, at 4:31 PM, Heikki Linnakangas wrote: The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query. That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache. I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore. http://xkcd.org/937/ :) Could something be added to pg_stats that tracks visibility map usefulness on a per-attribute basis? Perhaps another set of stats buckets that show visibility percentages for each stats bucket? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] walprotocol.h vs frontends
On Mon, Aug 15, 2011 at 10:32 PM, Magnus Hagander mag...@hagander.net wrote: At present the WALSender only sends from one file at a time, so sending a message when we open a new file would be straightforward. Are you sure? We can receive a single message spanning multiple files... You're right. That was the way the original code ran but I thought we had stopped that when we introduced MAX_SEND_SIZE. The comment says don't cross a logfile boundary within one message. What that actually does is prevent us incrementing a logid value, which happens every 255 files. I read that as meaning WAL file which is not what it means at all. So right now what we do is allow a single packet to span multiple files, but since MAX_SEND_SIZE is 128KB it will always be smaller than a single file, so we can only ever span two files at most. That is all just a little bizarre, especially since libpq sends data in 8KB chunks anyway. So I suggest we change XLogSend() so that it only ever sends up to the end of a file. That way all w messages will relate to just one file, and we can have another message to initiate a new file. And then, as you say, give full metadata for the new file. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Aug 15, 2011, at 6:23 AM, Joachim Wieland wrote: On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 15.08.2011 04:31, Joachim Wieland wrote: The one thing that it does not implement is leaving the transaction in an aborted state if the BEGIN TRANSACTION command failed for an invalid snapshot identifier. So what if the snapshot is invalid, the SNAPSHOT clause silently ignored? That sounds really bad. No, the command would fail, but since it fails, it doesn't change the transaction state. What was proposed originally was to start a transaction but throw an error that leaves the transaction in the aborted state. But then the command had some effect because it started a transaction block, even though it failed. It certainly seems safer to me to set the transaction to an aborted state; you were expecting a set of commands to run with one snapshot, but if we don't abort the transaction they'll end up running anyway and doing so with the *wrong* snapshot. That could certainly lead to data corruption. I suspect that all the other cases of BEGIN failing would be syntax errors, so you would immediately know in testing that something was wrong. A missing file is definitely not a syntax error, so we can't really depend on user testing to ensure this is handled correctly. IMO, that makes it critical that that error puts us in an aborted transaction. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote: I'm kind of interested by the result, actually, as I had feared that the spinlock protecting ProcArrayLock was going to be a bigger problem sooner. I think this depends on how many connections you have. If you try to scale up your benchmark by having hundreds of connections then get O(n^2) increase in the time spent with the procarray locked. It sounds like they pinned the number of connections at the number of cores they had. That makes sense if they're intentionally driving a cpu-bound benchmark but it means they won't run into this problem. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we have an optional limit on the recursion depth of recursive CTEs?
On 15 August 2011 21:31, Tom Lane t...@sss.pgh.pa.us wrote: I'd suggest that an appropriate interface would be an int GUC with a GucContext of PGC_SUSET, so that DBAs can impose system-wide limits. ... and that would be a seriously bad API. There are not SUSET restrictions on other resources such as work_mem. Why do we need one for this? I think that there perhaps /should/ be optional SUSET restrictions on those resources, particularly work_mem (though I'd suggest a more sophisticated interface there) - I haven't argued for that because, respectfully, I already know that to do so would be pretty close to futile. I have argued for this because I think that an important distinction can be drawn that might convince those who'd reject the idea of poor man's admission control. The distinction is that the only way that we'll ever be able to guard against this sort of failure is with an approach that is essentially equivalent to my proposal - stop trying after some arbitrary number of some unit of work. I'm sure that you don't need me to tell you that it has already been proven that solving the halting problem is impossible. What you may not be aware of is the fact that a proof exists for PG rCTE's Turing completeness. Consequently, I think that solving the halting problem is the barrier to coming up with something fundamentally better. I don't think that your scepticism about the general need to have such protection is justified; I believe that there is plenty of anecdotal evidence out there that this is useful in larger commercial contexts, and I've already named some places where a person might look for such anecdotal evidence. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] synchronized snapshots
On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby j...@nasby.net wrote: I suspect that all the other cases of BEGIN failing would be syntax errors, so you would immediately know in testing that something was wrong. A missing file is definitely not a syntax error, so we can't really depend on user testing to ensure this is handled correctly. IMO, that makes it critical that that error puts us in an aborted transaction. Why can we not just require the user to verify if his BEGIN query failed or succeeded? Is that really too much to ask for? Also see what Robert wrote about proxies in between that keep track of the transaction state. Consider they see a BEGIN query that fails. How would they know if the session is now in an aborted transaction or not in a transaction at all? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Aug 13, 2011, at 3:40 PM, Greg Stark wrote: It does kind of seem like your numbers indicate we're missing part of the picture though. The idea with the clock sweep algorithm is that you keep approximately 1/nth of the buffers with each of the n values. If we're allowing nearly all the buffers to reach a reference count of 5 then you're right that we've lost any information about which buffers have been referenced most recently. One possible missing piece here is that OS clock-sweeps depend on the clock hand to both increment and decrement the usage count. The hardware sets a bit any time a page is accessed; as the clock sweeps in increases usage count if the bit is set and decreases it if it's clear. I believe someone else in the thread suggested this, and I definitely think it's worth an experiment. Presumably this would also ease some lock contention issues. There is another piece that might be relevant... many (most?) OSes keep multiple lists of pages. FreeBSD for example contains these page lists (http://www.freebsd.org/doc/en/articles/vm-design/article.html). Full description follows, but I think the biggest take-away is that there is a difference in how pages are handled once they are no longer active based on whither the page is dirty or not. Active: These pages are actively in use and are not currently under consideration for eviction. This is roughy equivalent to all of our buffers with a usage count of 5. When an active page's usage count drops to it's minimum value, it will get unmapped from process space and moved to one of two queues: Inactive: DIRTY pages that are eligible for eviction once they've been written out. Cache: CLEAN pages that may be immediately reclaimed Free: A small set of pages that are basically the tail of the Cache list. The OS *must* maintain some pages on this list to support memory needed during interrupt handling. The size of this list is typically kept very small, and I'm not sure if non-interrupt processing will pull from this list. It's important to note that the OS can pull a page back out of the Inactive and Cache lists back into Active very cheaply. I think there are two interesting points here. First: after a page has been determined to no longer be in active use it goes into inactive or cache based on whether it's dirty. ISTM that allows for much better scheduling of the flushing of dirty pages. That said; I'm not sure how much that would help us due to checkpoint requirements. Second: AFAIK only the Active list has a clock sweep. I believe the others are LRU (the mentioned URL refers to them as queues). I believe this works well because if a page faults it just needs to be removed from whichever queue it is in, added to the Active queue, and mapped back into process space. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On 08/11/2011 04:06 PM, Robert Haas wrote: On my laptop, the first query executes in about 555 ms, while the second one takes about 1125 ms...I expect that you could get an even larger benefit from this type of query if you could avoid actual disk I/O, rather than just buffer cache thrashing, but I haven't come up with a suitable test cases for that yet (volunteers?). That part I can help with, using a Linux test that kills almost every cache. I get somewhat faster times on my desktop here running the cached version like you were doing (albeit with debugging options on, so I wouldn't read too much into this set of numbers): select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); sum -- 250279412983 (1 row) Time: 472.778 ms QUERY PLAN - Aggregate (cost=133325.00..133325.01 rows=1 width=4) - Nested Loop Semi Join (cost=0.00..133075.00 rows=10 width=4) - Seq Scan on sample_data a1 (cost=0.00..15286.00 rows=10 width=4) - Index Only Scan using pgbench_accounts_pkey on pgbench_accounts a (cost=0.00..1.17 rows=1 width=4) Index Cond: (aid = a1.aid) Filter: (aid 1234567) select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); sum -- 250279412983 Time: 677.902 ms explain select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); QUERY PLAN Aggregate (cost=133325.00..133325.01 rows=1 width=4) - Nested Loop Semi Join (cost=0.00..133075.00 rows=10 width=4) - Seq Scan on sample_data a1 (cost=0.00..15286.00 rows=10 width=4) - Index Scan using pgbench_accounts_pkey on pgbench_accounts a (cost=0.00..1.17 rows=1 width=4) Index Cond: (aid = a1.aid) Filter: (bid 1234567) If I setup my gsmith account to be able to start and stop the server with pg_ctl and a valid PGDATA, and drop these two scripts in that home directory: == index-only-1.sql == \timing select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); explain select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); == index-only-2.sql == \timing select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); explain select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); I can then run this script as root: #!/bin/bash ME=gsmith su - $ME -c pg_ctl stop -w echo 3 /proc/sys/vm/drop_caches su - $ME -c pg_ctl start -w su - $ME -c psql -ef index-only-1.sql su - $ME -c pg_ctl stop -w echo 3 /proc/sys/vm/drop_caches su - $ME -c pg_ctl start -w su - $ME -c psql -ef index-only-2.sql And get results that start with zero information cached in RAM, showing a much more dramatic difference. Including some snippets of interesting vmstat too, the index-only one gets faster as it runs while the regular one is pretty flat: select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); Time: 31677.683 ms $ vmstat 5 procs ---memory-- ---swap-- -io -system-- cpu 0 1 0 15807288 4388 12644000 4681 118 1407 2432 1 1 89 10 1 1 0 15779388 4396 15444800 358717 1135 2058 1 0 86 13 0 1 0 15739956 4396 19367200 5800 0 1195 2056 1 0 87 12 0 1 0 15691844 4396 24183200 7053 3 1299 2044 1 0 86 13 0 1 0 15629736 4396 30409600 799537 1391 2053 1 0 87 12 0 1 0 15519244 4400 41426800 1163914 1448 2189 1 0 87 12 select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); Time: 172381.235 ms $ vmstat 5 procs ---memory-- ---swap-- -io -system-- cpu 0 1 0 15736500 4848 19644400 314222 1092 1989 1 0 86 13 0 1 0 15711948 4852 22107200 3411 1 1039 1943 0 0 88 12 0 1 0 15685412 4852 24749600 361834 1997 0 0 86 13 [this is the middle part, rate doesn't vary too much] That's 5.4X as fast; not too shabby! Kind of interesting how much different the I/O pattern is on the index-only version. I
Re: [HACKERS] index-only scans
On Mon, Aug 15, 2011 at 7:37 PM, Greg Smith g...@2ndquadrant.com wrote: That's 5.4X as fast; not too shabby! Awesome! And with the large difference in response time, things appear to be working as hoped even in this situation. If you try this on your laptop, where drive cache size and random I/O are likely to be even slower, you might see an ever larger difference. Hah! Just in case a 5.4X performance improvement isn't good enough? :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUM FULL versus relcache init files
This might be the last bug from my concurrent-vacuum-full testing --- at least, I have no remaining unexplained events from about two full days of running the tests. The ones that are left involve backends randomly failing like this: psql: FATAL: could not read block 0 in file base/130532080/130545668: read only 0 of 8192 bytes usually during startup, though I have one example of a backend being repeatedly unable to access pg_proc due to similar errors. I believe what this traces to is stale relfilenode information taken from relcache init files, which contain precomputed relcache data intended to speed up backend startup. There is a curious little dance between write_relcache_init_file and RelationCacheInitFileInvalidate that is intended to ensure that when a process creates a new relcache init file that is already stale (because someone else invalidated the information concurrently), the bad file will get unlinked and not used. I think I invented that logic, so it's my fault that it doesn't work. It works fine as long as you consider only the two processes directly involved; but a third process can get fooled into using stale data. The scenario requires two successive invalidations, as for example from vacuum full on two system catalogs in a row, plus a stream of incoming new backends. It goes like this: 1. Process A vacuums a system catalog, unlinks init file, sends sinval messages, does second unlink (which does nothing). 2. Process B starts, observes lack of init file, begins to construct a new one. It gets to the end of AcceptInvalidationMessages in write_relcache_init_file. Since it started after A sent the first sinvals, it sees no incoming sinval messages and has no reason to think its new init file isn't good. 3. Meanwhile, Process A vacuums another system catalog, unlinks init file (doing nothing), and finally sends its sinval messages just after B looked for them. Now it will block trying to get RelCacheInitLock, which B holds. 4. Process B renames its already-stale init file into place, then releases RelCacheInitLock. 5. Process A gets the lock and removes the stale init file. Now process B is okay, because it will see A's second sinvals before it tries to make any use of the relcache data it has. And the stale init file is definitely gone after step 5. However, between steps 4 and 5 there is a window for Process C to start, read the stale init file, and attempt to use it. Since C started after A's second set of sinval messages, it doesn't see them and doesn't know it has stale data. As far as I can see at the moment, the only way to make this bulletproof is to turn both creation and deletion of the init file into atomic operations that include sinval messaging. What I have in mind is Creator: must take RelCacheInitLock, check for incoming invals, rename the new file into place if none, release RelCacheInitLock. (This is the same as what it does now.) Destroyer: must take RelCacheInitLock, unlink the init file, send its sinvals, release RelCacheInitLock. This guarantees that we serialize the sending of the sinval messages so that anyone who sees a bad init file in place *must* see the sinval messages afterwards, so long as they join the sinval messaging ring before looking for the init file (which they do). I don't think it's any worse than the current scheme from a parallelism point of view: the destroyer is holding RelCacheInitLock a bit longer than before, but that should not be a performance critical situation. Anybody see a hole in that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enforcing that all WAL has been replayed after restoring from backup
On Thu, Aug 11, 2011 at 1:34 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Hmm, that's not possible for the 'tar' output, but would work for 'dir' output. Another similar idea would be to withhold the control file in memory until the end of backup, and append it to the output as last. The backup can't be restored until the control file is written out. That won't protect from more complicated scenarios, like if you take the backup without the -x flag, and copy some but not all of the required WAL files manually to the pg_xlog directory. But it'd be much better than nothing for 9.1. We need to skip checking whether we've reached the end backup location only when the server crashes while base backup using pg_start_backup. Right? We can do this by *not* initializing ControlFile-backupStartPoint if the server is doing crash recovery and backupEndRequired is false. What about the attached patch? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 11035e6..d0d68d4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6329,11 +6329,8 @@ StartupXLOG(void) /* * set backupStartPoint if we're starting recovery from a base backup */ - if (haveBackupLabel) - { + if ((InArchiveRecovery haveBackupLabel) || backupEndRequired) ControlFile-backupStartPoint = checkPoint.redo; - ControlFile-backupEndRequired = backupEndRequired; - } ControlFile-time = (pg_time_t) time(NULL); /* No need to hold ControlFileLock yet, we aren't up far enough */ UpdateControlFile(); @@ -6703,20 +6700,13 @@ StartupXLOG(void) * crashes while an online backup is in progress. We must not treat * that as an error, or the database will refuse to start up. */ - if (InArchiveRecovery || ControlFile-backupEndRequired) - { - if (ControlFile-backupEndRequired) -ereport(FATAL, - (errmsg(WAL ends before end of online backup), - errhint(All WAL generated while online backup was taken must be available at recovery.))); - else if (!XLogRecPtrIsInvalid(ControlFile-backupStartPoint)) -ereport(FATAL, - (errmsg(WAL ends before end of online backup), - errhint(Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.))); - else -ereport(FATAL, - (errmsg(WAL ends before consistent recovery point))); - } + if (!XLogRecPtrIsInvalid(ControlFile-backupStartPoint)) + ereport(FATAL, + (errmsg(WAL ends before end of online backup), + errhint(Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.))); + else + ereport(FATAL, + (errmsg(WAL ends before consistent recovery point))); } /* @@ -8540,7 +8530,6 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record) if (XLByteLT(ControlFile-minRecoveryPoint, lsn)) ControlFile-minRecoveryPoint = lsn; MemSet(ControlFile-backupStartPoint, 0, sizeof(XLogRecPtr)); - ControlFile-backupEndRequired = false; UpdateControlFile(); LWLockRelease(ControlFileLock); @@ -9826,8 +9815,8 @@ read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired) (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg(invalid data in file \%s\, BACKUP_LABEL_FILE))); /* - * BACKUP METHOD line is new in 9.0. Don't complain if it doesn't exist, - * in case you're restoring from a backup taken with an 9.0 beta version + * BACKUP METHOD line is new in 9.1. Don't complain if it doesn't exist, + * in case you're restoring from a backup taken with an 9.1 beta version * that didn't emit it. */ if (fscanf(lfp, BACKUP METHOD: %19s, backuptype) == 1) diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index 6688c19..9600b50 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -137,16 +137,9 @@ typedef struct ControlFileData * we use the redo pointer as a cross-check when we see an end-of-backup * record, to make sure the end-of-backup record corresponds the base * backup we're recovering from. - * - * If backupEndRequired is true, we know for sure that we're restoring - * from a backup, and must see a backup-end record before we can safely - * start up. If it's false, but backupStartPoint is set, a backup_label - * file was found at startup but it may have been a leftover from a stray - * pg_start_backup() call, not accompanied by pg_stop_backup(). */ XLogRecPtr minRecoveryPoint; XLogRecPtr backupStartPoint; - bool backupEndRequired; /* * Parameter settings that determine if the WAL can be used for archival -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Possible Bug in pg_upgrade
Applied. --- Bruce Momjian wrote: Dave Byrne wrote: Beginning with commit 002c105a0706bd1c1e939fe0f47ecdceeae6c52d pg_upgrade will fail if there are orphaned temp tables in the current database with the message 'old and new databases postgres have a different number of relations' On line 41 of pg_upgrade/info.c pg_upgrade checks that the number of relations are the same but includes orphaned temp tables in the comparison. Is this expected behavior? If so is there a more detailed error message that can be added explain the cause of the failure? It wasn't evident until modified pg_upgrade to show the missing oid's and recognized them as temp tables with oid2name. Thanks for your report and patch. Let me give some background on pg_upgrade to explain what is happening. Pg_upgrade uses two C arrays to store information about tables and indexes for the old and new clusters. It is not possible to store this information in a database because both clusters are down when pg_upgrade needs to use this information. In pre-9.1 pg_upgrade, pg_upgrade did a sequential scan of the arrays looking for a match between old and new cluster objects. This was reported as slow for databases with many objects, and I could reproduce the slowness. I added some caching in 9.0 but the real solution for 9.1 was to assume a one-to-one mapping between the old and new C arrays, i.e. the 5th entry in the old cluster array is the same as the 5th element in the new cluster array. I knew this was risky but was the right solution so it doesn't surprise me you found a failure. pg_upgrade checks that the size of the two arrays in the same and also checks that each element matches --- the former is what generated your error. Now, about the cause. I had not anticipated that orphaned temp objects could exist in either cluster. In fact, this case would have generated an error in 9.0 as well, but with a different error message. Looking futher, pg_upgrade has to use the same object filter as pg_dump, and pg_dump uses this C test: pg_dump.c: else if (strncmp(nsinfo-dobj.name, pg_, 3) == 0 || pg_dumpall uses this filter: WHERE spcname !~ '^pg_' The problem is that the filter used by pg_upgrade only excluded pg_catalog, not pg_temp* as well. I have developed the attached two patches, one for 9.0, and the second for 9.1 and 9.2 which will make pg_upgrade now match the pg_dump filtering and produce proper results for orphaned temp tables by filtering them. As far as unlogged tables, those are dumped by 9.1/9.2, so there is no need to check relpersistence in this patch. pg_dump doesn't check relistemp either. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + [ text/x-diff is unsupported, treating like TEXT/PLAIN ] diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c new file mode 100644 index 567c64e..ca357e7 *** a/contrib/pg_upgrade/info.c --- b/contrib/pg_upgrade/info.c *** get_rel_infos(migratorContext *ctx, cons *** 326,332 ON c.relnamespace = n.oid LEFT OUTER JOIN pg_catalog.pg_tablespace t ON c.reltablespace = t.oid ! WHERE (( n.nspname NOT IN ('pg_catalog', 'information_schema') AND c.oid = %u ) OR ( n.nspname = 'pg_catalog' --- 326,335 ON c.relnamespace = n.oid LEFT OUTER JOIN pg_catalog.pg_tablespace t ON c.reltablespace = t.oid ! WHERE (( ! /* exclude pg_catalog and pg_temp_ (could be orphaned tables) */ !n.nspname !~ '^pg_' !AND n.nspname != 'information_schema' AND c.oid = %u ) OR ( n.nspname = 'pg_catalog' diff --git a/contrib/pg_upgrade/version_old_8_3.c b/contrib/pg_upgrade/version_old_8_3.c new file mode 100644 index 6ca266c..930f76d *** a/contrib/pg_upgrade/version_old_8_3.c --- b/contrib/pg_upgrade/version_old_8_3.c *** old_8_3_check_for_name_data_type_usage(m *** 61,67 NOT a.attisdropped AND a.atttypid = 'pg_catalog.name'::pg_catalog.regtype AND c.relnamespace = n.oid AND !
Re: [HACKERS] mosbench revisited
On Mon, Aug 15, 2011 at 6:22 PM, Greg Stark st...@mit.edu wrote: On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote: I'm kind of interested by the result, actually, as I had feared that the spinlock protecting ProcArrayLock was going to be a bigger problem sooner. I think this depends on how many connections you have. If you try to scale up your benchmark by having hundreds of connections then get O(n^2) increase in the time spent with the procarray locked. It sounds like they pinned the number of connections at the number of cores they had. That makes sense if they're intentionally driving a cpu-bound benchmark but it means they won't run into this problem. The experiments that I've done so far seem to indicate that there are actually a couple of different ways for ProcArrayLock to become a problem. First, as you say, if you have a busy read-write workload, you just get plain old LWLock contention. Everyone who wants a snapshot needs the lock in shared mode; everyone who wants to commit needs it in exclusive mode. So, the people who are committing block both snapshot-taking and other commits; while the snapshot-takers block commit. If you switch to a read-only workload, ProcArrayLock becomes uncontended, since a transaction without an XID does not need to acquire ProcArrayLock to commit. But if you have enough CPUs (64 will do it), the spinlock protecting ProcArrayLock becomes a bottleneck. However, you can only observe this problem if the working set fits in shared buffers; otherwise, BufFreelistLock contention slows the whole system to a crawl, and the spinlock contention never materializes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] src/backend/storage/ipc/README
$SUBJECT is wildly out-of-date. Is there any point in keeping this, given the large (and actually correct) comment block near the top of sinvaladt.c? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] src/backend/storage/ipc/README
Robert Haas robertmh...@gmail.com writes: $SUBJECT is wildly out-of-date. Is there any point in keeping this, given the large (and actually correct) comment block near the top of sinvaladt.c? Huh, I never noticed that file before. Yeah, it seems useless as it stands. I wonder however if we could repurpose it as a slightly higher-level summary? If you just look at sinval(adt).c, you'll find out all about how inval messages get transmitted, but nothing about why we need them in the first place. There's some material about that over in src/backend/utils/cache/inval.c, which is not the easiest place to find if you're looking at storage/ipc/. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers