Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-08-15 Thread daveg
[adding back hackers so the thread shows the resolution]

On Sun, Aug 14, 2011 at 07:02:55PM -0400, Tom Lane wrote:
 Sounds good.  Based on my own testing so far, I think that patch will
 probably make things measurably better for you, though it won't resolve
 every corner case.

The most recent catalog vacuums did vacuum full pg_class in 34 databases on
that instance with no new failures. So it looks like the patch fixes it. This
is not conclusive, but it looks very good so far. I'll send an update if I
see any new errors during the week.

Thanks for your help on this. It looks like it has sent you on a merry
search through all the catcache related program activities. I'm assuming
this patch or some improvement on it will show up in a point release.
Meanwhile, if this works as is for couple more days we will resume upgrading
the rest of the hosts to 9.0 using this patch.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Simon Riggs
On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland j...@mcknight.de wrote:

 In short, this is how it works:

 SELECT pg_export_snapshot();
  pg_export_snapshot
 
  03A1-1
 (1 row)


 (and then in a different session)

 BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1');

I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

If we have pg_export_snapshot() why not pg_import_snapshot() as well?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Heikki Linnakangas

On 15.08.2011 04:31, Joachim Wieland wrote:

The one thing that it does not implement is leaving the transaction in
an aborted state if the BEGIN TRANSACTION command failed for an
invalid snapshot identifier.


So what if the snapshot is invalid, the SNAPSHOT clause silently 
ignored? That sounds really bad.



I can certainly see that this would be
useful but I am not sure if it justifies introducing this
inconsistency. We would have a BEGIN TRANSACTION command that left the
session in a different state depending on why it failed...


I don't understand what inconsistency you're talking about. What else 
can cause BEGIN TRANSACTION to fail? Is there currently any failure mode 
that doesn't leave the transaction in aborted state?



I am wondering if pg_export_snapshot() is still the right name, since
the snapshot is no longer exported to the user. It is exported to a
file but that's an implementation detail.


It's still exporting the snapshot to other sessions, that name still 
seems appropriate to me.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Heikki Linnakangas

On 15.08.2011 10:40, Simon Riggs wrote:

On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wielandj...@mcknight.de  wrote:


In short, this is how it works:

SELECT pg_export_snapshot();
  pg_export_snapshot

  03A1-1
(1 row)


(and then in a different session)

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1');


I don't see the need to change the BEGIN command, which is SQL
Standard. We don't normally do that.

If we have pg_export_snapshot() why not pg_import_snapshot() as well?


It would be nice a symmetry, but you'd need a limitation that 
pg_import_snapshot() must be the first thing you do in the session. And 
it might be hard to enforce that, as once you get control into the 
function, you've already acquired another snapshot in the transaction to 
run the SELECT pg_import_snapshot() query with. Specifying the 
snapshot in the BEGIN command makes sense.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Andres Freund
On Monday, August 15, 2011 08:40:34 Simon Riggs wrote:
 On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland j...@mcknight.de wrote:
  In short, this is how it works:
  
  SELECT pg_export_snapshot();
   pg_export_snapshot
  
   03A1-1
  (1 row)
  
  
  (and then in a different session)
  
  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT =
  '03A1-1');
 I don't see the need to change the BEGIN command, which is SQL
 Standard. We don't normally do that.
Uhm. There already are several extensions to begin transaction. Like the just 
added DEFERRABLE.

 If we have pg_export_snapshot() why not pg_import_snapshot() as well?
Using BEGIN has the advantage of making it explicit that it cannot be used 
inside an existing transaction. Which I do find advantageous.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-08-15 Thread Jun Ishiduka

  * Not correspond yet
 
   * full_page_write = off
 - If the primary is full_page_write = off, archive recovery may not 
  act
normally. Therefore the standby may need to check whether 
  full_page_write
= off to WAL.
 
 Isn't having a standby make the full_page_write = on in all case
 (bypass configuration) ?

what's the meaning?



Jun Ishizuka
NTT Software Corporation
TEL:045-317-7018
E-Mail: ishizuka@po.ntts.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread PostgreSQL - Hans-Jürgen Schönig

On Aug 15, 2011, at 9:40 AM, Simon Riggs wrote:

 On Mon, Aug 15, 2011 at 2:31 AM, Joachim Wieland j...@mcknight.de wrote:
 
 In short, this is how it works:
 
 SELECT pg_export_snapshot();
  pg_export_snapshot
 
  03A1-1
 (1 row)
 
 
 (and then in a different session)
 
 BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT = '03A1-1');
 
 I don't see the need to change the BEGIN command, which is SQL
 Standard. We don't normally do that.
 
 If we have pg_export_snapshot() why not pg_import_snapshot() as well?
 
 -- 
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



i would definitely argue for a syntax like the one proposed by Joachim.. i 
could stay the same if this is turned into some sort of flashback 
implementation some day.

regards,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Florian Weimer
* Simon Riggs:

 I don't see the need to change the BEGIN command, which is SQL
 Standard. We don't normally do that.

Some language bindings treat BEGIN specially, so it might be difficult
to use this feature.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_stat_replication vs StandbyReplyMessage

2011-08-15 Thread Magnus Hagander
The pg_stat_replication view exposes all the fields in
StandbyReplyMessage *except* for the timestamp when the message was
generated. On an active system this is not all that interesting, but
on a mostly idle system that allows the monitoring to react faster
than the timeout that actually kicks the other end off - and could be
useful in manual debugging scenarios. Any particular reason why this
was not exposed as it's own column?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Magnus Hagander
I'm trying to make my streaming log receiver work properly with 9.1,
and have come across a couple of things. The first one that's causing
trouble is that the definition of the protocol is currently in
walprotocol.h, which is not include:able in a frontend application.
AFAICT, this is because it includes utils/timestamp.h, which doesn't
work. AFAICT, this means that anybody other than our own backend who
wants to talk our replication protocol has to copy the specific struct
defines they want in their own code. This seems like a really bad
idea. (In my case, it's the StandbyReplyMessage that I need, so I can
make my client not get killed by the default settings for timeout)

The basic reason for this is that we're putting TimestampTz fields in
the protocol. This also means that the protocol actually changes
definition depending on if the server is compiled with integer or
float timestamps. While the replication itself breaks if these are
different, this seems like a bad thing to expose in the protocol. It
also makes life a lot harder on third party tools.

Any ideas on what to do with this, other than me copying the struct
defs and changing them to int64 (which does work in my lab, but seems
like a horrible kludge).

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Joachim Wieland
On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 15.08.2011 04:31, Joachim Wieland wrote:

 The one thing that it does not implement is leaving the transaction in
 an aborted state if the BEGIN TRANSACTION command failed for an
 invalid snapshot identifier.

 So what if the snapshot is invalid, the SNAPSHOT clause silently ignored?
 That sounds really bad.

No, the command would fail, but since it fails, it doesn't change the
transaction state.

What was proposed originally was to start a transaction but throw an
error that leaves the transaction in the aborted state. But then the
command had some effect because it started a transaction block, even
though it failed.


 I can certainly see that this would be
 useful but I am not sure if it justifies introducing this
 inconsistency. We would have a BEGIN TRANSACTION command that left the
 session in a different state depending on why it failed...

 I don't understand what inconsistency you're talking about. What else can
 cause BEGIN TRANSACTION to fail? Is there currently any failure mode that
 doesn't leave the transaction in aborted state?

Granted, it might only fail for parse errors so far, but that would
include for example sending BEGIN DEFERRABLE to a pre-9.1 server. It
wouldn't start a transaction and leave it in an aborted state, but it
would just fail.


 I am wondering if pg_export_snapshot() is still the right name, since
 the snapshot is no longer exported to the user. It is exported to a
 file but that's an implementation detail.

 It's still exporting the snapshot to other sessions, that name still seems
 appropriate to me.

ok.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication vs StandbyReplyMessage

2011-08-15 Thread Robert Haas
On Mon, Aug 15, 2011 at 7:03 AM, Magnus Hagander mag...@hagander.net wrote:
 The pg_stat_replication view exposes all the fields in
 StandbyReplyMessage *except* for the timestamp when the message was
 generated. On an active system this is not all that interesting, but
 on a mostly idle system that allows the monitoring to react faster
 than the timeout that actually kicks the other end off - and could be
 useful in manual debugging scenarios. Any particular reason why this
 was not exposed as it's own column?

I wondered the same thing.  Sounds like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-08-15 Thread Robert Haas
2011/8/15 Jun Ishiduka ishizuka@po.ntts.co.jp:
  * Not correspond yet
 
   * full_page_write = off
     - If the primary is full_page_write = off, archive recovery may not 
  act
        normally. Therefore the standby may need to check whether 
  full_page_write
        = off to WAL.

 Isn't having a standby make the full_page_write = on in all case
 (bypass configuration) ?

 what's the meaning?

Yeah.  full_page_writes is a WAL generation parameter.  Standbys don't
generate WAL.  I think you just have to insist that the master has it
on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Joachim Wieland
On Mon, Aug 15, 2011 at 6:41 AM, Florian Weimer fwei...@bfk.de wrote:
 * Simon Riggs:

 I don't see the need to change the BEGIN command, which is SQL
 Standard. We don't normally do that.

 Some language bindings treat BEGIN specially, so it might be difficult
 to use this feature.

It's true, the command might require explicit support from language
bindings. However I used some perl test scripts, where you can also
send a START TRANSACTION command in an $dbh-do(...).

The intended use case of this feature is still pg_dump btw...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI error messages

2011-08-15 Thread Peter Eisentraut
On lör, 2011-07-16 at 21:55 +0300, Heikki Linnakangas wrote:
 I think I would prefer something like this:
 
 ERROR:  could not serialize access due to read/write dependencies
 among 
 transactions
 DETAIL: Reason code: %s
 HINT:  The transaction might succeed if retried. 

I've done it this way now, except that Reason code is not marked for
translation, for the implementation reasons discussed downthread.

I'm looking forward to see how many of these we'll see in practice and
how users will react to them.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Robert Haas
On Mon, Aug 15, 2011 at 3:51 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 It would be nice a symmetry, but you'd need a limitation that
 pg_import_snapshot() must be the first thing you do in the session. And it
 might be hard to enforce that, as once you get control into the function,
 you've already acquired another snapshot in the transaction to run the
 SELECT pg_import_snapshot() query with. Specifying the snapshot in the
 BEGIN command makes sense.

+1.  Also, I am pretty sure that there are drivers out there, and
connection poolers, that keep track of the transaction state by
watching commands go by.  Right now you can tell by the first word of
the command whether it's something that might change the transaction
state; I wouldn't like to make that harder.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM FULL versus TOAST

2011-08-15 Thread Robert Haas
On Sun, Aug 14, 2011 at 7:43 PM, Greg Stark st...@mit.edu wrote:
 On Sun, Aug 14, 2011 at 5:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 There would be some merit in your suggestion if we knew that all/most
 toasted columns would actually get fetched out of the catcache entry
 at some point.  Then we'd only be moving the cost around, and might even
 save something on repeated accesses.  But I don't think we know that.
 In the specific example at hand (pg_statistic entries) it's entirely
 plausible that the planner would only need the histogram, or only need
 the MCV list, depending on the sorts of queries it was coping with.

 Fwiw detoasting statistics entries sounds like a fine idea to me. I've
 often seen queries that are unexpectedly slow to plan and chalked it
 up to statistics entries getting toasted. If it's ok to read either
 the histogram or MVC list from disk every time we plan a query then
 why are we bothering with an in-memory cache of the statistics at all?

 The only thing that gives me pause is that it's possible these entries
 are *really* large. If you have a decent number of tables that are all
 a few megabytes of histograms then things could go poorly. But I don't
 think having to read in these entries from pg_toast every time you
 plan a query is going to go much better for you either.

Yep; in fact, I've previously submitted test results showing that
repeatedly decompressing TOAST entries can significantly slow down
query planning.

That having been said, Tom's fix seems safer to back-patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Peter Geoghegan
On 15 August 2011 12:22, Magnus Hagander mag...@hagander.net wrote:
 The basic reason for this is that we're putting TimestampTz fields in
 the protocol. This also means that the protocol actually changes
 definition depending on if the server is compiled with integer or
 float timestamps.

Without commenting on what should be done in your specific case, I
wonder whether it's time to fully retire the deprecated double
representation of timestamps. Is anyone actually expected to rely on
their availability when 9.2 is released? This also caused difficulties
for Johann Oskarsson recently, during work on PL/Java.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 I'm trying to make my streaming log receiver work properly with 9.1,
 and have come across a couple of things. The first one that's causing
 trouble is that the definition of the protocol is currently in
 walprotocol.h, which is not include:able in a frontend application.
 AFAICT, this is because it includes utils/timestamp.h, which doesn't
 work. AFAICT, this means that anybody other than our own backend who
 wants to talk our replication protocol has to copy the specific struct
 defines they want in their own code. This seems like a really bad
 idea. (In my case, it's the StandbyReplyMessage that I need, so I can
 make my client not get killed by the default settings for timeout)

 The basic reason for this is that we're putting TimestampTz fields in
 the protocol. This also means that the protocol actually changes
 definition depending on if the server is compiled with integer or
 float timestamps. While the replication itself breaks if these are
 different, this seems like a bad thing to expose in the protocol. It
 also makes life a lot harder on third party tools.

I don't really see why it matters, given that the data to be shipped is
also dependent on the timestamp format?

However, for a narrow fix, I could see moving the data type definition
to someplace with fewer dependencies.  Perhaps split it into a separate
file timestamp_type.h, or something like that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Magnus Hagander
On Mon, Aug 15, 2011 at 16:20, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 I'm trying to make my streaming log receiver work properly with 9.1,
 and have come across a couple of things. The first one that's causing
 trouble is that the definition of the protocol is currently in
 walprotocol.h, which is not include:able in a frontend application.
 AFAICT, this is because it includes utils/timestamp.h, which doesn't
 work. AFAICT, this means that anybody other than our own backend who
 wants to talk our replication protocol has to copy the specific struct
 defines they want in their own code. This seems like a really bad
 idea. (In my case, it's the StandbyReplyMessage that I need, so I can
 make my client not get killed by the default settings for timeout)

 The basic reason for this is that we're putting TimestampTz fields in
 the protocol. This also means that the protocol actually changes
 definition depending on if the server is compiled with integer or
 float timestamps. While the replication itself breaks if these are
 different, this seems like a bad thing to expose in the protocol. It
 also makes life a lot harder on third party tools.

 I don't really see why it matters, given that the data to be shipped is
 also dependent on the timestamp format?

As an example, the stream receiver program needs to be able to
construct the status message and send back - thus needs to be able to
construct a TimestampTz. It never needs to look into the actual WAL
data - it just writes it to a file considering it to be a stream of
pure binary data.


 However, for a narrow fix, I could see moving the data type definition
 to someplace with fewer dependencies.  Perhaps split it into a separate
 file timestamp_type.h, or something like that.

Yes, that seems to fix the problem of timestamptz. See the attached
patch - seems ok?


I also ran into a similar problem with some WAL macro definitions that
are in xlog_internal.h. I've moved them to xlogdefs.h in the attached
xlog.diff file. Does that seem ok as well, or should I move them
somewhere else? (I don't need all those macros, but I moved the
complete block of macros wherever I needed one of them, because
otherwise it would be completely impossible to track). This is just a
simple move of the definitions, zero new logic added and zero removed.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 9414667..cdba21b 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -13,7 +13,7 @@
 #define _WALPROTOCOL_H
 
 #include access/xlogdefs.h
-#include utils/timestamp.h
+#include utils/timestamp_defs.h
 
 
 /*
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 9e51b58..1a5ea0d 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -19,86 +19,7 @@
 
 #include fmgr.h
 #include pgtime.h
-#ifdef HAVE_INT64_TIMESTAMP
-#include utils/int8.h
-#endif
-
-/*
- * Timestamp represents absolute time.
- *
- * Interval represents delta time. Keep track of months (and years), days,
- * and hours/minutes/seconds separately since the elapsed time spanned is
- * unknown until instantiated relative to an absolute time.
- *
- * Note that Postgres uses time interval to mean a bounded interval,
- * consisting of a beginning and ending time, not a time span - thomas 97/03/20
- *
- * We have two implementations, one that uses int64 values with units of
- * microseconds, and one that uses double values with units of seconds.
- *
- * TimeOffset and fsec_t are convenience typedefs for temporary variables
- * that are of different types in the two cases.  Do not use fsec_t in values
- * stored on-disk, since it is not the same size in both implementations.
- * Also, fsec_t is only meant for *fractional* seconds; beware of overflow
- * if the value you need to store could be many seconds.
- */
-
-#ifdef HAVE_INT64_TIMESTAMP
-
-typedef int64 Timestamp;
-typedef int64 TimestampTz;
-typedef int64 TimeOffset;
-typedef int32 fsec_t;			/* fractional seconds (in microseconds) */
-#else
-
-typedef double Timestamp;
-typedef double TimestampTz;
-typedef double TimeOffset;
-typedef double fsec_t;			/* fractional seconds (in seconds) */
-#endif
-
-typedef struct
-{
-	TimeOffset	time;			/* all time units other than days, months and
- * years */
-	int32		day;			/* days, after time for alignment */
-	int32		month;			/* months and years, after time for alignment */
-} Interval;
-
-
-#define MAX_TIMESTAMP_PRECISION 6
-#define MAX_INTERVAL_PRECISION 6
-
-/* in both timestamp.h and ecpg/dt.h */
-#define DAYS_PER_YEAR	365.25	/* assumes leap year every four years */
-#define MONTHS_PER_YEAR 12
-/*
- *	DAYS_PER_MONTH is very imprecise.  The more accurate value is
- *	365.2425/12 = 30.436875, or '30 days 10:29:06'.  Right now we only
- *	return an 

Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Mon, Aug 15, 2011 at 16:20, Tom Lane t...@sss.pgh.pa.us wrote:
 However, for a narrow fix, I could see moving the data type definition
 to someplace with fewer dependencies.  Perhaps split it into a separate
 file timestamp_type.h, or something like that.

 Yes, that seems to fix the problem of timestamptz. See the attached
 patch - seems ok?

Don't think you should expose fsec_t, nor most of those macros.  The
foo_per_bar values are just namespace clutter.

 I also ran into a similar problem with some WAL macro definitions that
 are in xlog_internal.h. I've moved them to xlogdefs.h in the attached
 xlog.diff file. Does that seem ok as well, or should I move them
 somewhere else?

I don't like the idea of exposing those to frontends, either.  What do
you actually *need* out of that, and why?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 Joachim Wieland j...@mcknight.de wrote:
 
 BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ (SNAPSHOT =
 '03A1-1');
 
 I don't see the need to change the BEGIN command, which is SQL
 Standard.
 
No, it's not standard.
 
To quote from our docs at:
 
http://www.postgresql.org/docs/9.0/interactive/sql-begin.html#AEN58214
 
| BEGIN is a PostgreSQL language extension. It is equivalent to the
| SQL-standard command START TRANSACTION, whose reference page
| contains additional compatibility information.
| 
| Incidentally, the BEGIN key word is used for a different purpose
| in embedded SQL. You are advised to be careful about the
| transaction semantics when porting database applications. 
 
In checking the most recent standards draft I have available, it
appears that besides embedded SQL, this keyword is also used in the
standard trigger declaration syntax.  Using BEGIN to start a
transaction is a PostgreSQL extension to the standard.  That said,
if we support a feature on the nonstandard BEGIN statement, we
typically add it as an extension to the standard START TRANSACTION
and SET TRANSACTION statements.  Through 9.0 that consisted of
having a non-standard default for isolation level and the ability to
omit commas required by the standard.  In 9.1 we added another
optional transaction property which defaults to standard behavior:
DEFERRABLE.
 
If we're talking about a property of a transaction, like the
transaction snapshot, it seems to me to be best to support it using
the same statements we use for other transaction properties.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Magnus Hagander
On Mon, Aug 15, 2011 at 16:53, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 On Mon, Aug 15, 2011 at 16:20, Tom Lane t...@sss.pgh.pa.us wrote:
 However, for a narrow fix, I could see moving the data type definition
 to someplace with fewer dependencies.  Perhaps split it into a separate
 file timestamp_type.h, or something like that.

 Yes, that seems to fix the problem of timestamptz. See the attached
 patch - seems ok?

 Don't think you should expose fsec_t, nor most of those macros.  The
 foo_per_bar values are just namespace clutter.

Hmm, ok. I just went for what seemed like a reasonable subset. I do
also need the SECS_PER_DAY and those constants in order to be able to
convert the timestamp value. That led me to include the other
foo_per_bar so they are all in one place - seems wrong to have them
split up.



 I also ran into a similar problem with some WAL macro definitions that
 are in xlog_internal.h. I've moved them to xlogdefs.h in the attached
 xlog.diff file. Does that seem ok as well, or should I move them
 somewhere else?

 I don't like the idea of exposing those to frontends, either.  What do
 you actually *need* out of that, and why?

PrevLogSeg - which also brings in XLogSegsPerFile.
XLogFileName
XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize

PrevLogSeg should be self explaining, as should xlogfilename.

XLogFileSize is required by XLByteAdvance() which is already in
xlogdefs.h - so xlogdefs.h actually has a hidden dependency on
xlog_internal.h here today.

I can certainly separate those out, but it seemed more clean to move
the whole block they were in.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Steve Singer

On 11-08-15 10:00 AM, Peter Geoghegan wrote:


Without commenting on what should be done in your specific case, I
wonder whether it's time to fully retire the deprecated double
representation of timestamps. Is anyone actually expected to rely on
their availability when 9.2 is released? This also caused difficulties
for Johann Oskarsson recently, during work on PL/Java.


This would mean that anyone using the floating point timestamps today 
won't be able to use pg_upgrade to upgrade to whichever version we 
remove them from.  8.3 had float based timestamps as the default and I 
suspect many installations with the default 8.3 settings have been 
upgraded via pg_upgrade to 9.0 built the old timestamps representation.


Steve


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Heikki Linnakangas

On 15.08.2011 18:54, Magnus Hagander wrote:

On Mon, Aug 15, 2011 at 16:53, Tom Lanet...@sss.pgh.pa.us  wrote:

Magnus Hagandermag...@hagander.net  writes:

I also ran into a similar problem with some WAL macro definitions that
are in xlog_internal.h. I've moved them to xlogdefs.h in the attached
xlog.diff file. Does that seem ok as well, or should I move them
somewhere else?


I don't like the idea of exposing those to frontends, either.  What do
you actually *need* out of that, and why?


PrevLogSeg - which also brings in XLogSegsPerFile.
XLogFileName
XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize

PrevLogSeg should be self explaining, as should xlogfilename.

XLogFileSize is required by XLByteAdvance() which is already in
xlogdefs.h - so xlogdefs.h actually has a hidden dependency on
xlog_internal.h here today.

I can certainly separate those out, but it seemed more clean to move
the whole block they were in.


Perhaps we should change the protocol so that it explicitly says which 
file the streamed piece of WAL belongs to. That way a client could write 
it to the correct file without knowing about all those macros.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Perhaps we should change the protocol so that it explicitly says which 
 file the streamed piece of WAL belongs to. That way a client could write 
 it to the correct file without knowing about all those macros.

Yeah, maybe.  Otherwise, all that logic is not only implicitly part of
the protocol, but essentially impossible to change because it'll be
hard-coded into clients.  I wasn't too worried about this before because
I supposed that only walsender and walreceiver really needed to know
anything about it.  If Magnus is busy coding something else that speaks
that protocol, we have a problem.  (Assuming we believe his use-case is
sane, that is.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Magnus Hagander
On Mon, Aug 15, 2011 at 18:12, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Perhaps we should change the protocol so that it explicitly says which
 file the streamed piece of WAL belongs to. That way a client could write
 it to the correct file without knowing about all those macros.

 Yeah, maybe.  Otherwise, all that logic is not only implicitly part of
 the protocol, but essentially impossible to change because it'll be
 hard-coded into clients.  I wasn't too worried about this before because
 I supposed that only walsender and walreceiver really needed to know

Wouldn't doing that cause significant overhead? Every single packet
would have to contain the filename, since the wal stream isn't
depending onthe filenames in where it cuts off. Also, the way it is
now a single packet on the replication stream can span multiple WAL
files... (This is the bug in my previous version that I've been able
to fix now)

 anything about it.  If Magnus is busy coding something else that speaks
 that protocol, we have a problem.  (Assuming we believe his use-case is
 sane, that is.)

I am. And said program has existed for almost a year, IIRC. It has
even been on a commitfest for 9.1, but had a bug and I didn't have
time to fix it (it still worked in a lot of cases, but had
cornercases). And a lot of different people have requested said
functionality, so at least they believe it's a usecase (the usecase
being able to stream your backups at a smaller interval than 16Mb).


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Peter Geoghegan
On 15 August 2011 16:56, Steve Singer ssin...@ca.afilias.info wrote:
 This would mean that anyone using the floating point timestamps today won't
 be able to use pg_upgrade to upgrade to whichever version we remove them
 from.  8.3 had float based timestamps as the default and I suspect many
 installations with the default 8.3 settings have been upgraded via
 pg_upgrade to 9.0 built the old timestamps representation.

Really? I find that slightly surprising, considering that a quick look
at master's timestamp.c suggests that the choice to use the in64
representation over the double representation is entirely a case of
compile time either/or. There is no apparent fall-back to the double
representation available to binaries built without
--disable-integer-datetimes.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Mon, Aug 15, 2011 at 16:53, Tom Lane t...@sss.pgh.pa.us wrote:
 Don't think you should expose fsec_t, nor most of those macros.  The
 foo_per_bar values are just namespace clutter.

 Hmm, ok. I just went for what seemed like a reasonable subset. I do
 also need the SECS_PER_DAY and those constants in order to be able to
 convert the timestamp value.

Uh ... convert?  That seems to require a whole lot more knowledge
about timestamps than what is exposed by the proposed new header.
I thought you were just hoping to compile a struct containing a
TimestampTz field, not actually do anything with that field.

 I don't like the idea of exposing those to frontends, either.  What do
 you actually *need* out of that, and why?

 PrevLogSeg - which also brings in XLogSegsPerFile.
 XLogFileName
 XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize

 PrevLogSeg should be self explaining, as should xlogfilename.

 XLogFileSize is required by XLByteAdvance() which is already in
 xlogdefs.h - so xlogdefs.h actually has a hidden dependency on
 xlog_internal.h here today.

Yeah, and the question remains why is frontend code using any of that?

I'm inclined to think that there is zero chance of this code being
frontend in the sense of being at any real arm's length from backend
implementation details, and so you should just forget these header
maneuvers and do what pg_resetxlog.c does, ie

/*
 * We have to use postgres.h not postgres_fe.h here, because there's so much
 * backend-only stuff in the XLOG include files we need.  But we need a
 * frontend-ish environment otherwise.  Hence this ugly hack.
 */
#define FRONTEND 1

#include postgres.h


regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Magnus Hagander
On Mon, Aug 15, 2011 at 18:40, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 On Mon, Aug 15, 2011 at 16:53, Tom Lane t...@sss.pgh.pa.us wrote:
 Don't think you should expose fsec_t, nor most of those macros.  The
 foo_per_bar values are just namespace clutter.

 Hmm, ok. I just went for what seemed like a reasonable subset. I do
 also need the SECS_PER_DAY and those constants in order to be able to
 convert the timestamp value.

 Uh ... convert?  That seems to require a whole lot more knowledge
 about timestamps than what is exposed by the proposed new header.
 I thought you were just hoping to compile a struct containing a
 TimestampTz field, not actually do anything with that field.

Um, bad choice of words. I don't need to convert, really, but I do
need to stick the current timestamp into a TimestampTz field.

 I don't like the idea of exposing those to frontends, either.  What do
 you actually *need* out of that, and why?

 PrevLogSeg - which also brings in XLogSegsPerFile.
 XLogFileName
 XLogFileSize - which brings in XLogSegsPerFile and XLogSegSize

 PrevLogSeg should be self explaining, as should xlogfilename.

 XLogFileSize is required by XLByteAdvance() which is already in
 xlogdefs.h - so xlogdefs.h actually has a hidden dependency on
 xlog_internal.h here today.

 Yeah, and the question remains why is frontend code using any of that?

 I'm inclined to think that there is zero chance of this code being
 frontend in the sense of being at any real arm's length from backend
 implementation details, and so you should just forget these header
 maneuvers and do what pg_resetxlog.c does, ie

 /*
  * We have to use postgres.h not postgres_fe.h here, because there's so much
  * backend-only stuff in the XLOG include files we need.  But we need a
  * frontend-ish environment otherwise.  Hence this ugly hack.
  */
 #define FRONTEND 1

 #include postgres.h

That does seem to work, yes. I completely forgot about that hack. And
I agree that's probably th eway to do it...

Now I just need to figure out how to get to GetCurrentTimestamp() to
be accessible - or just copy it over under a different name...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-15 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
 
 Anyway, I think every idea thrown out here so far needs about an
 order of magnitude more types of benchmarking test cases before it
 can be evaluated at all.
 
Right.  I'm very excited about all the optimizations going in, and
can't see where the ones committed so far are very likely to cause
problems for workloads other than those tested, but here we're
heading into territory where the performance farm is very
desperately needed.
 
That said, I'll throw out one more idea, as something that worked
very well for me in a similar situation, and played well with
external caching which knew lots about the storage media but nothing
about the database side.  It's very similar to the clock sweep
algorithm, but uses a weighted average rather than a simple count. 
It didn't tend to leave as many buffers clustered at one end or the
other, and the gradation did seem to be useful.
 
I started out this post trying to describe it more generally, but it
all came out sounding too vague and hand-wavy; so I'll give a
description with more particulars which could, of course, be tweaked
without tossing the concept.  This particular description modifies
the techniques I've used to try to better fit the particulars of
PostgreSQL.  It may fall down on contention issues, but since it
benchmarked better for me than the alternatives, I thought it might
at least help spin off other useful ideas.
 
(1)  Each buffer would have a one byte count, incremented on access,
similar to the current count.  Of course, 255 would not wrap on
access, but be left unchanged.
 
(2)  We would have 256 int counters for how many buffers were
available with each access count.  These would be maintained when
the access counts changed.
 
(3)  While sweeping, we wouldn't decrement the access counters for
buffers we couldn't use; rather, there would be a boolean flag to
say whether to divide the access counts for non-chosen buffers by
two.
 
(4)  The reduce flag would be set when any access count goes to
255, and cleared after one complete sweep through the buffers.  (So,
obviously, we note the location when we set the flag.)
 
(5)  To find the next buffer to evict, we would find out what is the
lowest count in use, and sweep forward until we found one.
 
(6)  We would have a background process doing this sweeping and
count reduction which would try to keep a few evicted buffers in a
queue for quick pickup.  This queue would be the only source for
getting a buffer.  Getting a buffer would always be a very
lightweight operation if there's something in the queue.  The
background task would try very hard to keep the queue non-empty,
only coming to rest when the queue was full -- whatever that was
chosen to mean (possibly based on the sort of adaptive calculations
currently used by the background writer).
 
There are a lot of interesting ways this could interact with the
background writer.  One of the more intriguing to me would be for
the sweep process to estimate what current access count levels
will need to be used on its next sweep and queue up any at those
levels which are dirty for the background writer.  This is vaguely
conceptually similar to the wash marker used in some LRU lists.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Steve Singer

On 11-08-15 12:33 PM, Peter Geoghegan wrote:

On 15 August 2011 16:56, Steve Singerssin...@ca.afilias.info  wrote:

This would mean that anyone using the floating point timestamps today won't
be able to use pg_upgrade to upgrade to whichever version we remove them
from.  8.3 had float based timestamps as the default and I suspect many
installations with the default 8.3 settings have been upgraded via
pg_upgrade to 9.0 built the old timestamps representation.


Really? I find that slightly surprising, considering that a quick look
at master's timestamp.c suggests that the choice to use the in64
representation over the double representation is entirely a case of
compile time either/or. There is no apparent fall-back to the double
representation available to binaries built without
--disable-integer-datetimes.



I *think* the default on 8.3 was float based timestamps.  If you want to 
upgrade a system running 8.3 (that uses float based timestamps) in 
using pg_upgrade you must compile 9.0 (or 8.4 or 9.1) with 
--disable-integer-datetimes.  If at some point in the future you then 
want to upgrade to 9.2 with pg_upgrade you will again need to build 9.2 
with --disable-integer-datetimes.If we remove the code for floating 
point representations of datetime then you won't be able to do that.



Steve

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Tom Lane
Steve Singer ssin...@ca.afilias.info writes:
 On 11-08-15 12:33 PM, Peter Geoghegan wrote:
 On 15 August 2011 16:56, Steve Singerssin...@ca.afilias.info  wrote:
 This would mean that anyone using the floating point timestamps today won't
 be able to use pg_upgrade to upgrade to whichever version we remove them
 from.  8.3 had float based timestamps as the default and I suspect many
 installations with the default 8.3 settings have been upgraded via
 pg_upgrade to 9.0 built the old timestamps representation.

 Really?

 I *think* the default on 8.3 was float based timestamps.

Yeah, it was.  Also, the fact that the source distribution switched
defaults at 8.4 has not got a lot to do with what individual binary
distributions did --- some switched earlier, some may still not have
done so.  And of course individual users building from source were
and still are free to make their own choice.

It's going to be quite a long time before we can contemplate removing
float timestamp support altogether.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Peter Geoghegan
On 15 August 2011 18:09, Steve Singer ssin...@ca.afilias.info wrote:
 Really? I find that slightly surprising, considering that a quick look
 at master's timestamp.c suggests that the choice to use the in64
 representation over the double representation is entirely a case of
 compile time either/or. There is no apparent fall-back to the double
 representation available to binaries built without
 --disable-integer-datetimes.


 I *think* the default on 8.3 was float based timestamps.

Yes, it is.

 If you want to upgrade a system running 8.3 (that uses float based 
 timestamps) in using
 pg_upgrade you must compile 9.0 (or 8.4 or 9.1) with
 --disable-integer-datetimes.  If at some point in the future you then want
 to upgrade to 9.2 with pg_upgrade you will again need to build 9.2 with
 --disable-integer-datetimes.    If we remove the code for floating point
 representations of datetime then you won't be able to do that.

I'm pretty surprised that pg_upgrade pushes that onus onto its users -
for many users, the need to build their own binaries is a greater
barrier to upgrading than doing a logical restore. Maybe that's simply
considered a matter for package managers to worry about, but that
doesn't sit well with me.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Simon Riggs
On Mon, Aug 15, 2011 at 5:15 PM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Aug 15, 2011 at 18:12, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Perhaps we should change the protocol so that it explicitly says which
 file the streamed piece of WAL belongs to. That way a client could write
 it to the correct file without knowing about all those macros.

 Yeah, maybe.  Otherwise, all that logic is not only implicitly part of
 the protocol, but essentially impossible to change because it'll be
 hard-coded into clients.  I wasn't too worried about this before because
 I supposed that only walsender and walreceiver really needed to know

 Wouldn't doing that cause significant overhead? Every single packet
 would have to contain the filename, since the wal stream isn't
 depending onthe filenames in where it cuts off. Also, the way it is
 now a single packet on the replication stream can span multiple WAL
 files... (This is the bug in my previous version that I've been able
 to fix now)

Why not have a specific protocol message to indicate a change of filename?

I don't mean a WAL message, I mean a streaming protocol message.

At present the WALSender only sends from one file at a time, so
sending a message when we open a new file would be straightforward.

Magnus needs some things to make this work for 9.0/9.1, but we don't
need to change 9.2 to allow that, we just need to copy the definitions
for those now-fixed releases.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-15 Thread Peter Geoghegan
Should we have an optional, disabled-by-default limit on the
recursion/iteration depth of recursive CTEs to guard against stupid
queries that loop ad infinitum?

I've looked at other database systems that support WITH RECURSIVE
queries, and this idea crops up there. For example, Firebird, the only
other RDBMS that I cared to look at for reasons you can perhaps guess,
has a hard limit of 1024 (though you could argue that that's a
limitation of their implementation, and I'd agree). Maybe the
proprietary databases like SQL server have similar, perhaps even
optional/adjustable limits - I don't know because I didn't check.

I'd suggest that an appropriate interface would be an int GUC with a
GucContext of PGC_SUSET, so that DBAs can impose system-wide limits.

A possible use of such a GUC is to zero in on the actual recursion
depth of the rCTE with the greatest depth in a given query, by
performing a git bisect style binary search, setting the GUC
dynamically at each step. It's probably not worth having a proper
interface to do that with, but I can imagine that being a useful trick
in certain narrow situations.

We could also add a similar GUC that can be separately set by
unprivileged users, that independently limits the recursion depth per
session. This could be used as a sort of assertion of the maximum
recursion depth of a given query.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compressing the AFTER TRIGGER queue

2011-08-15 Thread Jim Nasby
On Aug 4, 2011, at 8:40 AM, Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:
 On Wed, Aug 3, 2011 at 6:05 PM, Jim Nasby j...@nasby.net wrote:
 Not sure how much this relates to this discussion, but I have
 often wished we had AFTER FOR EACH STATEMENT triggers that
 provided OLD and NEW recordsets you could make use of. Sometimes
 it's very valuably to be able to look at *all* the rows that
 changed in a transaction in one shot.
 
 Yeah, that would be awesome.  I think some of our competitors
 provide exactly that feature...
 
 If I remember correctly, MS SQL Server and Sybase ASE provide
 INSERTED and DELETED relations in triggers instead of NEW and OLD
 records.  In a FOR EACH ROW trigger the relation contains only one
 row.
 
 This is related to the thread on BEFORE triggers, in that these
 products require that you UPDATE the row in the base table to modify
 it (normally by joining to the INSERTED relation), making the latest
 values available to other trigger code, and providing a clear
 distinction between the values coming in to the trigger and the
 latest values in the database.

Yeah, that sounds like how DB2 does it (though the relations were by default 
named NEW and OLD).

If there is agreement that having a NEW recordset that contains all the new 
records (or new values on updates) and an OLD recordset that contains all the 
old records, are there any other API issues that need to be resolved? How would 
this actually be implemented?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-15 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 Should we have an optional, disabled-by-default limit on the
 recursion/iteration depth of recursive CTEs to guard against stupid
 queries that loop ad infinitum?

I think not ...

 I'd suggest that an appropriate interface would be an int GUC with a
 GucContext of PGC_SUSET, so that DBAs can impose system-wide limits.

... and that would be a seriously bad API.  There are not SUSET
restrictions on other resources such as work_mem.  Why do we need
one for this?

By and large, this sounds like a solution looking for a problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Magnus Hagander
On Mon, Aug 15, 2011 at 21:12, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Aug 15, 2011 at 5:15 PM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Aug 15, 2011 at 18:12, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Perhaps we should change the protocol so that it explicitly says which
 file the streamed piece of WAL belongs to. That way a client could write
 it to the correct file without knowing about all those macros.

 Yeah, maybe.  Otherwise, all that logic is not only implicitly part of
 the protocol, but essentially impossible to change because it'll be
 hard-coded into clients.  I wasn't too worried about this before because
 I supposed that only walsender and walreceiver really needed to know

 Wouldn't doing that cause significant overhead? Every single packet
 would have to contain the filename, since the wal stream isn't
 depending onthe filenames in where it cuts off. Also, the way it is
 now a single packet on the replication stream can span multiple WAL
 files... (This is the bug in my previous version that I've been able
 to fix now)

 Why not have a specific protocol message to indicate a change of filename?

 I don't mean a WAL message, I mean a streaming protocol message.

That we could have, and it would work as long as it's always sent as
the first packet in any stream.

If we do that, we should probably make it a general metadata package -
including things like segment size as well...


 At present the WALSender only sends from one file at a time, so
 sending a message when we open a new file would be straightforward.

Are you sure? We can receive a single message spanning multiple files...


 Magnus needs some things to make this work for 9.0/9.1, but we don't
 need to change 9.2 to allow that, we just need to copy the definitions
 for those now-fixed releases.

The hack from pg_resetxlog with a manual #define FRONTEND 1 will work
for current releases, I think. And it will work for future ones as
well - but you'll be in a position where using the log streaming tool
built with different parameters (like xlog seg size) than the server
will not work - I'm not sure that's a problem big enough to worry
about, though...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-15 Thread Greg Stark
On Mon, Aug 15, 2011 at 9:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... and that would be a seriously bad API.  There are not SUSET
 restrictions on other resources such as work_mem.  Why do we need
 one for this?

I think a better analogy would be imposing a maximum number of rows a
query can output. That might be a sane thing to have for some
circumstances but it's not useful in general.

Consider for instance my favourite recursive query application,
displaying the lock dependency graph for pg_locks. What arbitrary
maximum number of locks would you like to impose at which the query
should error out?

There is a situation though that I think is motivating this though
where it would be nice to detect a problem: when the query is such
that it *can't* produce a record because there's an infinite loop
before the first record. Ideally you want some way to detect that
you've recursed and haven't changed anything that could lead to a
change in the recursion condition. But that seems like a pretty hard
thing to detect, probably impossible.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Dimitri Fontaine
Magnus Hagander mag...@hagander.net writes:
 Why not have a specific protocol message to indicate a change of filename?
 I don't mean a WAL message, I mean a streaming protocol message.

 That we could have, and it would work as long as it's always sent as
 the first packet in any stream.

 If we do that, we should probably make it a general metadata package -
 including things like segment size as well...

I think this could be made an opt-in: the streaming client would have to
ask for it when connecting to the walsender (or maybe whenever in the
connection stream, I don't know).

Then it can even be backported, and it won't add anything to core
clients streaming data.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only scans

2011-08-15 Thread Jim Nasby
On Aug 13, 2011, at 4:31 PM, Heikki Linnakangas wrote:
 The example is much more realistic if the query is a fetch of N latest rows 
 from a table. Very common use case, and the whole relation's visibility 
 statistics are completely wrong for that query.
 
 That is somewhat compensated by the fact that tuples that are accessed more 
 often are also more likely to be in cache. Fetching the heap tuple to check 
 visibility is very cheap when the tuple is in cache.
 
 I'm not sure how far that compensates it, though. I'm sure there's typically 
 nevertheless a fairly wide range of pages that have been modified since the 
 last vacuum, but not in cache anymore.

http://xkcd.org/937/ :)

Could something be added to pg_stats that tracks visibility map usefulness on a 
per-attribute basis? Perhaps another set of stats buckets that show visibility 
percentages for each stats bucket?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-15 Thread Simon Riggs
On Mon, Aug 15, 2011 at 10:32 PM, Magnus Hagander mag...@hagander.net wrote:

 At present the WALSender only sends from one file at a time, so
 sending a message when we open a new file would be straightforward.

 Are you sure? We can receive a single message spanning multiple files...

You're right. That was the way the original code ran but I thought we
had stopped that when we introduced MAX_SEND_SIZE. The comment says
don't cross a logfile boundary within one message. What that
actually does is prevent us incrementing a logid value, which happens
every 255 files. I read that as meaning WAL file which is not what
it means at all.

So right now what we do is allow a single packet to span multiple
files, but since MAX_SEND_SIZE is 128KB it will always be smaller than
a single file, so we can only ever span two files at most.

That is all just a little bizarre, especially since libpq sends data
in 8KB chunks anyway.

So I suggest we change XLogSend() so that it only ever sends up to the
end of a file. That way all w messages will relate to just one file,
and we can have another message to initiate a new file. And then, as
you say, give full metadata for the new file.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Jim Nasby
On Aug 15, 2011, at 6:23 AM, Joachim Wieland wrote:
 On Mon, Aug 15, 2011 at 3:47 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 15.08.2011 04:31, Joachim Wieland wrote:
 
 The one thing that it does not implement is leaving the transaction in
 an aborted state if the BEGIN TRANSACTION command failed for an
 invalid snapshot identifier.
 
 So what if the snapshot is invalid, the SNAPSHOT clause silently ignored?
 That sounds really bad.
 
 No, the command would fail, but since it fails, it doesn't change the
 transaction state.
 
 What was proposed originally was to start a transaction but throw an
 error that leaves the transaction in the aborted state. But then the
 command had some effect because it started a transaction block, even
 though it failed.

It certainly seems safer to me to set the transaction to an aborted state; you 
were expecting a set of commands to run with one snapshot, but if we don't 
abort the transaction they'll end up running anyway and doing so with the 
*wrong* snapshot. That could certainly lead to data corruption.

I suspect that all the other cases of BEGIN failing would be syntax errors, so 
you would immediately know in testing that something was wrong. A missing file 
is definitely not a syntax error, so we can't really depend on user testing to 
ensure this is handled correctly. IMO, that makes it critical that that error 
puts us in an aborted transaction.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-15 Thread Greg Stark
On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote:
  I'm kind of interested by the
 result, actually, as I had feared that the spinlock protecting
 ProcArrayLock was going to be a bigger problem sooner.

I think this depends on how many connections you have. If you try to
scale up your benchmark by having hundreds of connections then get
O(n^2) increase in the time spent with the procarray locked. It sounds
like they pinned the number of connections at the number of cores they
had. That makes sense if they're intentionally driving a cpu-bound
benchmark but it means they won't run into this problem.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-15 Thread Peter Geoghegan
On 15 August 2011 21:31, Tom Lane t...@sss.pgh.pa.us wrote:
 I'd suggest that an appropriate interface would be an int GUC with a
 GucContext of PGC_SUSET, so that DBAs can impose system-wide limits.

 ... and that would be a seriously bad API.  There are not SUSET
 restrictions on other resources such as work_mem.  Why do we need
 one for this?

I think that there perhaps /should/ be optional SUSET restrictions on
those resources, particularly work_mem (though I'd suggest a more
sophisticated interface there) - I haven't argued for that because,
respectfully, I already know that to do so would be pretty close to
futile. I have argued for this because I think that an important
distinction can be drawn that might convince those who'd reject the
idea of poor man's admission control.

The distinction is that the only way that we'll ever be able to guard
against this sort of failure is with an approach that is essentially
equivalent to my proposal - stop trying after some arbitrary number of
some unit of work. I'm sure that you don't need me to tell you that it
has already been proven that solving the halting problem is
impossible. What you may not be aware of is the fact that a proof
exists for PG rCTE's Turing completeness. Consequently, I think that
solving the halting problem is the barrier to coming up with
something fundamentally better.

I don't think that your scepticism about the general need to have such
protection is justified; I believe that there is plenty of anecdotal
evidence out there that this is useful in larger commercial contexts,
and I've already named some places where a person might look for such
anecdotal evidence.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-15 Thread Joachim Wieland
On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby j...@nasby.net wrote:
 I suspect that all the other cases of BEGIN failing would be syntax errors, so
 you would immediately know in testing that something was wrong. A missing file
 is definitely not a syntax error, so we can't really depend on user testing 
 to ensure
 this is handled correctly. IMO, that makes it critical that that error puts 
 us in an
 aborted transaction.

Why can we not just require the user to verify if his BEGIN query
failed or succeeded?
Is that really too much to ask for?

Also see what Robert wrote about proxies in between that keep track of
the transaction
state. Consider they see a BEGIN query that fails. How would they know
if the session
is now in an aborted transaction or not in a transaction at all?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-15 Thread Jim Nasby
On Aug 13, 2011, at 3:40 PM, Greg Stark wrote:
 It does kind of seem like your numbers indicate we're missing part of
 the picture though. The idea with the clock sweep algorithm is that
 you keep approximately 1/nth of the buffers with each of the n values.
 If we're allowing nearly all the buffers to reach a reference count of
 5 then you're right that we've lost any information about which
 buffers have been referenced most recently.

One possible missing piece here is that OS clock-sweeps depend on the clock 
hand to both increment and decrement the usage count. The hardware sets a bit 
any time a page is accessed; as the clock sweeps in increases usage count if 
the bit is set and decreases it if it's clear. I believe someone else in the 
thread suggested this, and I definitely think it's worth an experiment. 
Presumably this would also ease some lock contention issues.

There is another piece that might be relevant... many (most?) OSes keep 
multiple lists of pages. FreeBSD for example contains these page lists 
(http://www.freebsd.org/doc/en/articles/vm-design/article.html). Full 
description follows, but I think the biggest take-away is that there is a 
difference in how pages are handled once they are no longer active based on 
whither the page is dirty or not.

Active: These pages are actively in use and are not currently under 
consideration for eviction. This is roughy equivalent to all of our buffers 
with a usage count of 5.

When an active page's usage count drops to it's minimum value, it will get 
unmapped from process space and moved to one of two queues:

Inactive: DIRTY pages that are eligible for eviction once they've been written 
out.

Cache: CLEAN pages that may be immediately reclaimed

Free: A small set of pages that are basically the tail of the Cache list. The 
OS *must* maintain some pages on this list to support memory needed during 
interrupt handling. The size of this list is typically kept very small, and I'm 
not sure if non-interrupt processing will pull from this list.

It's important to note that the OS can pull a page back out of the Inactive and 
Cache lists back into Active very cheaply.

I think there are two interesting points here. First: after a page has been 
determined to no longer be in active use it goes into inactive or cache based 
on whether it's dirty. ISTM that allows for much better scheduling of the 
flushing of dirty pages. That said; I'm not sure how much that would help us 
due to checkpoint requirements.

Second: AFAIK only the Active list has a clock sweep. I believe the others are 
LRU (the mentioned URL refers to them as queues). I believe this works well 
because if a page faults it just needs to be removed from whichever queue it is 
in, added to the Active queue, and mapped back into process space.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only scans

2011-08-15 Thread Greg Smith

On 08/11/2011 04:06 PM, Robert Haas wrote:

On my laptop, the first query executes in about 555 ms, while the
second one takes about 1125 ms...I expect that you could get
an even larger benefit from this type of query if you could avoid
actual disk I/O, rather than just buffer cache thrashing, but I
haven't come up with a suitable test cases for that yet (volunteers?).
   


That part I can help with, using a Linux test that kills almost every 
cache. I get somewhat faster times on my desktop here running the cached 
version like you were doing (albeit with debugging options on, so I 
wouldn't read too much into this set of numbers):


select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
 sum
--
 250279412983
(1 row)

Time: 472.778 ms

   QUERY PLAN
-
 Aggregate  (cost=133325.00..133325.01 rows=1 width=4)
   -  Nested Loop Semi Join  (cost=0.00..133075.00 rows=10 width=4)
 -  Seq Scan on sample_data a1  (cost=0.00..15286.00 
rows=10 width=4)
 -  Index Only Scan using pgbench_accounts_pkey on 
pgbench_accounts a  (cost=0.00..1.17 rows=1 width=4)

   Index Cond: (aid = a1.aid)
   Filter: (aid  1234567)

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
 sum
--
 250279412983

Time: 677.902 ms
explain select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
 QUERY PLAN

 Aggregate  (cost=133325.00..133325.01 rows=1 width=4)
   -  Nested Loop Semi Join  (cost=0.00..133075.00 rows=10 width=4)
 -  Seq Scan on sample_data a1  (cost=0.00..15286.00 
rows=10 width=4)
 -  Index Scan using pgbench_accounts_pkey on pgbench_accounts 
a  (cost=0.00..1.17 rows=1 width=4)

   Index Cond: (aid = a1.aid)
   Filter: (bid  1234567)

If I setup my gsmith account to be able to start and stop the server 
with pg_ctl and a valid PGDATA, and drop these two scripts in that home 
directory:


== index-only-1.sql ==

\timing
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

explain select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

== index-only-2.sql ==

\timing
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

explain select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

I can then run this script as root:

#!/bin/bash
ME=gsmith
su - $ME -c pg_ctl stop -w
echo 3  /proc/sys/vm/drop_caches
su - $ME -c pg_ctl start -w
su - $ME -c psql -ef index-only-1.sql
su - $ME -c pg_ctl stop -w
echo 3  /proc/sys/vm/drop_caches
su - $ME -c pg_ctl start -w
su - $ME -c psql -ef index-only-2.sql

And get results that start with zero information cached in RAM, showing 
a much more dramatic difference.  Including some snippets of interesting 
vmstat too, the index-only one gets faster as it runs while the regular 
one is pretty flat:


select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
Time: 31677.683 ms

$ vmstat 5
procs ---memory-- ---swap-- -io -system-- 
cpu
 0  1  0 15807288   4388 12644000  4681   118 1407 2432  1  
1 89 10
 1  1  0 15779388   4396 15444800  358717 1135 2058  1  
0 86 13
 0  1  0 15739956   4396 19367200  5800 0 1195 2056  1  
0 87 12
 0  1  0 15691844   4396 24183200  7053 3 1299 2044  1  
0 86 13
 0  1  0 15629736   4396 30409600  799537 1391 2053  1  
0 87 12
 0  1  0 15519244   4400 41426800 1163914 1448 2189  1  
0 87 12


select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
Time: 172381.235 ms

$ vmstat 5
procs ---memory-- ---swap-- -io -system-- 
cpu
 0  1  0 15736500   4848 19644400  314222 1092 1989  1  
0 86 13
 0  1  0 15711948   4852 22107200  3411 1 1039 1943  0  
0 88 12
 0  1  0 15685412   4852 24749600  361834  1997  0  
0 86 13

[this is the middle part, rate doesn't vary too much]

That's 5.4X as fast; not too shabby!  Kind of interesting how much 
different the I/O pattern is on the index-only version.  I 

Re: [HACKERS] index-only scans

2011-08-15 Thread Robert Haas
On Mon, Aug 15, 2011 at 7:37 PM, Greg Smith g...@2ndquadrant.com wrote:
 That's 5.4X as fast; not too shabby!

Awesome!

 And with the large difference in response time, things appear to be working
 as hoped even in this situation.  If you try this on your laptop, where
 drive cache size and random I/O are likely to be even slower, you might see
 an ever larger difference.

Hah!  Just in case a 5.4X performance improvement isn't good enough?  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] VACUUM FULL versus relcache init files

2011-08-15 Thread Tom Lane
This might be the last bug from my concurrent-vacuum-full testing --- at
least, I have no remaining unexplained events from about two full days
of running the tests.  The ones that are left involve backends randomly
failing like this:

psql: FATAL:  could not read block 0 in file base/130532080/130545668: read 
only 0 of 8192 bytes

usually during startup, though I have one example of a backend being
repeatedly unable to access pg_proc due to similar errors.

I believe what this traces to is stale relfilenode information taken
from relcache init files, which contain precomputed relcache data
intended to speed up backend startup.  There is a curious little dance
between write_relcache_init_file and RelationCacheInitFileInvalidate
that is intended to ensure that when a process creates a new relcache
init file that is already stale (because someone else invalidated the
information concurrently), the bad file will get unlinked and not used.
I think I invented that logic, so it's my fault that it doesn't work.

It works fine as long as you consider only the two processes directly
involved; but a third process can get fooled into using stale data.
The scenario requires two successive invalidations, as for example from
vacuum full on two system catalogs in a row, plus a stream of incoming
new backends.  It goes like this:

1. Process A vacuums a system catalog, unlinks init file, sends sinval
messages, does second unlink (which does nothing).

2. Process B starts, observes lack of init file, begins to construct
a new one.  It gets to the end of AcceptInvalidationMessages in
write_relcache_init_file.  Since it started after A sent the first
sinvals, it sees no incoming sinval messages and has no reason to think
its new init file isn't good.

3. Meanwhile, Process A vacuums another system catalog, unlinks init
file (doing nothing), and finally sends its sinval messages just after
B looked for them.  Now it will block trying to get RelCacheInitLock,
which B holds.

4. Process B renames its already-stale init file into place, then
releases RelCacheInitLock.

5. Process A gets the lock and removes the stale init file.

Now process B is okay, because it will see A's second sinvals before it
tries to make any use of the relcache data it has.  And the stale init
file is definitely gone after step 5.

However, between steps 4 and 5 there is a window for Process C to start,
read the stale init file, and attempt to use it.  Since C started after
A's second set of sinval messages, it doesn't see them and doesn't know
it has stale data.

As far as I can see at the moment, the only way to make this bulletproof
is to turn both creation and deletion of the init file into atomic
operations that include sinval messaging.  What I have in mind is

Creator: must take RelCacheInitLock, check for incoming invals, rename
the new file into place if none, release RelCacheInitLock.  (This is
the same as what it does now.)

Destroyer: must take RelCacheInitLock, unlink the init file, send its
sinvals, release RelCacheInitLock.

This guarantees that we serialize the sending of the sinval messages
so that anyone who sees a bad init file in place *must* see the sinval
messages afterwards, so long as they join the sinval messaging ring
before looking for the init file (which they do).  I don't think it's
any worse than the current scheme from a parallelism point of view: the
destroyer is holding RelCacheInitLock a bit longer than before, but that
should not be a performance critical situation.

Anybody see a hole in that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enforcing that all WAL has been replayed after restoring from backup

2011-08-15 Thread Fujii Masao
On Thu, Aug 11, 2011 at 1:34 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Hmm, that's not possible for the 'tar' output, but would work for 'dir'
 output. Another similar idea would be to withhold the control file in memory
 until the end of backup, and append it to the output as last. The backup
 can't be restored until the control file is written out.

 That won't protect from more complicated scenarios, like if you take the
 backup without the -x flag, and copy some but not all of the required WAL
 files manually to the pg_xlog directory. But it'd be much better than
 nothing for 9.1.

We need to skip checking whether we've reached the end backup location
only when the server crashes while base backup using pg_start_backup. Right?
We can do this by *not* initializing ControlFile-backupStartPoint if the server
is doing crash recovery and backupEndRequired is false. What about the attached
patch?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11035e6..d0d68d4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6329,11 +6329,8 @@ StartupXLOG(void)
 		/*
 		 * set backupStartPoint if we're starting recovery from a base backup
 		 */
-		if (haveBackupLabel)
-		{
+		if ((InArchiveRecovery  haveBackupLabel) || backupEndRequired)
 			ControlFile-backupStartPoint = checkPoint.redo;
-			ControlFile-backupEndRequired = backupEndRequired;
-		}
 		ControlFile-time = (pg_time_t) time(NULL);
 		/* No need to hold ControlFileLock yet, we aren't up far enough */
 		UpdateControlFile();
@@ -6703,20 +6700,13 @@ StartupXLOG(void)
 		 * crashes while an online backup is in progress. We must not treat
 		 * that as an error, or the database will refuse to start up.
 		 */
-		if (InArchiveRecovery || ControlFile-backupEndRequired)
-		{
-			if (ControlFile-backupEndRequired)
-ereport(FATAL,
-		(errmsg(WAL ends before end of online backup),
-		 errhint(All WAL generated while online backup was taken must be available at recovery.)));
-			else if (!XLogRecPtrIsInvalid(ControlFile-backupStartPoint))
-ereport(FATAL,
-		(errmsg(WAL ends before end of online backup),
-		 errhint(Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.)));
-			else
-ereport(FATAL,
-	  (errmsg(WAL ends before consistent recovery point)));
-		}
+		if (!XLogRecPtrIsInvalid(ControlFile-backupStartPoint))
+			ereport(FATAL,
+	(errmsg(WAL ends before end of online backup),
+	 errhint(Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.)));
+		else
+			ereport(FATAL,
+	(errmsg(WAL ends before consistent recovery point)));
 	}
 
 	/*
@@ -8540,7 +8530,6 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 			if (XLByteLT(ControlFile-minRecoveryPoint, lsn))
 ControlFile-minRecoveryPoint = lsn;
 			MemSet(ControlFile-backupStartPoint, 0, sizeof(XLogRecPtr));
-			ControlFile-backupEndRequired = false;
 			UpdateControlFile();
 
 			LWLockRelease(ControlFileLock);
@@ -9826,8 +9815,8 @@ read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg(invalid data in file \%s\, BACKUP_LABEL_FILE)));
 	/*
-	 * BACKUP METHOD line is new in 9.0. Don't complain if it doesn't exist,
-	 * in case you're restoring from a backup taken with an 9.0 beta version
+	 * BACKUP METHOD line is new in 9.1. Don't complain if it doesn't exist,
+	 * in case you're restoring from a backup taken with an 9.1 beta version
 	 * that didn't emit it.
 	 */
 	if (fscanf(lfp, BACKUP METHOD: %19s, backuptype) == 1)
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 6688c19..9600b50 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -137,16 +137,9 @@ typedef struct ControlFileData
 	 * we use the redo pointer as a cross-check when we see an end-of-backup
 	 * record, to make sure the end-of-backup record corresponds the base
 	 * backup we're recovering from.
-	 *
-	 * If backupEndRequired is true, we know for sure that we're restoring
-	 * from a backup, and must see a backup-end record before we can safely
-	 * start up. If it's false, but backupStartPoint is set, a backup_label
-	 * file was found at startup but it may have been a leftover from a stray
-	 * pg_start_backup() call, not accompanied by pg_stop_backup().
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	XLogRecPtr	backupStartPoint;
-	bool		backupEndRequired;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Possible Bug in pg_upgrade

2011-08-15 Thread Bruce Momjian

Applied.

---

Bruce Momjian wrote:
 Dave Byrne wrote:
  Beginning with commit 002c105a0706bd1c1e939fe0f47ecdceeae6c52d
  pg_upgrade will fail if there are orphaned temp tables in the current
  database with the message 'old and new databases postgres have a
  different number of relations'
  
  On line 41 of pg_upgrade/info.c pg_upgrade checks that the number of
  relations are the same but includes orphaned temp tables in the comparison.
  
  Is this expected behavior?  If so is there a more detailed error message
  that can be added explain the cause of the failure? It wasn't evident
  until modified pg_upgrade to show the missing oid's and recognized them
  as temp tables with oid2name.
 
 Thanks for your report and patch.
 
 Let me give some background on pg_upgrade to explain what is happening. 
 Pg_upgrade uses two C arrays to store information about tables and
 indexes for the old and new clusters.  It is not possible to store this
 information in a database because both clusters are down when pg_upgrade
 needs to use this information.
 
 In pre-9.1 pg_upgrade, pg_upgrade did a sequential scan of the arrays
 looking for a match between old and new cluster objects.  This was
 reported as slow for databases with many objects, and I could reproduce
 the slowness.  I added some caching in 9.0 but the real solution for 9.1
 was to assume a one-to-one mapping between the old and new C arrays,
 i.e. the 5th entry in the old cluster array is the same as the 5th
 element in the new cluster array.
 
 I knew this was risky but was the right solution so it doesn't surprise
 me you found a failure.  pg_upgrade checks that the size of the two
 arrays in the same and also checks that each element matches --- the
 former is what generated your error.
 
 Now, about the cause.  I had not anticipated that orphaned temp objects
 could exist in either cluster.  In fact, this case would have generated
 an error in 9.0 as well, but with a different error message.
 
 Looking futher, pg_upgrade has to use the same object filter as
 pg_dump, and pg_dump uses this C test:
 
   pg_dump.c:  else if (strncmp(nsinfo-dobj.name, pg_, 3) == 0 ||
 
 pg_dumpall uses this filter:
 
 WHERE spcname !~ '^pg_' 
 
 The problem is that the filter used by pg_upgrade only excluded
 pg_catalog, not pg_temp* as well.
 
 I have developed the attached two patches, one for 9.0, and the second
 for 9.1 and 9.2 which will make pg_upgrade now match the pg_dump
 filtering and produce proper results for orphaned temp tables by
 filtering them.
 
 As far as unlogged tables, those are dumped by 9.1/9.2, so there is no
 need to check relpersistence in this patch.  pg_dump doesn't check
 relistemp either.
 
 -- 
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com
 
   + It's impossible for everything to be true. +

[ text/x-diff is unsupported, treating like TEXT/PLAIN ]

 diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
 new file mode 100644
 index 567c64e..ca357e7
 *** a/contrib/pg_upgrade/info.c
 --- b/contrib/pg_upgrade/info.c
 *** get_rel_infos(migratorContext *ctx, cons
 *** 326,332 
  ON c.relnamespace = n.oid 
   LEFT OUTER JOIN pg_catalog.pg_tablespace t 
  ON c.reltablespace = t.oid 
 !  WHERE (( n.nspname NOT IN ('pg_catalog', 
 'information_schema') 
  AND c.oid = %u 
  ) OR ( 
  n.nspname = 'pg_catalog' 
 --- 326,335 
  ON c.relnamespace = n.oid 
   LEFT OUTER JOIN pg_catalog.pg_tablespace t 
  ON c.reltablespace = t.oid 
 !  WHERE (( 
 !  /* exclude pg_catalog and pg_temp_ (could be orphaned 
 tables) */
 !n.nspname !~ '^pg_' 
 !AND n.nspname != 'information_schema' 
  AND c.oid = %u 
  ) OR ( 
  n.nspname = 'pg_catalog' 
 diff --git a/contrib/pg_upgrade/version_old_8_3.c 
 b/contrib/pg_upgrade/version_old_8_3.c
 new file mode 100644
 index 6ca266c..930f76d
 *** a/contrib/pg_upgrade/version_old_8_3.c
 --- b/contrib/pg_upgrade/version_old_8_3.c
 *** old_8_3_check_for_name_data_type_usage(m
 *** 61,67 
  
 NOT a.attisdropped AND 
  
 a.atttypid = 'pg_catalog.name'::pg_catalog.regtype AND 
  
 c.relnamespace = n.oid AND 
 ! 

Re: [HACKERS] mosbench revisited

2011-08-15 Thread Robert Haas
On Mon, Aug 15, 2011 at 6:22 PM, Greg Stark st...@mit.edu wrote:
 On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote:
  I'm kind of interested by the
 result, actually, as I had feared that the spinlock protecting
 ProcArrayLock was going to be a bigger problem sooner.

 I think this depends on how many connections you have. If you try to
 scale up your benchmark by having hundreds of connections then get
 O(n^2) increase in the time spent with the procarray locked. It sounds
 like they pinned the number of connections at the number of cores they
 had. That makes sense if they're intentionally driving a cpu-bound
 benchmark but it means they won't run into this problem.

The experiments that I've done so far seem to indicate that there are
actually a couple of different ways for ProcArrayLock to become a
problem.

First, as you say, if you have a busy read-write workload, you just
get plain old LWLock contention.  Everyone who wants a snapshot needs
the lock in shared mode; everyone who wants to commit needs it in
exclusive mode.  So, the people who are committing block both
snapshot-taking and other commits; while the snapshot-takers block
commit.

If you switch to a read-only workload, ProcArrayLock becomes
uncontended, since a transaction without an XID does not need to
acquire ProcArrayLock to commit.  But if you have enough CPUs (64 will
do it), the spinlock protecting ProcArrayLock becomes a bottleneck.
However, you can only observe this problem if the working set fits in
shared buffers; otherwise, BufFreelistLock contention slows the whole
system to a crawl, and the spinlock contention never materializes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] src/backend/storage/ipc/README

2011-08-15 Thread Robert Haas
$SUBJECT is wildly out-of-date.  Is there any point in keeping this,
given the large (and actually correct) comment block near the top of
sinvaladt.c?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/backend/storage/ipc/README

2011-08-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 $SUBJECT is wildly out-of-date.  Is there any point in keeping this,
 given the large (and actually correct) comment block near the top of
 sinvaladt.c?

Huh, I never noticed that file before.  Yeah, it seems useless as it
stands.  I wonder however if we could repurpose it as a slightly
higher-level summary?  If you just look at sinval(adt).c, you'll find
out all about how inval messages get transmitted, but nothing about why
we need them in the first place.  There's some material about that over
in src/backend/utils/cache/inval.c, which is not the easiest place to
find if you're looking at storage/ipc/.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers