Re: [HACKERS] [BUGS] BUG #5487: dblink failed with 63 bytes connection names

2010-06-02 Thread Takahiro Itagaki

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:

 Hmm, seems that dblink should call truncate_identifier() for the 
 truncation, to be consistent with truncation of table names etc.

Hmmm, we need the same routine with truncate_identifier(), but we hard
to use the function because it modifies the input buffer directly.
Since all of the name strings in dblink is const char *, I added
a bit modified version of the function as truncate_identifier_copy()
in the attached v2 patch.

 I also spotted this in dblink.c:
 
  /* first gather the server connstr options */
  if (strlen(servername)  NAMEDATALEN)
  foreign_server = GetForeignServerByName(servername, true);
 
 I think that's wrong. We normally consistently truncate identifiers at 
 creation and at use, so that if you create an object with a very long 
 name and it's truncated, you can still refer to it with the untruncated 
 name because all such references are truncated too.

Absolutely. I re-use the added function for the fix.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



dblink_63bytes-2010602.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Greg Smith

Heikki Linnakangas wrote:
The possibilities are endless... Your proposal above covers a pretty 
good set of scenarios, but it's by no means complete. If we try to 
solve everything the configuration will need to be written in a 
Turing-complete Replication Description Language. We'll have to pick a 
useful, easy-to-understand subset that covers the common scenarios. To 
handle the more exotic scenarios, you can write a proxy that sits in 
front of the master, and implements whatever rules you wish, with the 
rules written in C.


I was thinking about this a bit recently.  As I see it, there are three 
fundamental parts of this:


1) We have a transaction that is being committed.  The rest of the 
computations here are all relative to it.


2) There is an (internal?) table that lists the state of each 
replication target relative to that transaction.  It would include the 
node name, perhaps some metadata ('location' seems the one that's most 
likely to help with the remote data center issue), and a state code.  
The codes from http://wiki.postgresql.org/wiki/Streaming_Replication 
work fine for the last part (which is the only dynamic one--everything 
else is static data being joined against):


async=hasn't received yet
recv=been received but just in RAM
fsync=received and synced to disk
apply=applied to the database

These would need to be enums so they can be ordered from lesser to 
greater consistency.


So in a 3 node case, the internal state table might look like this after 
a bit of data had been committed:


node | location | state
--
a | local| fsync 
b | remote | recv

c | remote | async

This means that the local node has a fully persistent copy, but the best 
either remote one has done is received the data, it's not on disk at all 
yet at the remote data center.  Still working its way through.


3) The decision about whether the data has been committed to enough 
places to be considered safe by the master is computed by a function 
that is passed this internal table as something like a SRF, and it 
returns a boolean.  Once that returns true, saying it's satisfied, the 
transaction closes on the master and continues to percolate out from 
there.  If it's false, we wait for another state change to come in and 
return to (2).


I would propose that most behaviors someone has expressed as being their 
desired implementation is possible to implement using this scheme. 

-Semi-sync commit:  proceed as soon somebody else has a copy and hope 
the copies all become consistent:  EXISTS WHERE state=recv
-Don't proceed until there's a fsync'd commit on at least one of the 
remote nodes:  EXISTS WHERE location='remote' AND state=fsync
-Look for a quorum of n commits of fsync quality:  CASE WHEN (SELECT 
COUNT(*) WHERE state=fsync)n THEN true else FALSE end;


Syntax is obviously rough but I think you can get the drift of what I'm 
suggesting.


While exposing the local state and running this computation isn't free, 
in situations where there truly are remote nodes in here being 
communicated with the network overhead is going to dwarf that.  If there 
were a fast path for the simplest cases and this complicated one for the 
rest, I think you could get the fully programmable behavior some people 
want using simple SQL, rather than having to write a new Replication 
Description Language or something so ambitious.  This data about what's 
been replicated to where looks an awful lot like a set of rows you can 
operate on using features already in the database to me.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] obsolete comments in xlog.c

2010-06-02 Thread Fujii Masao
Hi,

I found some obsolete comments in xlog.c, which are related to
recently-deleted variable fromArchive.


diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index c886571..65675b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2609,8 +2609,8 @@ XLogFileOpen(uint32 log, uint32 seg)
 /*
  * Open a logfile segment for reading (during recovery).
  *
- * If fromArchive is true, the segment is retrieved from archive, otherwise
- * it's read from pg_xlog.
+ * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive.
+ * If source = XLOG_FROM_PG_XLOG, it's read from pg_xlog.
  */
 static int
 XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
@@ -2673,8 +2673,6 @@ XLogFileRead(uint32 log, uint32 seg, int emode,
TimeLineID tli,
  * Open a logfile segment for reading (during recovery).
  *
  * This version searches for the segment with any TLI listed in expectedTLIs.
- * If not in StandbyMode and fromArchive is true, the segment is also
- * searched in pg_xlog if not found in archive.
  */
 static int
 XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)


Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 10:22, Greg Smith wrote:

Heikki Linnakangas wrote:

The possibilities are endless... Your proposal above covers a pretty
good set of scenarios, but it's by no means complete. If we try to
solve everything the configuration will need to be written in a
Turing-complete Replication Description Language. We'll have to pick a
useful, easy-to-understand subset that covers the common scenarios. To
handle the more exotic scenarios, you can write a proxy that sits in
front of the master, and implements whatever rules you wish, with the
rules written in C.


I was thinking about this a bit recently. As I see it, there are three
fundamental parts of this:

1) We have a transaction that is being committed. The rest of the
computations here are all relative to it.


Agreed.


So in a 3 node case, the internal state table might look like this after
a bit of data had been committed:

node | location | state
--
a | local | fsync b | remote | recv
c | remote | async

This means that the local node has a fully persistent copy, but the best
either remote one has done is received the data, it's not on disk at all
yet at the remote data center. Still working its way through.

3) The decision about whether the data has been committed to enough
places to be considered safe by the master is computed by a function
that is passed this internal table as something like a SRF, and it
returns a boolean. Once that returns true, saying it's satisfied, the
transaction closes on the master and continues to percolate out from
there. If it's false, we wait for another state change to come in and
return to (2).


You can't implement wait for X to ack the commit, but if that doesn't 
happen in Y seconds, time out and return true anyway with that.



While exposing the local state and running this computation isn't free,
in situations where there truly are remote nodes in here being
communicated with the network overhead is going to dwarf that. If there
were a fast path for the simplest cases and this complicated one for the
rest, I think you could get the fully programmable behavior some people
want using simple SQL, rather than having to write a new Replication
Description Language or something so ambitious. This data about what's
been replicated to where looks an awful lot like a set of rows you can
operate on using features already in the database to me.


Yeah, if we want to provide full control over when a commit is 
acknowledged to the client, there's certainly no reason we can't expose 
that using a hook or something.


It's pretty scary to call a user-defined function at that point in 
transaction. Even if we document that you must refrain from doing nasty 
stuff like modifying tables in that function, it's still scary.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] obsolete comments in xlog.c

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 10:39, Fujii Masao wrote:

I found some obsolete comments in xlog.c, which are related to
recently-deleted variable fromArchive.


Thanks, committed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 03:22 -0400, Greg Smith wrote:
 Heikki Linnakangas wrote:
  The possibilities are endless... Your proposal above covers a pretty 
  good set of scenarios, but it's by no means complete. If we try to 
  solve everything the configuration will need to be written in a 
  Turing-complete Replication Description Language. We'll have to pick a 
  useful, easy-to-understand subset that covers the common scenarios. To 
  handle the more exotic scenarios, you can write a proxy that sits in 
  front of the master, and implements whatever rules you wish, with the 
  rules written in C.
 
 I was thinking about this a bit recently.  As I see it, there are three 
 fundamental parts of this:
 
 1) We have a transaction that is being committed.  The rest of the 
 computations here are all relative to it.
 
 2) There is an (internal?) table that lists the state of each 
 replication target relative to that transaction.  It would include the 
 node name, perhaps some metadata ('location' seems the one that's most 
 likely to help with the remote data center issue), and a state code.  
 The codes from http://wiki.postgresql.org/wiki/Streaming_Replication 
 work fine for the last part (which is the only dynamic one--everything 
 else is static data being joined against):
 
 async=hasn't received yet
 recv=been received but just in RAM
 fsync=received and synced to disk
 apply=applied to the database
 
 These would need to be enums so they can be ordered from lesser to 
 greater consistency.
 
 So in a 3 node case, the internal state table might look like this after 
 a bit of data had been committed:
 
 node | location | state
 --
 a | local| fsync 
 b | remote | recv
 c | remote | async
 
 This means that the local node has a fully persistent copy, but the best 
 either remote one has done is received the data, it's not on disk at all 
 yet at the remote data center.  Still working its way through.
 
 3) The decision about whether the data has been committed to enough 
 places to be considered safe by the master is computed by a function 
 that is passed this internal table as something like a SRF, and it 
 returns a boolean.  Once that returns true, saying it's satisfied, the 
 transaction closes on the master and continues to percolate out from 
 there.  If it's false, we wait for another state change to come in and 
 return to (2).
 
 I would propose that most behaviors someone has expressed as being their 
 desired implementation is possible to implement using this scheme. 
 
 -Semi-sync commit:  proceed as soon somebody else has a copy and hope 
 the copies all become consistent:  EXISTS WHERE state=recv
 -Don't proceed until there's a fsync'd commit on at least one of the 
 remote nodes:  EXISTS WHERE location='remote' AND state=fsync
 -Look for a quorum of n commits of fsync quality:  CASE WHEN (SELECT 
 COUNT(*) WHERE state=fsync)n THEN true else FALSE end;
 
 Syntax is obviously rough but I think you can get the drift of what I'm 
 suggesting.
 
 While exposing the local state and running this computation isn't free, 
 in situations where there truly are remote nodes in here being 
 communicated with the network overhead is going to dwarf that.  If there 
 were a fast path for the simplest cases and this complicated one for the 
 rest, I think you could get the fully programmable behavior some people 
 want using simple SQL, rather than having to write a new Replication 
 Description Language or something so ambitious.  This data about what's 
 been replicated to where looks an awful lot like a set of rows you can 
 operate on using features already in the database to me.

I think we're all agreed on the 4 levels: async, recv, fsync, apply.

I also like the concept of a synchronisation/wakeup rule as an abstract
concept. Certainly makes things easier to discuss.

The inputs to the wakeup rule can be defined in different ways. Holding
per-node state at local level looks too complex to me. I'm not
suggesting that we need both per-node AND per-transaction options
interacting at the same time. (That would be a clear argument against
per-transaction options, if that was a requirement - its not, for me).

There seems to be a simpler way: a service oriented model. The
transaction requests a minimum level of synchronisation, the standbys
together service that request. A simple, clear process:

1. If transaction requests recv, fsync or apply, backend sleeps in the
appropriate queue

2. An agent on behalf of the remote standby provides feedback according
to the levels of service defined for that standby.

3. The agent calls a wakeup-rule to see if the backend can be woken yet

The most basic rule is first-standby-wakes meaning that the first
standby to provide feedback that the required synchronisation level has
been met by at least one standby will cause the rule to fire.

The next most basic thing is that some standbys can be marked as not
taking part in the quorum 

Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-02 Thread Russell Smith
On 28/05/10 04:00, Josh Berkus wrote:
  Consider a table that is
 regularly written but append-only.  Every time autovacuum kicks in,
 we'll go and remove any dead tuples and then mark the pages
 PD_ALL_VISIBLE and set the visibility map bits, which will cause
 subsequent vacuums to ignore the all-visible portions of the table...
 until anti-wraparound kicks in, at which point we'll vacuum the entire
 table and freeze everything.

 If, however, we decree that you can't write a new tuple into a
 PD_ALL_VISIBLE page without freezing the existing tuples, then you'll
 still have the small, incremental vacuums but those are pretty cheap,
 
 That only works if those pages were going to be autovacuumed anyway.  In
 the case outlined above (which I've seen at 3 different production sites
 this year), they wouldn't be; a table with less than 2% updates and
 deletes does not get vacuumed until max_freeze_age for any reason.  For
 that matter, pages which are getting autovacuumed are not a problem,
 period; they're being read and written and freezing them is not an issue.

 I'm not seeing a way of fixing this common issue short of overhauling
 CLOG, or of creating a freeze_map.  Darn.
   
Don't you not get a positive enough effect by adjusting the table's
autovacuum_min_freeze_age and autovacuum_max_freeze_age.  If you set
those numbers small, it appears to me that you would get very quickly to
a state where the vacuum would example only the most recent part of the
table rather than the whole thing.  Does that give you enough of a win
that it stops the scanning and writing of the whole table which reduces
the performance problem being experienced.  It's not a complete
solution, but does it go someway?

Regards

Russell


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #5487: dblink failed with 63 bytes connection names

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 09:46, Takahiro Itagaki wrote:


Heikki Linnakangasheikki.linnakan...@enterprisedb.com  wrote:


Hmm, seems that dblink should call truncate_identifier() for the
truncation, to be consistent with truncation of table names etc.


Hmmm, we need the same routine with truncate_identifier(), but we hard
to use the function because it modifies the input buffer directly.
Since all of the name strings in dblink is const char *, I added
a bit modified version of the function as truncate_identifier_copy()
in the attached v2 patch.


Well, looking at the callers, most if not all of them seem to actually 
pass a palloc'd copy, allocated by text_to_cstring().


Should we throw a NOTICE like vanilla truncate_identifier() does?

I feel it would be better to just call truncate_identifier() than roll 
your own. Assuming we want the same semantics in dblink, we'll otherwise 
have to remember to update truncate_identifier_copy() with any changes 
to truncate_identifier().



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 06:23, Fujii Masao wrote:

On Mon, May 31, 2010 at 7:17 PM, Fujii Masaomasao.fu...@gmail.com  wrote:

4) Change it so that checkpoint_segments takes effect in standby mode,
but not during recovery otherwise


I revised the patch to achieve 4). This will enable checkpoint_segments
to trigger a restartpoint like checkpoint_timeout already does, in
standby mode (i.e., streaming replication or file-based log shipping).


Hmm, XLogCtl-Insert.RedoRecPtr is not updated during recovery, so this 
doesn't work.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal - custom format for date, timestamp

2010-06-02 Thread Pavel Stehule
Hello

I thinking about request on custom datetime format. My first idea is:

a) add new datestyle format custom
b) add new GUC variables - custom_date_format, custom_time_format,
custom_timestamp_format

There are some questions:
a) what is good behave when datestyle = custom, but some necessary
custom_*_format is empty?
b) custom format is used for output, have to be used for input too?
c) when input isn't correct for custom format, should be parsed by usual way?

I did some conceptual prototype - see attachment

postgres=# set datestyle to custom;
SET
Time: 0,450 ms
postgres=# select current_date;
ERROR:  custom_datestyle_format is empty
postgres=# set custom_datestyle_format to 'Dy Mon ';
SET
Time: 0,409 ms
postgres=# select current_date;
 date
--
 Wed Jun 2010
(1 row)

Time: 0,485 ms

Regards
Pavel Stehule


custom_format.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

2010-06-02 Thread Fujii Masao
On Wed, Jun 2, 2010 at 8:40 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 02/06/10 06:23, Fujii Masao wrote:

 On Mon, May 31, 2010 at 7:17 PM, Fujii Masaomasao.fu...@gmail.com
  wrote:

 4) Change it so that checkpoint_segments takes effect in standby mode,
 but not during recovery otherwise

 I revised the patch to achieve 4). This will enable checkpoint_segments
 to trigger a restartpoint like checkpoint_timeout already does, in
 standby mode (i.e., streaming replication or file-based log shipping).

 Hmm, XLogCtl-Insert.RedoRecPtr is not updated during recovery, so this
 doesn't work.

Oops! I revised the patch, which changes CreateRestartPoint() so that
it updates XLogCtl-Insert.RedoRecPtr.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


checkpoint_segments_during_recovery_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - custom format for date, timestamp

2010-06-02 Thread Tom Lane
Pavel Stehule pavel.steh...@gmail.com writes:
 I thinking about request on custom datetime format. My first idea is:

 a) add new datestyle format custom
 b) add new GUC variables - custom_date_format, custom_time_format,
 custom_timestamp_format

Ick.  Why not just put them in one variable.
datestyle = 'custom  mm dd'
Interrelated GUCs are really painful.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 It's pretty scary to call a user-defined function at that point in 
 transaction.

Not so much pretty scary as zero chance of being accepted.
And I do mean zero.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - custom format for date, timestamp

2010-06-02 Thread Pavel Stehule
2010/6/2 Tom Lane t...@sss.pgh.pa.us:
 Pavel Stehule pavel.steh...@gmail.com writes:
 I thinking about request on custom datetime format. My first idea is:

 a) add new datestyle format custom
 b) add new GUC variables - custom_date_format, custom_time_format,
 custom_timestamp_format

 Ick.  Why not just put them in one variable.
        datestyle = 'custom  mm dd'
 Interrelated GUCs are really painful.

I understand, but personally I dislike mix two different values - name
and format together. And what more - we need separate formats for
date, time and timestamp. It is differnet than Oracle because Oracle
uses only one datatype for all.

Regards
Pavel

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs
On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:

 Uh, we have three days before we package 9.0beta2.  It would be good if
 we could decide on the max_standby_delay issue soon.

I've heard something from Heikki, not from anyone else. Those comments
amount to lets replace max_standby_delay with max_apply_delay.

Got a beta idea?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:
 
 Uh, we have three days before we package 9.0beta2.  It would be
 good if we could decide on the max_standby_delay issue soon.
 
 I've heard something from Heikki, not from anyone else. Those
 comments amount to lets replace max_standby_delay with
 max_apply_delay.
 
 Got a beta idea?
 
Given the incessant ticking of the clock, I have a hard time
believing we have any real options besides max_standby_delay or a
boolean which corresponds to the -1 and 0 settings of
max_standby_delay.  I think it's pretty clear that there's a use
case for the positive values, although there are bound to be some
who try it and are surprised by behavior at transition from idle to
active.  The whole debate seems to boil down to how important a
middle ground is versus how damaging the surprise factor is.  (I
don't really buy the argument that we won't be able to remove it
later if we replace it with something better.)
 
I know there were initially some technical problems, too; have
those been resolved?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 OK, here's v4.

I've been trying to stay out of this thread, but with beta2 approaching
and no resolution in sight, I'm afraid I have to get involved.

This patch seems to me to be going in fundamentally the wrong direction.
It's adding complexity and overhead (far more than is needed), and it's
failing utterly to resolve the objections that I raised to start with.

In particular, Simon seems to be basically refusing to do anything about
the complaint that the code fails unless master and standby clocks are
in close sync.  I do not believe that this is acceptable, and since he
won't fix it, I guess I'll have to.

The other basic problem that I had with the current code, which this
does nothing about, is that it believes that timestamps pulled from WAL
archive should be treated on the same footing as timestamps of live
actions.  That might work in certain limited scenarios but it's going to
be a disaster in general.

I believe that the motivation for treating archived timestamps as live
is, essentially, to force rapid catchup if a slave falls behind so far
that it's reading from archive instead of SR.  There are certainly
use-cases where that's appropriate (though, again, not all); but even
when you do want it it's a pretty inflexible implementation.  For
realistic values of max_standby_delay the behavior is going to pretty
much be instant kill whenever we're reading from archive.

I have backed off my original suggestion that we should reduce
max_standby_delay to a boolean: that was based on the idea that delays
would only occur in response to DDL on the master, but since vacuum and
btree page splits can also trigger delays, it seems clear that a kill
queries immediately policy isn't really very useful in practice.  So we
need to make max_standby_delay work rather than just get rid of it.

What I think might be a realistic compromise is this:

1. Separate max_standby_delay into two GUCs, say max_streaming_delay
and max_archive_delay.

2. When applying WAL that came across SR, use max_streaming_delay and
let the time measurement be current time minus time of receipt of the
current WAL send chunk.

3. When applying WAL that came from archive, use max_archive_delay and
let the time measurement be current time minus time of acquisition of
the current WAL segment from the archive.

The current code's behavior in the latter case could effectively be
modeled by setting max_archive_delay to zero, but that isn't the only
plausible setting.  More likely DBAs would set max_archive_delay to
something smaller than max_streaming_delay, but still positive so as to
not kill conflicting queries instantly.

An important property of this design is that all relevant timestamps
are taken on the slave, so clock skew isn't an issue.

I'm still inclined to apply the part of Simon's patch that adds a
transmit timestamp to each SR send chunk.  That would actually be
completely unused by the slave given my proposal above, but I think that
it is an important step to take to future-proof the SR protocol against
possible changes in the slave-side timing logic.  I don't however see
the value of transmitting keepalive records when we'd otherwise not
have anything to send.  The value of adding timestamps to the SR
protocol is really to let the slave determine the current amount of
clock skew between it and the master; which is a number that should not
change so rapidly that it has to be updated every 100ms even in an idle
system.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 An important property of this design is that all relevant timestamps
 are taken on the slave, so clock skew isn't an issue.

I agree that this is important, and I do run NTP on all my servers and
even monitor it using Nagios.

It's still not a cure-all for time skew issues.

 Comments?

I'm not really a huge fan of adding another GUC, to be honest.  I'm more
inclined to say we treat 'max_archive_delay' as '0', and turn
max_streaming_delay into what you've described.  If we fall back so far
that we have to go back to reading WALs, then we need to hurry up and
catch-up and damn the torpedos.  I'd also prefer that we only wait the
delay time once until we're fully caught up again (and have gotten
back around to waiting for new data).  

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Synchronization levels in SR

2010-06-02 Thread Greg Smith

Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  
It's pretty scary to call a user-defined function at that point in 
transaction.



Not so much pretty scary as zero chance of being accepted.
And I do mean zero.
  


I swear, you guys are such buzzkills some days.  I was suggesting a 
model for building easy prototypes, and advocating a more formal way to 
explain, in what could be code form, what someone means when they 
suggest a particular quorum model or the like.  Maybe all that will ever 
be exposed into a production server are the best of the hand-written 
implementations, and the scary try your prototype here hook only shows 
up in debug builds, or never gets written at all.  I did comment that I 
expected faster built-in implementations to be the primary way these 
would be handled.


From what Heikki said, it sounds like the main thing I was didn't 
remember is to include some timestamp information to allow rules based 
on that information too.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Andrew Dunstan



Tom Lane wrote:

I'm still inclined to apply the part of Simon's patch that adds a
transmit timestamp to each SR send chunk.  That would actually be
completely unused by the slave given my proposal above, but I think that
it is an important step to take to future-proof the SR protocol against
possible changes in the slave-side timing logic.  
  


+1.

From a radically different perspective, I had to do something similar 
in the buildfarm years ago to protect us from machines reporting with 
grossly inaccurate timestamps. This was part of the solution. The client 
adds its current timestamp setting just before transmitting the data to 
the server.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 Comments?

 I'm not really a huge fan of adding another GUC, to be honest.  I'm more
 inclined to say we treat 'max_archive_delay' as '0', and turn
 max_streaming_delay into what you've described.  If we fall back so far
 that we have to go back to reading WALs, then we need to hurry up and
 catch-up and damn the torpedos.

If I thought that 0 were a generally acceptable value, I'd still be
pushing the simplify it to a boolean agenda ;-).  The problem is that
that will sometimes kill standby queries even when they are quite short
and doing nothing objectionable.

 I'd also prefer that we only wait the
 delay time once until we're fully caught up again (and have gotten
 back around to waiting for new data).  

The delays will be measured from a receipt instant to current time,
which means that the longer it takes to apply a WAL segment or WAL
send chunk, the less grace period there will be.  (Which is the
same as what CVS HEAD does --- I'm just arguing about where we get
the start time from.)  I believe this does what you suggest and more.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-02 Thread Alvaro Herrera
Excerpts from Russell Smith's message of mié jun 02 06:38:35 -0400 2010:

 Don't you not get a positive enough effect by adjusting the table's
 autovacuum_min_freeze_age and autovacuum_max_freeze_age.  If you set
 those numbers small, it appears to me that you would get very quickly to
 a state where the vacuum would example only the most recent part of the
 table rather than the whole thing.

The problem is that vacuum doesn't know that a certain part of the table
is already frozen.  It needs to scan it completely anyways.  If we had a
frozen map, we could mark pages that are completely frozen and thus do
not need any vacuuming; but we don't (I don't recall the reasons for
this.  Maybe it's just that no one has gotten around to it, or maybe
there's something else).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
 Stephen Frost sfr...@snowman.net writes:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
  Comments?
 
  I'm not really a huge fan of adding another GUC, to be honest.  I'm more
  inclined to say we treat 'max_archive_delay' as '0', and turn
  max_streaming_delay into what you've described.  If we fall back so far
  that we have to go back to reading WALs, then we need to hurry up and
  catch-up and damn the torpedos.
 
 If I thought that 0 were a generally acceptable value, I'd still be
 pushing the simplify it to a boolean agenda ;-).  The problem is that
 that will sometimes kill standby queries even when they are quite short
 and doing nothing objectionable.

OK, now I understand. I was just thinking the same as Stephen, but now I
agree we need a second parameter.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-02 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 The problem is that vacuum doesn't know that a certain part of the table
 is already frozen.  It needs to scan it completely anyways.  If we had a
 frozen map, we could mark pages that are completely frozen and thus do
 not need any vacuuming; but we don't (I don't recall the reasons for
 this.  Maybe it's just that no one has gotten around to it, or maybe
 there's something else).

Offhand I think the reason is that you'd have to trust the frozen bit
to be 100% correct (or at least never set to 1 in error).  Currently,
both the FSM and visibility forks are just hints, and we won't suffer
data corruption if they're wrong; so we don't get too tense about WAL
logging or fsync'ing updates.  I believe Heikki is looking into what
it'd take to make the visibility map 100% reliable, in connection with
the desire for index-only scans.  If we get that and the overhead isn't
too terrible maybe we could build a frozen-status map the same way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 2:03 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
 Stephen Frost sfr...@snowman.net writes:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
  Comments?

  I'm not really a huge fan of adding another GUC, to be honest.  I'm more
  inclined to say we treat 'max_archive_delay' as '0', and turn
  max_streaming_delay into what you've described.  If we fall back so far
  that we have to go back to reading WALs, then we need to hurry up and
  catch-up and damn the torpedos.

 If I thought that 0 were a generally acceptable value, I'd still be
 pushing the simplify it to a boolean agenda ;-).  The problem is that
 that will sometimes kill standby queries even when they are quite short
 and doing nothing objectionable.

 OK, now I understand. I was just thinking the same as Stephen, but now I
 agree we need a second parameter.

I too was thinking the same as Stephen, but now I also agree we need a
second parameter.  I think the whole design Tom proposed seems good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 2:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 The problem is that vacuum doesn't know that a certain part of the table
 is already frozen.  It needs to scan it completely anyways.  If we had a
 frozen map, we could mark pages that are completely frozen and thus do
 not need any vacuuming; but we don't (I don't recall the reasons for
 this.  Maybe it's just that no one has gotten around to it, or maybe
 there's something else).

 Offhand I think the reason is that you'd have to trust the frozen bit
 to be 100% correct (or at least never set to 1 in error).  Currently,
 both the FSM and visibility forks are just hints, and we won't suffer
 data corruption if they're wrong; so we don't get too tense about WAL
 logging or fsync'ing updates.  I believe Heikki is looking into what
 it'd take to make the visibility map 100% reliable, in connection with
 the desire for index-only scans.  If we get that and the overhead isn't
 too terrible maybe we could build a frozen-status map the same way.

We could, but I think we'd be better off just freezing at the time we
mark the page PD_ALL_VISIBLE and then using the visibility map for
both purposes.  Keeping around the old xmin values after every tuple
on the page is visible to every running transaction is useful only for
forensics, and building a whole new freeze map just to retain that
information longer (and eventually force a massive anti-wraparound
vacuum) seems like overkill.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Greg Stark
On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 I believe that the motivation for treating archived timestamps as live
 is, essentially, to force rapid catchup if a slave falls behind so far
 that it's reading from archive instead of SR.


Huh, I think this is the first mention of this that I've seen. I
always assumed the motivation was just that you wanted to control how
much data loss could occur on failover and how long recovery would
take. I think separating the two delays is an interesting idea but I
don't see how it counts as a simplification.

This also still allows a slave to become arbitrarily far behind the
master. The master might take a long time to fill up a log file, and
if the slave is blocked for a few minutes, then requests the next bit
of log file and blocks for a few more minutes, then requests the next
log file and blocks for a few more minutes, etc. As long as the slave
is reading more slowly than the master is filling those segments it
will fall further and further behind but never be more than a few
minutes behind receiving the logs.

I propose an alternate way out of the problem of syncing two clocks.
Instead of comparing timestamps compare time intervals. So as it reads
xlog records it only ever compares the master timestamps with previous
master timestamps to determine how much time has elapsed on the
master. It compares that time elapsed with the time elapsed on the
slave to determine if it's falling behind. I'm assuming it
periodically catches up and can throw out its accumulated time elapsed
and start fresh. If not then the accumulated error might be a problem,
but hopefully one we can find a solution to.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:

 This patch seems to me to be going in fundamentally the wrong direction.
 It's adding complexity and overhead (far more than is needed), and it's
 failing utterly to resolve the objections that I raised to start with.

Having read your proposal, it seems changing from time-on-sender to
time-on-receiver is a one line change to the patch.

What else are you thinking of removing, if anything?

Adding an extra parameter adds more obviously and is something I now
agree with.

 In particular, Simon seems to be basically refusing to do anything about
 the complaint that the code fails unless master and standby clocks are
 in close sync.  I do not believe that this is acceptable, and since he
 won't fix it, I guess I'll have to.

Syncing two servers in replication is common practice, as has been
explained here; I'm still surprised people think otherwise. Measuring
the time between two servers is the very purpose of the patch, so the
synchronisation is not a design flaw, it is its raison d'etre. There's
been a few spleens emptied on that topic, not all of them mine, and
certainly no consensus on that. So I'm not refusing to do anything
that's been agreed... 

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing the Xact commit order to the user

2010-06-02 Thread Chris Browne
d...@csail.mit.edu (Dan Ports) writes:
 I'm not clear on why the total rowcount is useful, but perhaps I'm
 missing something obvious.

It would make it easy to conclude:

   This next transaction did 8328194 updates.  Maybe we should do
   some kind of checkpoint (e.g. - commit transaction or such) before
   working on it.

versus

   This transaction we're thinking of working on had 7 updates.  No
   big deal...
-- 
(reverse (concatenate 'string ofni.secnanifxunil @ enworbbc))
http://linuxfinances.info/info/finances.html
Rules of  the Evil Overlord #189. I  will never tell the  hero Yes I
was the one who  did it, but you'll never be able  to prove it to that
incompetent  old fool.  Chances  are, that  incompetent  old fool  is
standing behind the curtain.  http://www.eviloverlord.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 2:27 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Syncing two servers in replication is common practice, as has been
 explained here; I'm still surprised people think otherwise. Measuring
 the time between two servers is the very purpose of the patch, so the
 synchronisation is not a design flaw, it is its raison d'etre.

I think the purpose of the patch should not be to measure the time
difference between servers, but rather the replication delay.  While I
don't necessarily agree with Tom's statement that this is must-fix, I
do agree that it would be nicer if we could avoid depending on time
sync.  Yeah, I keep my servers time synced, too.  But, shit happens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] caught_up status in walsender

2010-06-02 Thread Tom Lane
I wrote:
 I'm still inclined to apply the part of Simon's patch that adds a
 transmit timestamp to each SR send chunk.  That would actually be
 completely unused by the slave given my proposal above, but I think that
 it is an important step to take to future-proof the SR protocol against
 possible changes in the slave-side timing logic.

On further contemplation, it seems like the protocol needs another field
besides that: each record should also carry a boolean indicating whether
walsender.c thinks it is currently caught up, ie the record carries
all WAL data up to the current end of WAL.  If the sender is not caught
up, then the receiver should apply max_archive_delay not
max_streaming_delay.  In this way, WAL chunks that are a little bit
behind current time will be treated the same way whether they come
across the SR link or via the archive.

In conjunction with that, I think there's a logic bug in XLogSend;
it ought to be changed like so:

/* if we went beyond SendRqstPtr, back off */
if (XLByteLT(SendRqstPtr, endptr))
+   {
endptr = SendRqstPtr;
+   *caughtup = true;
+   }

In the current coding, the effect of not setting *caughtup here is just
that we uselessly call XLogSend an extra time for each transmission
(because the main loop won't ever delay immediately after a
transmission).  But without this, we'd never send caughtup = true
to the slave.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-02 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010:

 We could, but I think we'd be better off just freezing at the time we
 mark the page PD_ALL_VISIBLE and then using the visibility map for
 both purposes.  Keeping around the old xmin values after every tuple
 on the page is visible to every running transaction is useful only for
 forensics, and building a whole new freeze map just to retain that
 information longer (and eventually force a massive anti-wraparound
 vacuum) seems like overkill.

Reducing the xid wraparound horizon a bit is reasonable, but moving it
all the way forward to OldestXmin is a bit much, methinks.

Besides, there's another argument for not freezing tuples immediately:
they may be updated shortly thereafter, causing extra churn for no gain.

I'd prefer a setting that would tell the system to freeze all tuples
that fall within a safety range whenever any tuple in the page is frozen
-- weren't you working on a patch to do this?  (was it Jeff Davis?)

(BTW maybe instead of separate visibility and freeze maps we could have
two bits in the visibility map?)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I believe that the motivation for treating archived timestamps as live
 is, essentially, to force rapid catchup if a slave falls behind so far
 that it's reading from archive instead of SR.

 Huh, I think this is the first mention of this that I've seen. I
 always assumed the motivation was just that you wanted to control how
 much data loss could occur on failover and how long recovery would
 take. I think separating the two delays is an interesting idea but I
 don't see how it counts as a simplification.

Well, it isn't a simplification: it's bringing it up to the minimum
complication level where it'll actually work sanely.  The current
implementation doesn't work sanely because it confuses stale timestamps
read from WAL with real live time.

 This also still allows a slave to become arbitrarily far behind the
 master.

Indeed, but nothing we do can prevent that, if the slave is just plain
slower than the master.  You have to assume that the slave is capable of
keeping up in the absence of query-caused delays, or you're hosed.

The real reason this is at issue is the fact that the max_standby_delay
kill mechanism applies to certain buffer-level locking operations.
On the master we just wait, and it's not a problem, because in practice
the conflicting queries almost always release these locks pretty quick.
On the slave, though, instant kill as a result of a buffer-level lock
conflict would result in a very serious degradation in standby query
reliability (while also doing practically nothing for the speed of WAL
application, most of the time).  This morning on the phone Bruce and I
were seriously discussing the idea of ripping the max_standby_delay
mechanism out of the buffer-level locking paths, and just let them work
like they do on the master, ie, wait forever.  If we did that then
simplifying max_standby_delay to a boolean would be reasonable again
(because it really would only come into play for DDL on the master).
The sticky point is that once in a blue moon you do have a conflicting
query sitting on a buffer lock for a long time, or even more likely a
series of queries keeping the WAL replay process from obtaining buffer
cleanup lock.

So it seems that we have to have max_standby_delay-like logic for those
locks, and also that zero grace period before kill isn't a very practical
setting.  However, there isn't a lot of point in obsessing over exactly
how long the grace period ought to be, as long as it's more than a few
milliseconds.  It *isn't* going to have any real effect on whether the
slave can stay caught up.  You could make a fairly decent case for just
measuring the grace period from when the replay process starts to wait,
as I think I proposed awhile back.  The value of measuring delay from a
receipt time is that if you do happen to have a bunch of delays within
a short interval you'll get more willing to kill queries --- but I
really believe that that is a corner case and will have nothing to do
with ordinary performance.

 I propose an alternate way out of the problem of syncing two clocks.
 Instead of comparing timestamps compare time intervals. So as it reads
 xlog records it only ever compares the master timestamps with previous
 master timestamps to determine how much time has elapsed on the
 master. It compares that time elapsed with the time elapsed on the
 slave to determine if it's falling behind.

I think this would just add complexity and uncertainty, to deal with
something that won't be much of a problem in practice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] caught_up status in walsender

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 21:44, Tom Lane wrote:

In conjunction with that, I think there's a logic bug in XLogSend;
it ought to be changed like so:

/* if we went beyond SendRqstPtr, back off */
if (XLByteLT(SendRqstPtr, endptr))
+   {
endptr = SendRqstPtr;
+   *caughtup = true;
+   }

In the current coding, the effect of not setting *caughtup here is just
that we uselessly call XLogSend an extra time for each transmission
(because the main loop won't ever delay immediately after a
transmission).  But without this, we'd never send caughtup = true
to the slave.


That's intentional. It could take some time for the WAL to be sent, if 
the network is busy, so by the time XLogSend returns you might well not 
be caught up anymore.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] caught_up status in walsender

2010-06-02 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 02/06/10 21:44, Tom Lane wrote:
 In the current coding, the effect of not setting *caughtup here is just
 that we uselessly call XLogSend an extra time for each transmission
 (because the main loop won't ever delay immediately after a
 transmission).  But without this, we'd never send caughtup = true
 to the slave.

 That's intentional. It could take some time for the WAL to be sent, if 
 the network is busy, so by the time XLogSend returns you might well not 
 be caught up anymore.

It may have been intentional, but it's still wrong.  If you were able to
pull all of WAL into the record-to-be-sent, you should sleep afterwards,
not send an extra record containing a few more bytes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Allow wal_keep_segments to keep all segments

2010-06-02 Thread Bruce Momjian
Bruce Momjian wrote:
 Robert Haas wrote:
  On Sat, May 8, 2010 at 10:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   Bruce Momjian br...@momjian.us writes:
   Uh, did we decide that 'wal_keep_segments' was the best name for this
   GUC setting? ?I know we shipped beta1 using that name.
  
   I thought min_wal_segments was a reasonable proposal, but it wasn't
   clear if there was consensus or not.
  
  I think most people thought it was another reasonable choice, but I
  think the consensus position is probably something like it's about
  the same rather than it's definitely better.  We had one or two
  people with stronger opinions than that on either side, I believe.
 
 Agreed the current name seems OK.  However, was there agreement that
 wal_keep_segments = -1 should keep all WAL segements?  I can see that as
 useful for cases where you are doing a dump to be transfered to the
 slave, and not using archive_command.  This avoids the need for the set
 a huge value solution.

The attached patch allows wal_keep_segments = -1 to keep all segements; 
this is particularly useful for taking a base backup, where you need all
the WAL files during startup of the standby.  I have documented this
usage in the patch as well.

I am thinking of applying this after 9.0 beta2 if there is no objection.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +
Index: doc/src/sgml/config.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.280
diff -c -c -r1.280 config.sgml
*** doc/src/sgml/config.sgml	31 May 2010 15:50:48 -	1.280
--- doc/src/sgml/config.sgml	2 Jun 2010 19:19:18 -
***
*** 1887,1893 
  Specifies the number of past log file segments kept in the
  filenamepg_xlog/
  directory, in case a standby server needs to fetch them for streaming
! replication. Each segment is normally 16 megabytes. If a standby
  server connected to the primary falls behind by more than
  varnamewal_keep_segments/ segments, the primary might remove
  a WAL segment still needed by the standby, in which case the
--- 1887,1893 
  Specifies the number of past log file segments kept in the
  filenamepg_xlog/
  directory, in case a standby server needs to fetch them for streaming
! replication.  Each segment is normally 16 megabytes. If a standby
  server connected to the primary falls behind by more than
  varnamewal_keep_segments/ segments, the primary might remove
  a WAL segment still needed by the standby, in which case the
***
*** 1901,1908 
  is zero (the default), the system doesn't keep any extra segments
  for standby purposes, and the number of old WAL segments available
  for standbys is determined based only on the location of the previous
! checkpoint and status of WAL archiving.
! This parameter can only be set in the filenamepostgresql.conf/
  file or on the server command line.
 /para
 /listitem
--- 1901,1909 
  is zero (the default), the system doesn't keep any extra segments
  for standby purposes, and the number of old WAL segments available
  for standbys is determined based only on the location of the previous
! checkpoint and status of WAL archiving.  If literal-1/ is
! specified, log file segments are kept indefinitely. This
! parameter can only be set in the filenamepostgresql.conf/
  file or on the server command line.
 /para
 /listitem
Index: doc/src/sgml/high-availability.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/high-availability.sgml,v
retrieving revision 1.70
diff -c -c -r1.70 high-availability.sgml
*** doc/src/sgml/high-availability.sgml	29 May 2010 09:01:10 -	1.70
--- doc/src/sgml/high-availability.sgml	2 Jun 2010 19:19:19 -
***
*** 750,756 
  If you use streaming replication without file-based continuous
  archiving, you have to set varnamewal_keep_segments/ in the master
  to a value high enough to ensure that old WAL segments are not recycled
! too early, while the standby might still need them to catch up. If the
  standby falls behind too much, it needs to be reinitialized from a new
  base backup. If you set up a WAL archive that's accessible from the
  standby, wal_keep_segments is not required as the standby can always
--- 750,760 
  If you use streaming replication without file-based continuous
  archiving, you have to set varnamewal_keep_segments/ in the master
  to a value high enough to ensure that old WAL segments are not recycled
! too early, while the standby 

Re: [HACKERS] Exposing the Xact commit order to the user

2010-06-02 Thread Chris Browne
heikki.linnakan...@enterprisedb.com (Heikki Linnakangas) writes:
 On 24/05/10 19:51, Kevin Grittner wrote:
 The only thing I'm confused about is what benefit anyone expects to
 get from looking at data between commits in some way other than our
 current snapshot mechanism.  Can someone explain a use case where
 what Jan is proposing is better than snapshot isolation?  It doesn't
 provide any additional integrity guarantees that I can see.

 Right, it doesn't. What it provides is a way to reconstruct a snapshot
 at any point in time, after the fact. For example, after transactions
 A, C, D and B have committed in that order, it allows you to
 reconstruct a snapshot just like you would've gotten immediately after
 the commit of A, C, D and B respectively. That's useful replication
 tools like Slony that needs to commit the changes of those
 transactions in the slave in the same order as they were committed in
 the master.

 I don't know enough of Slony et al. to understand why that'd be better
 than the current heartbeat mechanism they use, taking a snapshot every
 few seconds, batching commits.

I see two advantages:

 a) Identifying things on a transaction-by-transaction basis means that
the snapshots (syncs) don't need to be captured, which is
presently an area of fragility.  If the slon daemon falls over on
Friday evening, and nobody notices until Monday, the snapshot
reverts to being all updates between Friday and whenever SYNCs
start to be collected again.

Exposing commit orders eliminates that fragility.  SYNCs don't
need to be captured anymore, so they can't be missed (which is
today's problem).

 b) The sequence currently used to control log application ordering is
a bottleneck, as it is a single sequence shared across all
connections.

It could be eliminated in favor of (perhaps) an in-memory variable
defined on a per-connection basis.

It's not a bottleneck that we hear a lot of complaints about, but
the sequence certainly is a bottleneck.

-- 
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://cbbrowne.com/info/internet.html
MS  apparently now  has a  team dedicated  to tracking  problems with
Linux  and publicizing them.   I guess  eventually they'll  figure out
this back fires... ;) -- William Burrow aa...@delete.fan.nb.ca

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow wal_keep_segments to keep all segments

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 3:20 PM, Bruce Momjian br...@momjian.us wrote:
 Bruce Momjian wrote:
 Robert Haas wrote:
  On Sat, May 8, 2010 at 10:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   Bruce Momjian br...@momjian.us writes:
   Uh, did we decide that 'wal_keep_segments' was the best name for this
   GUC setting? ?I know we shipped beta1 using that name.
  
   I thought min_wal_segments was a reasonable proposal, but it wasn't
   clear if there was consensus or not.
 
  I think most people thought it was another reasonable choice, but I
  think the consensus position is probably something like it's about
  the same rather than it's definitely better.  We had one or two
  people with stronger opinions than that on either side, I believe.

 Agreed the current name seems OK.  However, was there agreement that
 wal_keep_segments = -1 should keep all WAL segements?  I can see that as
 useful for cases where you are doing a dump to be transfered to the
 slave, and not using archive_command.  This avoids the need for the set
 a huge value solution.

 The attached patch allows wal_keep_segments = -1 to keep all segements;
 this is particularly useful for taking a base backup, where you need all
 the WAL files during startup of the standby.  I have documented this
 usage in the patch as well.

 I am thinking of applying this after 9.0 beta2 if there is no objection.

+1 for the patch, but why wait until after beta2?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow wal_keep_segments to keep all segments

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 15:20 -0400, Bruce Momjian wrote:

 The attached patch allows wal_keep_segments = -1 to keep all segements; 
 this is particularly useful for taking a base backup, where you need all
 the WAL files during startup of the standby.  I have documented this
 usage in the patch as well.
 
 I am thinking of applying this after 9.0 beta2 if there is no objection.

It's not clear to me why keep all files until server breaks is a good
setting. Surely you would set this parameter to the size of your disk.
Why allow it to go higher?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] caught_up status in walsender

2010-06-02 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:

 I wrote:
 I'm still inclined to apply the part of Simon's patch that adds a
 transmit timestamp to each SR send chunk.  That would actually be
 completely unused by the slave given my proposal above, but I think that
 it is an important step to take to future-proof the SR protocol against
 possible changes in the slave-side timing logic.

 On further contemplation, it seems like the protocol needs another field
 besides that: each record should also carry a boolean indicating whether
 walsender.c thinks it is currently caught up, ie the record carries
 all WAL data up to the current end of WAL.  If the sender is not caught
 up, then the receiver should apply max_archive_delay not
 max_streaming_delay.  In this way, WAL chunks that are a little bit
 behind current time will be treated the same way whether they come
 across the SR link or via the archive.

Then we'd have max_catchup_delay and max_streaming_delay, wouldn't we?

I'm still trying to understand the implications of the proposal, which
sounds good but… given just enough load on the slave, wouldn't it be
playing catchup all the time? Wouldn't enough load mean one read query
per write commit on the master?

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 20:14, Tom Lane wrote:

 For realistic values of max_standby_delay ...


Hang on right there. What do you consider a realistic value for 
max_standby_delay? Because I'm not sure I have a grip on that myself. 5 
seconds? 5 minutes? 5 hours? I can see use cases for all of those...



What I think might be a realistic compromise is this:

1. Separate max_standby_delay into two GUCs, say max_streaming_delay
and max_archive_delay.

2. When applying WAL that came across SR, use max_streaming_delay and
let the time measurement be current time minus time of receipt of the
current WAL send chunk.

3. When applying WAL that came from archive, use max_archive_delay and
let the time measurement be current time minus time of acquisition of
the current WAL segment from the archive.

The current code's behavior in the latter case could effectively be
modeled by setting max_archive_delay to zero, but that isn't the only
plausible setting.  More likely DBAs would set max_archive_delay to
something smaller than max_streaming_delay, but still positive so as to
not kill conflicting queries instantly.


The problem with defining max_archive_delay that way is again that you 
can fall behind indefinitely. If you set it to 5 minutes, it means that 
you'll wait a maximum of 5 minutes *per WAL segment*, even if WAL is 
being generated faster.


I don't understand why you want to use a different delay when you're 
restoring from archive vs. when you're streaming (what about existing 
WAL files found in pg_xlog, BTW?). The source of WAL shouldn't make a 
difference. If it's because you assume that restoring from archive is a 
sign that you've fallen behind a lot, surely you've exceeded 
max_standby_delay then and I still don't see a need for a separate GUC.


I stand by my suggestion from yesterday: Let's define max_standby_delay 
as the difference between a piece of WAL becoming available in the 
standby, and applying it. To approximate piece of WAL becoming 
available for SR, we can use the mechanism with send/applyChunks from 
Simon's latest patch, or go with the simpler scheme of just resetting a 
last caughtup timestamp to current time whenever we have to wait for 
new WAL to arrive. When restoring from archive, likewise reset last 
caughtup timestamp whenever restore_command returns non-0, i.e we have 
to wait for the next WAL file to arrive.


That works the same for both SR and file-based log shipping, there's 
only one knob to set, is simple to implement and doesn't require 
synchronized clocks.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to pass around collation information

2010-06-02 Thread Peter Eisentraut
On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote:
 The feature I'm thinking of is what
 people might call per-column locale, and the SQL standard defines
 that.  It would look like this:
 
 CREATE TABLE test (
 a varchar COLLATE de,
 b varchar COLLATE fr
 );
 
 SELECT * FROM test WHERE a  'baz' ORDER BY b;

Perhaps it's also worth pointing out there could be use cases other than
supporting multiple natural languages.  For example, it is frequently
requested to be able to sort in ways that doesn't ignore special
characters, binary sort, or perhaps special file name sort that treats
'/' special in some way.  So it could be quite useful to be able to say

CREATE TABLE something (
description text COLLATE en,
code char(6) COLLATE binary,
file text COLLATE filename_sort
);

or even something like

CREATE DOMAIN filename AS text COLLATE filename_sort;

These are examples where having the collation attached to the column
would appear to make more sense then having it attached only to
operations.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] caught_up status in walsender

2010-06-02 Thread Tom Lane
Dimitri Fontaine dfonta...@hi-media.com writes:
 I'm still trying to understand the implications of the proposal, which
 sounds good but… given just enough load on the slave, wouldn't it be
 playing catchup all the time?

Uh, if the slave is overloaded, *any* implementation will be playing
catchup all the time.  Not sure about your point here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 The problem with defining max_archive_delay that way is again that you 
 can fall behind indefinitely.

See my response to Greg Stark --- I don't think this is really an
issue.  It's certainly far less of an issue than the current situation
that query grace periods go to zero under not-at-all-extreme conditions.

 I don't understand why you want to use a different delay when you're 
 restoring from archive vs. when you're streaming (what about existing 
 WAL files found in pg_xlog, BTW?).

You're missing the point.  I want the DBA to be able to control what
happens in those two cases.  In the current implementation he has no
control over what happens while restoring from archive: it's going to
effectively act like max_archive_delay = 0 all the time.  If you're of
the opinion that's good, you can set the parameter that way and be
happy.  I'm of the opinion you'll soon find out it isn't so good,
because it'll kill standby queries too easily.

 I stand by my suggestion from yesterday: Let's define max_standby_delay 
 as the difference between a piece of WAL becoming available in the 
 standby, and applying it.

My proposal is essentially the same as yours plus allowing the DBA to
choose different max delays for the caught-up and not-caught-up cases.
Maybe everybody will end up setting the two delays the same, but I think
we don't have enough experience to decide that for them now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] caught_up status in walsender

2010-06-02 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Uh, if the slave is overloaded, *any* implementation will be playing
 catchup all the time.  Not sure about your point here.

Well, sorry, I missed the part where the catchup is measured on
walsender side of things. Now that means that a non interrupted flow of
queries quicker than the wal sender delay (100ms, right?) on the slave
would certainly leave enough room for it to stay current.

Well that depends also on the time it takes to replay the wals.

I'm trying to decide if exposing this 100ms magic setup (or something
else) would allow for a lot more control with respect to what means
being overloaded. 

Say you set the 100ms delay to any value that you know (from testing and
log_min_duration_statement, say) just a little higher than the slowest
query you typically run on the slave. If that reduces the chances to
ever playing cathing-up (as soon as there's no unexpected slow query
there), that would be great.

If you can manage to make sense of this, I'm interested into hearing how
far it is from reality.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be

2010-06-02 Thread Robert Haas
On Mon, May 17, 2010 at 4:33 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, May 15, 2010 at 3:20 AM, Robert Haas robertmh...@gmail.com wrote:
 Hmm, OK, I think that makes sense.  Would you care to propose a patch?

 Yep. Here is the patch.

 This patch distinguishes normal shutdown from unexpected exit, while the
 server is in recovery. That is, when smart or fast shutdown is requested
 during recovery, the bgwriter sets the ControlFile-state to new-introduced
 DB_SHUTDOWNED_IN_RECOVERY state.

 When recovery starts from the DB_SHUTDOWNED_IN_RECOVERY state, the startup
 process emits

    LOG:  database system was shut down in recovery at 2010-05-12 20:35:24 EDT

 instead of

    LOG:  database system was interrupted while in recovery at log
 time 2010-05-12 20:35:24 EDT
    HINT:  If this has occurred more than once some data might be
 corrupted and you might need to choose an earlier recovery target.

Heikki and I discussed this over IM today and came away with two questions.

First, is it appropriate to set the control file state to
DB_SHUTDOWNED_IN_RECOVERY even when we're in crash recovery (as
opposed to archive recovery/SR)?  My vote is no, but Heikki thought it
might be OK.

Second, one of the places where this patch updates the control file
immediately follows a call to UpdateMinRecoveryPoint().  That can lead
to fsync-ing the control file twice in a row.  Should we worry about
this or just let it go?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] caught_up status in walsender

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 I'm still inclined to apply the part of Simon's patch that adds a
 transmit timestamp to each SR send chunk.  That would actually be
 completely unused by the slave given my proposal above, but I think that
 it is an important step to take to future-proof the SR protocol against
 possible changes in the slave-side timing logic.

 On further contemplation, it seems like the protocol needs another field
 besides that: each record should also carry a boolean indicating whether
 walsender.c thinks it is currently caught up, ie the record carries
 all WAL data up to the current end of WAL.  If the sender is not caught
 up, then the receiver should apply max_archive_delay not
 max_streaming_delay.  In this way, WAL chunks that are a little bit
 behind current time will be treated the same way whether they come
 across the SR link or via the archive.

I'm not sure that makes sense.  I thought the point of separating the
archive and streaming cases was that the same timeout wouldn't
necessarily be correct for a 16MB WAL file as it would for a 16-byte
WAL record you've just received.  IOW, you might want
max_archive_delay  max_streaming_delay.  The original proposal also
seems somewhat easier to understand, to me at least.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to pass around collation information

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 3:46 PM, Peter Eisentraut pete...@gmx.net wrote:
 On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote:
 The feature I'm thinking of is what
 people might call per-column locale, and the SQL standard defines
 that.  It would look like this:

 CREATE TABLE test (
     a varchar COLLATE de,
     b varchar COLLATE fr
 );

 SELECT * FROM test WHERE a  'baz' ORDER BY b;

 Perhaps it's also worth pointing out there could be use cases other than
 supporting multiple natural languages.  For example, it is frequently
 requested to be able to sort in ways that doesn't ignore special
 characters, binary sort, or perhaps special file name sort that treats
 '/' special in some way.  So it could be quite useful to be able to say

 CREATE TABLE something (
    description text COLLATE en,
    code char(6) COLLATE binary,
    file text COLLATE filename_sort
 );

 or even something like

 CREATE DOMAIN filename AS text COLLATE filename_sort;

 These are examples where having the collation attached to the column
 would appear to make more sense then having it attached only to
 operations.

But in the end the only purpose of setting it on a column is to set
which one will be used for operations on that column.  And the user
might still override it for a particular query.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 23:50, Robert Haas wrote:

First, is it appropriate to set the control file state to
DB_SHUTDOWNED_IN_RECOVERY even when we're in crash recovery (as
opposed to archive recovery/SR)?  My vote is no, but Heikki thought it
might be OK.


My logic on that is:

If the database is known to be in good shape, i.e not corrupt, after 
shutdown during crash recovery, then we should not print the warning at 
restart saying This probably means that some data is corrupted. 
There's no reason to believe the database is corrupt if it's a 
controlled shutdown, so setting control file state to 
DB_SHUTDOWNED_IN_RECOVERY is OK. But if it's not OK for some reason, 
then we really shouldn't allow the shut down in the first place until we 
hit the end of WAL.


So the option allow shutdown, but warn at restart that your data is 
probably corrupt does not make sense in any case.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 5:39 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 02/06/10 23:50, Robert Haas wrote:

 First, is it appropriate to set the control file state to
 DB_SHUTDOWNED_IN_RECOVERY even when we're in crash recovery (as
 opposed to archive recovery/SR)?  My vote is no, but Heikki thought it
 might be OK.

 My logic on that is:

 If the database is known to be in good shape, i.e not corrupt, after
 shutdown during crash recovery, then we should not print the warning at
 restart saying This probably means that some data is corrupted. There's no
 reason to believe the database is corrupt if it's a controlled shutdown, so
 setting control file state to DB_SHUTDOWNED_IN_RECOVERY is OK. But if it's
 not OK for some reason, then we really shouldn't allow the shut down in the
 first place until we hit the end of WAL.

 So the option allow shutdown, but warn at restart that your data is
 probably corrupt does not make sense in any case.

Well, the point is, we emit that message every time we go to recover
from a crash.  Presumably the message is as valid after a restart of
crash recovery as it was the first time around.

thinks

But maybe the message isn't right the first time either.  After all
the point of having a write-ahead log in the first place is that we
should be able to prevent corruption in the event of an unexpected
shutdown.  Maybe the right thing to do is to forget about adding a new
state and just remove or change the errhint from these messages:

ereport(LOG, (errmsg(database system was interrupted while in
recovery at %s, str_time(ControlFile-time)),
errhint(This probably means that some data is
corrupted and
 you will have to use the
last backup for recovery.)));

ereport(LOG, (errmsg(database system was interrupted while in
recovery at log time %s, str_time(ControlFile-checkPointCopy.time)),
   errhint(If this has occurred more than once
some data might be corrupted
   and you might need to choose an earlier
recovery target.)));

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Greg Stark
On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Indeed, but nothing we do can prevent that, if the slave is just plain
 slower than the master.  You have to assume that the slave is capable of
 keeping up in the absence of query-caused delays, or you're hosed.

I was assuming the walreceiver only requests more wal in relatively
small chunks and only when replay has caught up and needs more data. I
haven't actually read this code so if that assumption is wrong then
I'm off-base. But if my assumption is right then it's not merely the
master running faster than the slave that can cause you to fall
arbitrarily far behind. The receive time is delayed by how long the
slave waited on a previous lock, so if we wait for a few minutes, then
proceed and find we need more wal data we'll read data from the master
that it could have generated those few minutes ago.


 The sticky point is that once in a blue moon you do have a conflicting
 query sitting on a buffer lock for a long time, or even more likely a
 series of queries keeping the WAL replay process from obtaining buffer
 cleanup lock.

So I think this isn't necessarily such a blue moon event. As I
understand it, all it would take is a single long-running report and a
vacuum or HOT cleanup occurring on the master. If I want to set
max_standby_delay to 60min to allow reports to run for up to an hour
then any random HOT cleanup on the master will propagate to the slave
and cause a WAL stall until the transactions which have that xid in
their snapshot end.

Even the buffer locks are could potentially be blocked for a long time
if you happen to run the right kind of query (say a nested loop with
the buffer in question on the outside and a large table on the
inside). That's a rarer event though; is that what you were thinking
of?


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 I was assuming the walreceiver only requests more wal in relatively
 small chunks and only when replay has caught up and needs more data. I
 haven't actually read this code so if that assumption is wrong then
 I'm off-base.

It is off-base.  The receiver does not request data, the sender is
what determines how much WAL is sent when.

 So I think this isn't necessarily such a blue moon event. As I
 understand it, all it would take is a single long-running report and a
 vacuum or HOT cleanup occurring on the master.

I think this is mostly FUD too.  How often do you see vacuum blocked for
an hour now?  It probably can happen, which is why we need to be able to
kick queries off the locks with max_standby_delay, but it's far from
common.  What we're concerned about here is how long a buffer lock on a
single page is held, not how long heavyweight locks are held.  The
normal hold times are measured in microseconds.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-02 Thread Bruce Momjian
Alvaro Herrera wrote:
 Excerpts from Robert Haas's message of mi?? jun 02 14:16:33 -0400 2010:
 
  We could, but I think we'd be better off just freezing at the time we
  mark the page PD_ALL_VISIBLE and then using the visibility map for
  both purposes.  Keeping around the old xmin values after every tuple
  on the page is visible to every running transaction is useful only for
  forensics, and building a whole new freeze map just to retain that
  information longer (and eventually force a massive anti-wraparound
  vacuum) seems like overkill.
 
 Reducing the xid wraparound horizon a bit is reasonable, but moving it
 all the way forward to OldestXmin is a bit much, methinks.
 
 Besides, there's another argument for not freezing tuples immediately:
 they may be updated shortly thereafter, causing extra churn for no gain.
 
 I'd prefer a setting that would tell the system to freeze all tuples
 that fall within a safety range whenever any tuple in the page is frozen
 -- weren't you working on a patch to do this?  (was it Jeff Davis?)
 
 (BTW maybe instead of separate visibility and freeze maps we could have
 two bits in the visibility map?)

Yeah, the two-bits idea was suggested during the conversation core had
about the issue.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing the Xact commit order to the user

2010-06-02 Thread Greg Stark
On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne cbbro...@acm.org wrote:
 It would make it easy to conclude:

   This next transaction did 8328194 updates.  Maybe we should do
   some kind of checkpoint (e.g. - commit transaction or such) before
   working on it.

    versus

   This transaction we're thinking of working on had 7 updates.  No
   big deal...

I'm puzzled how you would define this value. How do you add 7 inserts,
7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7
inserts and 7 deletes worth twice as much as the 7 updates when
they're basically the same thing? What if the inserts fired triggers
which inserted 7 more rows, is that 14? What if the 7 updates modified
2 TB of TOAST data but the 8238194 updates were all to the same record
and they were all HOT updates so all it did was change 8kB?

In any case you'll have all the actual data from your triggers or
hooks or whatever so what value does having the system keep track of
this add?



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow wal_keep_segments to keep all segments

2010-06-02 Thread Bruce Momjian
Simon Riggs wrote:
 On Wed, 2010-06-02 at 15:20 -0400, Bruce Momjian wrote:
 
  The attached patch allows wal_keep_segments = -1 to keep all segements; 
  this is particularly useful for taking a base backup, where you need all
  the WAL files during startup of the standby.  I have documented this
  usage in the patch as well.
  
  I am thinking of applying this after 9.0 beta2 if there is no objection.
 
 It's not clear to me why keep all files until server breaks is a good
 setting. Surely you would set this parameter to the size of your disk.
 Why allow it to go higher?

Well, the -1 allows them to set it temporarily without having to compute
their free disk space.  Frankly, because the disk space varies, it is
impossible to know exactly how large the disk is at the time it would
fill up.

I think the normal computation would be:

1) How long is my file system backup and restore to standby
   going to take
2) How often do I generate a 16MB WAL file

You would do some computation to figure that out, then maybe multiply it
by 10x and set that for wal_keep_segments.  I figured allowing a simple
-1 would be easier.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow wal_keep_segments to keep all segments

2010-06-02 Thread Bruce Momjian
Robert Haas wrote:
  The attached patch allows wal_keep_segments = -1 to keep all segements;
  this is particularly useful for taking a base backup, where you need all
  the WAL files during startup of the standby. ?I have documented this
  usage in the patch as well.
 
  I am thinking of applying this after 9.0 beta2 if there is no objection.
 
 +1 for the patch, but why wait until after beta2?

I wanted to give people enough time to review/discuss it.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Greg Stark gsst...@mit.edu writes:
  So I think this isn't necessarily such a blue moon event. As I
  understand it, all it would take is a single long-running report and a
  vacuum or HOT cleanup occurring on the master.
 
 I think this is mostly FUD too.  How often do you see vacuum blocked for
 an hour now?  It probably can happen, which is why we need to be able to
 kick queries off the locks with max_standby_delay, but it's far from
 common.  What we're concerned about here is how long a buffer lock on a
 single page is held, not how long heavyweight locks are held.  The
 normal hold times are measured in microseconds.

Maybe I'm missing something, but I think Greg's point was that if you
have a long-running query running against the standby/slave/whatever,
which is holding locks on various relations to implement that report,
and then a vacuum or HOT update happens on the master, the long-running
report query will get killed off unless you bump max_streaming_delay up
pretty high (eg: 60 mins).

That being said, I'm not sure that there's really another solution.
Yes, in this case, the slave can end up being an hour behind, but that's
the trade-off you have to make if you want to run an hour-long query on
the slave.  The other answer is to make the master not update those
tuples, etc, which might be possible by starting a transaction on the
master which grabs things enough to prevent the vacuum/hot/etc update
from happening.  That may be possible manually, but it's not fun and it
certainly isn't something we'll have built-in support for in 9.0.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] ALTER TABLE .... make constraint DEFERRABLE

2010-06-02 Thread Bruce Momjian
Simon Riggs wrote:
 
 Deferrable unique constraints seem an interesting feature, though I have
 either some questions or some issues, not sure which.
 
 I don't seem to be able to find any way to do an ALTER TABLE that adds
 this new capability to an existing table.

I was able to do it:

test= create table test (x int unique DEFERRABLE INITIALLY DEFERRED);
NOTICE:  CREATE TABLE / UNIQUE will create implicit index test_x_key
for table test
CREATE TABLE

test= alter table test add column y int;
ALTER TABLE

test= alter table test add unique (y) DEFERRABLE INITIALLY DEFERRED;
NOTICE:  ALTER TABLE / ADD UNIQUE will create implicit index
test_y_key for table test
ALTER TABLE

Is that what you were asking?

 There is no way to add a constraint via a CREATE TABLE AS SELECT, so
 that means there is no way to use the feature at all in that case.

Uh, CREATE TABLE AS SELECT seems to be very limited, but I have not
heard any complaints about it before.

 Also, foreign keys can't be defined that refer to a deferrable primary
 key. That isn't mentioned at all in the manual with regard to the
 DEFERRABLE clause, though it is mentioned in the FK section. You get
 this error message
 ERROR:  cannot use a deferrable unique constraint for referenced table
 
 The use case for this feature looks a little narrow at present. Can we
 do something about usability?

Not sure why that was a limitation.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #5487: dblink failed with 63 bytes connection names

2010-06-02 Thread Takahiro Itagaki

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:

 Well, looking at the callers, most if not all of them seem to actually 
 pass a palloc'd copy, allocated by text_to_cstring().
 
 Should we throw a NOTICE like vanilla truncate_identifier() does?
 
 I feel it would be better to just call truncate_identifier() than roll 
 your own. Assuming we want the same semantics in dblink, we'll otherwise 
 have to remember to update truncate_identifier_copy() with any changes 
 to truncate_identifier().

That makes sense. Now I use truncate_identifier(warn=true) for the fix.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



dblink_63bytes-2010-603.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be

2010-06-02 Thread Florian Pflug
On Jun 3, 2010, at 0:58 , Robert Haas wrote:
 But maybe the message isn't right the first time either.  After all
 the point of having a write-ahead log in the first place is that we
 should be able to prevent corruption in the event of an unexpected
 shutdown.  Maybe the right thing to do is to forget about adding a new
 state and just remove or change the errhint from these messages:

You've fallen prey to a (very common) miss-interpration of this message. It is 
not about corruption *caused* by a crash during recovery, it's about corruption 
*causing* the crash.

I'm not in favor of getting rid of that message entirely, since produces a 
worthwhile hint if the crash was really caused by corrupt data. But it 
desperately needs a better wording that makes cause and effect perfectly clear. 
That even you miss-read it conclusively proves that.

How about
If this has happened repeatedly and without manual intervention, it was 
probably caused by corrupted data and you may need to restore from backup
for the crash recovery case and
If this has happened repeatedly and without manual intervention, it was 
probably caused by corrupted data and you may need to choose an earlier 
recovery target
for the PITR case.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 9:07 PM, Florian Pflug f...@phlo.org wrote:
 On Jun 3, 2010, at 0:58 , Robert Haas wrote:
 But maybe the message isn't right the first time either.  After all
 the point of having a write-ahead log in the first place is that we
 should be able to prevent corruption in the event of an unexpected
 shutdown.  Maybe the right thing to do is to forget about adding a new
 state and just remove or change the errhint from these messages:

 You've fallen prey to a (very common) miss-interpration of this message. It 
 is not about corruption *caused* by a crash during recovery, it's about 
 corruption *causing* the crash.

 I'm not in favor of getting rid of that message entirely, since produces a 
 worthwhile hint if the crash was really caused by corrupt data. But it 
 desperately needs a better wording that makes cause and effect perfectly 
 clear. That even you miss-read it conclusively proves that.

 How about
 If this has happened repeatedly and without manual intervention, it was 
 probably caused by corrupted data and you may need to restore from backup
 for the crash recovery case and
 If this has happened repeatedly and without manual intervention, it was 
 probably caused by corrupted data and you may need to choose an earlier 
 recovery target
 for the PITR case.

Oh.  Well, if that's the case, then I guess I lean toward applying the
patch as-is.  Then there's no need for the caveat and without manual
intervention.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] current value support

2010-06-02 Thread Michael Paquier
Hi all,

Please see attached a patch that finishes the support for currval.
All the structure was in place in GTM but there was still one missing call
in sequence.c when calling the function.

Now it is possible to get current values for sequences in the whole cluster.

Regards,

-- 
Michael Paquier
NIPPON TELEGRAPH AND
TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/access/transam/gtm.c b/src/backend/access/transam/gtm.c
index c3cb72b..3be8d20 100644
--- a/src/backend/access/transam/gtm.c
+++ b/src/backend/access/transam/gtm.c
@@ -172,7 +172,7 @@ GetSnapshotGTM(GlobalTransactionId gxid, bool canbe_grouped)
 }
 
 
-/**
+/*
  * Create a sequence on the GTM.
  *
  * 
@@ -189,7 +189,30 @@ CreateSequenceGTM(char *seqname, GTM_Sequence increment, GTM_Sequence minval,
 	return conn ? open_sequence(conn, seqkey, increment, minval, maxval, startval, cycle) : 0;
 }
 
-/**
+/*
+ * get the current sequence value
+ */
+
+GTM_Sequence
+GetCurrentValGTM(char *seqname)
+{
+	GTM_Sequence ret = -1;
+	GTM_SequenceKeyData seqkey;
+	CheckConnection();
+	seqkey.gsk_keylen = strlen(seqname);
+	seqkey.gsk_key = seqname;
+
+	if (conn)
+		ret =  get_current(conn, seqkey);
+	if (ret  0)
+	{
+		CloseGTM();
+		InitGTM();
+	}
+	return ret;
+}
+
+/*
  * Get the next sequence value
  */
 GTM_Sequence
@@ -211,7 +234,7 @@ GetNextValGTM(char *seqname)
 	return ret;
 }
 
-/**
+/*
  * Drop the sequence
  */
 int
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index ba9a932..5df50a3 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -533,8 +533,8 @@ nextval_internal(Oid relid)
 		result = (int64) GetNextValGTM(RelationGetRelationName(seqrel));
 		if (result  0)
 			ereport(ERROR,
-(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg(GTM error, could not obtain sequence value)));
+	(errcode(ERRCODE_CONNECTION_FAILURE),
+	 errmsg(GTM error, could not obtain sequence value)));
 	} else
 	{
 #endif
@@ -714,6 +714,19 @@ currval_oid(PG_FUNCTION_ARGS)
 	/* open and AccessShareLock sequence */
 	init_sequence(relid, elm, seqrel);
 
+#ifdef PGXC
+	if (IS_PGXC_COORDINATOR)
+	{
+		result = (int64) GetCurrentValGTM(RelationGetRelationName(seqrel));
+		if (result  0)
+			ereport(ERROR,
+	(errcode(ERRCODE_CONNECTION_FAILURE),
+	 errmsg(GTM error, could not obtain sequence value)));
+	}
+	else
+	{
+#endif
+
 	if (pg_class_aclcheck(elm-relid, GetUserId(), ACL_SELECT) != ACLCHECK_OK 
 		pg_class_aclcheck(elm-relid, GetUserId(), ACL_USAGE) != ACLCHECK_OK)
 		ereport(ERROR,
@@ -729,6 +742,10 @@ currval_oid(PG_FUNCTION_ARGS)
 
 	result = elm-last;
 
+#ifdef PGXC
+	}
+#endif
+
 	relation_close(seqrel, NoLock);
 
 	PG_RETURN_INT64(result);
diff --git a/src/include/access/gtm.h b/src/include/access/gtm.h
index 66ca3f1..88de4d8 100644
--- a/src/include/access/gtm.h
+++ b/src/include/access/gtm.h
@@ -25,6 +25,9 @@ extern GlobalTransactionId BeginTranAutovacuumGTM(void);
 extern int CommitTranGTM(GlobalTransactionId gxid);
 extern int RollbackTranGTM(GlobalTransactionId gxid);
 extern GTM_Snapshot GetSnapshotGTM(GlobalTransactionId gxid, bool canbe_grouped);
+
+/* Sequence interface APIs with GTM */
+extern GTM_Sequence GetCurrentValGTM(char *seqname);
 extern GTM_Sequence GetNextValGTM(char *seqname);
 extern int CreateSequenceGTM(char *seqname, GTM_Sequence increment, 
 		GTM_Sequence minval, GTM_Sequence maxval, GTM_Sequence startval,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Comments on Exclusion Constraints and related datatypes

2010-06-02 Thread Bruce Momjian
David Fetter wrote:
 On Mon, Mar 22, 2010 at 04:04:16PM -0300, Alvaro Herrera wrote:
  David Fetter wrote:
  
   diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
   index 9881ff4..9313112 100644
   --- a/doc/src/sgml/func.sgml
   +++ b/doc/src/sgml/func.sgml
   @@ -7134,7 +7134,7 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 
   'yellow', 'green', 'blue', 'purple
   /row
   row
entry literalamp;amp;/literal /entry
   -entryOverlaps?/entry
   +entryOverlaps?  One point in common makes this true./entry
entryliteralbox '((0,0),(1,1))' amp;amp; box 
   '((0,0),(2,2))'/literal/entry
   /row
   row
  
  Hmm, how does this look in horizontal space?  (The row makes me think
  it's a table.)
 
 Looks OK to me.  The entry above, Closest point to first operand on
 second operand is actually wider.

Patch applied.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 3:10 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of mié jun 02 14:16:33 -0400 2010:

 We could, but I think we'd be better off just freezing at the time we
 mark the page PD_ALL_VISIBLE and then using the visibility map for
 both purposes.  Keeping around the old xmin values after every tuple
 on the page is visible to every running transaction is useful only for
 forensics, and building a whole new freeze map just to retain that
 information longer (and eventually force a massive anti-wraparound
 vacuum) seems like overkill.

 Reducing the xid wraparound horizon a bit is reasonable, but moving it
 all the way forward to OldestXmin is a bit much, methinks.

Why?  If it's just for forensics, those are some pretty expensive
forensics - it eventually costs you an additional complete rewrite of
every page.

 Besides, there's another argument for not freezing tuples immediately:
 they may be updated shortly thereafter, causing extra churn for no gain.

But if you were going to update PD_ALL_VISIBLE, then you were going to
write the page anyway.  You might as well freeze everything at the
same time so you don't have to come back.

Alternatively, you could do what I suggested upthread and just believe
PD_ALL_VISIBLE over the individual tuple xmins.  Then you don't have
to freeze the page until it's next written, but you still get to keep
your forensic info.

 I'd prefer a setting that would tell the system to freeze all tuples
 that fall within a safety range whenever any tuple in the page is frozen
 -- weren't you working on a patch to do this?  (was it Jeff Davis?)

Not me.  I don't think that's going to help a whole lot, though.  In
many of the painful scenarios, every tuple on the page will have the
same XID, and therefore they'll all be frozen at the same time anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Comments on Exclusion Constraints and related datatypes

2010-06-02 Thread Bruce Momjian
Greg Stark wrote:
 On Mon, Mar 22, 2010 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 
  * Circles, Boxes and other geometric datatypes defined overlaps to
  include touching shapes. So
 
  * inet datatypes don't have a commutative operator on which a unique
  index can be built. There is no overlaps equivalent, which again is a
  shame because that stops them being used with the new feature.
 
 I think our unusual data types are one of the strong points of
 Postgres but they're missing a lot of operators and opclasses to make
 them really useful.
 
 There's no reason we couldn't have separate overlaps and
 overlaps-internally operators just like we have =,= and ,. And it
 would be nice to flesh out the network data type more fully, perhaps
 merging in as much of ip4r as makes sense.

Added to TODO:

Add overlaps geometric operators that ignore point overlaps

* http://archives.postgresql.org/pgsql-hackers/2010-03/msg00861.php 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Comments on Exclusion Constraints and related datatypes

2010-06-02 Thread Bruce Momjian
Robert Haas wrote:
  * inet datatypes don't have a commutative operator on which a unique
  index can be built. There is no overlaps equivalent, which again is a
  shame because that stops them being used with the new feature.
 
 This would be a nice thing to fix, and I was thinking about doing it,
 but I just ran out of time.  I think it can be left for 9.1.  I have
 not infrequently wanted to build an IP allocation database, and this
 would be perfect for that.

Added to TODO:

Add INET overlaps operator, for use by exclusion constraints

* http://archives.postgresql.org/pgsql-hackers/2010-03/msg00845.php 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be

2010-06-02 Thread Florian Pflug
On Jun 3, 2010, at 3:31 , Robert Haas wrote:
 On Wed, Jun 2, 2010 at 9:07 PM, Florian Pflug f...@phlo.org wrote:
 On Jun 3, 2010, at 0:58 , Robert Haas wrote:
 But maybe the message isn't right the first time either.  After all
 the point of having a write-ahead log in the first place is that we
 should be able to prevent corruption in the event of an unexpected
 shutdown.  Maybe the right thing to do is to forget about adding a new
 state and just remove or change the errhint from these messages:
 
 You've fallen prey to a (very common) miss-interpration of this message. It 
 is not about corruption *caused* by a crash during recovery, it's about 
 corruption *causing* the crash.
 
 I'm not in favor of getting rid of that message entirely, since produces a 
 worthwhile hint if the crash was really caused by corrupt data. But it 
 desperately needs a better wording that makes cause and effect perfectly 
 clear. That even you miss-read it conclusively proves that.
 
 How about
 If this has happened repeatedly and without manual intervention, it was 
 probably caused by corrupted data and you may need to restore from backup
 for the crash recovery case and
 If this has happened repeatedly and without manual intervention, it was 
 probably caused by corrupted data and you may need to choose an earlier 
 recovery target
 for the PITR case.
 
 Oh.  Well, if that's the case, then I guess I lean toward applying the
 patch as-is.  Then there's no need for the caveat and without manual
 intervention.

That still leaves the messages awfully ambiguous concerning the cause (data 
corruption) and the effect (crash during recovery).

How about
If this has occurred more than once, it is probably caused by corrupt data and 
you have to use the latest backup for recovery
for the crash recovery case and
If this has occurred more than once, it is probably caused by corrupt data and 
you have to choose an earlier recovery target
for the PITR case.

I don't see why currently only the PITR-case includes the more than once 
clause. Its probably supposed to prevent unnecessarily alarming the user if the 
crash was in fact a stray SIGKILL or an out-of-memory condition, which seems 
equally likely in both cases.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] A tackle to the leaky VIEWs for RLS

2010-06-02 Thread KaiGai Kohei
(2010/06/02 12:02), KaiGai Kohei wrote:
 Here's another thought.  If we're leaning toward explicit syntax to
 designate security views (and I do mean IF, since only one person has
 signed on to that, even if it is Tom Lane!), then maybe we should
 think about ripping out the logic that causes regular views to be
 evaluated using the credentials of the view owner rather than the
 person selecting from it.  A security view would still use that logic,
 plus whatever additional stuff we come up with to prevent leakage.
 Perhaps this would be viewed as a nasty backward compatibility break,
 but the upside is that we'd then be being absolutely clear that a
 non-security view isn't and can never be trusted to be a security
 barrier.  Right now we're shipping something that purports to act as a
 barrier but really doesn't.

 
 Sorry, should we make clear the purpose of explicit syntax for security
 views being issued now?
 In my understanding, if security views, the planner tries to checks
 privileges of the person selecting it to reference underlaying tables
 without any ereport. If violated, the planner prevents user given
 quals (perhaps FuncExpr only?) to come into inside of the join scan.
 Otherwise, if regular views, the planner works as is. Right?
 

I'd like to check up the problem in details.

We can sort out a view into three types, as follows:
(1) A view which cannot be pulled up
(2) A view which can be pulled up, but does not contain any joins
(3) A view which can be pulled up, and contains joins.

For the type (1), we don't need to handle anything special, because
no external quals are unavailable to come into.

For the type (2), we also don't need to handle anything special
except for the evaluation order matter in ExecQual(), because it is
impossible to distribute external quals into a part of relations.

So, the type (3) is all we have to consider here. Right?


Then, where should we distinguish a security view and others?
At least, it should be decided at pull_up_subqueries() or earlier,
because it merges subqueries into the upper query.
In subquery_planner(), the pull_up_subqueries() is called just after
inline_set_returning_functions() which makes RTE_FUNCTION entry flatten
if available. It seems to me not a strange logic to check privileges
on underlaying relations in pull_up_subqueries(), because
inline_set_returning_functions() also checks ACL_EXECUTE here on the
functions to be flatten.


Then, once we can identify what is a security view and not, it shall
be marked on somewhere. Maybe, FromExpr of the pulled-up subquery?


In my current idea, when a qual-node that contains FuncExpr tries to
reference a part of relations within a security view, its referencing
relations will be expanded to whole of the security view at
distribute_qual_to_rels().
Then, it will prevent a user defined function to come into inside of
security views.


At least, it seems to me reasonable to implement.
Shall I launch to develop a proof of concept patch?

Thanks,
-- 
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] recovery getting interrupted is not so unusual as it used to be

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 10:34 PM, Florian Pflug f...@phlo.org wrote:
 Oh.  Well, if that's the case, then I guess I lean toward applying the
 patch as-is.  Then there's no need for the caveat and without manual
 intervention.

 That still leaves the messages awfully ambiguous concerning the cause (data 
 corruption) and the effect (crash during recovery).

 How about
 If this has occurred more than once, it is probably caused by corrupt data 
 and you have to use the latest backup for recovery
 for the crash recovery case and
 If this has occurred more than once, it is probably caused by corrupt data 
 and you have to choose an earlier recovery target
 for the PITR case.

 I don't see why currently only the PITR-case includes the more than once 
 clause. Its probably supposed to prevent unnecessarily alarming the user if 
 the crash was in fact a stray SIGKILL or an out-of-memory condition, which 
 seems equally likely in both cases.

I've applied the patch for now - we can fix the wording of the other
messages with a follow-on patch if we agree on what they should say.
I don't like the use of the phrase you have to, particularly...  I
would tend to leave the archive recovery message alone and change the
crash recovery message to be more like it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] current value support

2010-06-02 Thread Michael Paquier
Sorry about my previous email,
I sent a patch to the wrong mailing list.

Once again my apologies about that.
Regards,

Michael P.


Re: [HACKERS] Allow wal_keep_segments to keep all segments

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 20:28 -0400, Bruce Momjian wrote:
 Simon Riggs wrote:
  On Wed, 2010-06-02 at 15:20 -0400, Bruce Momjian wrote:
  
   The attached patch allows wal_keep_segments = -1 to keep all segements; 
   this is particularly useful for taking a base backup, where you need all
   the WAL files during startup of the standby.  I have documented this
   usage in the patch as well.
   
   I am thinking of applying this after 9.0 beta2 if there is no objection.
  
  It's not clear to me why keep all files until server breaks is a good
  setting. Surely you would set this parameter to the size of your disk.
  Why allow it to go higher?
 
 Well, the -1 allows them to set it temporarily without having to compute
 their free disk space.  Frankly, because the disk space varies, it is
 impossible to know exactly how large the disk is at the time it would
 fill up.
 
 I think the normal computation would be:
 
   1) How long is my file system backup and restore to standby
  going to take
   2) How often do I generate a 16MB WAL file
 
 You would do some computation to figure that out, then maybe multiply it
 by 10x and set that for wal_keep_segments.  I figured allowing a simple
 -1 would be easier.

I think its much easier to find out your free disk space than it is to
calculate how much WAL might be generated during backup. Disk space
doesn't vary significantly on a production database.

If we encourage that laziness then we will get reports that replication
doesn't work and Postgres crashes.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers