Re: [HACKERS] Online base backup from the hot-standby

2011-08-16 Thread Jun Ishiduka

   * Not correspond yet
  
   ?* full_page_write = off
   ? ?- If the primary is full_page_write = off, archive recovery may 
   not act
   ? ? ? normally. Therefore the standby may need to check whether 
   full_page_write
   ? ? ? = off to WAL.
 
  Isn't having a standby make the full_page_write = on in all case
  (bypass configuration) ?
 
  what's the meaning?

Thanks. 

This has the following two problems.
 * pg_start_backup() must set 'on' to full_page_writes of the master that 
   is actual writing of the WAL, but not the standby.
 * The standby doesn't need to connect to the master that's actual writing 
   WAL.
   (Ex. Standby2 in Cascade Replication: Master - Standby1 - Standby2)

I'm worried how I should clear these problems.

Regards.


Jun Ishizuka
NTT Software Corporation
TEL:045-317-7018
E-Mail: ishizuka@po.ntts.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-08-16 Thread Jun Ishiduka

   * Not correspond yet
  
   ?* full_page_write = off
   ? ?- If the primary is full_page_write = off, archive recovery may 
   not act
   ? ? ? normally. Therefore the standby may need to check whether 
   full_page_write
   ? ? ? = off to WAL.
 
  Isn't having a standby make the full_page_write = on in all case
  (bypass configuration) ?
 
  what's the meaning?
 
 Yeah.  full_page_writes is a WAL generation parameter.  Standbys don't
 generate WAL.  I think you just have to insist that the master has it
 on.

Thanks. 

This has the following two problems.
 * pg_start_backup() must set 'on' to full_page_writes of the master that 
   is actual writing of the WAL, but not the standby.
 * The standby doesn't need to connect to the master that's actual writing 
   WAL.
   (Ex. Standby2 in Cascade Replication: Master - Standby1 - Standby2)

I'm worried how I should clear these problems.

Regards.




Jun Ishizuka
NTT Software Corporation
TEL:045-317-7018
E-Mail: ishizuka@po.ntts.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backup's from standby

2011-08-16 Thread senthilnathan
Thanks for your reply.

What will happen if you issue *checkpoint* at STANDBY. I presume that it
will flush the data to the disk.

Will there be any conflict with the master WAL.(like checkpoint location...)

Senthil


--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Backup-s-from-standby-tp4688344p4703469.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Alexander Korotkov
I found that I forgot to remove levelstep and buffersize from reloptions.c.
Updated patch is attached.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.14.1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some missing internationalization in pg_basebackup

2011-08-16 Thread Peter Eisentraut
On ons, 2011-08-10 at 11:39 +0200, Magnus Hagander wrote:
 On Tue, Aug 9, 2011 at 13:38, Peter Eisentraut pete...@gmx.net wrote:
  I noticed that the progress reporting code in pg_basebackup does not
  allow for translation.  This would normally be easy to fix, but this
  code has a number of tricky issues, including the INT64_FORMAT, possibly
  some plural concerns, and some space alignment issues that hidden in
  some of those hardcoded numbers.

 Maybe it can/should be rewritten not to try to do all those things in
 one step, thus making it easier somehow? I'm afraid I don't know
 enough about the translation system to know exactly what would go all
 the way there, though..

I've fixed this now.

During testing, I noticed that the final kB number always overshoots the
projected total by 1 kB, e.g.,

52763/52762 kB (100%), 1/1 tablespace

This happens even in trivial test (base backup of regression test
database, for example), so is it possible that there is some counting
bug in the code?  It could appear as postgres can't count to the user.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-16 Thread Magnus Hagander
On Tue, Aug 16, 2011 at 00:05, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Aug 15, 2011 at 10:32 PM, Magnus Hagander mag...@hagander.net wrote:

 At present the WALSender only sends from one file at a time, so
 sending a message when we open a new file would be straightforward.

 Are you sure? We can receive a single message spanning multiple files...

 You're right. That was the way the original code ran but I thought we
 had stopped that when we introduced MAX_SEND_SIZE. The comment says
 don't cross a logfile boundary within one message. What that
 actually does is prevent us incrementing a logid value, which happens
 every 255 files. I read that as meaning WAL file which is not what
 it means at all.

That has bitten me many times. log file != wal file.. It's not exactly
intuitive.


 So right now what we do is allow a single packet to span multiple
 files, but since MAX_SEND_SIZE is 128KB it will always be smaller than
 a single file, so we can only ever span two files at most.

Unless someone has tweaked their xlog segment size to insane values.

I ended up writing a loop that can deal with it spanning 2 files as
well - but that also turned out to be less code than the special case
for 2 files, so I'll be keeping that :-)


 That is all just a little bizarre, especially since libpq sends data
 in 8KB chunks anyway.

 So I suggest we change XLogSend() so that it only ever sends up to the
 end of a file. That way all w messages will relate to just one file,
 and we can have another message to initiate a new file. And then, as
 you say, give full metadata for the new file.

That could be an idea in itself. It is not necessary for the log
receiver, since it can use the #define FRONTEND hack (which only works
if you build inside the source tree of course, not if it's an
extension) but it might make the protocol simpler to deal with for
third parties who want to do something similar.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some missing internationalization in pg_basebackup

2011-08-16 Thread Magnus Hagander
On Tue, Aug 16, 2011 at 10:33, Peter Eisentraut pete...@gmx.net wrote:
 On ons, 2011-08-10 at 11:39 +0200, Magnus Hagander wrote:
 On Tue, Aug 9, 2011 at 13:38, Peter Eisentraut pete...@gmx.net wrote:
  I noticed that the progress reporting code in pg_basebackup does not
  allow for translation.  This would normally be easy to fix, but this
  code has a number of tricky issues, including the INT64_FORMAT, possibly
  some plural concerns, and some space alignment issues that hidden in
  some of those hardcoded numbers.

 Maybe it can/should be rewritten not to try to do all those things in
 one step, thus making it easier somehow? I'm afraid I don't know
 enough about the translation system to know exactly what would go all
 the way there, though..

 I've fixed this now.

 During testing, I noticed that the final kB number always overshoots the
 projected total by 1 kB, e.g.,

 52763/52762 kB (100%), 1/1 tablespace

 This happens even in trivial test (base backup of regression test
 database, for example), so is it possible that there is some counting
 bug in the code?  It could appear as postgres can't count to the user.

Hmm. I bet that's the backup label file. It doesn't exist in the data
directory anymore when we do these backups, but it is synthesized into
the straem. Thus it's not counted.

I wonder if we should bother counting it, or just adjust pg_basebackup
so that it always ends up at exactly 100% by the numbers as well in
cases like this.

Note that the progress indicator will *always* count wrong when you
choose to include WAL, since we just don't know how much WAL there
should be. I guess in this case we could just advance the end
counter as well as we go, making sure it doesn't shoot above 100%.

Does that seem reasonable?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-16 Thread Peter Eisentraut
On mån, 2011-08-15 at 18:39 +0100, Peter Geoghegan wrote:
  If you want to upgrade a system running 8.3 (that uses float based
 timestamps) in using
  pg_upgrade you must compile 9.0 (or 8.4 or 9.1) with
  --disable-integer-datetimes.  If at some point in the future you
 then want
  to upgrade to 9.2 with pg_upgrade you will again need to build 9.2
 with
  --disable-integer-datetimes.If we remove the code for floating
 point
  representations of datetime then you won't be able to do that.
 
 I'm pretty surprised that pg_upgrade pushes that onus onto its users -
 for many users, the need to build their own binaries is a greater
 barrier to upgrading than doing a logical restore. Maybe that's simply
 considered a matter for package managers to worry about, but that
 doesn't sit well with me.

Well, pg_upgrade only moves the heap files, it doesn't look into them or
change them.

Possibly, this sort of issue could be better handled in the future by
making this a cluster, database, or table flag instead of a compile-time
option.  That way, at least newly created things could move to the new
recommended behavior.  The way it is set up now, we will possibly never
get rid of the legacy behavior, unless we break pg_upgrade at some
point.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Some problems about cascading replication

2011-08-16 Thread Fujii Masao
Hi,

When I tested the PITR on git master with max_wal_senders  0,
I found that the following inappropriate log meesage was always
output even though cascading replication is not in progress. Attached
patch fixes this problem.

LOG:  terminating all walsender processes to force cascaded
standby(s) to update timeline and reconnect

When making the patch, I found another problem about cascading
replication; When promoting a cascading standby, postmaster sends
SIGUSR2 to any cascading walsenders to kill them. But there is a
orner-case where such walsender fails to receive SIGUSR2 and
survives a standby promotion unexpectedly. This happens when
postmaster sends SIGUSR2 before the walsender marks itself as
a WAL sender, because postmaster sends SIGUSR2 to only the
processes marked as a WAL sender.

To avoid the corner-case, I changed walsender so that it checks
whether recovery is in progress or not again after marking itself
as a WAL sender. If recovery is not in progress even though the
walsender is cascading one, it does the same thing as SIGUSR2
signal handler does, and then exits later. Attached patch also includes
this fix.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c0a32a3..2ec39dd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2328,7 +2328,7 @@ reaper(SIGNAL_ARGS)
 			 * XXX should avoid the need for disconnection. When we do,
 			 * am_cascading_walsender should be replaced with RecoveryInProgress()
 			 */
-			if (max_wal_senders  0)
+			if (max_wal_senders  0  CountChildren(BACKEND_TYPE_WALSND)  0)
 			{
 ereport(LOG,
 		(errmsg(terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect)));
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0eadf64..ef6894c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -368,6 +368,22 @@ StartReplication(StartReplicationCmd *cmd)
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
 
 	/*
+	 * When promoting a cascading standby, postmaster sends SIGUSR2 to
+	 * any cascading walsenders to kill them. But there is a corner-case where
+	 * such walsender fails to receive SIGUSR2 and survives a standby promotion
+	 * unexpectedly. This happens when postmaster sends SIGUSR2 before
+	 * the walsender marks itself as a WAL sender, because postmaster sends
+	 * SIGUSR2 to only the processes marked as a WAL sender.
+	 *
+	 * To avoid this corner-case, if recovery is NOT in progress even though
+	 * the walsender is cascading one, we do the same thing as SIGUSR2 signal
+	 * handler does, i.e., set walsender_ready_to_stop to true. Which causes
+	 * the walsender to end later.
+	 */
+	if (am_cascading_walsender  !RecoveryInProgress())
+		walsender_ready_to_stop = true;
+
+	/*
 	 * We assume here that we're logging enough information in the WAL for
 	 * log-shipping, since this is checked in PostmasterMain().
 	 *

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-16 Thread Magnus Hagander
On Mon, Aug 15, 2011 at 23:49, Greg Stark st...@mit.edu wrote:
 On Mon, Aug 15, 2011 at 9:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... and that would be a seriously bad API.  There are not SUSET
 restrictions on other resources such as work_mem.  Why do we need
 one for this?

 I think a better analogy would be imposing a maximum number of rows a
 query can output. That might be a sane thing to have for some
 circumstances but it's not useful in general.

Uh. You mean like LIMIT, which we already have?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Allowing same cursor name in nested levels

2011-08-16 Thread Jeevan Chalke
Hi Tom,

While going through few test-cases, I found that we cannot have two opened
cursors with same name even though they are in two different functions. Here
is what I mean:

1. I have two functions func1 and func2.
2. func1 calls func2
3. Both has cursor with same name, say mycursor
4. Somehow I forgot closing it
5. executing func1 throws an error 'cursor mycursor already in use'


Is this expected behavior???

I just mingled around the code and later appended a cursor count to the
cursor name to allow same cursor name in nested levels. I have run make
check and didn't find any side-effect other than one expected output change.

PFA, patch for the same.

Please let me know your views / suggestions.

Thanks

Test-case:
===

CREATE OR REPLACE FUNCTION func1() RETURNS void AS $$
DECLARE
 res int;
 mycursor CURSOR FOR SELECT 'foo';
BEGIN
 OPEN mycursor;
 SELECT func2() INTO res;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION func2() RETURNS int AS $$
DECLARE
 mycursor CURSOR FOR SELECT 'bar';
BEGIN
 OPEN mycursor;
 return 1;
END;
$$ LANGUAGE plpgsql;

SELECT func1();
ERROR:  cursor mycursor already in use
CONTEXT:  PL/pgSQL function func2 line 5 at OPEN
SQL statement SELECT func2()
PL/pgSQL function func1 line 7 at SQL statement




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are not
the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.
diff --git a/src/pl/plpgsql/src/gram.y b/src/pl/plpgsql/src/gram.y
index 92b54dd..46e7442 100644
--- a/src/pl/plpgsql/src/gram.y
+++ b/src/pl/plpgsql/src/gram.y
@@ -509,6 +509,7 @@ decl_statement	: decl_varname decl_const decl_datatype decl_collate decl_notnull
 		char		buf[1024];
 		char		*cp1;
 		char		*cp2;
+		static unsigned int cursor_count = 0;
 
 		/* pop local namespace for cursor args */
 		plpgsql_ns_pop();
@@ -539,7 +540,13 @@ decl_statement	: decl_varname decl_const decl_datatype decl_collate decl_notnull
 *cp2++ = *cp1;
 			*cp2++ = *cp1++;
 		}
-		strcpy(cp2, '::pg_catalog.refcursor);
+
+		/*
+		 * Append cursor count after cursor name to remove
+		 * conflict of same named cursor in different
+		 * nested levels.
+	 	 */
+		snprintf(cp2, sizeof(buf) - strlen(buf), :%u'::pg_catalog.refcursor, cursor_count++);
 		curname_def-query = pstrdup(buf);
 		new-default_val = curname_def;
 
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index bed34c8..4bdccf3 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -3284,15 +3284,15 @@ begin
 end;
 $$ language plpgsql;
 select forc01();
-NOTICE:  5 from c
-NOTICE:  6 from c
-NOTICE:  7 from c
-NOTICE:  9 from c
-NOTICE:  10 from c
-NOTICE:  41 from c2
-NOTICE:  42 from c2
-NOTICE:  43 from c2
-NOTICE:  after loop, c2 = c2
+NOTICE:  5 from c:5
+NOTICE:  6 from c:5
+NOTICE:  7 from c:5
+NOTICE:  9 from c:5
+NOTICE:  10 from c:5
+NOTICE:  41 from c2:6
+NOTICE:  42 from c2:6
+NOTICE:  43 from c2:6
+NOTICE:  after loop, c2 = c2:6
 NOTICE:  41 from special_name
 NOTICE:  42 from special_name
 NOTICE:  43 from special_name

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some missing internationalization in pg_basebackup

2011-08-16 Thread Dimitri Fontaine
Magnus Hagander mag...@hagander.net writes:
 Note that the progress indicator will *always* count wrong when you
 choose to include WAL, since we just don't know how much WAL there
 should be. I guess in this case we could just advance the end
 counter as well as we go, making sure it doesn't shoot above 100%.

 Does that seem reasonable?

Yes.  If you're interested into the progress being made and the goal is
moving, it's better to know about that.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication vs StandbyReplyMessage

2011-08-16 Thread Magnus Hagander
On Mon, Aug 15, 2011 at 13:50, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Aug 15, 2011 at 7:03 AM, Magnus Hagander mag...@hagander.net wrote:
 The pg_stat_replication view exposes all the fields in
 StandbyReplyMessage *except* for the timestamp when the message was
 generated. On an active system this is not all that interesting, but
 on a mostly idle system that allows the monitoring to react faster
 than the timeout that actually kicks the other end off - and could be
 useful in manual debugging scenarios. Any particular reason why this
 was not exposed as it's own column?

 I wondered the same thing.  Sounds like a good idea.

I can go do that. Care to argue^Wbikeshed for a specific name?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only scans

2011-08-16 Thread Anssi Kääriäinen

On 08/14/2011 12:31 AM, Heikki Linnakangas wrote:

The same idea could of course be used to calculate the effective cache hit 
ratio for each table. Cache hit ratio would have the problem of feedback loops, 
though.

Yeah, I'm not excited about making the planner and statistics more
dynamic. Feedback loops and plan instability are not fun.
I might be a little out of my league here... But I was thinking about 
the cache hit ratio and feedback loops. I understand automatic tuning 
would be hard. But making automatic tuning easier (by using pg_tune for 
example) would be a big plus for most use cases.


To make it easier to tune the page read costs automatically, it would be 
nice if there would be four variables instead of the current two:
  - random_page_cost is the cost of reading a random page from storage. 
Currently it is not, it is the cost of accessing a random page, taking 
in account it might be in memory.

  - seq_page_cost is the cost of reading pages sequentially from storage
  - memory_page_cost is the cost of reading a page in memory
  - cache_hit_ratio is the expected cache hit ratio

memory_page_cost would be server global, random and seq page costs 
tablespace specific, and cache_hit_ratio relation specific. You would 
get the current behavior by tuning *_page_costs realistically, and 
setting cache_hit_ratio globally so that the expected random_page_cost / 
seq_page_cost stays the same as now.


The biggest advantage of this would be that the correct values are much 
easier to detect automatically compared to current situation. This can 
be done using pg_statio_* views and IO speed testing. They should not be 
tuned automatically by PostgreSQL, at least not the cache_hit_ratio, as 
that leads to the possibility of feedback loops and plan instability. 
The variables would also be much easier to understand.


There is the question if one should be allowed to tune the *_page_costs 
at all. If I am not missing something, it is possible to detect the 
correct values programmatically and they do not change if you do not 
change the hardware. Cache hit ratio is the real reason why they are 
currently so important for tuning.


An example why the current random_page_cost and seq_page_cost tuning is 
not adequate is that you can only set random_page_cost per tablespace. 
That makes perfect sense if random_page_cost would be the cost of 
accessing a page in storage. But it is not, it is a combination of that 
and caching effects, so that it actually varies per relation (and over 
time). How do you set it correctly for a query where one relation is 
fully cached and another one not?


Another problem is that if you use random_page_cost == seq_page_cost, 
you are effectively saying that everything is in cache. But if 
everything is in cache, the cost of page access relative to cpu_*_costs 
is way off. The more random_page_cost and seq_page_cost are different, 
the more they mean the storage access costs. When they are the same, 
they mean the memory page cost. There can be an order of magnitude in 
difference of a storage page cost and a memory page cost. So it is hard 
to tune the cpu_*_costs realistically for cases where sometimes data is 
in cache and sometimes not.


Ok, enough hand waving for one post :) Sorry if this all is obvious / 
discussed before. My googling didn't turn out anything directly related, 
although these have some similarity:
 - Per-table random_page_cost for tables that we know are always cached 
[http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php]

 - Script to compute random page cost
[http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php]
-  The science of optimization in practical terms?
[http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], 
getting really interesting starting from here:

[http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php]

 - Anssi


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-16 Thread Andrew Dunstan



On 08/16/2011 04:56 AM, Magnus Hagander wrote:

On Mon, Aug 15, 2011 at 23:49, Greg Starkst...@mit.edu  wrote:

On Mon, Aug 15, 2011 at 9:31 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

... and that would be a seriously bad API.  There are not SUSET
restrictions on other resources such as work_mem.  Why do we need
one for this?

I think a better analogy would be imposing a maximum number of rows a
query can output. That might be a sane thing to have for some
circumstances but it's not useful in general.

Uh. You mean like LIMIT, which we already have?


There is no LIMIT imposed on a query by a server setting, which would be 
the right analogy here.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Heikki Linnakangas

Looking at the calculation of levelStep:


+   /*
+* Calculate levelStep by available amount of memory. We should be able 
to
+* load into main memory one page of each underlying node buffer (which
+* are in levelStep below). That give constraint over
+* maintenance_work_mem. Also we should be able to have subtree of
+* levelStep level in cache. That give constraint over
+* effective_cache_size.
+*
+* i'th underlying level of sub-tree can consists of
+* i^maxIndexTuplesPerPage pages at maximum. So, subtree of levelStep
+* levels can't be greater then 2 * maxIndexTuplesPerPage ^ levelStep
+* pages. We use some more reserve due to we probably can't take whole
+* effective cache and use formula 4 * maxIndexTuplesPerPage ^ 
levelStep =
+* effectiveCache. We use similar logic with maintenance_work_mem. We
+* should be able to store at least last pages of all buffers where we 
are
+* emptying current buffer to.
+*/
+   effectiveMemory = Min(maintenance_work_mem * 1024 / BLCKSZ,
+ effective_cache_size);
+   levelStep = (int) log((double) effectiveMemory / 4.0) /
+   log((double) maxIndexTuplesPerPage);
+


I can see that that's equal to the formula given in the paper, 
log_B(M/4B), but I couldn't see any explanation for that formula in the 
paper. Your explanation makes sense, but where did it come from?


It seems a bit pessimistic: while it's true that the a subtree can't be 
larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter 
upper bound on it. The max size of a subtree of depth n can be 
calculated as the geometric series:


r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but 
for a large r and small n (which is typical), the 2 * 
maxIndexTuplesPerPage^levelStep formula gives a value that's almost 
twice as large as the real max size of a subtree.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] non-ipv6 vs hostnames

2011-08-16 Thread Magnus Hagander
Accidentally specifying an IPv6 address in pg_hba.conf on a system
that doesn't have ipv6 support gives the following error:

LOG:  specifying both host name and CIDR mask is invalid: ::1/128


Which is obviously wrong, because I didn't do that. Do we need to
detect and special-case ipv6 addresses in this case?

FWIW, the line was simply:
hostreplication all ::1/128 trust

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backup's from standby

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 2:34 AM, senthilnathan
senthilnatha...@gmail.com wrote:
 Thanks for your reply.

 What will happen if you issue *checkpoint* at STANDBY. I presume that it
 will flush the data to the disk.

It will perform a restartpoint.

http://www.postgresql.org/docs/9.0/static/wal-configuration.html

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backup's from standby

2011-08-16 Thread Simon Riggs
On Tue, Aug 16, 2011 at 7:34 AM, senthilnathan
senthilnatha...@gmail.com wrote:
 Thanks for your reply.

 What will happen if you issue *checkpoint* at STANDBY. I presume that it
 will flush the data to the disk.

Yes. On the standby that is known as a restartpoint.

 Will there be any conflict with the master WAL.(like checkpoint location...)

No, there is no effect on the master. The restartpoint happens as a
background process on the standby, so replay will continue while we
restart.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication vs StandbyReplyMessage

2011-08-16 Thread Simon Riggs
On Tue, Aug 16, 2011 at 10:51 AM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Aug 15, 2011 at 13:50, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Aug 15, 2011 at 7:03 AM, Magnus Hagander mag...@hagander.net wrote:
 The pg_stat_replication view exposes all the fields in
 StandbyReplyMessage *except* for the timestamp when the message was
 generated. On an active system this is not all that interesting, but
 on a mostly idle system that allows the monitoring to react faster
 than the timeout that actually kicks the other end off - and could be
 useful in manual debugging scenarios. Any particular reason why this
 was not exposed as it's own column?

 I wondered the same thing.  Sounds like a good idea.

 I can go do that. Care to argue^Wbikeshed for a specific name?

reply_timestamp

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walprotocol.h vs frontends

2011-08-16 Thread Simon Riggs
On Tue, Aug 16, 2011 at 9:35 AM, Magnus Hagander mag...@hagander.net wrote:

 So right now what we do is allow a single packet to span multiple
 files, but since MAX_SEND_SIZE is 128KB it will always be smaller than
 a single file, so we can only ever span two files at most.

 Unless someone has tweaked their xlog segment size to insane values.

We should limit MAX_SEND_SIZE to be same or less than WAL_SEGMENT_SIZE.

We gain nothing by continuing to allow the existing code to work in
the way it does.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Alexander Korotkov
On Tue, Aug 16, 2011 at 4:04 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 I can see that that's equal to the formula given in the paper, log_B(M/4B),
 but I couldn't see any explanation for that formula in the paper. Your
 explanation makes sense, but where did it come from?

I didn't find it too. But it has to reservse memory for both sub-tree and
active buffers. While we'are reserving memory for sub-tree in
effective_cache_size and memory for last pages of buffers in
maintenance_work_mem.


 It seems a bit pessimistic: while it's true that the a subtree can't be
 larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter
 upper bound on it. The max size of a subtree of depth n can be calculated as
 the geometric series:

 r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

 where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but for
 a large r and small n (which is typical), the 2 * 
 maxIndexTuplesPerPage^**levelStep
 formula gives a value that's almost twice as large as the real max size of a
 subtree.

Thus, we can calculate:
levelstep = min(log_r(1 + effective_cache_size_in_pages*(r - 1)) - 1,
log_r(maintenance_work_mem_in_pages - 1))
and get more precise result. But also we need at least very rough estimate
of memory occupied by node buffers hash tab and path items.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] src/backend/storage/ipc/README

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 12:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 $SUBJECT is wildly out-of-date.  Is there any point in keeping this,
 given the large (and actually correct) comment block near the top of
 sinvaladt.c?

 Huh, I never noticed that file before.  Yeah, it seems useless as it
 stands.  I wonder however if we could repurpose it as a slightly
 higher-level summary?  If you just look at sinval(adt).c, you'll find
 out all about how inval messages get transmitted, but nothing about why
 we need them in the first place.  There's some material about that over
 in src/backend/utils/cache/inval.c, which is not the easiest place to
 find if you're looking at storage/ipc/.

That's for sure.  IMHO, the fact that this functionality is broken up
between three source files in two completely separate parts of the
source tree is a mistake to begin with.  It appears to me to be an
attempt to create proper layers of abstraction, but in reality I think
it's just an obstacle to understanding and improving the code.  I
would suggest merging inval.c into sinval.c so that it's all in
src/backend/storage/ipc.

That having been said, some more documentation of what sinval is for
and what its restrictions are would be great.  I wonder if it
shouldn't go into a file called README.sinval or somesuch, though.
The functionality in that directory is fairly diverse and you could
conceivably end up with README.procarray, README.shmem, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some problems about cascading replication

2011-08-16 Thread Simon Riggs
On Tue, Aug 16, 2011 at 9:55 AM, Fujii Masao masao.fu...@gmail.com wrote:

 When I tested the PITR on git master with max_wal_senders  0,
 I found that the following inappropriate log meesage was always
 output even though cascading replication is not in progress. Attached
 patch fixes this problem.

    LOG:  terminating all walsender processes to force cascaded
 standby(s) to update timeline and reconnect

 When making the patch, I found another problem about cascading
 replication; When promoting a cascading standby, postmaster sends
 SIGUSR2 to any cascading walsenders to kill them. But there is a
 orner-case where such walsender fails to receive SIGUSR2 and
 survives a standby promotion unexpectedly. This happens when
 postmaster sends SIGUSR2 before the walsender marks itself as
 a WAL sender, because postmaster sends SIGUSR2 to only the
 processes marked as a WAL sender.

 To avoid the corner-case, I changed walsender so that it checks
 whether recovery is in progress or not again after marking itself
 as a WAL sender. If recovery is not in progress even though the
 walsender is cascading one, it does the same thing as SIGUSR2
 signal handler does, and then exits later. Attached patch also includes
 this fix.

Looks like valid problems and appropriate fixes to me. Will commit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication vs StandbyReplyMessage

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 8:54 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Aug 16, 2011 at 10:51 AM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Aug 15, 2011 at 13:50, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Aug 15, 2011 at 7:03 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 The pg_stat_replication view exposes all the fields in
 StandbyReplyMessage *except* for the timestamp when the message was
 generated. On an active system this is not all that interesting, but
 on a mostly idle system that allows the monitoring to react faster
 than the timeout that actually kicks the other end off - and could be
 useful in manual debugging scenarios. Any particular reason why this
 was not exposed as it's own column?

 I wondered the same thing.  Sounds like a good idea.

 I can go do that. Care to argue^Wbikeshed for a specific name?

 reply_timestamp

Works for me.  I'd suggest that we rename it that way in
StandbyReplyMessage, so that the name in the struct and the name in
the system view match.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 7:23 AM, Andrew Dunstan and...@dunslane.net wrote:
 There is no LIMIT imposed on a query by a server setting, which would be the
 right analogy here.

I am not sure I understand any of these analogies.  I think Peter's
point is that it's not very difficult to write (perhaps accidentally)
a CTE that goes into infinite recursion.  In general, we can't detect
that situation, because it's equivalent to the halting problem.  But
there's an old joke about a Turing test (where a computer program must
try to fool a human into believing that it is also human) where the
person asks the computer:

What would the following program do?
10 PRINT HELLO
20 GOTO 10

And gets back an infinite stream of HELLO HELLO HELLO HELLO HELLO

I don't think it's going to be feasible to implement a security
restriction that keeps untrusted users from hosing the machine with a
long running CTE; there are nearly infinitely many ways for an
untrusted user who can run queries to hose the machine, and plugging
one of them imperfectly is going to get us pretty much nowhere.  On
the other hand, there is perhaps a reasonable argument to be made that
we should cut off CTE processing at some point to prevent
*inadvertent* exhaustion of system resources.  Or even query
processing more generally.

In fact, we already have some things sort of like this: you can use
statement_timeout to kill queries that run for too long, and we just
recently added temp_file_limit to kill those that eat too much temp
file space.   I can see a good case for memory_limit and
query_cpu_limit and maybe some others.  cte_recursion_depth_limit
wouldn't be all that high on my personal list, I guess, but the
concept doesn't seem completely insane.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing same cursor name in nested levels

2011-08-16 Thread Tom Lane
Jeevan Chalke jeevan.cha...@enterprisedb.com writes:
 1. I have two functions func1 and func2.
 2. func1 calls func2
 3. Both has cursor with same name, say mycursor
 4. Somehow I forgot closing it
 5. executing func1 throws an error 'cursor mycursor already in use'

 Is this expected behavior???

Yes ... or at least, it's always been like that.

 I just mingled around the code and later appended a cursor count to the
 cursor name to allow same cursor name in nested levels.

That would break code that expects the cursor name to be what it said
it should be.  It is documented that you can refer to cursors by name
across multiple functions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Robert Haas
On Mon, Aug 15, 2011 at 6:46 PM, Joachim Wieland j...@mcknight.de wrote:
 On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby j...@nasby.net wrote:
 I suspect that all the other cases of BEGIN failing would be syntax errors, 
 so
 you would immediately know in testing that something was wrong. A missing 
 file
 is definitely not a syntax error, so we can't really depend on user testing 
 to ensure
 this is handled correctly. IMO, that makes it critical that that error puts 
 us in an
 aborted transaction.

 Why can we not just require the user to verify if his BEGIN query
 failed or succeeded?
 Is that really too much to ask for?

 Also see what Robert wrote about proxies in between that keep track of
 the transaction
 state. Consider they see a BEGIN query that fails. How would they know
 if the session
 is now in an aborted transaction or not in a transaction at all?

I think the point here is that we should be consistent.  Currently,
you can make BEGIN fail by doing it on the standby, and asking for
READ WRITE mode:

rhaas=# begin transaction read write;
ERROR:  cannot set transaction read-write mode during recovery

After doing that, you are NOT in a transaction context:

rhaas=# select 1;
 ?column?
--
1
(1 row)

So whatever this does should be consistent with that, at least IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication vs StandbyReplyMessage

2011-08-16 Thread Simon Riggs
On Tue, Aug 16, 2011 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Aug 16, 2011 at 8:54 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Aug 16, 2011 at 10:51 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 On Mon, Aug 15, 2011 at 13:50, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Aug 15, 2011 at 7:03 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 The pg_stat_replication view exposes all the fields in
 StandbyReplyMessage *except* for the timestamp when the message was
 generated. On an active system this is not all that interesting, but
 on a mostly idle system that allows the monitoring to react faster
 than the timeout that actually kicks the other end off - and could be
 useful in manual debugging scenarios. Any particular reason why this
 was not exposed as it's own column?

 I wondered the same thing.  Sounds like a good idea.

 I can go do that. Care to argue^Wbikeshed for a specific name?

 reply_timestamp

 Works for me.

 I'd suggest that we rename it that way in
 StandbyReplyMessage, so that the name in the struct and the name in
 the system view match.

-1

The field is named same as equivalent field in other messages.

The field on the view is a summary of all previous messages, which is
a different thing. Perhaps we should call it last_reply_timestamp to
make that clearer, though long titles are annoying.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication vs StandbyReplyMessage

2011-08-16 Thread Magnus Hagander
On Tue, Aug 16, 2011 at 16:00, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Aug 16, 2011 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Aug 16, 2011 at 8:54 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Aug 16, 2011 at 10:51 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 On Mon, Aug 15, 2011 at 13:50, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Aug 15, 2011 at 7:03 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 The pg_stat_replication view exposes all the fields in
 StandbyReplyMessage *except* for the timestamp when the message was
 generated. On an active system this is not all that interesting, but
 on a mostly idle system that allows the monitoring to react faster
 than the timeout that actually kicks the other end off - and could be
 useful in manual debugging scenarios. Any particular reason why this
 was not exposed as it's own column?

 I wondered the same thing.  Sounds like a good idea.

 I can go do that. Care to argue^Wbikeshed for a specific name?

 reply_timestamp

 Works for me.

 I'd suggest that we rename it that way in
 StandbyReplyMessage, so that the name in the struct and the name in
 the system view match.

 -1

 The field is named same as equivalent field in other messages.

 The field on the view is a summary of all previous messages, which is
 a different thing. Perhaps we should call it last_reply_timestamp to
 make that clearer, though long titles are annoying.

We don't say last_replay_location either, we just say replay_location.
Adding the last_ part is just annoying.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] non-ipv6 vs hostnames

2011-08-16 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 Accidentally specifying an IPv6 address in pg_hba.conf on a system
 that doesn't have ipv6 support gives the following error:

 LOG:  specifying both host name and CIDR mask is invalid: ::1/128

 Which is obviously wrong, because I didn't do that. Do we need to
 detect and special-case ipv6 addresses in this case?

Doesn't really seem worth going out of our way for that.  Systems with
no IPv6 support are a dying breed, and will be more so by the time 9.2
gets deployed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] non-ipv6 vs hostnames

2011-08-16 Thread Magnus Hagander
On Tue, Aug 16, 2011 at 16:12, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 Accidentally specifying an IPv6 address in pg_hba.conf on a system
 that doesn't have ipv6 support gives the following error:

 LOG:  specifying both host name and CIDR mask is invalid: ::1/128

 Which is obviously wrong, because I didn't do that. Do we need to
 detect and special-case ipv6 addresses in this case?

 Doesn't really seem worth going out of our way for that.  Systems with
 no IPv6 support are a dying breed, and will be more so by the time 9.2
 gets deployed.

Well, I got this on a win64 build. It's *supposed* to have ipv6. I
wonder if it breaks on windows just because there is no ipv6 address
on the machine...

Unfortunately I shut the machine down and won't have time to test more
right now, but I'll try to figure that out later unless beaten to
it...


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I don't think it's going to be feasible to implement a security
 restriction that keeps untrusted users from hosing the machine with a
 long running CTE; there are nearly infinitely many ways for an
 untrusted user who can run queries to hose the machine, and plugging
 one of them imperfectly is going to get us pretty much nowhere.  On
 the other hand, there is perhaps a reasonable argument to be made that
 we should cut off CTE processing at some point to prevent
 *inadvertent* exhaustion of system resources.  Or even query
 processing more generally.

Indeed: the real question here is why a recursive CTE is any worse
than, say, an accidentally unconstrained join (or three or four...).

However, we already have a perfectly suitable general mechanism for
that; it's called statement_timeout.

I think we've already had the discussion about whether there should be
a system-wide SUSET maximum statement_timeout, and rejected it on the
grounds that there was not a very clear need for it.

 In fact, we already have some things sort of like this: you can use
 statement_timeout to kill queries that run for too long, and we just
 recently added temp_file_limit to kill those that eat too much temp
 file space.   I can see a good case for memory_limit and
 query_cpu_limit and maybe some others.

temp_file_limit got accepted because it was constraining a resource not
closely related to run time.  I don't think that it provides a precedent
in support of any of these other ideas.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 10:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 In fact, we already have some things sort of like this: you can use
 statement_timeout to kill queries that run for too long, and we just
 recently added temp_file_limit to kill those that eat too much temp
 file space.   I can see a good case for memory_limit and
 query_cpu_limit and maybe some others.

 temp_file_limit got accepted because it was constraining a resource not
 closely related to run time.  I don't think that it provides a precedent
 in support of any of these other ideas.

Well, CPU usage might be somewhat closely related to query runtime,
but memory usage sure isn't.

But we digress from $SUBJECT...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mar ago 16 09:59:04 -0400 2011:
 On Mon, Aug 15, 2011 at 6:46 PM, Joachim Wieland j...@mcknight.de wrote:

  Also see what Robert wrote about proxies in between that keep track
  of the transaction state. Consider they see a BEGIN query that
  fails. How would they know if the session is now in an aborted
  transaction or not in a transaction at all?
 
 I think the point here is that we should be consistent.  Currently,
 you can make BEGIN fail by doing it on the standby, and asking for
 READ WRITE mode:
 
 rhaas=# begin transaction read write;
 ERROR:  cannot set transaction read-write mode during recovery
 
 After doing that, you are NOT in a transaction context:
 
 rhaas=# select 1;
  ?column?
 --
 1
 (1 row)
 
 So whatever this does should be consistent with that, at least IMHO.

I think we argued about a very similar problem years ago and the outcome
was that you should be left in an aborted transaction block; otherwise
running a dumb SQL script (which has no way to abort if it fails)
could wreak serious havoc (?).  I think this failure to behave in that
fashion on the standby is something to be fixed, not imitated.

What this says is that a driver or app seeing BEGIN fail should issue
ROLLBACK before going further -- which seems the intuitive way to behave
to me.  No?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 10:43 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of mar ago 16 09:59:04 -0400 2011:
 On Mon, Aug 15, 2011 at 6:46 PM, Joachim Wieland j...@mcknight.de wrote:

  Also see what Robert wrote about proxies in between that keep track
  of the transaction state. Consider they see a BEGIN query that
  fails. How would they know if the session is now in an aborted
  transaction or not in a transaction at all?

 I think the point here is that we should be consistent.  Currently,
 you can make BEGIN fail by doing it on the standby, and asking for
 READ WRITE mode:

 rhaas=# begin transaction read write;
 ERROR:  cannot set transaction read-write mode during recovery

 After doing that, you are NOT in a transaction context:

 rhaas=# select 1;
  ?column?
 --
         1
 (1 row)

 So whatever this does should be consistent with that, at least IMHO.

 I think we argued about a very similar problem years ago and the outcome
 was that you should be left in an aborted transaction block; otherwise
 running a dumb SQL script (which has no way to abort if it fails)
 could wreak serious havoc (?).  I think this failure to behave in that
 fashion on the standby is something to be fixed, not imitated.

 What this says is that a driver or app seeing BEGIN fail should issue
 ROLLBACK before going further -- which seems the intuitive way to behave
 to me.  No?

Maybe.  But if we're going to change the behavior of BEGIN, then (1)
we need to think about backward compatibility and (2) we should change
it across the board.  It's not for this patch to go invent something
that's inconsistent with what we're already doing elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup and wal streaming

2011-08-16 Thread Magnus Hagander
On Sat, Feb 26, 2011 at 20:59, Magnus Hagander mag...@hagander.net wrote:
 On Sat, Feb 26, 2011 at 20:48, Yeb Havinga yebhavi...@gmail.com wrote:
 On 2011-02-26 18:19, Magnus Hagander wrote:

 Attached is an updated version of the patch that includes these
 changes, as well as Windows support and an initial cut at a ref page
 for pg_receivexlog (needs some more detail still).

 I'm testing a bit more (with the previous version, sorry) and got the
 following while doing a stream backup from a cluster that was at that moment
 doing a pgbench run with 1 synchronous standby.

 mgrid@mg79:~$ pg_basebackup --xlog=stream -D /data -vP -h mg73 -U repuser
 Password:
 xlog start point: 15/72C8
 pg_basebackup: starting background WAL receiver
 pg_basebackup: got WAL data offset 14744, expected 16791960424        )
 5148915/5148026 kb g(100%) 1/1 tablespaces
 xlog end point: 15/80568878
 pg_basebackup: waiting for background process to finish streaming...
 pg_basebackup: child process exited with error 1

 Hmm, strange. What platform are you on?

 I saw something similar *once* on Windows, but it then passed my tests
 a lot of times in a row so I figured it was just a didn't clean
 properly thing. Clearly there's a bug around.

 What's the size of the latest WAL file that it did work on? Is it
 16791960424 bytes? That's way way to large, but perhaps it's not
 switching segment properly? (that value is supposedly the current
 write position in the file..)


 I'm in total monkey test mode here, so I don't even know if I'm not supposed
 to do the streaming variant while other stuff is going on.

 Oh yes, that's one of the main reasons to use it, so you should
 definitely be able to do that!


I've posted a new version of this patch at
http://archives.postgresql.org/pgsql-hackers/2011-08/msg00776.php -
forgot there was an open thread on this, sorry.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython crash

2011-08-16 Thread Jean-Baptiste Quenot
After backporting plpython.c from HEAD, this is the error message I get:

ERROR:  key pg.dropped.6 not found in mapping
HINT:  To return null in a column, add the value None to the mapping
with the key named after the column.
CONTEXT:  while creating return value
PL/Python function myfunc

What does it mean?
-- 
Jean-Baptiste Quenot

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some problems about cascading replication

2011-08-16 Thread Heikki Linnakangas

On 16.08.2011 16:25, Simon Riggs wrote:

On Tue, Aug 16, 2011 at 9:55 AM, Fujii Masaomasao.fu...@gmail.com  wrote:


When I tested the PITR on git master with max_wal_senders  0,
I found that the following inappropriate log meesage was always
output even though cascading replication is not in progress. Attached
patch fixes this problem.

LOG:  terminating all walsender processes to force cascaded
standby(s) to update timeline and reconnect

When making the patch, I found another problem about cascading
replication; When promoting a cascading standby, postmaster sends
SIGUSR2 to any cascading walsenders to kill them. But there is a
orner-case where such walsender fails to receive SIGUSR2 and
survives a standby promotion unexpectedly. This happens when
postmaster sends SIGUSR2 before the walsender marks itself as
a WAL sender, because postmaster sends SIGUSR2 to only the
processes marked as a WAL sender.

To avoid the corner-case, I changed walsender so that it checks
whether recovery is in progress or not again after marking itself
as a WAL sender. If recovery is not in progress even though the
walsender is cascading one, it does the same thing as SIGUSR2
signal handler does, and then exits later. Attached patch also includes
this fix.


Looks like valid problems and appropriate fixes to me. Will commit.


I think there's a race condition here. If a walsender is just starting 
up, it might not have registered itself as a walsender yet. It's 
actually been there before this patch to suppress the log message.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some missing internationalization in pg_basebackup

2011-08-16 Thread Magnus Hagander
On Tue, Aug 16, 2011 at 11:35, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Magnus Hagander mag...@hagander.net writes:
 Note that the progress indicator will *always* count wrong when you
 choose to include WAL, since we just don't know how much WAL there
 should be. I guess in this case we could just advance the end
 counter as well as we go, making sure it doesn't shoot above 100%.

 Does that seem reasonable?

 Yes.  If you're interested into the progress being made and the goal is
 moving, it's better to know about that.

Done and applied.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython crash

2011-08-16 Thread Jan Urbański
On 16/08/11 16:52, Jean-Baptiste Quenot wrote:
 After backporting plpython.c from HEAD, this is the error message I get:
 
 ERROR:  key pg.dropped.6 not found in mapping
 HINT:  To return null in a column, add the value None to the mapping
 with the key named after the column.
 CONTEXT:  while creating return value
 PL/Python function myfunc
 
 What does it mean?

Ah, interesting, I think that this means that you are returning a table
type and that table has a dropped column. The code should skip over
dropped columns, but apparently it does not and tries to find a value
for that column in the mapping you are returning.

I'll try to reproduce it here.

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Should we have an optional limit on the recursion depth of recursive CTEs?

2011-08-16 Thread Peter Geoghegan
On 16 August 2011 14:43, Robert Haas robertmh...@gmail.com wrote:
 What would the following program do?
 10 PRINT HELLO
 20 GOTO 10

 And gets back an infinite stream of HELLO HELLO HELLO HELLO HELLO

heh, that's pretty funny. It also compliments my view, because the
Turing test is only failed because the human eventually thinks hmm,
he's taking way too long to get to the '...and so on infinitum' bit.

 I don't think it's going to be feasible to implement a security
 restriction that keeps untrusted users from hosing the machine with a
 long running CTE; there are nearly infinitely many ways for an
 untrusted user who can run queries to hose the machine, and plugging
 one of them imperfectly is going to get us pretty much nowhere.

Unless that happens to be the exact area that is a problem for you,
due perhaps to a poorly written application. We're protecting against
Murphy, not Machiavelli - if your users are malicious, or are
motivated by seeing if they can somehow hose the machine for kicks,
clearly all bets are off. This mindset happens to pretty well meet the
needs of industry, IMHO. That said, I admit the case for making a
separate SUSET GUC is the least compelling one I've made on this
thread, if only because of the glaring inconsistency with other areas.

 On the other hand, there is perhaps a reasonable argument to be made that
 we should cut off CTE processing at some point to prevent
 *inadvertent* exhaustion of system resources.  Or even query
 processing more generally.

 In fact, we already have some things sort of like this: you can use
 statement_timeout to kill queries that run for too long, and we just
 recently added temp_file_limit to kill those that eat too much temp
 file space.

statement_timeout is far too blunt an instrument to deal with this
problem. For one thing, it may vary based on many external factors,
whereas number of iterations is a consistent, useful metric for the
WITH query in isolation. For another, it prevents the DBA from
managing known problems with deployed apps per database - maybe they
have a reporting query that is expected to take a really long time.
Sure, they can increase statement_timeout when that it run, but that's
another thing to remember.

 I can see a good case for memory_limit and
 query_cpu_limit and maybe some others.  cte_recursion_depth_limit
 wouldn't be all that high on my personal list, I guess, but the
 concept doesn't seem completely insane.

I agree that those things would be much better than this. This is
still a useful, easy-to-implement feature though.

On 16 August 2011 15:26, Tom Lane t...@sss.pgh.pa.us wrote:
 Indeed: the real question here is why a recursive CTE is any worse
 than, say, an accidentally unconstrained join (or three or four...).

It's much worse because an unconstrained join query will not
all-of-a-sudden fail to have a terminating condition. It will, for the
most part, take forever or practically forever predictably and
consistently, even as the contents of tables changes over time.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-08-16 Thread Steve Singer
On 11-08-16 02:09 AM, Jun Ishiduka wrote:

 Thanks. 

 This has the following two problems.
  * pg_start_backup() must set 'on' to full_page_writes of the master that 
is actual writing of the WAL, but not the standby.
Is there any way to tell from the WAL segments if they contain the full
page data? If so could you verify this on the second slave when it is
brought up? Or can you track this on the first slave and produce an
error in either pg_start_backup or pg_stop_backup()

I see in xlog.h XLR_BKP_REMOVABLE, the comment above it says that this
flag is used to indicate that the archiver can compress the full page
blocks to non-full page blocks. I am not familiar with where in the code
this actually happens but will this cause issues if the first standby is
processing WAL files from the archive?


  * The standby doesn't need to connect to the master that's actual writing 
WAL.
(Ex. Standby2 in Cascade Replication: Master - Standby1 - Standby2)

 I'm worried how I should clear these problems.

 Regards.

 
 Jun Ishizuka
 NTT Software Corporation
 TEL:045-317-7018
 E-Mail: ishizuka@po.ntts.co.jp
 





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython crash

2011-08-16 Thread Jan Urbański
On 16/08/11 17:06, Jan Urbański wrote:
 On 16/08/11 16:52, Jean-Baptiste Quenot wrote:
 After backporting plpython.c from HEAD, this is the error message I get:

 ERROR:  key pg.dropped.6 not found in mapping
 HINT:  To return null in a column, add the value None to the mapping
 with the key named after the column.
 CONTEXT:  while creating return value
 PL/Python function myfunc

 What does it mean?

I did a couple of simple tests and can't see how can the code not skip
dropped columns.

It seems like you're missing this commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=41282111e6cc73aca4b63dffe950ba7a63e4bd8a

Could you try running this query? (assuming your function is called
'myfync')

select proname, relname, attname, attisdropped from pg_proc join
pg_class on (prorettype = reltype) join pg_attribute on (attrelid =
pg_class.oid) where proname = 'myfunc';

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] non-ipv6 vs hostnames

2011-08-16 Thread Peter Eisentraut
On tis, 2011-08-16 at 16:17 +0200, Magnus Hagander wrote:
 Well, I got this on a win64 build. It's *supposed* to have ipv6. I
 wonder if it breaks on windows just because there is no ipv6 address
 on the machine... 

It would mean that getaddrinfo() of ::1 failed.  That seems weird.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython crash

2011-08-16 Thread Jean-Baptiste Quenot
Dear Jan,

Sorry I typed the wrong git commands.  With latest plpython from
branch master I got the same gdb backtrace as reported before.  I
managed to wrap up a testcase that fails 100% of times on my setup:
https://gist.github.com/1149512

Hope it crashes on your side too :-)

This is the result on PG 9.0.4:
https://gist.github.com/1149543

This is the result on PG 9.0.4 with plpython.c backported from HEAD:
https://gist.github.com/1149558

Cheers,
-- 
Jean-Baptiste Quenot

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython crash

2011-08-16 Thread Jan Urbański
On 16/08/11 19:07, Jean-Baptiste Quenot wrote:
 Dear Jan,
 
 Sorry I typed the wrong git commands.  With latest plpython from
 branch master I got the same gdb backtrace as reported before.  I
 managed to wrap up a testcase that fails 100% of times on my setup:
 https://gist.github.com/1149512
 
 Hope it crashes on your side too :-)

Awesome, it segfaults for me with HEAD ;)

Now it's just a simple matter of programming... I'll take a look at it
this evening.

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Heikki Linnakangas
Why is there ever a buffer on the root node? It seems like a waste of 
time to load N tuples from the heap into the root buffer, only to empty 
the buffer after it fills up. You might as well pull tuples directly 
from the heap.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Alexander Korotkov
On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 Why is there ever a buffer on the root node? It seems like a waste of time
 to load N tuples from the heap into the root buffer, only to empty the
 buffer after it fills up. You might as well pull tuples directly from the
 heap.

Yes, seems reasonable. Buffer on the root node was in the paper. But now I
don't see the need of it too.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] WIP: Fast GiST index build

2011-08-16 Thread Heikki Linnakangas

On 16.08.2011 21:46, Alexander Korotkov wrote:

On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Why is there ever a buffer on the root node? It seems like a waste of time
to load N tuples from the heap into the root buffer, only to empty the
buffer after it fills up. You might as well pull tuples directly from the
heap.


Yes, seems reasonable. Buffer on the root node was in the paper. But now I
don't see the need of it too.


Here's an version of the patch with a bunch of minor changes:

* No more buffer on root node. Aside from the root buffer being 
pointless, this simplifies gistRelocateBuildBuffersOnSplit slightly as 
it doesn't need the special case for root block anymore.


* Moved the code to create new root item from gistplacetopage() to 
gistRelocateBuildBuffersOnSplit(). Seems better to keep the 
buffering-related code away from the normal codepath, for the sake of 
readability.


* Changed the levelStep calculation to use the more accurate upper bound 
on subtree size that we discussed.


* Changed the levelStep calculation so that it doesn't do just 
min(maintenance_work_mem, effective_cache_size) and calculate the 
levelStep from that. Maintenance_work_mem matters determines the max. 
number of page buffers that can be held in memory at a time, while 
effective_cache_size determines the max size of the subtree. Those are 
subtly different things.


* Renamed NodeBuffer to GISTNodeBuffer, to avoid cluttering the namespace

* Plus misc comment, whitespace, formatting and naming changes.

I think this patch is in pretty good shape now. Could you re-run the 
performance tests you have on the wiki page, please, to make sure the 
performance hasn't regressed? It would also be nice to get some testing 
on the levelStep and pagesPerBuffer estimates, and the point where we 
switch to the buffering method. I'm particularly interested to know if 
there's any corner-cases with very skewed data distributions or strange 
GUC settings, where the estimates fails badly.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM FULL versus relcache init files

2011-08-16 Thread Robert Haas
On Mon, Aug 15, 2011 at 9:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This might be the last bug from my concurrent-vacuum-full testing --- at
 least, I have no remaining unexplained events from about two full days
 of running the tests.  The ones that are left involve backends randomly
 failing like this:

 psql: FATAL:  could not read block 0 in file base/130532080/130545668: read 
 only 0 of 8192 bytes

 usually during startup, though I have one example of a backend being
 repeatedly unable to access pg_proc due to similar errors.

 I believe what this traces to is stale relfilenode information taken
 from relcache init files, which contain precomputed relcache data
 intended to speed up backend startup.  There is a curious little dance
 between write_relcache_init_file and RelationCacheInitFileInvalidate
 that is intended to ensure that when a process creates a new relcache
 init file that is already stale (because someone else invalidated the
 information concurrently), the bad file will get unlinked and not used.
 I think I invented that logic, so it's my fault that it doesn't work.

 It works fine as long as you consider only the two processes directly
 involved; but a third process can get fooled into using stale data.
 The scenario requires two successive invalidations, as for example from
 vacuum full on two system catalogs in a row, plus a stream of incoming
 new backends.  It goes like this:

 1. Process A vacuums a system catalog, unlinks init file, sends sinval
 messages, does second unlink (which does nothing).

 2. Process B starts, observes lack of init file, begins to construct
 a new one.  It gets to the end of AcceptInvalidationMessages in
 write_relcache_init_file.  Since it started after A sent the first
 sinvals, it sees no incoming sinval messages and has no reason to think
 its new init file isn't good.

 3. Meanwhile, Process A vacuums another system catalog, unlinks init
 file (doing nothing), and finally sends its sinval messages just after
 B looked for them.  Now it will block trying to get RelCacheInitLock,
 which B holds.

 4. Process B renames its already-stale init file into place, then
 releases RelCacheInitLock.

 5. Process A gets the lock and removes the stale init file.

 Now process B is okay, because it will see A's second sinvals before it
 tries to make any use of the relcache data it has.  And the stale init
 file is definitely gone after step 5.

 However, between steps 4 and 5 there is a window for Process C to start,
 read the stale init file, and attempt to use it.  Since C started after
 A's second set of sinval messages, it doesn't see them and doesn't know
 it has stale data.

 As far as I can see at the moment, the only way to make this bulletproof
 is to turn both creation and deletion of the init file into atomic
 operations that include sinval messaging.  What I have in mind is

 Creator: must take RelCacheInitLock, check for incoming invals, rename
 the new file into place if none, release RelCacheInitLock.  (This is
 the same as what it does now.)

 Destroyer: must take RelCacheInitLock, unlink the init file, send its
 sinvals, release RelCacheInitLock.

 This guarantees that we serialize the sending of the sinval messages
 so that anyone who sees a bad init file in place *must* see the sinval
 messages afterwards, so long as they join the sinval messaging ring
 before looking for the init file (which they do).  I don't think it's
 any worse than the current scheme from a parallelism point of view: the
 destroyer is holding RelCacheInitLock a bit longer than before, but that
 should not be a performance critical situation.

 Anybody see a hole in that?

I don't.  Seems more robust than the old way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-08-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Aug 11, 2011 at 5:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 What's bothering me at the moment is that the CLOBBER_CACHE_ALWAYS hack,
 which was meant to expose exactly this sort of problem, failed to do so
 --- buildfarm member jaguar has been running with that flag for ages and
 never showed this problem.  I'm thinking that we should take out the
 hack in AcceptInvalidationMessages and instead put in #ifdeffed code
 that causes ReceiveSharedInvalidMessages to forcibly always call the
 reset function.  Any thoughts about that?

 I'm OK with that, but while we're whacking this around, is there any
 chance that we could reduce the number of layers of function calls
 that are happening in here?

I'm not excited about that.  We have got three layers that each have
significant complexities of their own to deal with.  The only way to
reduce the layer count is to make some piece of code deal with more of
those complexities at once.  I don't want to go there, especially not
in the name of very marginal cycle-shaving.

 It would be nice to move the short-circuit test I recently inserted at
 the top of SIGetDataEntries() somewhere higher up in the call stack,
 but right now the layers of abstraction are so thick that it's not
 exactly clear how to do that.

Glad you didn't try, because it would be wrong.  The fact that there
are no outstanding messages in sinvaladt.c doesn't prove there are none
yet unprocessed inside ReceiveSharedInvalidMessages.


Anyway, to get back to the original point: I'm getting less excited
about redoing the CLOBBER_CACHE_ALWAYS code at a different level.
The only thing that can really promise is that we find places outside
the cache code that are mistakenly holding onto cache entry pointers.
It can't do very much for the sort of problems I've been finding over
the past week, first because you need some other process actively
changing the underlying information (else the use of a stale cache entry
isn't a problem), and second because when you're looking for bugs
involving use of stale cache entries, over-enthusiastic flushing of the
cache can actually mask the bugs.

I'd still like to think of a better test methodology, but I don't think
force clobbers inside ReceiveSharedInvalidMessages is it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Jim Nasby
On Aug 15, 2011, at 5:46 PM, Joachim Wieland wrote:
 On Mon, Aug 15, 2011 at 6:09 PM, Jim Nasby j...@nasby.net wrote:
 I suspect that all the other cases of BEGIN failing would be syntax errors, 
 so
 you would immediately know in testing that something was wrong. A missing 
 file
 is definitely not a syntax error, so we can't really depend on user testing 
 to ensure
 this is handled correctly. IMO, that makes it critical that that error puts 
 us in an
 aborted transaction.
 
 Why can we not just require the user to verify if his BEGIN query
 failed or succeeded?
 Is that really too much to ask for?

It's something else that you have to remember to get right. psql, for example, 
will blindly continue on unless you remembered to tell it to exit on an error.

Also, an invalid transaction seems to be the result of least surprise... if you 
cared enough to begin a transaction, you're going to expect that either 
everything between that and the COMMIT succeeds or fails, not something 
in-between.

 Also see what Robert wrote about proxies in between that keep track of
 the transaction
 state. Consider they see a BEGIN query that fails. How would they know
 if the session
 is now in an aborted transaction or not in a transaction at all?

AFAIK a proxy can tell if a transaction is in progress or not via libpq. 
Worst-case, it just needs to send an extra ROLLBACK.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 3:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It would be nice to move the short-circuit test I recently inserted at
 the top of SIGetDataEntries() somewhere higher up in the call stack,
 but right now the layers of abstraction are so thick that it's not
 exactly clear how to do that.

 Glad you didn't try, because it would be wrong.  The fact that there
 are no outstanding messages in sinvaladt.c doesn't prove there are none
 yet unprocessed inside ReceiveSharedInvalidMessages.

Oh.  Good point.

 Anyway, to get back to the original point: I'm getting less excited
 about redoing the CLOBBER_CACHE_ALWAYS code at a different level.
 The only thing that can really promise is that we find places outside
 the cache code that are mistakenly holding onto cache entry pointers.
 It can't do very much for the sort of problems I've been finding over
 the past week, first because you need some other process actively
 changing the underlying information (else the use of a stale cache entry
 isn't a problem), and second because when you're looking for bugs
 involving use of stale cache entries, over-enthusiastic flushing of the
 cache can actually mask the bugs.

 I'd still like to think of a better test methodology, but I don't think
 force clobbers inside ReceiveSharedInvalidMessages is it.

Makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] A note about hash-based catcache invalidations

2011-08-16 Thread Tom Lane
I'm looking into the idea I mentioned earlier:

 All is not entirely lost, however: there's still some possible
 performance benefit to be gained here, if we go to the scheme of
 identifying victim catcache entries by hashvalue only.  Currently,
 each heap_update in a cached catalog has to issue two sinval messages
 (per cache!): one against the old TID and one against the new TID.
 We'd be able to reduce that to one message in the common case where the
 hashvalue remains the same because the cache key columns didn't change.

Removing the tuple ID from sinval messages turns out to have slightly
wider impact than I thought at first, because the current coding passes
those to callback functions registered with
CacheRegisterSyscacheCallback, and there are a lot of 'em.  However,
only one of them seems to be doing anything with the tuplePtr argument,
namely PlanCacheFuncCallback.  We can make it work with the hash value
instead, which will be about as good as what we're doing now.

Any objections to that plan?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Jeff Davis
On Tue, 2011-08-16 at 11:01 -0500, Jim Nasby wrote:
 Also, an invalid transaction seems to be the result of least
 surprise... if you cared enough to begin a transaction, you're going
 to expect that either everything between that and the COMMIT succeeds
 or fails, not something in-between.

Agreed.

Perhaps we need a new utility command to set the snapshot to make the
error handling a little more obvious?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Jim Nasby
On Aug 16, 2011, at 5:40 PM, Jeff Davis wrote:
 On Tue, 2011-08-16 at 11:01 -0500, Jim Nasby wrote:
 Also, an invalid transaction seems to be the result of least
 surprise... if you cared enough to begin a transaction, you're going
 to expect that either everything between that and the COMMIT succeeds
 or fails, not something in-between.
 
 Agreed.
 
 Perhaps we need a new utility command to set the snapshot to make the
 error handling a little more obvious?

Well, it appears we have a larger problem, as Robert pointed out that trying to 
start a writable transaction on a hot standby leaves you not in a transaction 
(which I feel is a problem).

So IMHO the right thing to do here is make it so that runtime errors in BEGIN 
leave you in an invalid transaction. Then we can decide on the API for 
synchronized snapshots that makes sense instead of working around the behavior 
of BEGIN.

I guess the big question to answer now is: what's the backwards compatibility 
impact of changing how BEGIN deals with runtime errors?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Finding tables dropped by DROP TABLE CASCADE

2011-08-16 Thread Tatsuo Ishii
Hi,

I'm working on implemeting query cache for pgpool-II. The query cache
must be deleted if related tables are dropped. Finding tables oids
from DROP TABLE t1, t2, t3... is easy. Problem is DROP TABLE
CASCADE. It seems there's no easy way to find table oids which will be
deleted by DROP TABLE CASCADE. Any idea?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finding tables dropped by DROP TABLE CASCADE

2011-08-16 Thread Michael Paquier
Just a suggestion, but...
Why not using an external wrapper function on reportDependentObjects in
dependency.c to find the list of Oids for a cascade deletion based on a list
of objects?
Isn't it possible?

Regards,
-- 
Michael Paquier
http://michael.otacoo.com


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Tom Lane
Jim Nasby j...@nasby.net writes:
 Well, it appears we have a larger problem, as Robert pointed out that trying 
 to start a writable transaction on a hot standby leaves you not in a 
 transaction (which I feel is a problem).

 So IMHO the right thing to do here is make it so that runtime errors in BEGIN 
 leave you in an invalid transaction. Then we can decide on the API for 
 synchronized snapshots that makes sense instead of working around the 
 behavior of BEGIN.

I'm not convinced by the above argument, because it requires that
you pretend there's a significant difference between syntax errors and
run time errors (whatever those are).  Syntax errors in a BEGIN
command are not going to leave you in an aborted transaction, because
the backend is not going to recognize the command as a BEGIN at all.
This means that frontends *must* be capable of dealing with the case
that a failed BEGIN didn't start a transaction.  (Either that, or
they just assume their commands are always syntactically perfect,
which seems like pretty fragile programming to me; and the more weird
nonstandard options we load onto BEGIN, the less tenable the position
becomes.  For example, if you feed BEGIN option-foo to a server that's
a bit older than you thought it was, you will get a syntax error.)
If we have some failure cases that start a transaction and some that do
not, we just have a mess, IMO.

I think we'd be far better off to maintain the position that a failed
BEGIN does not start a transaction, under any circumstances.  To do
that, we cannot have this new option attached to the BEGIN, which is a
good thing anyway IMO from a standards compatibility point of view.
It'd be better to make it a separate utility statement.  There is no
logical problem in doing that, and we already have a precedent for
utility statements that do something special before the transaction
snapshot is taken: see LOCK.

In fact, now that I think about it, setting the transaction snapshot
from a utility statement would be functionally useful because then you
could take locks beforehand.

And as a bonus, we don't have a backwards compatibility problem to solve.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 8:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jim Nasby j...@nasby.net writes:
 Well, it appears we have a larger problem, as Robert pointed out that trying 
 to start a writable transaction on a hot standby leaves you not in a 
 transaction (which I feel is a problem).

 So IMHO the right thing to do here is make it so that runtime errors in 
 BEGIN leave you in an invalid transaction. Then we can decide on the API for 
 synchronized snapshots that makes sense instead of working around the 
 behavior of BEGIN.

 I'm not convinced by the above argument, because it requires that
 you pretend there's a significant difference between syntax errors and
 run time errors (whatever those are).  Syntax errors in a BEGIN
 command are not going to leave you in an aborted transaction, because
 the backend is not going to recognize the command as a BEGIN at all.
 This means that frontends *must* be capable of dealing with the case
 that a failed BEGIN didn't start a transaction.  (Either that, or
 they just assume their commands are always syntactically perfect,
 which seems like pretty fragile programming to me; and the more weird
 nonstandard options we load onto BEGIN, the less tenable the position
 becomes.  For example, if you feed BEGIN option-foo to a server that's
 a bit older than you thought it was, you will get a syntax error.)
 If we have some failure cases that start a transaction and some that do
 not, we just have a mess, IMO.

More or less agreed.

 I think we'd be far better off to maintain the position that a failed
 BEGIN does not start a transaction, under any circumstances.

Also agreed.

 To do
 that, we cannot have this new option attached to the BEGIN, ...

Eh, why not?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finding tables dropped by DROP TABLE CASCADE

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 8:10 PM, Tatsuo Ishii is...@postgresql.org wrote:
 I'm working on implemeting query cache for pgpool-II. The query cache
 must be deleted if related tables are dropped. Finding tables oids
 from DROP TABLE t1, t2, t3... is easy. Problem is DROP TABLE
 CASCADE. It seems there's no easy way to find table oids which will be
 deleted by DROP TABLE CASCADE. Any idea?

Presumably it would also need to invalidated if someone did ALTER
TABLE (which might recurse into unspecified children).

It sort of seems like what you want to do is snoop the sinval traffic...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finding tables dropped by DROP TABLE CASCADE

2011-08-16 Thread Tatsuo Ishii
 Presumably it would also need to invalidated if someone did ALTER
 TABLE (which might recurse into unspecified children).

Good point. For DROP TABLE/ALTER TABLE, I need to take care of its chidren.

 It sort of seems like what you want to do is snoop the sinval traffic...

It's hard for pgpool-II since there's no API in PostgreSQL for
that. Maybe I will look into the system catalog to find out
children. I'm not sure if I can deal with CASCADE by the same method
though.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Aug 16, 2011 at 8:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think we'd be far better off to maintain the position that a failed
 BEGIN does not start a transaction, under any circumstances.

 Also agreed.

 To do
 that, we cannot have this new option attached to the BEGIN, ...

 Eh, why not?

Maybe I wasn't paying close enough attention to the thread, but I had
the idea that there was some implementation reason why not.  If not,
we could still load the option onto BEGIN ... but I still find myself
liking the idea of a separate command better, because of the locking
issue.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 8:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Aug 16, 2011 at 8:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think we'd be far better off to maintain the position that a failed
 BEGIN does not start a transaction, under any circumstances.

 Also agreed.

 To do
 that, we cannot have this new option attached to the BEGIN, ...

 Eh, why not?

 Maybe I wasn't paying close enough attention to the thread, but I had
 the idea that there was some implementation reason why not.  If not,
 we could still load the option onto BEGIN ... but I still find myself
 liking the idea of a separate command better, because of the locking
 issue.

Why does it matter whether you take the locks before or after the snapshot?

If you're concerned with minimizing the race, what you should do is
take all relevant locks in the parent before exporting the snapshot.

I am not wild about adding another toplevel command for this.  It
seems a rather narrow use case, and attaching it to BEGIN feels
natural to me.  There may be some small benefit also in terms of
minimizing the amount of sanity checking that must be done - for
example, at BEGIN time, you don't have to check for the case where a
snapshot has already been set.

If we did add another toplevel command, what would we call it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finding tables dropped by DROP TABLE CASCADE

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 8:52 PM, Tatsuo Ishii is...@postgresql.org wrote:
 Presumably it would also need to invalidated if someone did ALTER
 TABLE (which might recurse into unspecified children).

 Good point. For DROP TABLE/ALTER TABLE, I need to take care of its chidren.

 It sort of seems like what you want to do is snoop the sinval traffic...

 It's hard for pgpool-II since there's no API in PostgreSQL for
 that. Maybe I will look into the system catalog to find out
 children. I'm not sure if I can deal with CASCADE by the same method
 though.

It's possible, but not too easy.

Maybe we should have a special LISTEN channel that plays back (some
subset of? some decoded version of?) the sinval messaging.  I bet the
pgAdmin guys would like an automated way of knowing when tables had
been created/dropped, too...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Jeff Davis
On Tue, 2011-08-16 at 20:35 -0400, Tom Lane wrote:
 I'm not convinced by the above argument, because it requires that
 you pretend there's a significant difference between syntax errors and
 run time errors (whatever those are).

After a syntax error like COMMMIT the transaction will remain inside
the failed transaction block, but an error during COMMIT (e.g. deferred
constraint check failure) will exit the transaction block.

 I think we'd be far better off to maintain the position that a failed
 BEGIN does not start a transaction, under any circumstances.  To do
 that, we cannot have this new option attached to the BEGIN, which is a
 good thing anyway IMO from a standards compatibility point of view.
 It'd be better to make it a separate utility statement.

+1 for a utility statement. Much clearer from the user's standpoint what
kind of errors they might expect, and whether the session will remain in
a transaction block.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finding tables dropped by DROP TABLE CASCADE

2011-08-16 Thread Joe Abbate
On 08/16/2011 08:52 PM, Tatsuo Ishii wrote:
 Presumably it would also need to invalidated if someone did ALTER
 TABLE (which might recurse into unspecified children).
 
 Good point. For DROP TABLE/ALTER TABLE, I need to take care of its chidren.
 
 It sort of seems like what you want to do is snoop the sinval traffic...
 
 It's hard for pgpool-II since there's no API in PostgreSQL for
 that. Maybe I will look into the system catalog to find out
 children. I'm not sure if I can deal with CASCADE by the same method
 though.

Not sure how much it will help, but I have implemented the logic to drop
dependent tables and other objects from catalog info, in Pyrseas.  The
relevant code is in

https://github.com/jmafc/Pyrseas/blob/master/pyrseas/dbobject/table.py

In particular, _from_catalog() at line 375 fetches the information using
the query and inhquery SELECTs just above it.  Then in diff_map()
starting at line 641 it issues the SQL to drop the various dependent
objects and tables.

If it's mostly inherited tables that you're concerned, the inhquery and
the code dealing with inhstack should be helpful.

Joe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Jeff Davis
On Tue, 2011-08-16 at 21:08 -0400, Robert Haas wrote: 
 attaching it to BEGIN feels natural to me.

My only objection is that people have different expectations about
whether the session will remain in a transaction block when they
encounter an error. So, it's hard to make this work without surprising
about half the users.

And there are some fairly significant consequences to users who guess
that they will remain in a transaction block in case of an error; or who
are just careless and don't consider that an error may occur.

 If we did add another toplevel command, what would we call it?

SET TRANSACTION SNAPSHOT perhaps?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronized snapshots

2011-08-16 Thread Robert Haas
On Tue, Aug 16, 2011 at 9:54 PM, Jeff Davis pg...@j-davis.com wrote:
 If we did add another toplevel command, what would we call it?

 SET TRANSACTION SNAPSHOT perhaps?

Hmm, that's not bad, but I think we'd have to partially reserve
TRANSACTION to make it work, which doesn't seem worth the pain it
would cause.

We could do something like TRANSACTION SNAPSHOT IS 'xyz', but that's a
bit awkard.

I still like BEGIN SNAPSHOT 'xyz' -- or even adding a generic options
list like we use for some other commands, i.e. BEGIN (snapshot 'xyz'),
which would leave the door open to arbitrary amounts of future
nonstandard fiddling without the need for any more keywords.  I
understand the point about the results of a BEGIN failure leaving you
outside a transaction, but that really only matters if you're doing
psql  dumb_script.sql, which is impractical for this feature
anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Adjust total size in pg_basebackup progress report when reality

2011-08-16 Thread Fujii Masao
On Wed, Aug 17, 2011 at 12:00 AM, Magnus Hagander mag...@hagander.net wrote:
 Adjust total size in pg_basebackup progress report when reality changes

 When streaming including WAL, the size estimate will always be incorrect,
 since we don't know how much WAL is included. To make sure the output doesn't
 look completely unreasonable, this patch increases the total size whenever we
 go past the estimate, to make sure we never go above 100%.

http://developer.postgresql.org/pgdocs/postgres/app-pgbasebackup.html

Enables progress reporting. Turning this on will deliver an approximate progress
report during the backup. Since the database may change during the backup,
this is only an approximation and may not end at exactly 100%. In particular,
when WAL log is included in the backup, the total amount of data cannot be
estimated in advance, and in this case the progress report will only count
towards the total amount of data without WAL.


ISTM that the last sentence needs to be changed slightly because this commit
changed the total amount so that it contains the amount of WAL.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some problems about cascading replication

2011-08-16 Thread Fujii Masao
On Tue, Aug 16, 2011 at 11:56 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I think there's a race condition here. If a walsender is just starting up,
 it might not have registered itself as a walsender yet. It's actually been
 there before this patch to suppress the log message.

Right. To address this problem, I changed the patch so that dead-end
walsender (i.e., cascading walsender even though recovery is not in
progress) always emits the log message. This change would cause
duplicate log messages if the standby promotion is requested while
multiple walsenders including dead-end one are running. But since
this is less likely to happen, I don't think it's worth writing code to
suppress those duplicate log messages. Comments?

I attached the updated version of the patch.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c0a32a3..90dad2c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2328,10 +2328,11 @@ reaper(SIGNAL_ARGS)
 			 * XXX should avoid the need for disconnection. When we do,
 			 * am_cascading_walsender should be replaced with RecoveryInProgress()
 			 */
-			if (max_wal_senders  0)
+			if (max_wal_senders  0  CountChildren(BACKEND_TYPE_WALSND)  0)
 			{
 ereport(LOG,
-		(errmsg(terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect)));
+		(errmsg(terminating all walsender processes to force cascaded 
+standby(s) to update timeline and reconnect)));
 SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND);
 			}
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0eadf64..813c998 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -368,6 +368,35 @@ StartReplication(StartReplicationCmd *cmd)
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
 
 	/*
+	 * When promoting a cascading standby, postmaster sends SIGUSR2 to
+	 * any cascading walsenders to kill them. But there is a corner-case where
+	 * such walsender fails to receive SIGUSR2 and survives a standby promotion
+	 * unexpectedly. This happens when postmaster sends SIGUSR2 before
+	 * the walsender marks itself as a WAL sender, because postmaster sends
+	 * SIGUSR2 to only the processes marked as a WAL sender.
+	 *
+	 * To avoid this corner-case, if recovery is NOT in progress even though
+	 * the walsender is cascading one, we do the same thing as SIGUSR2 signal
+	 * handler does, i.e., set walsender_ready_to_stop to true. Which causes
+	 * the walsender to end later.
+	 *
+	 * When terminating cascading walsenders, usually postmaster writes
+	 * the log message announcing the terminations. But there is a race condition
+	 * here. If there is no walsender except this process before reaching here,
+	 * postmaster thinks that there is no walsender and suppresses that
+	 * log message. To handle this case, we always emit that log message here.
+	 * This might cause duplicate log messages, but which is less likely to happen,
+	 * so it's not worth writing some code to suppress them.
+	 */
+	if (am_cascading_walsender  !RecoveryInProgress())
+	{
+		ereport(LOG,
+(errmsg(terminating walsender process to force cascaded standby 
+		to update timeline and reconnect)));
+		walsender_ready_to_stop = true;
+	}
+
+	/*
 	 * We assume here that we're logging enough information in the WAL for
 	 * log-shipping, since this is checked in PostmasterMain().
 	 *

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Fix bogus comment that claimed that the new BACKUP METHOD line i

2011-08-16 Thread Fujii Masao
On Tue, Aug 16, 2011 at 6:24 PM, Heikki Linnakangas
heikki.linnakan...@iki.fi wrote:
 Fix bogus comment that claimed that the new BACKUP METHOD line in
 backup_label was new in 9.0. Spotted by Fujii Masao.

Thanks for fixing that.

What about the remaining of the patch which I submitted? Unless I'm missing
something, the patch enables us to handle correctly the case where recovery
starts from the backup using pg_basebackup, without adding new field like
backupEndRequired into pg_control. Since it doesn't change the content of
pg_control, it's applicable to v9.1, I think.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers