[HACKERS] DROP FUNCTION of multiple functions

2016-10-31 Thread Peter Eisentraut
Here is a patch series that implements several changes in the internal
grammar and node representation of function signatures.  They are not
necessarily meant to be applied separately, but they explain the
progression of the changes nicely, so I left them like that for review.

The end goal is to make some user-visible changes in DROP FUNCTION and
possibly other commands that refer to functions.

With these patches, it is now possible to use DROP FUNCTION to drop
multiple functions at once: DROP FUNCTION func1(), func2(), func3().
Other DROP commands already supported that, but DROP FUNCTION didn't
because the internal representation was complicated and couldn't handle it.

The next step after this would be to allow referring to functions
without having to supply the arguments, if the name is unique.  This is
an SQL-standard feature and would be very useful for dealing "business
logic" functions with 10+ arguments.  The details of that are to be
worked out, but with the help of the present changes, this would be a
quite localized change, because the grammar representation is well
encapsulated.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 0a5d27cf9ab91bb03daa956d5167c5c993dfb857 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 15 Sep 2016 12:00:00 -0500
Subject: [PATCH 1/7] Use grammar symbol function_with_argtypes consistently

Instead of sometimes referring to a function signature like func_name
func_args, use the existing function_with_argtypes symbol, which
combines the two.
---
 src/backend/parser/gram.y | 44 ++--
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 5547fc8..755b387 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -5366,21 +5366,21 @@ opclass_item:
 	n->order_family = $5;
 	$$ = (Node *) n;
 }
-			| FUNCTION Iconst func_name func_args
+			| FUNCTION Iconst function_with_argtypes
 {
 	CreateOpClassItem *n = makeNode(CreateOpClassItem);
 	n->itemtype = OPCLASS_ITEM_FUNCTION;
-	n->name = $3;
-	n->args = extractArgTypes($4);
+	n->name = $3->funcname;
+	n->args = $3->funcargs;
 	n->number = $2;
 	$$ = (Node *) n;
 }
-			| FUNCTION Iconst '(' type_list ')' func_name func_args
+			| FUNCTION Iconst '(' type_list ')' function_with_argtypes
 {
 	CreateOpClassItem *n = makeNode(CreateOpClassItem);
 	n->itemtype = OPCLASS_ITEM_FUNCTION;
-	n->name = $6;
-	n->args = extractArgTypes($7);
+	n->name = $6->funcname;
+	n->args = $6->funcargs;
 	n->number = $2;
 	n->class_args = $4;
 	$$ = (Node *) n;
@@ -5780,13 +5780,13 @@ CommentStmt:
 	n->comment = $7;
 	$$ = (Node *) n;
 }
-			| COMMENT ON FUNCTION func_name func_args IS comment_text
+			| COMMENT ON FUNCTION function_with_argtypes IS comment_text
 {
 	CommentStmt *n = makeNode(CommentStmt);
 	n->objtype = OBJECT_FUNCTION;
-	n->objname = $4;
-	n->objargs = extractArgTypes($5);
-	n->comment = $7;
+	n->objname = $4->funcname;
+	n->objargs = $4->funcargs;
+	n->comment = $6;
 	$$ = (Node *) n;
 }
 			| COMMENT ON OPERATOR any_operator oper_argtypes IS comment_text
@@ -5998,15 +5998,15 @@ SecLabelStmt:
 	n->label = $9;
 	$$ = (Node *) n;
 }
-			| SECURITY LABEL opt_provider ON FUNCTION func_name func_args
+			| SECURITY LABEL opt_provider ON FUNCTION function_with_argtypes
 			  IS security_label
 {
 	SecLabelStmt *n = makeNode(SecLabelStmt);
 	n->provider = $3;
 	n->objtype = OBJECT_FUNCTION;
-	n->objname = $6;
-	n->objargs = extractArgTypes($7);
-	n->label = $9;
+	n->objname = $6->funcname;
+	n->objargs = $6->funcargs;
+	n->label = $8;
 	$$ = (Node *) n;
 }
 			| SECURITY LABEL opt_provider ON LARGE_P OBJECT_P NumericOnly
@@ -7236,24 +7236,24 @@ opt_restrict:
  */
 
 RemoveFuncStmt:
-			DROP FUNCTION func_name func_args opt_drop_behavior
+			DROP FUNCTION function_with_argtypes opt_drop_behavior
 {
 	DropStmt *n = makeNode(DropStmt);
 	n->removeType = OBJECT_FUNCTION;
-	n->objects = list_make1($3);
-	n->arguments = list_make1(extractArgTypes($4));
-	n->behavior = $5;
+	n->objects = list_make1($3->funcname);
+	n->arguments = list_make1($3->funcargs);
+	n->behavior = $4;
 	n->missing_ok = false;
 	n->concurrent = false;
 	$$ = (Node *)n;
 }
-			| DROP FUNCTION IF_P EXISTS func_name func_args opt_drop_behavior
+			| DROP FUNCTION IF_P EXISTS function_with_argtypes opt_drop_behavior
 {
 	DropStmt *n = makeNode(DropStmt);
 	n->removeType = OBJECT_FUNCTION;
-	n->objects = list_make1($5);
-	n->arguments = list_make1(extractArgTypes($6));
-	n->behavior 

Re: [HACKERS] Indirect indexes

2016-10-31 Thread Alvaro Herrera
Alvaro Herrera wrote:
> I propose we introduce the concept of "indirect indexes".

This is a WIP non-functional patch for indirect indexes.  I've been
distracted from working on it for some time already and will be off-line
for half this month yet, but since this was discussed and seems to be
considered a welcome idea, I am posting it for those who want to have a
look at what I'm doing.  This can write values to indirect indexes (only
btrees), but it cannot read values from them yet, so don't expect this
to work at all unless you are hoping to read index files using
pageinspect.  (If you do test this, be aware that "VACUUM FULL pg_class"
is a case that I know needs fixed.)

I think the most interesting change here is how
HeapSatisfiesHOTandKeyUpdate() has accomodated some additional code to
return a bitmapset of columns that are not modified by an update.

This implements a new command
  CREATE INDIRECT INDEX
which instructs the AM to create an index that stores primary key values
instead of CTID values.  I have not tried yet to remove the limitation
of only six bytes in the PK value.  The part of the patch I'm not
submitting just yet adds a flag to IndexInfo used by IndexPath, so that
when the index is selected by the planner, an IndirectIndexScan node is
created instead of a plain IndexScan.  This node knows how to invoke the
AM so that the PK values are extracted in a first step and the CTIDs are
extracted from the PK in a second step (IndirectIndexScan carries two
IndexScanDesc structs and two index RelationDescs, so it keeps both the
indirect index and the PK index open).

The part that generated the most discussion was vacuuming.  As I said
earlier, I think that instead of trying to shoehorn an index cleanup in
regular vacuum (and cause a terrible degradation of maintenance_work_mem
consumption, into optimizing which so much work has gone), these indexes
would rely on desultory cleanup instead through the "killtuple"
interface that causes index tuples to be removed during scan.  Timely
cleanup is not critical as it is with regular (direct) indexes, given
that CTIDs are not stored and thus tuple movement does not affect this
type of indexes.

This patch is considerably smaller than the toy patch I had, which
introduced a separate AM for "ibtree" -- that was duplicating a lot
of code and adding more code complexity, which becomes much simpler with
the approach in the current code; one thing I didn't like at all was the
fact that the ibtree routines were calling the "internal" btree
routines, which was not nice.  (Also, it would have meant having
duplicate operator family/class rows.)

I hope to be back at home to collaborate with the commitfest on Nov
14th.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..9f899c7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -92,6 +92,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amstorage = true;
amroutine->amclusterable = false;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = InvalidOid;
 
amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f07eedc..1bc91d2 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->amstorage = true;
amroutine->amclusterable = false;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = InvalidOid;
 
amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..4ec34d5 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amstorage = true;
amroutine->amclusterable = true;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = InvalidOid;
 
amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..3fa3d71 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -66,6 +66,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amstorage = false;
amroutine->amclusterable = false;
amroutine->ampredlocks = false;
+   amroutine->amcanindirect = false;
amroutine->amkeytype = INT4OID;
 
amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..9e91b41 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -96,10 +96,10 @@ static XLogRecPtr log_heap_update(Relation 

Re: [HACKERS] ECPG BUlk insert support using arrays

2016-10-31 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Haribabu Kommi
> I didn't find any relevant thread about the discussion of Bulk insert support
> in ECPG using arrays. Oracle supports the same and details are available
> in [1].
> 
> 
> I see some performance benefits in supporting the same in ECPG also.
> Does any one worked on this area before? Are there any problems in preparing
> a patch to support the same?

Please see "batch/pipelining support for libpq" by Craig.  I said I'll use his 
API to implement the array insert for ECPG, but you can feel free to do it.

Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-10-31 Thread Thomas Munro
On Thu, Aug 18, 2016 at 1:55 PM, Thomas Munro
 wrote:
> On Tue, Aug 16, 2016 at 1:55 AM, Robert Haas  wrote:
>> If we're going to remove barrier.h, I think that should be a separate
>> commit from creating a new barrier.h.
>
> OK.  Here's a patch to remove the old header, and the v2 barrier patch
> which adds phases and attach/detach.  As before, it depends on
> condition-variable-v2.patch.

Here's a new version which is rebased and adds support for passing
wait_event through to pg_stat_activity.

-- 
Thomas Munro
http://www.enterprisedb.com


remove-useless-barrier-header-v2.patch
Description: Binary data


barrier-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] identity columns

2016-10-31 Thread Peter Eisentraut
New patch.

On 9/9/16 11:45 PM, Vitaly Burovoy wrote:
> 1. The standard requires "... ALTER COLUMN ... SET GENERATED { ALWAYS
> | BY DEFAULT }" (9075-2:2011 subcl 11.20), but the patch implements it
> as "... ALTER COLUMN ... ADD GENERATED { ALWAYS | BY DEFAULT } AS
> IDENTITY"

Has both now.  They do different things, as documented.

> 2. The standard requires not more than one identity column, the patch
> does not follow that requirement, but it does not mentioned in the
> doc.

fixed

> 3. Changes in the table "information_schema.columns" is not full.

fixed

> 4. "" is not fully implemented
> because "" is implemented
> whereas "" is not.

done

> 5. According to 9075-2:2011 subcl 14.11 Syntax Rule 11)c) for a column
> with an indication that values are generated by default the only
> possible "" is "OVERRIDING USER VALUE".
> Implementation allows to use "OVERRIDING SYSTEM VALUE" (as "do
> nothing"), but it should be mentioned in "Compatibility" part in the
> doc.

done (documented)

> 6. "CREATE TABLE ... (LIKE ... INCLUDING ALL)" fails Assertion  at
> src/backend/commands/tablecmds.c:631

fixed

> 7. Changing default is allowed but a column is still "identity":

fixed

> 8. Changing a column to be "identity" raises "duplicate key" exception:

fixed

> 9. Changing type of a column deletes linked sequence but leaves
> "default" and "identity" marks:

fixed

> 10. "identity" modifier is lost when the table inherits another one:

fixed, but I invite more testing of inheritance-related things

> 11. The documentation says "OVERRIDING ... VALUE" can be placed even
> before "DEFAULT VALUES", but it is against SQL spec and the
> implementation:

fixed

> 12. Dump/restore is broken for some cases:

fixed

> 13. doc/src/sgml/ref/create_table.sgml (5th chunk) has "TODO". Why?

fixed

> 14. It would be fine if psql has support of new clauses.

done

> 15. Initializing attidentity in most places is ' ' but makefuncs.c has
> "n->identity = 0;". Is it correct?

fixed

> 16. I think it is a good idea to not raise exceptions for "SET
> GENERATED/DROP IDENTITY" if a column has the same type of identity/not
> an identity. To be consistent with "SET/DROP NOT NULL".

The present behavior is per SQL standard.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 29738b0..027c73e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1095,6 +1095,17 @@ pg_attribute Columns
  
 
  
+  attidentity
+  char
+  
+  
+   If a space character, then not an identity column.  Otherwise,
+   a = generated always, d =
+   generated by default.
+  
+ 
+
+ 
   attisdropped
   bool
   
diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml
index c43e325..8ece439 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -1583,13 +1583,20 @@ columns Columns
  
   is_identity
   yes_or_no
-  Applies to a feature not available in PostgreSQL
+  
+   If the column is an identity column, then YES,
+   else NO.
+  
  
 
  
   identity_generation
   character_data
-  Applies to a feature not available in PostgreSQL
+  
+   If the column is an identity column, then ALWAYS
+   or BY DEFAULT, reflecting the definition of the
+   column.
+  
  
 
  
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index e48ccf2..b272633 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -42,6 +42,9 @@
 ALTER [ COLUMN ] column_name SET DEFAULT expression
 ALTER [ COLUMN ] column_name DROP DEFAULT
 ALTER [ COLUMN ] column_name { SET | DROP } NOT NULL
+ALTER [ COLUMN ] column_name ADD GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ]
+ALTER [ COLUMN ] column_name { SET GENERATED { ALWAYS | BY DEFAULT } | SET sequence_option | RESET } [...]
+ALTER [ COLUMN ] column_name DROP IDENTITY
 ALTER [ COLUMN ] column_name SET STATISTICS integer
 ALTER [ COLUMN ] column_name SET ( attribute_option = value [, ... ] )
 ALTER [ COLUMN ] column_name RESET ( attribute_option [, ... ] )
@@ -170,6 +173,32 @@ Description

 

+ADD GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY
+SET GENERATED { ALWAYS | BY DEFAULT }
+DROP IDENTITY
+
+ 
+  These forms change whether a column is an identity column or change the
+  generation attribute of an existing identity column.
+  See  for details.
+ 
+
+   
+
+   
+SET sequence_option
+RESET
+
+ 
+  These forms alter the sequence that underlies an existing identity
+  column.  sequence_option is an option
+  supported by  such
+  as INCREMENT BY.
+ 
+
+   
+

[HACKERS] ECPG BUlk insert support using arrays

2016-10-31 Thread Haribabu Kommi
Hi All,


I didn't find any relevant thread about the discussion of Bulk insert
support in ECPG using arrays. Oracle supports the same and details
are available in [1].


I see some performance benefits in supporting the same in ECPG also.
Does any one worked on this area before? Are there any problems in
preparing a patch to support the same?


[1] - https://docs.oracle.com/cd/B28359_01/appdev.111/b28427/pc_08arr.htm


Regards,
Hari Babu
Fujitsu Australia


[HACKERS] background sessions

2016-10-31 Thread Peter Eisentraut
Here is a patch for the background sessions C API and PL/Python support.
 This was previously submitted as "autonomous transactions", which
proved controversial, and there have been several suggestions for a new
name.

I have renamed everything, removed all the incomplete PL/pgSQL stuff,
did some refinement on the PL/Python interfaces, added resource owner
management so that you can preserve session handles across transactions.
 That allows a pg_background-like behavior implemented in a PL function.
 I have also added documentation, so reviewers could start there.
Probably not quite all done yet, but I think it contains a lot of useful
pieces that we could make into something nice.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 661c7fe769982e3f0c71f4ad57a768d7eb55f6e2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 31 Oct 2016 12:00:00 -0400
Subject: [PATCH] Add background sessions

This adds a C API to run SQL statements in a background worker,
communicating by FE/BE protocol over a DSM message queue.  This can be
used to execute statements and transactions separate from the main
foreground session.

Also included is a PL/Python interface to this functionality.
---
 doc/src/sgml/bgsession.sgml | 236 +++
 doc/src/sgml/filelist.sgml  |   1 +
 doc/src/sgml/plpython.sgml  | 102 +++
 doc/src/sgml/postgres.sgml  |   1 +
 src/backend/commands/variable.c |   5 +
 src/backend/libpq/pqmq.c|  23 +
 src/backend/tcop/Makefile   |   2 +-
 src/backend/tcop/bgsession.c| 890 
 src/backend/tcop/postgres.c |  24 +-
 src/include/commands/variable.h |   1 +
 src/include/libpq/pqmq.h|   1 +
 src/include/tcop/bgsession.h|  26 +
 src/include/tcop/tcopprot.h |   9 +
 src/pl/plpython/Makefile|   2 +
 src/pl/plpython/expected/plpython_bgsession.out | 188 +
 src/pl/plpython/expected/plpython_test.out  |   7 +-
 src/pl/plpython/plpy_bgsession.c| 454 
 src/pl/plpython/plpy_bgsession.h|  18 +
 src/pl/plpython/plpy_main.h |   3 +
 src/pl/plpython/plpy_planobject.c   |   1 +
 src/pl/plpython/plpy_planobject.h   |   2 +
 src/pl/plpython/plpy_plpymodule.c   |   5 +
 src/pl/plpython/plpy_spi.c  |   7 +-
 src/pl/plpython/plpy_spi.h  |   3 +
 src/pl/plpython/sql/plpython_bgsession.sql  | 148 
 25 files changed, 2139 insertions(+), 20 deletions(-)
 create mode 100644 doc/src/sgml/bgsession.sgml
 create mode 100644 src/backend/tcop/bgsession.c
 create mode 100644 src/include/tcop/bgsession.h
 create mode 100644 src/pl/plpython/expected/plpython_bgsession.out
 create mode 100644 src/pl/plpython/plpy_bgsession.c
 create mode 100644 src/pl/plpython/plpy_bgsession.h
 create mode 100644 src/pl/plpython/sql/plpython_bgsession.sql

diff --git a/doc/src/sgml/bgsession.sgml b/doc/src/sgml/bgsession.sgml
new file mode 100644
index 000..785ee66
--- /dev/null
+++ b/doc/src/sgml/bgsession.sgml
@@ -0,0 +1,236 @@
+
+
+
+ Background Session API
+
+ 
+  The background session API is a C API for creating additional database
+  sessions in the background and running SQL statements in them.  A background
+  session behaves like a normal (foreground) session in that it has session
+  state, transactions, can run SQL statements, and so on.  Unlike a foreground
+  session, it is not connected directly to a client.  Instead the foreground
+  session can use this API to execute SQL statements and retrieve their
+  results.  Higher-level integrations, such as in procedural languages, can
+  make this functionality available to clients.  Background sessions are
+  independent from their foreground sessions in their session and transaction
+  state.  So a background session cannot see uncommitted data in foreground
+  sessions or vice versa, and there is no preferential treatment about
+  locking.  Like all sessions, background sessions are separate processes.
+  Foreground and background sessions communicate over shared memory messages
+  queues instead of the sockets that a client/server connection uses.
+ 
+
+ 
+  Background sessions can be useful in a variety of scenarios when effects
+  that are independent of the foreground session are to be achieved, for
+  example:
+  
+   
+
+ Commit data independent of whether a foreground transaction commits, for
+ example for auditing.  A trigger in the foreground session could effect
+ the necessary writes via a background session.
+
+   
+   
+
+ Large changes can be split up into smaller transactions.  A foreground

Re: [HACKERS] commit fest manager for CF 2016-11?

2016-10-31 Thread Thomas Munro
On Tue, Nov 1, 2016 at 4:15 PM, Peter Eisentraut
 wrote:
> On 10/31/16 9:39 PM, Michael Paquier wrote:
>> We are on the 1st, at least in my timezone. So the CF should start
>> really soon, meaning that no new patches would get in. In a couple of
>> hours that would be fine? I recall that the last times we tried to
>> sync with the US East coast time.
>
> https://en.wikipedia.org/wiki/Anywhere_on_Earth

+1

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] commit fest manager for CF 2016-11?

2016-10-31 Thread Peter Eisentraut
On 10/31/16 9:39 PM, Michael Paquier wrote:
> We are on the 1st, at least in my timezone. So the CF should start
> really soon, meaning that no new patches would get in. In a couple of
> hours that would be fine? I recall that the last times we tried to
> sync with the US East coast time.

https://en.wikipedia.org/wiki/Anywhere_on_Earth

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-10-31 Thread Kyotaro HORIGUCHI
Hi,

At Sun, 2 Oct 2016 21:43:46 +0900, Michael Paquier  
wrote in 
> On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello,
> >
> > At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier 
> >  wrote in 
> > 
> >> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
> >>  wrote:
> >> > Hello, I return to this before my things:)
> >> >
> >> > Though I haven't played with the patch yet..
> >>
> >> Be sure to run the test cases in the patch or base your tests on them then!
> >
> > All items of 006_truncate_opt fail on ed0b228 and they are fixed
> > with the patch.
> >
> >> > Though I don't know how it actually impacts the perfomance, it
> >> > seems to me that we can live with truncated_to and sync_above in
> >> > RelationData and BufferNeedsWAL(rel, buf) instead of
> >> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> >> > seems to exist at once in the hash.
> >>
> >> TBH, I still think that the design of this patch as proposed is pretty
> >> cool and easy to follow.
> >
> > It is clean from certain viewpoint but additional hash,
> > especially hash-searching on every HeapNeedsWAL seems to me to be
> > unacceptable. Do you see it accetable?
> >
> >
> > The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> > just as a PoC of my proposal quoted above. This also passes the
> > 006 test.  The major changes are the following.
> >
> > - Moved sync_above and truncted_to into  RelationData.
> >
> > - Cleaning up is done in AtEOXact_cleanup instead of explicit
> >   calling to smgrDoPendingSyncs().
> >
> > * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
> >   hash_search. It just refers to the additional members in the
> >   given Relation.
> >
> > X I feel that I have dropped one of the features of the origitnal
> >   patch during the hack, but I don't recall it clearly now:(
> >
> > X I haven't consider relfilenode replacement, which didn't matter
> >   for the original patch. (but there's few places to consider).
> >
> > What do you think about this?
> 
> I have moved this patch to next CF. (I still need to look at your patch.)

Thanks for considering that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] commit fest manager for CF 2016-11?

2016-10-31 Thread Michael Paquier
On Tue, Nov 1, 2016 at 10:27 AM, Haribabu Kommi
 wrote:
> On Mon, Oct 31, 2016 at 4:31 PM, Michael Paquier 
> wrote:
>>
>> On Mon, Oct 31, 2016 at 1:49 PM, Peter Eisentraut
>>  wrote:
>> > Commit fest CF 2016-11 is supposed to start in about a day.  I don't
>> > think a commit fest manager was chosen yet.  Volunteers?
>>
>> I'd like to pass my turn on this one as CFM, but I can bring in some
>> time for the one of January.
>
> I am willing to do as CFM for this commit fest, until unless some one
> else is interested.

Great. Thanks!

> I may need some help from previous commit fest managers as I am
> new to this process.

Sure, feel free to ask anything here if you want. I guess that it
would be adapted if you have the admin rights in the CF app so as you
can switch the status of the CF itself. Magnus, could it be possible
to do so?

We are on the 1st, at least in my timezone. So the CF should start
really soon, meaning that no new patches would get in. In a couple of
hours that would be fine? I recall that the last times we tried to
sync with the US East coast time.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] commit fest manager for CF 2016-11?

2016-10-31 Thread Haribabu Kommi
On Mon, Oct 31, 2016 at 4:31 PM, Michael Paquier 
wrote:

> On Mon, Oct 31, 2016 at 1:49 PM, Peter Eisentraut
>  wrote:
> > Commit fest CF 2016-11 is supposed to start in about a day.  I don't
> > think a commit fest manager was chosen yet.  Volunteers?
>
> I'd like to pass my turn on this one as CFM, but I can bring in some
> time for the one of January.


I am willing to do as CFM for this commit fest, until unless some one
else is interested.

I may need some help from previous commit fest managers as I am
new to this process.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Vladimir Gordiychuk
>>> When sending a big message, WalSndWriteData() notices that it's
>>> approaching timeout and tries to send a keepalive request, but the
>>> request just gets buffered behind the remaining output plugin data and
>>> isn't seen by the client until the client has received the rest of the
>>> pending data.
>>
>> Only for individual messages, not the entire transaction though.

>Right.  I initially thought it was the whole tx, but I was mistaken as
>I'd failed to notice that WalSndWriteData() queues a keepalive
>request.

This problem can be resolve periodically send keepalive by client, and this
interval should be less than timeout configure on server. For example on
server configure timeout wal_sender_timeout=60 so, client should send keep
alive message to server with interval 60/3. In that case server will not
send keep alive with flag required reply, and also not disconnect client
because during decode huge transaction present check income messages. I
faced a similar problem in pgjdc and resolve it as I write before.

2016-10-31 16:28 GMT+03:00 Craig Ringer :

> On 31 October 2016 at 16:52, Andres Freund  wrote:
> > Hi,
> >
> > On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
> >> TL;DR: Logical decoding clients need to generate their own keepalives
> >> and not rely on the server requesting them to prevent timeouts. Or
> >> admins should raise the wal_sender_timeout by a LOT when using logical
> >> decoding on DBs with any big rows.
> >
> > Unconvinced.
>
> Yeah. I've seen enough issues in the wild where we keep timing out and
> restarting over and over until we increase wal_sender_timeout to know
> there's _something_ going on. I am less sure I'm right about what is
> or how to solve it.
>
> >> When sending a big message, WalSndWriteData() notices that it's
> >> approaching timeout and tries to send a keepalive request, but the
> >> request just gets buffered behind the remaining output plugin data and
> >> isn't seen by the client until the client has received the rest of the
> >> pending data.
> >
> > Only for individual messages, not the entire transaction though.
>
> Right.  I initially thought it was the whole tx, but I was mistaken as
> I'd failed to notice that WalSndWriteData() queues a keepalive
> request.
>
> > Are
> > you sure the problem at hand is that we're sending a keepalive, but it's
> > too late?
>
> No, I'm not sure. I'm trying to identify the cause of an issue I've
> seen in the wild, but never under conditions where it's been possible
> to sit around and debug in a leisurely manner.
>
> I'm trying to set up a TAP test to demonstrate that this happens, but
> I don't think it's going to work without some kind of network
> bandwidth limitation simulation or simulated latency. A local unix
> socket is just too fast for Pg's row size limits.
>
> > It might very well be that the actual issue is that we're
> > never sending keepalives, because the network is fast enough / the tcp
> > window is large enough.  IIRC we only send a keepalive if we're blocked
> > on network IO?
>
> Mm, that's a good point. That might better explain the issues I've
> seen in the wild, since I never found strong evidence that individual
> big rows were involved, but hadn't been able to come up with anything
> else yet.
>
> >> So: We could ask output plugins to deal with this for us, by chunking
> >> up their data in small pieces and calling OutputPluginPrepareWrite()
> >> and OutputPluginWrite() more than once per output plugin callback if
> >> they expect to send a big message. But this pushes the complexity of
> >> splitting up and handling big rows, and big Datums, onto each plugin.
> >> It's awkward to do well and hard to avoid splitting things up
> >> unnecessarily.
> >
> > There's decent reason for doing that independently though, namely that
> > it's a lot more efficient from a memory management POV.
>
> Definitely. Though you're always going to be tossing around ridiculous
> chunks of memory when dealing with big external compressed toasted
> data, unless there are ways to access that progressively that I'm
> unaware of. Hopefully there are.
>
> I'd quite like to extend the bdr/pglogical/logicalrep protocol so that
> in-core logical rep, in some later version, can write a field as 'to
> follow', like we currently mark unchanged toasted datums separately.
> Then send it chunked, after the main row, in follow-up messages. That
> way we keep processing keepalives, we don't allocate preposterous
> amounts of memory, etc.
>
> > I don't think the "unrequested keepalive" approach really solves the
> > problem on a fundamental enough level.
>
> Fair. It feels a bit like flailing in the dark, too.
>
> >> (A separate issue is that we can also time out when doing logical
> >> _replication_ if the downstream side blocks on a lock, since it's not
> >> safe to send on a socket from a signal handler ... )
> >
> > That's strictly speaking not true. write() / sendmsg() are 

Re: [HACKERS] Unsafe use of relation->rd_options without checking its type

2016-10-31 Thread neha khatri
>
>
>  ^
>
> The reason for the error is that transformOnConflictArbiter applies
> RelationIsUsedAsCatalogTable() to something that it doesn't know to
> be a plain relation --- it's a view in this case.  And that macro
> blindly assumes that relation->rd_options is a StdRdOptions struct,
> when in this case it's actually a ViewOptions struct.
>
> Now that I've seen this I wonder which other uses of rd_options are
> potentially broken.  RelationIsUsedAsCatalogTable() is hardly the
> only macro that is assuming with little justification that it's
> applied to the right kind of reloptions.
>

Right, there are many macros, which blindly assume the rd_options
to be of specific type. Here is the list of such macros :

 RelationGetFillFactor
 RelationIsUsedAsCatalogTable
 RelationGetParallelWorkers
 RelationIsSecurityView
 RelationHasCheckOption
 RelationHasLocalCheckOption
 RelationHasCascadedCheckOption
 BrinGetPagesPerRange
 GinGetUseFastUpdate
 GinGetPendingListCleanupSize

For the macros associated with specific index type, it might be alright to
assume the type of the rd_options because those have very limited usage.
However for the rest of the macros, the code does not limit its usage. This
can
lead to problems as you described above .



> We could band-aid this by having the RelationIsUsedAsCatalogTable()
> macro check relation->relkind, but I'm inclined to think that the
> right fix going forward is to insist that StdRdOptions, ViewOptions,
> and the other in-memory representations of reloptions ought to be
> self-identifying somehow.  We could add a field to them similar to
> nodeTag, but one of the things that was envisioned to begin with
> is relation types that have StdRdOptions as the first part of a
> larger struct.  I'm not sure how to make that work with a tag.
>
> Thoughts?
>

Yes, it seems appropriate to tag all types of the rd_options with an
identification and maybe check for that identification within each macro.
Currently, there are following types of rd_options as far as I could find:

 GiSTOptions
 BrinOptions
 GinOptions
 StdRdOptions
 ViewOptions
 BloomOptions (from contrib)

I am not clear on how the identification field in the above structures can
be a problem,  for the relations have StdRdOptions as the first part of a
larger structure.

Regards,
Neha


[HACKERS] WIP: [[Parallel] Shared] Hash

2016-10-31 Thread Thomas Munro
Hi hackers,

In PostgreSQL 9.6, hash joins can be parallelised under certain
conditions, but a copy of the hash table is built in every
participating backend.  That means that memory and CPU time are
wasted.  In many cases, that's OK: if the hash table contents are
small and cheap to compute, then we don't really care, we're just
happy that the probing can be done in parallel.  But in cases where
the hash table is large and/or expensive to build, we could do much
better.  I am working on that problem.

To recap the situation in 9.6, a hash join can appear below a Gather
node and it looks much the same as a non-parallel hash join except
that it has a partial outer plan:

  ->  Hash Join
->  
->  Hash
  ->  

A partial plan is one that has some kind of 'scatter' operation as its
ultimate source of tuples.  Currently the only kind of scatter
operation is a Parallel Seq Scan (but see also the Parallel Index Scan
and Parallel Bitmap Scan proposals).  The scatter operation enables
parallelism in all the executor nodes above it, as far as the
enclosing 'gather' operation which must appear somewhere above it.
Currently the only kind of gather operation is a Gather node (but see
also the Gather Merge proposal which adds a new one).

The inner plan is built from a non-partial parallel-safe path and will
be run in every worker.

Note that a Hash Join node in 9.6 isn't parallel-aware itself: it's
not doing anything special at execution time to support parallelism.
The planner has determined that correct partial results will be
produced by this plan, but the executor nodes are blissfully unaware
of parallelism.

PROPOSED NEW PLAN VARIANTS

Shortly I will post a patch which introduces two new hash join plan
variants that are parallel-aware:

1.  Parallel Hash Join with Shared Hash

  ->  Parallel Hash Join
->  
->  Shared Hash
  ->  

In this case, there is only one copy of the hash table and only one
participant loads it.  The other participants wait patiently for one
chosen backend to finish building the hash table, and then they all
wake up and probe.

Call the number of participants P, being the number of workers + 1
(for the leader).  Compared to a non-shared hash plan, we avoid
wasting CPU and IO resources running P copies of the inner plan in
parallel (something that is not well captured in our costing model for
parallel query today), and we can allow ourselves to use a hash table
P times larger while sticking to the same overall space target of
work_mem * P.

2.  Parallel Hash Join with Parallel Shared Hash

  ->  Parallel Hash Join
->  
->  Parallel Shared Hash
  ->  

In this case, the inner plan is run in parallel by all participants.
We have the advantages of a shared hash table as described above, and
now we can also divide the work of running the inner plan and hashing
the resulting tuples by P participants.  Note that Parallel Shared
Hash is acting as a special kind of gather operation that is the
counterpart to the scatter operation contained in the inner plan.

PERFORMANCE

So far I have been unable to measure any performance degradation
compared with unpatched master for hash joins with non-shared hash.
That's good because it means that I didn't slow existing plans down
when I introduced a bunch of conditional branches to existing hash
join code.

Laptop testing shows greater than 2x speedups on several of the TPC-H
queries with single batches, and no slowdowns.  I will post test
numbers on big rig hardware in the coming weeks when I have the
batching code in more complete and stable shape.

IMPLEMENTATION

I have taken the approach of extending the existing hash join
algorithm, rather than introducing separate hash join executor nodes
or a fundamentally different algorithm.  Here's a short description of
what the patch does:

1.  SHARED HASH TABLE

To share data between participants, the patch uses two other patches I
have proposed:  DSA areas[1], which provide a higher level interface
to DSM segments to make programming with processes a little more like
programming with threads, and in particular a per-parallel-query DSA
area[2] that is made available for any executor node that needs some
shared work space.

The patch uses atomic operations to push tuples into the hash table
buckets while building, rehashing and loading, and then the hash table
is immutable during probing (except for match flags used to implement
outer joins).  The existing memory chunk design is retained for dense
allocation of tuples, which provides a convenient way to rehash the
table when its size changes.

2.  WORK COORDINATION

To coordinate parallel work, this patch uses two other patches:
barriers[3], to implement a 'barrier' or 'phaser' synchronisation
primitive, and those in turn use the condition variables proposed by
Robert Haas.

Barriers provide a way for participants to break work up into 

Re: [HACKERS] WAL consistency check facility

2016-10-31 Thread Michael Paquier
On Mon, Oct 31, 2016 at 9:31 PM, Robert Haas  wrote:
> On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
>> - Instead of palloc'ing the old and new pages to compare, it would be
>> more performant to keep around two static buffers worth of BLCKSZ and
>> just use that. This way there is no need as well to perform any palloc
>> calls in the masking functions, limiting the risk of errors (those
>> code paths had better avoid errors IMO). It would be also less costly
>> to just pass to the masking function a pointer to a buffer of size
>> BLCKSZ and just do the masking on it.
>
> We always palloc buffers like this so that they will be aligned.  But
> we could arrange not to repeat the palloc every time (see, e.g.,
> BootstrapXLOG()).

Yeah, we could go with that and there is clearly no reason to not do so.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-10-31 Thread Abhijit Menon-Sen
At 2016-09-28 13:13:56 -0400, robertmh...@gmail.com wrote:
>
> I hope that the fact that there's been no discussion for the last
> three weeks doesn't mean this effort is dead; I would like very
> much to see it move forward.

Here's an updated patch. Sorry, I got busy elswhere.

I struggled with the handling of recovery_target a little. For example,
one suggested alternative was:

recovery_target_type = xid
recovery_target_value = …

The problem with implementing it this way is that the _value setting
cannot be parsed without already having parsed the _type, and I didn't
want to force that sort of dependency.

What I've done instead is to make this work:

recovery_target = xid|time|name|lsn|immediate
recovery_target_xid = …
recovery_target_time = …
recovery_target_name = …
recovery_target_lsn = …

The recovery_target_xxx values are parsed as they used to be, but the
one that's used is the one that's set in recovery_target. That's easy to
explain, and the patch is much less intrusive, but I'm certainly open to
suggestions to improve this, and I have the time to work on this patch
with a view towards getting it committed in this cycle.

-- Abhijit

P.S. Sorry, I haven't been able to resolve the conflicts between Simon's
earlier recovery_startup_r10_api.v1b patch and the "pg_ctl promote -w"
changes in master. I was distracted by some illness in the family, but
I will post another update very soon.
>From 231ccefbc99c07f00560f4e88a961db186db704e Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Mon, 31 Oct 2016 11:32:51 +0530
Subject: Convert recovery.conf settings to GUCs

Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao.
---
 contrib/pg_standby/pg_standby.c |   2 +-
 doc/src/sgml/backup.sgml|  31 +-
 doc/src/sgml/config.sgml| 480 ++
 doc/src/sgml/filelist.sgml  |   1 -
 doc/src/sgml/func.sgml  |   2 +-
 doc/src/sgml/high-availability.sgml |  42 +-
 doc/src/sgml/pgstandby.sgml |   4 +-
 doc/src/sgml/postgres.sgml  |   1 -
 doc/src/sgml/recovery-config.sgml   | 508 ---
 doc/src/sgml/ref/pgarchivecleanup.sgml  |   6 +-
 doc/src/sgml/release-9.1.sgml   |   5 +-
 doc/src/sgml/release.sgml   |   2 +-
 src/backend/access/transam/recovery.conf.sample |  29 +-
 src/backend/access/transam/xlog.c   | 520 +++-
 src/backend/access/transam/xlogarchive.c|   8 +-
 src/backend/postmaster/startup.c|   2 +
 src/backend/replication/walreceiver.c   |  24 ++
 src/backend/utils/misc/guc-file.l   |  18 +
 src/backend/utils/misc/guc.c| 430 
 src/backend/utils/misc/postgresql.conf.sample   |  24 ++
 src/bin/pg_archivecleanup/pg_archivecleanup.c   |   2 +-
 src/include/access/xlog.h   |  21 +
 src/include/access/xlog_internal.h  |   2 +
 src/include/utils/guc_tables.h  |   2 +
 24 files changed, 1202 insertions(+), 964 deletions(-)
 delete mode 100644 doc/src/sgml/recovery-config.sgml

diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index e4136f9..8fcb85c 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -520,7 +520,7 @@ usage(void)
 	printf("  -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n");
 	printf("  -?, --help show this help, then exit\n");
 	printf("\n"
-		   "Main intended use as restore_command in recovery.conf:\n"
+		   "Main intended use as restore_command in postgresql.conf:\n"
 		   "  restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n"
 		   "e.g.\n"
 	"  restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n");
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 6eaed1e..2df0dc6 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1190,8 +1190,15 @@ SELECT pg_stop_backup();


 
- Create a recovery command file recovery.conf in the cluster
- data directory (see ). You might
+ Set up recovery parameters in postgresql.conf (see
+  and
+ ).
+
+   
+   
+
+ Create a recovery trigger file recovery.trigger
+ in the cluster data directory. You might
  also want to temporarily modify pg_hba.conf to prevent
  ordinary users from connecting until you are sure the recovery was successful.
 
@@ -1203,7 +1210,7 @@ SELECT pg_stop_backup();
  recovery be terminated because of an external error, the server can
  simply be restarted and it will continue recovery.  Upon completion
  of the recovery process, the server will rename
- recovery.conf to recovery.done (to prevent
+ recovery.trigger to 

[HACKERS] auto_explain vs. parallel query

2016-10-31 Thread Tomas Vondra

Hi,

While debugging something on 9.6, I've noticed that auto_explain handles 
parallel queries in a slightly strange way - both the leader and all the 
workers log their chunk of the query (i.e. the leader logs explain for 
the whole query, while workers only log the parts they've been executing).


So for example for a query with 2 workers, the log look like this:

   2016-10-31 23:10:23.481 CET [12278] LOG:  duration: 32.562 ms  pl ...
   Query Text:   ...
   Parallel Seq Scan on tables  (cost=0.00..15158.25 rows=220 wi ...
 Filter: ((table_datoid < '10'::oid) AND (table_nspo ...
 Rows Removed by Filter: 140614  ...
   2016-10-31 23:10:23.481 CET [12277] LOG:  duration: 32.325 ms  pl ...
   Query Text:   ...
   Parallel Seq Scan on tables  (cost=0.00..15158.25 rows=220 wi ...
 Filter: ((table_datoid < '10'::oid) AND (table_nspo ...
 Rows Removed by Filter: 80948   ...
   2016-10-31 23:10:23.483 CET [12259] LOG:  duration: 38.997 ms  pl ...
   Query Text: explain analyze select * from tables where table_ ...
   Gather  (cost=1000.00..16211.15 rows=529 width=356) (actual t ...
 Workers Planned: 2  ...
 Workers Launched: 2 ...
 ->  Parallel Seq Scan on tables  (cost=0.00..15158.25 rows= ...
   Filter: ((table_datoid < '10'::oid) AND (tabl ...
   Rows Removed by Filter: 105570...

I'd say that's fairly surprising, and I haven't found any discussion 
about auto_explain vs. parallel queries in the archives (within the last 
year), so I assume this may not be entirely intentional.


Not only this does not match the output when running EXPLAIN ANALYZE 
manually, it also provides no information about which messages from 
workers belong to which "leader" message.


Another thing is that this interacts with sampling in a rather 
unfortunate way, because each process evaluates the sampling condition 
independently. So for example with auto_explain.sample_rate=0.5 a random 
subset of worker/leader explain plans will be logged.


But this also applies to the conditions in ExecutorStart, which enables 
instrumentation only when the query gets sampled - so when the leader 
gets sampled but not all the workers, the counters in the query logged 
by the leader are incomplete.


I do believe those are bugs that make auto_explain rather unusable with 
parallel queries. Adding IsParallelWorker() to the condition in 
ExecutorEnd should fix the extra logging messages (and log only from the 
leader).


Not sure how to fix the sampling - simply adding IsParallelWorker() to 
ExecutorStart won't do the trick, because we don't want to enable 
instrumentation only for sample queries. So I guess this needs to be 
decided in the leader, and communicated to the workers somehow.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvements in psql hooks for variables

2016-10-31 Thread Rahila Syed
Hello,

I have applied this patch on latest HEAD and have done basic testing which
works fine.

Some comments,

>if (current->assign_hook)
>-   (*current->assign_hook) (current->value);
>-   return true;
>+   {
>+   confirmed = (*current->assign_hook) (value);
>+   }
>+   if (confirmed)

Spurious brackets

>static bool
>+generic_boolean_hook(const char *newval, const char* varname, bool *flag)
>+{

Contrary to what name suggests this function does not seem to have other
implementations as in a hook.
Also this takes care of rejecting a syntactically wrong value only for
boolean variable hooks like autocommit_hook,
on_error_stop_hook. However, there are other variable hooks which call
ParseVariableBool.
For instance, echo_hidden_hook which is handled separately in the patch.
Thus there is some duplication of code between generic_boolean_hook and
echo_hidden_hook.
Similarly for on_error_rollback_hook.

>-static void
>+static bool
> fetch_count_hook(const char *newval)
> {
>pset.fetch_count = ParseVariableNum(newval, -1, -1, false);
>+   return true;
> }

Shouldn't invalid numeric string assignment for numeric variables be
handled too?

Instead of generic_boolean_hook cant we have something like follows which
like generic_boolean_hook can be called from specific variable assignment
hooks,

static bool
ParseVariable(newval, VariableName, )
{
if (VariableName == ‘AUTOCOMMIT’ || ECHO_HIDDEN || other variable with
hooks which call ParseVariableBool )

else if (VariableName == ‘FETCH_COUNT’)
ParseVariableNum();
}
This will help merge the logic which is to check for valid syntax before
assigning and returning false on error, in one place.

>@@ -260,7 +276,7 @@ SetVariableAssignHook(VariableSpace space, const char
*name, VariableAssignHook
>current->assign_hook = hook;
>current->next = NULL;
>previous->next = current;
>-   (*hook) (NULL);
>+   (void)(*hook) (NULL);   /* ignore return value */

Sorry for my lack of understanding, can you explain why is above change
needed?

Thank you,
Rahila Syed








On Tue, Sep 20, 2016 at 11:16 PM, Daniel Verite 
wrote:

> Ashutosh Bapat wrote:
>
> > You may want to add this to the November commitfest
> > https://commitfest.postgresql.org/11/.
>
> Done. It's at https://commitfest.postgresql.org/11/799/
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] WAL consistency check facility

2016-10-31 Thread Michael Paquier
On Mon, Oct 31, 2016 at 9:31 PM, Robert Haas  wrote:
> On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
>  wrote:
>> And here we go. Here is a review as well as a large brush-up for this
>> patch. A couple of things:
>> - wal_consistency is using a list of RMGRs, at the cost of being
>> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
>> have been thinking hard about that, and still I don't see the point).
>> It is rather easy to for example default it to false, and enable it to
>> true to check if a certain code path is correctly exercised or not for
>> WAL consistency. Note that this simplification reduces the patch size
>> by 100~150 lines. I know, I know, I'd expect some complains about
>> that
>
> I don't understand how you can fail to see the point of that.  As you
> yourself said, this facility generates a ton of WAL.  If you're
> focusing on one AM, why would you want to be forced to incur the
> overhead for every other AM?  A good deal has been written about this
> upthread already, and just saying "I don't see the point" seems to be
> ignoring the explanations already given.

Hehe, I was expecting you to jump on those lines. While looking at the
patch I have simplified it first to focus on the core engine of the
thing. Adding back this code sounds fine to me as there is a wall of
contestation. I offer to do it myself if the effort is the problem.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Tomas Vondra

On 10/31/2016 02:24 PM, Tomas Vondra wrote:

On 10/31/2016 05:01 AM, Jim Nasby wrote:

On 10/30/16 1:32 PM, Tomas Vondra wrote:


Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
some sort of CPU / OS scheduling artifact. For example, the system has
36 physical cores, 72 virtual ones (thanks to HT). I find it strange
that the "good" client counts are always multiples of 72, while the
"bad" ones fall in between.

  72 = 72 * 1   (good)
 108 = 72 * 1.5 (bad)
 144 = 72 * 2   (good)
 180 = 72 * 2.5 (bad)
 216 = 72 * 3   (good)
 252 = 72 * 3.5 (bad)
 288 = 72 * 4   (good)

So maybe this has something to do with how OS schedules the tasks, or
maybe some internal heuristics in the CPU, or something like that.


It might be enlightening to run a series of tests that are 72*.1 or *.2
apart (say, 72, 79, 86, ..., 137, 144).


Yeah, I've started a benchmark with client a step of 6 clients

36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288

instead of just

36 72 108 144 180 216 252 288

which did a test every 36 clients. To compensate for the 6x longer runs,
I'm only running tests for "group-update" and "master", so I should have
the results in ~36h.



So I've been curious and looked at results of the runs executed so far, 
and for the group_update patch it looks like this:


  clients  tps
 -
   36  117663
   42  139791
   48  129331
   54  144970
   60  124174
   66  137227
   72  146064
   78  100267
   84  141538
   90   96607
   96  139290
  102   93976
  108  136421
  114   91848
  120  133563
  126   89801
  132  132607
  138   87912
  144  129688
  150   87221
  156  129608
  162   85403
  168  130193
  174   83863
  180  129337
  186   81968
  192  128571
  198   82053
  204  128020
  210   80768
  216  124153
  222   80493
  228  125503
  234   78950
  240  125670
  246   78418
  252  123532
  258   77623
  264  124366
  270   76726
  276  119054
  282   76960
  288  121819

So, similar saw-like behavior, perfectly periodic. But the really 
strange thing is the peaks/valleys don't match those observed before!


That is, during the previous runs, 72, 144, 216 and 288 were "good" 
while 108, 180 and 252 were "bad". But in those runs, all those client 
counts are "good" ...


Honestly, I have no idea what to think about this ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Tomas Vondra

On 10/31/2016 08:43 PM, Amit Kapila wrote:

On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondra
 wrote:

On 10/31/2016 02:51 PM, Amit Kapila wrote:
And moreover, this setup (single device for the whole cluster) is very
common, we can't just neglect it.

But my main point here really is that the trade-off in those cases may not
be really all that great, because you get the best performance at 36/72
clients, and then the tps drops and variability increases. At least not
right now, before tackling contention on the WAL lock (or whatever lock
becomes the bottleneck).



Okay, but does wait event results show increase in contention on some
other locks for pgbench-3000-logged-sync-skip-64?  Can you share wait
events for the runs where there is a fluctuation?



Sure, I do have wait event stats, including a summary for different 
client counts - see this:


http://tvondra.bitbucket.org/by-test/pgbench-3000-logged-sync-skip-64.txt

Looking only at group_update patch for three interesting client counts, 
it looks like this:


   wait_event_type |wait_event |108 144  180
  -+---+-
   LWLockNamed | WALWriteLock  | 661284  847057  1006061
   |   | 126654  191506   265386
   Client  | ClientRead|  37273   5279164799
   LWLockTranche   | wal_insert|  28394   5189379932
   LWLockNamed | CLogControlLock   |   7766   1491323138
   LWLockNamed | WALBufMappingLock |   36153739 3803
   LWLockNamed | ProcArrayLock |9131776 2685
   Lock| extend|9092082 2228
   LWLockNamed | XidGenLock|301 349  675
   LWLockTranche   | clog  |173 331  607
   LWLockTranche   | buffer_content|163 468  737
   LWLockTranche   | lock_manager  | 88 140  145

Compared to master, this shows significant reduction of contention on 
CLogControlLock (which on master has 20k, 83k and 200k samples), and 
moving the contention to WALWriteLock.


But perhaps you're asking about variability during the benchmark? I 
suppose that could be extracted from the collected data, but I haven't 
done that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sequential scan result order vs performance

2016-10-31 Thread Corey Huinker
On Sun, Oct 30, 2016 at 11:37 PM, Jim Nasby 
wrote:

> BTW, I've sometimes wished for a mode where queries would silently have
> result ordering intentionally futzed, to eliminate any possibility of
> dependence on tuple ordering (as well as having sequences start at some
> random value). I guess with the hooks that are in place today it wouldn't
> be hard to stick a ORDER BY random() in if there wasn't already a Sort node
> at the top level...


+1
In Oracle, we sorta had that feature by adding a parallel hint to a query
even if it didn't need it. It came in handy.


Re: [HACKERS] COPY as a set returning function

2016-10-31 Thread Corey Huinker
Attached is a patch that implements copy_srf().

The function signature reflects cstate more than it reflects the COPY
options (filename+is_program instead of  FILENAME or PROGRAM, etc)

CREATE OR REPLACE FUNCTION copy_srf(
   filenametext DEFAULT null,
   is_program  boolean DEFAULT false,
   format  text DEFAULT 'text',  /* accepts text, csv, binary */
   delimiter   text DEFAULT null,
   null_string text DEFAULT E'\\N',
   header  boolean DEFAULT false,
   quote   text DEFAULT null,
   escape  text DEFAULT null,
   encodingtext DEFAULT null)
RETURNS SETOF RECORD

The major utility for this (at least for me) will be in ETLs that currently
make a lot of use of temp tables, and wish to either reduce I/O or avoid
pg_attribute bloat.

I have not yet implemented STDIN mode, but it's a start.

$ psql test
psql (10devel)
Type "help" for help.

# select * fromcopy_srf('echo 1,2; echo 4,5',true,'csv') as t(x text, y
text);
 x | y
---+---
 1 | 2
 4 | 5
(2 rows)

Time: 1.456 ms
# select * fromcopy_srf('/tmp/small_file.txt'::text,false,'text') as
t(x text, y text);
  x  |  y
-+-
 1   | 2
 a   | b
 cde | fgh
(3 rows)


On Mon, Oct 17, 2016 at 2:33 PM, Merlin Moncure  wrote:

> On Fri, Sep 30, 2016 at 9:56 PM, Tom Lane  wrote:
> > Craig Ringer  writes:
> >> On 1 Oct. 2016 05:20, "Tom Lane"  wrote:
> >>> I think the last of those suggestions has come up before.  It has the
> >>> large advantage that you don't have to remember a different syntax for
> >>> copy-as-a-function.
> >
> >> That sounds fantastic. It'd help this copy variant retain festure parity
> >> with normal copy. And it'd bring us closer to being able to FETCH in non
> >> queries.
> >
> > On second thought, though, this couldn't exactly duplicate the existing
> > COPY syntax, because COPY relies heavily on the rowtype of the named
> > target table to tell it what it's copying.  You'd need some new syntax
> > to provide the list of column names and types, which puts a bit of
> > a hole in the "syntax we already know" argument.  A SRF-returning-record
> > would have a leg up on that, because we do have existing syntax for
> > defining the concrete rowtype that any particular call returns.
>
> One big disadvantage of SRF-returning-record syntax is that functions
> are basically unwrappable with generic wrappers sans major gymnastics
> such as dynamically generating the query and executing it.  This is a
> major disadvantage relative to the null::type hack we use in the
> populate_record style functions and perhaps ought to make this
> (SRF-returning-record syntax) style of use discouraged for useful
> library functions.  If there were a way to handle wrapping I'd
> withdraw this minor objection -- this has come up in dblink
> discussions a few times).
>
> merlin
>
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index ada2142..0876ee1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1006,6 +1006,21 @@ LANGUAGE INTERNAL
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'jsonb_insert';
 
+CREATE OR REPLACE FUNCTION copy_srf(
+   IN filename text DEFAULT null, 
+   IN is_program boolean DEFAULT false,
+   IN format text DEFAULT 'text',
+   IN delimiter text DEFAULT null,
+   IN null_string text DEFAULT E'\\N',
+   IN header boolean DEFAULT false,
+   IN quote text DEFAULT null,
+   IN escape text DEFAULT null,
+   IN encoding text DEFAULT null)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000 CALLED ON NULL INPUT
+AS 'copy_srf';
+
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
 -- than use explicit 'superuser()' checks in those functions, we use the GRANT
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b4140eb..90ed2c5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -30,6 +30,7 @@
 #include "commands/defrem.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
+#include "funcapi.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "mb/pg_wchar.h"
@@ -555,7 +556,6 @@ CopyGetData(CopyState cstate, void *databuf, int minread, 
int maxread)
 errmsg("could not read from 
COPY file: %m")));
break;
case COPY_OLD_FE:
-
/*
 * We cannot read more than minread bytes (which in 
practice is 1)
 * because old protocol doesn't have any clear way of 
separating
@@ -4555,3 +4555,377 @@ CreateCopyDestReceiver(void)
 
return (DestReceiver *) self;
 }
+
+Datum
+copy_srf(PG_FUNCTION_ARGS)
+{
+   ReturnSetInfo   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+  

[HACKERS] Increase pltcl test coverage

2016-10-31 Thread Jim Nasby
This patch increases test coverage for pltcl, from 70% to 83%. Aside 
from that, the work on this uncovered 2 new bugs (the trigger return one 
I just submitted, as well as a bug in the SRF/composite patch).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461
diff --git a/src/pl/tcl/expected/pltcl_queries.out 
b/src/pl/tcl/expected/pltcl_queries.out
index 6cb1fdb..9e3a0dc 100644
--- a/src/pl/tcl/expected/pltcl_queries.out
+++ b/src/pl/tcl/expected/pltcl_queries.out
@@ -1,3 +1,7 @@
+BEGIN;
+SET LOCAL client_min_messages = WARNING;
+CREATE EXTENSION IF NOT EXISTS plpgsql;
+COMMIT;
 -- suppress CONTEXT so that function OIDs aren't in output
 \set VERBOSITY terse
 insert into T_pkey1 values (1, 'key1-1', 'test key');
@@ -185,12 +189,23 @@ select * from T_pkey2 order by key1 using @<, key2 
collate "C";
 
 -- show dump of trigger data
 insert into trigger_test values(1,'insert');
-NOTICE:  NEW: {i: 1, v: insert}
+NOTICE:  NEW: {}
+NOTICE:  OLD: {}
+NOTICE:  TG_level: STATEMENT
+NOTICE:  TG_name: statement_trigger
+NOTICE:  TG_op: INSERT
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
+NOTICE:  TG_relid: bogus:12345
+NOTICE:  TG_table_name: trigger_test
+NOTICE:  TG_table_schema: public
+NOTICE:  TG_when: BEFORE
+NOTICE:  args: {42 {statement trigger}}
+NOTICE:  NEW: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: 
insert}
 NOTICE:  OLD: {}
 NOTICE:  TG_level: ROW
 NOTICE:  TG_name: show_trigger_data_trig
 NOTICE:  TG_op: INSERT
-NOTICE:  TG_relatts: {{} i v}
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
 NOTICE:  TG_relid: bogus:12345
 NOTICE:  TG_table_name: trigger_test
 NOTICE:  TG_table_schema: public
@@ -232,13 +247,37 @@ NOTICE:  TG_table_name: trigger_test_view
 NOTICE:  TG_table_schema: public
 NOTICE:  TG_when: {INSTEAD OF}
 NOTICE:  args: {24 {skidoo view}}
+update trigger_test set v = 'update', test_skip=true where i = 1;
+NOTICE:  NEW: {}
+NOTICE:  OLD: {}
+NOTICE:  TG_level: STATEMENT
+NOTICE:  TG_name: statement_trigger
+NOTICE:  TG_op: UPDATE
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
+NOTICE:  TG_relid: bogus:12345
+NOTICE:  TG_table_name: trigger_test
+NOTICE:  TG_table_schema: public
+NOTICE:  TG_when: BEFORE
+NOTICE:  args: {42 {statement trigger}}
+NOTICE:  SKIPPING OPERATION UPDATE
 update trigger_test set v = 'update' where i = 1;
-NOTICE:  NEW: {i: 1, v: update}
-NOTICE:  OLD: {i: 1, v: insert}
+NOTICE:  NEW: {}
+NOTICE:  OLD: {}
+NOTICE:  TG_level: STATEMENT
+NOTICE:  TG_name: statement_trigger
+NOTICE:  TG_op: UPDATE
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
+NOTICE:  TG_relid: bogus:12345
+NOTICE:  TG_table_name: trigger_test
+NOTICE:  TG_table_schema: public
+NOTICE:  TG_when: BEFORE
+NOTICE:  args: {42 {statement trigger}}
+NOTICE:  NEW: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: 
update}
+NOTICE:  OLD: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: 
insert}
 NOTICE:  TG_level: ROW
 NOTICE:  TG_name: show_trigger_data_trig
 NOTICE:  TG_op: UPDATE
-NOTICE:  TG_relatts: {{} i v}
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
 NOTICE:  TG_relid: bogus:12345
 NOTICE:  TG_table_name: trigger_test
 NOTICE:  TG_table_schema: public
@@ -246,16 +285,39 @@ NOTICE:  TG_when: BEFORE
 NOTICE:  args: {23 skidoo}
 delete from trigger_test;
 NOTICE:  NEW: {}
-NOTICE:  OLD: {i: 1, v: update}
+NOTICE:  OLD: {}
+NOTICE:  TG_level: STATEMENT
+NOTICE:  TG_name: statement_trigger
+NOTICE:  TG_op: DELETE
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
+NOTICE:  TG_relid: bogus:12345
+NOTICE:  TG_table_name: trigger_test
+NOTICE:  TG_table_schema: public
+NOTICE:  TG_when: BEFORE
+NOTICE:  args: {42 {statement trigger}}
+NOTICE:  NEW: {}
+NOTICE:  OLD: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: 
update}
 NOTICE:  TG_level: ROW
 NOTICE:  TG_name: show_trigger_data_trig
 NOTICE:  TG_op: DELETE
-NOTICE:  TG_relatts: {{} i v}
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
 NOTICE:  TG_relid: bogus:12345
 NOTICE:  TG_table_name: trigger_test
 NOTICE:  TG_table_schema: public
 NOTICE:  TG_when: BEFORE
 NOTICE:  args: {23 skidoo}
+truncate trigger_test;
+NOTICE:  NEW: {}
+NOTICE:  OLD: {}
+NOTICE:  TG_level: STATEMENT
+NOTICE:  TG_name: statement_trigger
+NOTICE:  TG_op: TRUNCATE
+NOTICE:  TG_relatts: {{} i v {} test_skip test_return_null test_argisnull}
+NOTICE:  TG_relid: bogus:12345
+NOTICE:  TG_table_name: trigger_test
+NOTICE:  TG_table_schema: public
+NOTICE:  TG_when: BEFORE
+NOTICE:  args: {42 {statement trigger}}
 -- Test composite-type arguments
 select tcl_composite_arg_ref1(row('tkey', 42, 'ref2'));
  tcl_composite_arg_ref1 
@@ -288,6 +350,20 @@ 

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Amit Kapila
On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondra
 wrote:
> On 10/31/2016 02:51 PM, Amit Kapila wrote:
> And moreover, this setup (single device for the whole cluster) is very
> common, we can't just neglect it.
>
> But my main point here really is that the trade-off in those cases may not
> be really all that great, because you get the best performance at 36/72
> clients, and then the tps drops and variability increases. At least not
> right now, before tackling contention on the WAL lock (or whatever lock
> becomes the bottleneck).
>

Okay, but does wait event results show increase in contention on some
other locks for pgbench-3000-logged-sync-skip-64?  Can you share wait
events for the runs where there is a fluctuation?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix bug in handling of dropped columns in pltcl triggers

2016-10-31 Thread Jim Nasby
While reviewing code coverage in pltcl, I uncovered a bug in trigger 
function return handling. If you returned the munged name of a dropped 
column, that would silently be ignored. It would be unusual to hit this, 
since dropped columns end up with names like 
"...pg.dropped.2...", but since that's still a legitimate name 
for a column silently ignoring it seems rather bogus.


Work sponsored by FlightAware.

https://github.com/postgres/postgres/compare/master...decibel:tcl_dropped
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461
commit 8f88fda52751f88aca4786f45d3d7f16a5343fc0
Author: Jim Nasby 
Date:   Mon Oct 31 14:11:43 2016 -0500

Fix trigger dropped column handling
Previously, if a trigger returned a column that had been dropped, using the 
munged column name,
no error would be thrown. Since dropping a column does forcibly overwrite 
the column name, it
would be very unusual for this to happen in practice, but silently ignoring 
what would otherwise
be a legitimate column name seemed rather bogus.

diff --git a/src/pl/tcl/expected/pltcl_queries.out 
b/src/pl/tcl/expected/pltcl_queries.out
index 6cb1fdb..c3f33d9 100644
--- a/src/pl/tcl/expected/pltcl_queries.out
+++ b/src/pl/tcl/expected/pltcl_queries.out
@@ -184,6 +184,21 @@ select * from T_pkey2 order by key1 using @<, key2 collate 
"C";
 (4 rows)
 
 -- show dump of trigger data
+insert into test_return values(-1,'1 element array');
+ERROR:  trigger's return list must have even number of elements
+insert into test_return values(-2,'return dropped column');
+ERROR:  unrecognized attribute "pg.dropped.2"
+insert into test_return values(-3,'return ctid column');
+ERROR:  cannot set system attribute "ctid"
+insert into test_return values(-10,'do nothing');
+insert into test_return values(1,'good');
+select * from test_return;
+ i |  t   
+---+--
+   | 
+ 1 | good
+(2 rows)
+
 insert into trigger_test values(1,'insert');
 NOTICE:  NEW: {i: 1, v: insert}
 NOTICE:  OLD: {}
diff --git a/src/pl/tcl/expected/pltcl_setup.out 
b/src/pl/tcl/expected/pltcl_setup.out
index e65e9e3..707d77b 100644
--- a/src/pl/tcl/expected/pltcl_setup.out
+++ b/src/pl/tcl/expected/pltcl_setup.out
@@ -89,6 +89,33 @@ FOR EACH ROW EXECUTE PROCEDURE trigger_data(23,'skidoo');
 CREATE TRIGGER show_trigger_data_view_trig
 INSTEAD OF INSERT OR UPDATE OR DELETE ON trigger_test_view
 FOR EACH ROW EXECUTE PROCEDURE trigger_data(24,'skidoo view');
+-- test handling of return
+CREATE TABLE test_return(i int, dropme int, t text);
+ALTER TABLE test_return DROP dropme;
+CREATE FUNCTION trigger_return() RETURNS trigger LANGUAGE pltcl AS $body$
+switch $NEW(i) {
+-1 {
+   return {"single element"}
+   }
+   -2 {
+   spi_exec "SELECT attname FROM pg_attribute WHERE 
attrelid=$TG_relid AND attisdropped"
+   set a($attname) 1
+   return [array get a]
+   }
+   -3 {
+   return {ctid 1}
+   }
+   -10 {
+   # Will return nothing
+   }
+   default {
+   return OK
+   }
+   }
+$body$;
+CREATE TRIGGER test_trigger_return
+BEFORE INSERT ON test_return
+FOR EACH ROW EXECUTE PROCEDURE trigger_return();
 --
 -- Trigger function on every change to T_pkey1
 --
diff --git a/src/pl/tcl/pltcl.c b/src/pl/tcl/pltcl.c
index d236890..5d9a857 100644
--- a/src/pl/tcl/pltcl.c
+++ b/src/pl/tcl/pltcl.c
@@ -1128,7 +1128,9 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, bool pltrusted)
 * Get the attribute number
 
/
attnum = SPI_fnumber(tupdesc, ret_name);
-   if (attnum == SPI_ERROR_NOATTRIBUTE)
+   /* Assume system attributes can't be marked as dropped 
*/
+   if (attnum == SPI_ERROR_NOATTRIBUTE ||
+   (attnum > 0 && tupdesc->attrs[attnum - 
1]->attisdropped))
ereport(ERROR,

(errcode(ERRCODE_UNDEFINED_COLUMN),
 errmsg("unrecognized attribute 
\"%s\"",
@@ -1140,12 +1142,6 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, bool pltrusted)
ret_name)));
 

/
-* Ignore dropped columns
-
/
-   if 

Re: [HACKERS] Unsafe use of relation->rd_options without checking its type

2016-10-31 Thread Peter Geoghegan
On Mon, Oct 31, 2016 at 11:57 AM, Tom Lane  wrote:
> Now that I've seen this I wonder which other uses of rd_options are
> potentially broken.  RelationIsUsedAsCatalogTable() is hardly the
> only macro that is assuming with little justification that it's
> applied to the right kind of reloptions.

It seems worth adding an assertion, at least. I wonder what running
the regression tests with a bunch of similar assertions shows up...


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unsafe use of relation->rd_options without checking its type

2016-10-31 Thread Tom Lane
I looked into bug#14397.  The submitter was discourteous enough not to
provide an in-line example, but here it is:

CREATE TABLE IF NOT EXISTS TEST_1 (
ID   SERIAL PRIMARY KEY,
C1   BYTEA,
C2   TEXT NOT NULL,
C3   BOOLEAN NOT NULL DEFAULT FALSE,
CONSTRAINT TEST_1_unique_idx UNIQUE(C1, C2)

create or replace view test as select * from test_1 with cascaded check option;

insert into test (c1, c2, c3) values (decode('MTIzAAE=', 'base64'),
'text', true) on conflict (c1, c2) do update set c3=excluded.c3;

ERROR:  ON CONFLICT is not supported on table "test" used as a catalog table
LINE 1: ...lues (decode('MTIzAAE=', 'base64'), 'text', true) on conflic...
 ^

The reason for the error is that transformOnConflictArbiter applies
RelationIsUsedAsCatalogTable() to something that it doesn't know to
be a plain relation --- it's a view in this case.  And that macro
blindly assumes that relation->rd_options is a StdRdOptions struct,
when in this case it's actually a ViewOptions struct.

Now that I've seen this I wonder which other uses of rd_options are
potentially broken.  RelationIsUsedAsCatalogTable() is hardly the
only macro that is assuming with little justification that it's
applied to the right kind of reloptions.

We could band-aid this by having the RelationIsUsedAsCatalogTable()
macro check relation->relkind, but I'm inclined to think that the
right fix going forward is to insist that StdRdOptions, ViewOptions,
and the other in-memory representations of reloptions ought to be
self-identifying somehow.  We could add a field to them similar to
nodeTag, but one of the things that was envisioned to begin with
is relation types that have StdRdOptions as the first part of a
larger struct.  I'm not sure how to make that work with a tag.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query regarding selectDumpableExtension()

2016-10-31 Thread Tom Lane
amul sul  writes:
> Thanks for your explanation, I agree that this is not a single scenario
> where we need special care, but still my question stands there, why do we
> really need to dump built-in extension?

It's not built-in.  It's installed by default, yes, but it's also
droppable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query regarding selectDumpableExtension()

2016-10-31 Thread amul sul
On 31 Oct 2016 6:48 pm, "Tom Lane"  wrote:
>
> amul sul  writes:
> > On Fri, Oct 28, 2016 at 6:22 PM, Robert Haas 
wrote:
> >> There's a comment in dumpExtension() that explains it.
>
> > Let me explain the case I'm trying to tackle. I have two old dump
> > data, each of them have couple objects depend on plpgsql. I have
> > restored first dump and trying restore second dump using 'pg_restore
> > -c' command, it is failing with following error:
> > ERROR:  cannot drop extension plpgsql because other objects depend on it
>
> This is hardly specific to extensions.  If you try a restore with -c into
> a database that has other random objects besides what's in the dump, you
> could get errors from
> * dropping tables that are referenced by foreign keys from tables not
>   known in the dump
> * dropping functions that are used in views not known in the dump
> * dropping operators or opclasses used by indexes not known in the dump
> etc etc.
>
> > Works well without '-c' option, but that what not a general solution,
IMHO.
>
> The general solution is either don't restore into a database containing
> unrelated objects, or be prepared to ignore errors from the DROP commands.
> The extension case actually works more smoothly than most of the others.
>

Thanks for your explanation, I agree that this is not a single scenario
where we need special care, but still my question stands there, why do we
really need to dump built-in extension?

Of course you could ask, why not? And my answer will be same, "to placate
pg_restore at least in the case I've explained before" :)

Regards,
Amul

Sent from a mobile device. Please excuse brevity and tpyos.


Re: [HACKERS] pg_sequence catalog

2016-10-31 Thread Peter Eisentraut
On 9/10/16 7:17 AM, Peter Eisentraut wrote:
> Let's start with that.  Here is a patch that adds a pg_sequences view in
> the style of pg_tables, pg_indexes, etc.  This seems useful independent
> of anything else, but would give us more freedom to change things around
> behind the scenes.

I have registered the pg_sequences view patch separately in the upcoming
commit fest: https://commitfest.postgresql.org/11/865/ (The posted patch
still applies, so no new patch.)

Attached is an updated patch for the pg_sequence catalog.  I haven't
received an actual technical review in the last CF, so I'll move it
forward.  However, if the pg_sequences view is accepted, the catalog
patch will of course need an update.  So if someone is looking here with
limited time, review the view patch first.

As before, this patch requires the "sequences and pg_upgrade" patch in
order for pg_upgrade to work.

I think together these three changes will make sequence metadata
inspection and modification easier and more robust and will facilitate
future changes in this area.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From bb8d2260d27ba8685844d6268550c8b6f11f195a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 31 Oct 2016 12:00:00 -0400
Subject: [PATCH v2] Add pg_sequence system catalog

Move sequence metadata (start, increment, etc.) into a proper system
catalog instead of storing it in the sequence heap object.  This
separates the metadata from the sequence data.  Sequence metadata is now
operated on transactionally by DDL commands, whereas previously
rollbacks of sequence-related DDL commands would be ignored.
---
 src/backend/catalog/Makefile  |   2 +-
 src/backend/catalog/dependency.c  |   3 +
 src/backend/catalog/information_schema.sql|  13 +-
 src/backend/commands/sequence.c   | 375 +++---
 src/backend/utils/cache/syscache.c|  12 +
 src/include/catalog/indexing.h|   3 +
 src/include/catalog/pg_sequence.h |  30 +++
 src/include/commands/sequence.h   |  29 +-
 src/include/utils/syscache.h  |   1 +
 src/test/regress/expected/sanity_check.out|   1 +
 src/test/regress/expected/sequence.out|  33 ++-
 src/test/regress/expected/updatable_views.out |  93 +++
 src/test/regress/sql/sequence.sql |   8 +
 src/test/regress/sql/updatable_views.sql  |   2 +-
 14 files changed, 358 insertions(+), 247 deletions(-)
 create mode 100644 src/include/catalog/pg_sequence.h

diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 1ce7610..cbf0d79 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -41,7 +41,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
 	pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
 	pg_foreign_table.h pg_policy.h pg_replication_origin.h \
 	pg_default_acl.h pg_init_privs.h pg_seclabel.h pg_shseclabel.h \
-	pg_collation.h pg_range.h pg_transform.h \
+	pg_collation.h pg_range.h pg_transform.h pg_sequence.h \
 	toasting.h indexing.h \
 )
 
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 04d7840..8e1e1ac 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -66,6 +66,7 @@
 #include "commands/proclang.h"
 #include "commands/schemacmds.h"
 #include "commands/seclabel.h"
+#include "commands/sequence.h"
 #include "commands/trigger.h"
 #include "commands/typecmds.h"
 #include "nodes/nodeFuncs.h"
@@ -1155,6 +1156,8 @@ doDeletion(const ObjectAddress *object, int flags)
 	else
 		heap_drop_with_catalog(object->objectId);
 }
+if (relKind == RELKIND_SEQUENCE)
+	DeleteSequenceTuple(object->objectId);
 break;
 			}
 
diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index 00550eb..182d2d0 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -1535,15 +1535,16 @@ CREATE VIEW sequences AS
CAST(64 AS cardinal_number) AS numeric_precision,
CAST(2 AS cardinal_number) AS numeric_precision_radix,
CAST(0 AS cardinal_number) AS numeric_scale,
-   CAST(p.start_value AS character_data) AS start_value,
-   CAST(p.minimum_value AS character_data) AS minimum_value,
-   CAST(p.maximum_value AS character_data) AS maximum_value,
-   CAST(p.increment AS character_data) AS increment,
-   CAST(CASE WHEN p.cycle_option THEN 'YES' ELSE 'NO' END AS yes_or_no) AS cycle_option
-FROM pg_namespace nc, pg_class c, LATERAL pg_sequence_parameters(c.oid) p
+   CAST(s.seqstart AS character_data) AS start_value,
+   CAST(s.seqmin AS character_data) AS minimum_value,
+   CAST(s.seqmax AS character_data) AS 

Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-10-31 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-14 19:28:25 -0400, Tom Lane wrote:
>> So I think we should continue investigating this way of doing things.
>> I'll try to take a look at the executor end of it tomorrow.  However
>> I'm leaving Friday for a week's vacation, and may not have anything to
>> show before that.

> Are you planning to work on the execution side of things? I otherwise
> can take a stab...

My plan is to start on this when I go back into commitfest mode,
but right now I'm trying to produce a draft patch for RLS changes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-31 Thread Andres Freund
On 2016-10-31 09:28:00 -0400, Robert Haas wrote:
> On Fri, Oct 28, 2016 at 2:46 AM, Andres Freund  wrote:
> > Well, that'll also make the feature not particularly useful :(.  My
> > suspicion is that the way to suceed here isn't to rely more on testing
> > as part of the scan, but create a more general fastpath for qual
> > evaluation, which atm is a *LOT* more heavyweight than what
> > HeapKeyTest() does.  But maybe I'm biased since I'm working on the
> > latter...
>
> I think you might be right, but I'm not very clear on what the
> timeline for your work is.

Me neither.  But I think if we can stomach Dilip's approach of using a
slot in heapam, then I think my concerns are addressed, and this is
probably going go to be a win regardless of faster expression evaluation
and/or batching.

> It would be easier to say, sure, let's put
> this on hold if we knew that in a month or two we could come back and
> retest after you've made some progress.  But I don't know whether
> we're talking about months or years.

I sure hope it's months.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Radix tree for character conversion

2016-10-31 Thread Daniel Gustafsson
> On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI  
> wrote:
> 
> Hello, thank you very much for the work. My work became quite
> easier with it.
> 
> At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas  
> wrote in <08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi>
>> 
>> [..]
>> The perl scripts are still quite messy. For example, I lost the checks
>> for duplicate mappings somewhere along the way - that ought to be put
>> back. My Perl skills are limited.
> 
> Perl scripts are to be messy, I believe. Anyway the duplicate
> check as been built into the sub print_radix_trees. Maybe the
> same check is needed by some plain map files but it would be just
> duplication for the maps having radix tree.

I took a small stab at doing some cleaning of the Perl scripts, mainly around
using the more modern (well, modern as in +15 years old) form for open(..),
avoiding global filehandles for passing scalar references and enforcing use
strict.  Some smaller typos and fixes were also included.  It seems my Perl has
become a bit rusty so I hope the changes make sense.  The produced files are
identical with these patches applied, they are merely doing cleaning as opposed
to bugfixing.

The attached patches are against the 0001-0006 patches from Heikki and you in
this series of emails, the separation is intended to make them easier to read.

cheers ./daniel



0007-Fix-filehandle-usage.patch
Description: Binary data


0008-Make-all-scripts-use-strict-and-rearrange-logic.patch
Description: Binary data


0009-Use-my-instead-of-local.patch
Description: Binary data


0010-Various-small-style-nits-and-typos.patch
Description: Binary data


0011-Fix-hash-lookup.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)

2016-10-31 Thread Andres Freund
Hi Tom,

On 2016-09-14 19:28:25 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2016-09-12 19:35:22 -0400, Tom Lane wrote:
> >> Anyway I'll draft a prototype and then we can compare.
> 
> > Ok, cool.
> 
> Here's a draft patch that is just meant to investigate what the planner
> changes might look like if we do it in the pipelined-result way.
> Accordingly, I didn't touch the executor, but just had it emit regular
> Result nodes for SRF-execution steps.  However, the SRFs are all
> guaranteed to appear at top level of their respective tlists, so that
> those Results could be replaced with something that works like
> nodeFunctionscan.

> So I think we should continue investigating this way of doing things.
> I'll try to take a look at the executor end of it tomorrow.  However
> I'm leaving Friday for a week's vacation, and may not have anything to
> show before that.

Are you planning to work on the execution side of things? I otherwise
can take a stab...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] emergency outage requiring database restart

2016-10-31 Thread Oskari Saarenmaa

27.10.2016, 21:53, Merlin Moncure kirjoitti:

As noted earlier, I was not able to reproduce the issue with
crashme.sh, which was:

NUM_FORKS=16
do_parallel psql -p 5432  -c"select PushMarketSample('1740')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('4400')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('2160')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('6680')" castaging_test


(do_parallel is simple wrapper to executing the command in parallel up
to NUM_FORKS).   This is on the same server and cluster as above.
This kind of suggests that either
A) there is some concurrent activity from another process that is
tripping the issue
or
B) there is something particular to the session invoking the function
that is participating in the problem.  As the application is
structured, a single threaded node.js app is issuing the query that is
high traffic and long lived.  It's still running in fact and I'm kind
of tempted to find some downtime to see if I can still reproduce via
the UI.


Your production system's postgres backends probably have a lot more open 
files associated with them than the simple test case does.  Since 
Postgres likes to keep files open as long as possible and only closes 
them when you need to free up fds to open new files, it's possible that 
your production backends have almost all allowed fds used when you 
execute your pl/sh function.


If that's the case, the sqsh process that's executed may not have enough 
fds to do what it wanted to do and because of busted error handling 
could end up writing to fds that were opened by Postgres and point to 
$PGDATA files.


/ Oskari


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-10-31 Thread Kuntal Ghosh
On Fri, Oct 28, 2016 at 11:35 AM, Michael Paquier
 wrote:
> And here we go. Here is a review as well as a large brush-up for this
> patch. A couple of things:
Thanks for reviewing the patch in detail.

> - Speaking of which using BKPIMAGE_IS_REQUIRED_FOR_REDO stored in the
> block definition is sort of weird because we want to know if
> consistency should be checked at a higher level.
A full page image can be included in the WAL record because of the following
reasons:
1. It needs to be restored during replay.
2. WAL consistency should be checked for the record.
3. Both of above.
In your patch, you've included a full page image whenever wal_consistency
is true. So, XLogReadBufferForRedoExtended always restores the image
and returns BLK_RESTORED, which is unacceptable. We can't change
the default WAL replay behaviour. A full image should only be restored if it is
necessary to do so. Although, I agree that BKPIMAGE_IS_REQUIRED_FOR_REDO
doesn't look a clean way to implement this feature.

> - wal_consistency is using a list of RMGRs, at the cost of being
> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
> have been thinking hard about that, and still I don't see the point).
> It is rather easy to for example default it to false, and enable it to
> true to check if a certain code path is correctly exercised or not for
> WAL consistency. Note that this simplification reduces the patch size
> by 100~150 lines. I know, I know, I'd expect some complains about
> that
As Robert also told, if I'm focusing on a single AM, I really don't
want to store
full images and perform consistency check for other AMs.

> On top of that, I have done a fair amount of testing, creating
> manually some inconsistencies in the REDO routines to trigger failures
> on standbys. And that was sort of fun to break things intentionally.
I know the feeling. :)

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2016-10-31 Thread Kouhei Kaigai
Hello,

The attached patch implements the suggestion by Amit before.

What I'm motivated is to collect extra run-time statistics specific
to a particular ForeignScan/CustomScan, not only the standard
Instrumentation; like DMA transfer rate or execution time of GPU
kernels in my case.

Per-node DSM toc is one of the best way to return run-time statistics
to the master backend, because FDW/CSP can assign arbitrary length of
the region according to its needs. It is quite easy to require.
However, one problem is, the per-node DSM toc is already released when
ExecEndNode() is called on the child node of Gather.

This patch allows extensions to get control on the master backend's
context when all the worker node gets finished but prior to release
of the DSM segment. If FDW/CSP has its special statistics on the
segment, it can move to the private memory area for EXPLAIN output
or something other purpose.

One design consideration is whether the hook shall be called from
ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
The former is a function to retrieve the standard Instrumentation
information, thus, it is valid only if EXPLAIN ANALYZE.
On the other hands, if we put entrypoint at ExecParallelFinish(),
extension can get control regardless of EXPLAIN ANALYZE, however,
it also needs an extra planstate_tree_walker().

Right now, we don't assume anything onto the requirement by FDW/CSP.
It may want run-time statistics regardless of EXPLAIN ANALYZE, thus,
hook shall be invoked always when Gather node confirmed termination
of the worker processes.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Sent: Monday, October 17, 2016 11:22 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers
> Subject: ##freemail## Re: [HACKERS] Steps inside ExecEndGather
> 
> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai  wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >   ExecShutdownGather(node);
> >   ExecFreeExprContext(>ps);
> >   ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >   ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
> 
> 
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com


parallel-finish-fdw_csp.v1.patch
Description: parallel-finish-fdw_csp.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Tomas Vondra

On 10/31/2016 02:51 PM, Amit Kapila wrote:

On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondra
 wrote:

Hi,

On 10/27/2016 01:44 PM, Amit Kapila wrote:

I've read that analysis, but I'm not sure I see how it explains the "zig
zag" behavior. I do understand that shifting the contention to some other
(already busy) lock may negatively impact throughput, or that the
group_update may result in updating multiple clog pages, but I don't
understand two things:

(1) Why this should result in the fluctuations we observe in some of the
cases. For example, why should we see 150k tps on, 72 clients, then drop to
92k with 108 clients, then back to 130k on 144 clients, then 84k on 180
clients etc. That seems fairly strange.



I don't think hitting multiple clog pages has much to do with
client-count.  However, we can wait to see your further detailed test
report.


(2) Why this should affect all three patches, when only group_update has to
modify multiple clog pages.



No, all three patches can be affected due to multiple clog pages.
Read second paragraph ("I think one of the probable reasons that could
happen for both the approaches") in same e-mail [1].  It is basically
due to frequent release-and-reacquire of locks.





On logged tables it usually looks like this (i.e. modest increase for
high
client counts at the expense of significantly higher variability):

  http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64



What variability are you referring to in those results?






Good question. What I mean by "variability" is how stable the tps is during
the benchmark (when measured on per-second granularity). For example, let's
run a 10-second benchmark, measuring number of transactions committed each
second.

Then all those runs do 1000 tps on average:

  run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
  run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
  run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000



Generally, such behaviours are seen due to writes. Are WAL and DATA
on same disk in your tests?



Yes, there's one RAID device on 10 SSDs, with 4GB of the controller. 
I've done some tests and it easily handles > 1.5GB/s in sequential 
writes, and >500MB/s in sustained random writes.


Also, let me point out that most of the tests were done so that the 
whole data set fits into shared_buffers, and with no checkpoints during 
the runs (so no writes to data files should really happen).


For example these tests were done on scale 3000 (45GB data set) with 
64GB shared buffers:


[a] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-unlogged-sync-noskip-64


[b] 
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-async-noskip-64


and I could show similar cases with scale 300 on 16GB shared buffers.

In those cases, there's very little contention between WAL and the rest 
of the data base (in terms of I/O).


And moreover, this setup (single device for the whole cluster) is very 
common, we can't just neglect it.


But my main point here really is that the trade-off in those cases may 
not be really all that great, because you get the best performance at 
36/72 clients, and then the tps drops and variability increases. At 
least not right now, before tackling contention on the WAL lock (or 
whatever lock becomes the bottleneck).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-10-31 Thread Kouhei Kaigai
Hello,

The attached patch is revised version of the pass-down-bounds feature.
Its functionality is not changed from the previous version, however,
implementation was revised according to the discussion at the last CF.

This patch add a new fields (ps_numTuples) to the PlanState. This is
a hint for optimization when parent node needs first N-tuples only.
It shall be set prior to ExecProcNode() after ExecInitNode() or
ExecReScan() by the parent node, then child nodes can adjust its
execution behavior (e.g, Sort will take top-N heapsort if ps_numTuples
is set) and pass down the hint to its child nodes furthermore.

As an example, I enhanced postgres_fdw to understand the ps_numTuples
if it is set. If and when remote ORDER BY is pushed down, the latest
code tries to sort the entire remote table because it does not know
how many rows to be returned. Thus, it took larger execution time.
On the other hands, the patched one runs the remote query with LIMIT
clause according to the ps_numTuples; which is informed by the Limit
node on top of the ForeignScan node.

* without patch
=
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
 QUERY PLAN

 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=2332.548..2332.550 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=2332.547..2332.548 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.177 ms
 Execution time: 2445.590 ms
(7 rows)

* with patch
==
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
QUERY PLAN
--
 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=579.469..579.471 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=579.468..579.469 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.123 ms
 Execution time: 579.858 ms
(7 rows)


Right now, I have a few concerns for this patch.
1. Because LIMIT clause can have expression not only constant value,
   we cannot determine the value of ps_numTuples until execution time.
   So, it is not possible to adjust remote query on planning time, and
   EXPLAIN does not show exact remote query even if LIMIT clause was
   attached actually.

2. Where is the best location to put the interface contract to set
   ps_numTuples field. It has to be set prior to the first ExecProcNode()
   after ExecInitNode() or ExecReScan().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Friday, September 16, 2016 12:39 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jeevan Chalke; pgsql-hackers@postgresql.org; Etsuro Fujita; Andres Freund
> Subject: ##freemail## Re: [HACKERS] PassDownLimitBound for 
> ForeignScan/CustomScan
> 
> On Tue, Sep 13, 2016 at 9:07 PM, Kouhei Kaigai  wrote:
> > In the current implementation calls recompute_limits() on the first
> > invocation of ExecLimit and ExecReScanLimit. Do we expect the
> > ps->numTuples will be also passed down to the child nodes on the same
> > timing?
> 
> Sure, unless we find some reason why that's not good.
> 
> > I also think this new executor contract shall be considered as a hint
> > (but not a requirement) for the child nodes, because it allows the
> > parent nodes to re-distribute the upper limit regardless of the type
> > of the child nodes as long as the parent node can work correctly and
> > has benefit even if the child node returns a part of tuples. It makes
> > the decision whether the upper limit should be passed down much simple.
> > The child node "can" ignore the hint but can utilize for more optimization.
> 
> +1.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


passdown-limit-fdw.v1.patch
Description: passdown-limit-fdw.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DML and column cound in aggregated subqueries

2016-10-31 Thread Andres Freund
On 2016-10-31 09:35:57 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > this doesn't look right. The ctid shouldn't be in the aggregate output -
> > after all it's pretty much meaningless here.
> 
> I suspect it's being added to support EvalPlanQual row re-fetches.

Hm, that doesn't seem particularly meaningful though? I wonder if it's
not actually more an accident than anything. Looks like
preprocess_rowmarks() adds them pretty unconditionally to everything for DML.


> > Casting a wider net: find_hash_columns() and it's subroutines seem like
> > pretty much dead code, including outdated comments (look for "SQL99
> > semantics").
> 
> Hm, maybe.  In principle the planner could do that instead, but it doesn't
> look like it actually does at the moment; I'm not seeing any distinction
> in tlist-building for AGG_HASHED vs other cases in create_agg_plan().

Isn't the important part what's inside Agg->numCols/grpColIdx/grpOperators?
Those should Because those are the ones that are minimal, and it looks
we do the right thing there already (and if not, we'd be in trouble
anyways afaics).

We already only store columns that find_hash_columns() figures out to be
required in the hashtable, but afaics we can instead just iterate over
grpColIdx[] and just store the ones in there.


The reason I'm looking into this in the first place is that the slot
access inside execGrouping turns out to be pretty expensive, especially
because all the tuples have nulls, so we can't even skip columns and
such. I wrote code to create a new tupledesc which only containing the
columns referenced by grpColIdx, and asserted that it's the same set
that find_hash_columns() - but that's not always the case, but afaics
spuriously.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Amit Kapila
On Mon, Oct 31, 2016 at 7:02 PM, Tomas Vondra
 wrote:
>
> The remaining benchmark with 512 clog buffers completed, and the impact
> roughly matches Dilip's benchmark - that is, increasing the number of clog
> buffers eliminates all positive impact of the patches observed on 128
> buffers. Compare these two reports:
>
> [a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest
>
> [b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512
>
> With 128 buffers the group_update and granular_locking patches achieve up to
> 50k tps, while master and no_content_lock do ~30k tps. After increasing
> number of clog buffers, we get only ~30k in all cases.
>
> I'm not sure what's causing this, whether we're hitting limits of the simple
> LRU cache used for clog buffers, or something else.
>

I have also seen previously that increasing clog buffers to 256 can
impact performance negatively.  So, probably here the gains due to
group_update patch is negated due to the impact of increasing clog
buffers.   I am not sure if it is good idea to see the impact of
increasing clog buffers along with this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dumb mistakes in WalSndWriteData()

2016-10-31 Thread Andres Freund
On 2016-10-31 09:44:16 -0400, Robert Haas wrote:
> On Mon, Oct 31, 2016 at 4:59 AM, Andres Freund  wrote:
> > I^Wsomebody appears to have made a number of dumb mistakes in
> > WalSndWriteData(), namely:
> > 1) The timestamp is set way too late, after
> >pq_putmessage_noblock(). That'll sometimes work, sometimes not.  I
> >have no idea how that ended up happening. It's eye-wateringly dumb.
> >
> > 2) We only do WalSndKeepaliveIfNecessary() if we're blocked on socket
> >IO. But on a long-lived connection that might be a lot of data, we
> >should really do that once *before* trying to send the payload in the
> >first place.
> >
> > 3) Similarly to 2) it might be worthwhile checking for interrupts
> >everytime, not just when blocked on network IO.
> >
> > See also:
> > http://archives.postgresql.org/message-id/CAMsr%2BYE2dSfHVr7iEv1GSPZihitWX-PMkD9QALEGcTYa%2Bsdsgg%40mail.gmail.com
> 
> Do you intend to do something about these problems?

At least 1) and 2), yes. I basically wrote this email to have something
to reference in my todo list...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Amit Kapila
On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondra
 wrote:
> Hi,
>
> On 10/27/2016 01:44 PM, Amit Kapila wrote:
>
> I've read that analysis, but I'm not sure I see how it explains the "zig
> zag" behavior. I do understand that shifting the contention to some other
> (already busy) lock may negatively impact throughput, or that the
> group_update may result in updating multiple clog pages, but I don't
> understand two things:
>
> (1) Why this should result in the fluctuations we observe in some of the
> cases. For example, why should we see 150k tps on, 72 clients, then drop to
> 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180
> clients etc. That seems fairly strange.
>

I don't think hitting multiple clog pages has much to do with
client-count.  However, we can wait to see your further detailed test
report.

> (2) Why this should affect all three patches, when only group_update has to
> modify multiple clog pages.
>

No, all three patches can be affected due to multiple clog pages.
Read second paragraph ("I think one of the probable reasons that could
happen for both the approaches") in same e-mail [1].  It is basically
due to frequent release-and-reacquire of locks.

>
>
>>> On logged tables it usually looks like this (i.e. modest increase for
>>> high
>>> client counts at the expense of significantly higher variability):
>>>
>>>   http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64
>>>
>>
>> What variability are you referring to in those results?
>
>>
>
> Good question. What I mean by "variability" is how stable the tps is during
> the benchmark (when measured on per-second granularity). For example, let's
> run a 10-second benchmark, measuring number of transactions committed each
> second.
>
> Then all those runs do 1000 tps on average:
>
>   run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
>   run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
>   run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000
>

Generally, such behaviours are seen due to writes.  Are WAL and DATA
on same disk in your tests?


[1] - 
https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dumb mistakes in WalSndWriteData()

2016-10-31 Thread Robert Haas
On Mon, Oct 31, 2016 at 4:59 AM, Andres Freund  wrote:
> I^Wsomebody appears to have made a number of dumb mistakes in
> WalSndWriteData(), namely:
> 1) The timestamp is set way too late, after
>pq_putmessage_noblock(). That'll sometimes work, sometimes not.  I
>have no idea how that ended up happening. It's eye-wateringly dumb.
>
> 2) We only do WalSndKeepaliveIfNecessary() if we're blocked on socket
>IO. But on a long-lived connection that might be a lot of data, we
>should really do that once *before* trying to send the payload in the
>first place.
>
> 3) Similarly to 2) it might be worthwhile checking for interrupts
>everytime, not just when blocked on network IO.
>
> See also:
> http://archives.postgresql.org/message-id/CAMsr%2BYE2dSfHVr7iEv1GSPZihitWX-PMkD9QALEGcTYa%2Bsdsgg%40mail.gmail.com

Do you intend to do something about these problems?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Tomas Vondra

On 10/30/2016 07:32 PM, Tomas Vondra wrote:

Hi,

On 10/27/2016 01:44 PM, Amit Kapila wrote:

On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondra
 wrote:


FWIW I plan to run the same test with logged tables - if it shows
similar
regression, I'll be much more worried, because that's a fairly typical
scenario (logged tables, data set > shared buffers), and we surely can't
just go and break that.



Sure, please do those tests.



OK, so I do have results for those tests - that is, scale 3000 with
shared_buffers=16GB (so continuously writing out dirty buffers). The
following reports show the results slightly differently - all three "tps
charts" next to each other, then the speedup charts and tables.

Overall, the results are surprisingly positive - look at these results
(all ending with "-retest"):

[1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest

[2]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest


[3]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest


All three show significant improvement, even with fairly low client
counts. For example with 72 clients, the tps improves 20%, without
significantly affecting variability variability of the results( measured
as stdddev, more on this later).

It's however interesting that "no_content_lock" is almost exactly the
same as master, while the other two cases improve significantly.

The other interesting thing is that "pgbench -N" [3] shows no such
improvement, unlike regular pgbench and Dilip's workload. Not sure why,
though - I'd expect to see significant improvement in this case.

I have also repeated those tests with clog buffers increased to 512 (so
4x the current maximum of 128). I only have results for Dilip's workload
and "pgbench -N":

[4]
http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512

[5]
http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512


The results are somewhat surprising, I guess, because the effect is
wildly different for each workload.

For Dilip's workload increasing clog buffers to 512 pretty much
eliminates all benefits of the patches. For example with 288 client,
the group_update patch gives ~60k tps on 128 buffers [1] but only 42k
tps on 512 buffers [4].

With "pgbench -N", the effect is exactly the opposite - while with
128 buffers there was pretty much no benefit from any of the patches
[3], with 512 buffers we suddenly get almost 2x the throughput, but
only for group_update and master (while the other two patches show no
improvement at all).



The remaining benchmark with 512 clog buffers completed, and the impact 
roughly matches Dilip's benchmark - that is, increasing the number of 
clog buffers eliminates all positive impact of the patches observed on 
128 buffers. Compare these two reports:


[a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest

[b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512

With 128 buffers the group_update and granular_locking patches achieve 
up to 50k tps, while master and no_content_lock do ~30k tps. After 
increasing number of clog buffers, we get only ~30k in all cases.


I'm not sure what's causing this, whether we're hitting limits of the 
simple LRU cache used for clog buffers, or something else. But maybe 
there's something in the design of clog buffers that make them work less 
efficiently with more clog buffers? I'm not sure whether that's 
something we need to fix before eventually committing any of them.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DML and column cound in aggregated subqueries

2016-10-31 Thread Tom Lane
Andres Freund  writes:
> this doesn't look right. The ctid shouldn't be in the aggregate output -
> after all it's pretty much meaningless here.

I suspect it's being added to support EvalPlanQual row re-fetches.

> Casting a wider net: find_hash_columns() and it's subroutines seem like
> pretty much dead code, including outdated comments (look for "SQL99
> semantics").

Hm, maybe.  In principle the planner could do that instead, but it doesn't
look like it actually does at the moment; I'm not seeing any distinction
in tlist-building for AGG_HASHED vs other cases in create_agg_plan().
As an example:

regression=# explain verbose select avg(ten), hundred from tenk1 group by 2;
  QUERY PLAN
   

---
 HashAggregate  (cost=508.00..509.25 rows=100 width=36)
   Output: avg(ten), hundred
   Group Key: tenk1.hundred
   ->  Seq Scan on public.tenk1  (cost=0.00..458.00 rows=1 width=8)
 Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, tw
othousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
(5 rows)

If you wanted to forgo find_hash_columns() then it would be important
for the seqscan to output a minimal tlist rather than try to optimize
away its projection step.

I'm not that excited about making such a change in terms of performance:
you'd essentially be skipping a handmade projection step inside nodeAgg
at the cost of probably adding one at the input node, as in this example.
But it might be worth doing anyway just on the grounds that this ought to
be the planner's domain not a custom hack in the executor.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Craig Ringer
On 31 October 2016 at 16:52, Andres Freund  wrote:
> Hi,
>
> On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
>> TL;DR: Logical decoding clients need to generate their own keepalives
>> and not rely on the server requesting them to prevent timeouts. Or
>> admins should raise the wal_sender_timeout by a LOT when using logical
>> decoding on DBs with any big rows.
>
> Unconvinced.

Yeah. I've seen enough issues in the wild where we keep timing out and
restarting over and over until we increase wal_sender_timeout to know
there's _something_ going on. I am less sure I'm right about what is
or how to solve it.

>> When sending a big message, WalSndWriteData() notices that it's
>> approaching timeout and tries to send a keepalive request, but the
>> request just gets buffered behind the remaining output plugin data and
>> isn't seen by the client until the client has received the rest of the
>> pending data.
>
> Only for individual messages, not the entire transaction though.

Right.  I initially thought it was the whole tx, but I was mistaken as
I'd failed to notice that WalSndWriteData() queues a keepalive
request.

> Are
> you sure the problem at hand is that we're sending a keepalive, but it's
> too late?

No, I'm not sure. I'm trying to identify the cause of an issue I've
seen in the wild, but never under conditions where it's been possible
to sit around and debug in a leisurely manner.

I'm trying to set up a TAP test to demonstrate that this happens, but
I don't think it's going to work without some kind of network
bandwidth limitation simulation or simulated latency. A local unix
socket is just too fast for Pg's row size limits.

> It might very well be that the actual issue is that we're
> never sending keepalives, because the network is fast enough / the tcp
> window is large enough.  IIRC we only send a keepalive if we're blocked
> on network IO?

Mm, that's a good point. That might better explain the issues I've
seen in the wild, since I never found strong evidence that individual
big rows were involved, but hadn't been able to come up with anything
else yet.

>> So: We could ask output plugins to deal with this for us, by chunking
>> up their data in small pieces and calling OutputPluginPrepareWrite()
>> and OutputPluginWrite() more than once per output plugin callback if
>> they expect to send a big message. But this pushes the complexity of
>> splitting up and handling big rows, and big Datums, onto each plugin.
>> It's awkward to do well and hard to avoid splitting things up
>> unnecessarily.
>
> There's decent reason for doing that independently though, namely that
> it's a lot more efficient from a memory management POV.

Definitely. Though you're always going to be tossing around ridiculous
chunks of memory when dealing with big external compressed toasted
data, unless there are ways to access that progressively that I'm
unaware of. Hopefully there are.

I'd quite like to extend the bdr/pglogical/logicalrep protocol so that
in-core logical rep, in some later version, can write a field as 'to
follow', like we currently mark unchanged toasted datums separately.
Then send it chunked, after the main row, in follow-up messages. That
way we keep processing keepalives, we don't allocate preposterous
amounts of memory, etc.

> I don't think the "unrequested keepalive" approach really solves the
> problem on a fundamental enough level.

Fair. It feels a bit like flailing in the dark, too.

>> (A separate issue is that we can also time out when doing logical
>> _replication_ if the downstream side blocks on a lock, since it's not
>> safe to send on a socket from a signal handler ... )
>
> That's strictly speaking not true. write() / sendmsg() are signal safe
> functions.  There's good reasons not to do that however, namely that the
> non signal handler code might be busy writing data itself.

Huh, ok. And since in pglogical/bdr and as far as I can tell in core
logical rep we don't send anything on the socket while we're calling
in to heap access, the executor, etc, that'd actually be an option. We
could possibly safeguard it with a volatile "socket busy" flag since
we don't do much sending anyway. But I'd need to do my reading on
signal handler safety etc. Still, good to know it's not completely
absurd to do this if the issue comes up.

Thanks very much for the input. I saw your post with proposed changes.
Once I can get the issue simulated reliably I'll test the patch and
see if it solves it, but it looks like it's sensible to just apply it
anyway TBH.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-31 Thread Robert Haas
On Fri, Oct 28, 2016 at 2:46 AM, Andres Freund  wrote:
> Well, that'll also make the feature not particularly useful :(.  My
> suspicion is that the way to suceed here isn't to rely more on testing
> as part of the scan, but create a more general fastpath for qual
> evaluation, which atm is a *LOT* more heavyweight than what
> HeapKeyTest() does.  But maybe I'm biased since I'm working on the
> latter...

I think you might be right, but I'm not very clear on what the
timeline for your work is.  It would be easier to say, sure, let's put
this on hold if we knew that in a month or two we could come back and
retest after you've made some progress.  But I don't know whether
we're talking about months or years.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-31 Thread Tomas Vondra

On 10/31/2016 05:01 AM, Jim Nasby wrote:

On 10/30/16 1:32 PM, Tomas Vondra wrote:


Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's
some sort of CPU / OS scheduling artifact. For example, the system has
36 physical cores, 72 virtual ones (thanks to HT). I find it strange
that the "good" client counts are always multiples of 72, while the
"bad" ones fall in between.

  72 = 72 * 1   (good)
 108 = 72 * 1.5 (bad)
 144 = 72 * 2   (good)
 180 = 72 * 2.5 (bad)
 216 = 72 * 3   (good)
 252 = 72 * 3.5 (bad)
 288 = 72 * 4   (good)

So maybe this has something to do with how OS schedules the tasks, or
maybe some internal heuristics in the CPU, or something like that.


It might be enlightening to run a series of tests that are 72*.1 or *.2
apart (say, 72, 79, 86, ..., 137, 144).


Yeah, I've started a benchmark with client a step of 6 clients

36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288

instead of just

36 72 108 144 180 216 252 288

which did a test every 36 clients. To compensate for the 6x longer runs, 
I'm only running tests for "group-update" and "master", so I should have 
the results in ~36h.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Overlook in 2bd9e412?

2016-10-31 Thread Robert Haas
On Fri, Oct 28, 2016 at 8:46 AM, Julien Rouhaud
 wrote:
> It looks like pq_putmessage_hook and pq_flush_hook were meant for
> development purpose and not supposed to be committed.
>
> Attached patch remove them.

Thanks.  Those were actually remnants of an earlier design which
didn't survive contact with Andres, and which I failed to clean up
properly.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query regarding selectDumpableExtension()

2016-10-31 Thread Tom Lane
amul sul  writes:
> On Fri, Oct 28, 2016 at 6:22 PM, Robert Haas  wrote:
>> There's a comment in dumpExtension() that explains it.

> Let me explain the case I'm trying to tackle. I have two old dump
> data, each of them have couple objects depend on plpgsql. I have
> restored first dump and trying restore second dump using 'pg_restore
> -c' command, it is failing with following error:
> ERROR:  cannot drop extension plpgsql because other objects depend on it

This is hardly specific to extensions.  If you try a restore with -c into
a database that has other random objects besides what's in the dump, you
could get errors from
* dropping tables that are referenced by foreign keys from tables not
  known in the dump
* dropping functions that are used in views not known in the dump
* dropping operators or opclasses used by indexes not known in the dump
etc etc.

> Works well without '-c' option, but that what not a general solution, IMHO.

The general solution is either don't restore into a database containing
unrelated objects, or be prepared to ignore errors from the DROP commands.
The extension case actually works more smoothly than most of the others.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2016-10-31 Thread Robert Haas
On Fri, Oct 28, 2016 at 3:40 AM, Andres Freund  wrote:
>>
>> There is not much common functionality between the two.
>
> I don't really agree. For me manual and automated prewarming are pretty
> closely related. Sure they have their independent usages, and not too
> much code might be shared. But grouping them in the same extension seems
> to make sense, it's confusing to have closely related but independent
> extensions.

I agree that putting them together would be fine.

>> One point that seems to be worth discussing is when should the buffer
>> information be dumped to a file?  Shall we dump at each checkpoint or
>> at regular intervals via auto prewarm worker or at some other time?
>
> Should probably be at some regular interval - not sure if checkpoints
> are the best time (or if it's even realistic to tie a bgworker to
> checkpoints), since checkpoints have a significant impact on the state
> of shared_buffers.

Checkpoints don't cause any buffer replacement, which I think is what
would be relevant here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2016-10-31 Thread Robert Haas
On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
 wrote:
> I think there are going to be two kinds of partitioning use-cases.
> First, carefully hand-crafted by DBAs so that every partition is
> different from other and so is every join between two partitions.
> There will be lesser number of partitions, but creating paths for each
> join between partitions will be crucial from performance point of
> view. Consider, for example, systems which use partitions to
> consolidate results from different sources for analytical purposes or
> sharding. If we consider various points you have listed in [1] as to
> why a partition is equivalent to a table, each join between partitions
> is going to have very different characteristics and thus deserves a
> set of paths for its own. Add to that possibility of partition pruning
> or certain conditions affecting particular partitions, the need for
> detailed planning evident.
>
> The other usage of partitioning is to distribute the data and/or
> quickly eliminate the data by partition pruning. In such case, all
> partitions of a given table will have very similar properties. There
> is a large chance that we will end up having same plans for every
> partition and for joins between partitions. In such cases, I think it
> suffices to create paths for just one or may be a handful partitions
> of join and repeat that plan for other partitions of join. But in such
> cases it also makes sense to have a light-weight representation for
> partitions as compared to partitions being a full-fledged tables. If
> we have such a light-weight representation, we may not even create
> RelOptInfos representing joins between partitions, and different paths
> for each join between partitions.

I'm not sure I see a real distinction between these two use cases.  I
think that the problem of differing data distribution between
partitions is almost always going to be an issue.  Take the simple
case of an "orders" table which is partitioned by month.  First, the
month that's currently in progress may be much smaller than a typical
completed month.  Second, many businesses are seasonal and may have
many more orders at certain times of year.  For example, in American
retail, many businesses have large spikes in December.  I think some
businesses may do four times as much business in December as any other
month, for example.  So you will have that sort of variation, at
least.

> A typical join tree will be composite: some portion partitioned and
> some portion unpartitioned or different portions partitioned by
> different partition schemes. In such case, inaccurate costs for
> PartitionJoinPath, can affect the plan heavily, causing a suboptimal
> path to be picked. Assuming that partitioning will be useful for large
> sets of data, choosing a suboptimal plan can be more dangerous than
> consuming memory for creating paths.

Well, sure.  But, I mean, every simplifying assumption which the
planner makes to limit resource consumption could have that effect.
join_collapse_limit, for example, can cause horrible plans.  However,
we have it anyway, because the alternative of having planning take far
too long is unpalatable.  Planning is always, at some level,
guesswork.

>> For each
>> partition, we switch to a new memory context, perform planning, copy
>> the best path and its substructure back to the parent context, and
>> then reset the context.
>
> This could be rather tricky. It assumes that all the code that creates
> paths for joins, should not allocate any memory which is linked to
> some object in a context that lives longer than the path creation
> context. There is some code like create_join_clause() or
> make_canonical_pathkey(), which carefully chooses which memory context
> to allocate memory in. But can we ensure it always? postgres_fdw for
> example allocates memory for PgFdwRelationInfo in current memory
> context and attaches it in RelOptInfo, which should be in the
> planner's original context. So, if we create a new memory context for
> each partition, fpinfos would be invalidated when those contexts are
> released. Not that, we can not enforce some restriction on the memory
> usage while planning, it's hard to enforce it and bugs arising from it
> may go unnoticed. GEQO planner might have its own problems with this
> approach. Third party FDWs will pose a problem.

Yep, there are problems.  :-)

> A possible solution would be to keep the track of used paths using a
> reference count. Once the paths for given join tree are created, free
> up the unused paths by traversing pathlist in each of the RelOptInfos.
> Attached patch has a prototype implementation for the same. There are
> some paths which are not linked to RelOptInfos, which need a bit
> different treatment, but they can be handled too.

So, if you apply this with your previous patch, how much does it cut
down memory consumption?

>> In that way, peak memory usage only grows by
>> about a factor of 

Re: [HACKERS] sequences and pg_upgrade

2016-10-31 Thread Peter Eisentraut
On 9/30/16 12:50 PM, Anastasia Lubennikova wrote:
> The patches are good, no complaints.
> But again, I have the same question.
> I was confused, why do we always dump sequence data,
> because I'd overlooked the --sequence-data key. I'd rather leave this 
> option,
> because it's quite non intuitive behaviour...
>   /* dump sequence data even in schema-only mode */

Here are rebased patches.

Regarding your question:  The initial patch had a separate option for
this behavior, which was then used by pg_upgrade.  It was commented that
this option is not useful outside of pg_upgrade, so it doesn't need to
be exposed as a user-facing option.  I agreed with that and removed the
option.  We can always add the option back easily if someone really
wants it, but so far no use case has been presented.  So I suggest we
proceed with this proposal ignoring whether this option is exposed or not.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From d2b98ba5df815018dac1650134398c1bac7164a4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 23 Aug 2016 12:00:00 -0400
Subject: [PATCH v3 1/2] pg_dump: Separate table and sequence data object types

Instead of handling both sequence data and table data internally as
"table data", handle sequences separately under a "sequence set" type.
We already handled materialized view data differently, so it makes the
code somewhat cleaner to handle each relation kind separately at the top
level.

This does not change the output format, since there already was a
separate "SEQUENCE SET" archive entry type.  A noticeable difference is
that SEQUENCE SET entries now always appear after TABLE DATA entries.
And in parallel mode there is less sorting to do, because the sequence
data entries are no longer considered table data.
---
 src/bin/pg_dump/pg_dump.c  | 11 +++
 src/bin/pg_dump/pg_dump.h  |  1 +
 src/bin/pg_dump/pg_dump_sort.c | 28 +---
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4da297f..3485cab 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2090,6 +2090,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo, bool oids)
 
 	if (tbinfo->relkind == RELKIND_MATVIEW)
 		tdinfo->dobj.objType = DO_REFRESH_MATVIEW;
+	else if (tbinfo->relkind == RELKIND_SEQUENCE)
+		tdinfo->dobj.objType = DO_SEQUENCE_SET;
 	else
 		tdinfo->dobj.objType = DO_TABLE_DATA;
 
@@ -8498,11 +8500,11 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj)
 		case DO_TRANSFORM:
 			dumpTransform(fout, (TransformInfo *) dobj);
 			break;
+		case DO_SEQUENCE_SET:
+			dumpSequenceData(fout, (TableDataInfo *) dobj);
+			break;
 		case DO_TABLE_DATA:
-			if (((TableDataInfo *) dobj)->tdtable->relkind == RELKIND_SEQUENCE)
-dumpSequenceData(fout, (TableDataInfo *) dobj);
-			else
-dumpTableData(fout, (TableDataInfo *) dobj);
+			dumpTableData(fout, (TableDataInfo *) dobj);
 			break;
 		case DO_DUMMY_TYPE:
 			/* table rowtypes and array types are never dumped separately */
@@ -16226,6 +16228,7 @@ addBoundaryDependencies(DumpableObject **dobjs, int numObjs,
 addObjectDependency(preDataBound, dobj->dumpId);
 break;
 			case DO_TABLE_DATA:
+			case DO_SEQUENCE_SET:
 			case DO_BLOB_DATA:
 /* Data objects: must come between the boundaries */
 addObjectDependency(dobj, preDataBound->dumpId);
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index a60cf95..642c4d5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -63,6 +63,7 @@ typedef enum
 	DO_PROCLANG,
 	DO_CAST,
 	DO_TABLE_DATA,
+	DO_SEQUENCE_SET,
 	DO_DUMMY_TYPE,
 	DO_TSPARSER,
 	DO_TSDICT,
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 195b84a..5b96334 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -47,14 +47,15 @@ static const int dbObjectTypePriority[] =
 	11,			/* DO_CONVERSION */
 	18,			/* DO_TABLE */
 	20,			/* DO_ATTRDEF */
-	27,			/* DO_INDEX */
-	28,			/* DO_RULE */
-	29,			/* DO_TRIGGER */
-	26,			/* DO_CONSTRAINT */
-	30,			/* DO_FK_CONSTRAINT */
+	28,			/* DO_INDEX */
+	29,			/* DO_RULE */
+	30,			/* DO_TRIGGER */
+	27,			/* DO_CONSTRAINT */
+	31,			/* DO_FK_CONSTRAINT */
 	2,			/* DO_PROCLANG */
 	10,			/* DO_CAST */
 	23,			/* DO_TABLE_DATA */
+	24,			/* DO_SEQUENCE_SET */
 	19,			/* DO_DUMMY_TYPE */
 	12,			/* DO_TSPARSER */
 	14,			/* DO_TSDICT */
@@ -62,15 +63,15 @@ static const int dbObjectTypePriority[] =
 	15,			/* DO_TSCONFIG */
 	16,			/* DO_FDW */
 	17,			/* DO_FOREIGN_SERVER */
-	31,			/* DO_DEFAULT_ACL */
+	32,			/* DO_DEFAULT_ACL */
 	3,			/* DO_TRANSFORM */
 	21,			/* DO_BLOB */
-	24,			/* DO_BLOB_DATA */
+	25,			/* DO_BLOB_DATA */
 	22,		

Re: [HACKERS] WAL consistency check facility

2016-10-31 Thread Robert Haas
On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
 wrote:
> And here we go. Here is a review as well as a large brush-up for this
> patch. A couple of things:
> - wal_consistency is using a list of RMGRs, at the cost of being
> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
> have been thinking hard about that, and still I don't see the point).
> It is rather easy to for example default it to false, and enable it to
> true to check if a certain code path is correctly exercised or not for
> WAL consistency. Note that this simplification reduces the patch size
> by 100~150 lines. I know, I know, I'd expect some complains about
> that

I don't understand how you can fail to see the point of that.  As you
yourself said, this facility generates a ton of WAL.  If you're
focusing on one AM, why would you want to be forced to incur the
overhead for every other AM?  A good deal has been written about this
upthread already, and just saying "I don't see the point" seems to be
ignoring the explanations already given.

> - Looking for wal_consistency at replay has no real value. What if on
> a standby the parameter value is inconsistent than the one on the
> master? This logic adds a whole new level of complications and
> potential bugs. So instead my suggestion is to add a marker at WAL
> record level to check if this record should be checked for consistency
> at replay or not.

Agreed.

> This is also quite flexible if you think about it,
> the standby is made independent of the WAL generated on the master and
> just applies, or checks what it sees is fit checking for.

+1.

> The best
> match here is to add a flag for XLogRecord->xl_info and make use of
> one of the low 4 bits and only one is used now for
> XLR_SPECIAL_REL_UPDATE.

Seems reasonable.

> - in maskPage, the new rmgr routine, there is no need for the info and
> blkno arguments. info is not used at all to begin with. blkno is used
> for gin pages to detect meta pages but this can be guessed using the
> opaque pointer. For heap pages and speculative inserts, masking the
> blkno would be fine. That's not worth it.

Passing the blkno doesn't cost anything.  If it avoids guessing,
that's entirely worth it.

> - Instead of palloc'ing the old and new pages to compare, it would be
> more performant to keep around two static buffers worth of BLCKSZ and
> just use that. This way there is no need as well to perform any palloc
> calls in the masking functions, limiting the risk of errors (those
> code paths had better avoid errors IMO). It would be also less costly
> to just pass to the masking function a pointer to a buffer of size
> BLCKSZ and just do the masking on it.

We always palloc buffers like this so that they will be aligned.  But
we could arrange not to repeat the palloc every time (see, e.g.,
BootstrapXLOG()).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] DML and column cound in aggregated subqueries

2016-10-31 Thread Andres Freund
Hi,

while hacking up nodeAgg.c to use a dedicated slot/desc for the
hashtable, an assert I placed brought up a weird case:

regression[29535][1]=# EXPLAIN VERBOSE update bar set f2 = f2 + 100 where f1 in 
(select f1 from foo);
┌───┐
│  QUERY PLAN   
│
├───┤
│ Update on public.bar  (cost=42.75..109.25 rows=1130 width=20) 
│
│   ->  Hash Join  (cost=42.75..109.25 rows=1130 width=20)  
│
│ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid
│
│ Hash Cond: (bar.f1 = foo.f1)  
│
│ ->  Seq Scan on public.bar  (cost=0.00..32.60 rows=2260 width=14) 
│
│   Output: bar.f1, bar.f2, bar.ctid
│
│ ->  Hash  (cost=40.25..40.25 rows=200 width=10)   
│
│   Output: foo.ctid, foo.f1
│
│   ->  HashAggregate  (cost=38.25..40.25 rows=200 width=10)
│
│ Output: foo.ctid, foo.f1  
│
│ Group Key: foo.f1 
│
│ ->  Seq Scan on public.foo  (cost=0.00..32.60 rows=2260 
width=10) │
│   Output: foo.ctid, foo.f1
│
└───┘
(13 rows)

this doesn't look right. The ctid shouldn't be in the aggregate output -
after all it's pretty much meaningless here.

Note that in this case find_hash_columns(aggstate) will return two
columns, but node->numCols will be 1 (and thus node->grpOperators only
have one element).

So it seems we're doing something wrong here.

Casting a wider net: find_hash_columns() and it's subroutines seem like
pretty much dead code, including outdated comments (look for "SQL99
semantics"). The main explanation says: /* * Create a list of the tuple
columns that actually need to be stored in * hashtable entries.  The
incoming tuples from the child plan node will * contain grouping
columns, other columns referenced in our targetlist and * qual, columns
used to compute the aggregate functions, and perhaps just * junk columns
we don't use at all.  Only columns of the first two types * need to be
stored in the hashtable, and getting rid of the others can * make the
table entries significantly smaller.  To avoid messing up Var *
numbering, we keep the same tuple descriptor for hashtable entries as
the * incoming tuples have, but set unwanted columns to NULL in the
tuples that * go into the table.  but that seems bogus - when are we
referencing "other columns referenced in our targetlist and qual" from
representative tuple that aren't also grouped upon?  This method would
select pretty arbitrary value for those, since we don't hash them, so
it'll just be the tuple first occurred?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE

2016-10-31 Thread Emre Hasegeli
The BRIN Bitmap Index Scan has the same problem.  I have seen people
confused by this.  I think N/A would clearly improve the situation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sequential scan result order vs performance

2016-10-31 Thread Dagfinn Ilmari Mannsåker
Jim Nasby  writes:

> BTW, I've sometimes wished for a mode where queries would silently have
> result ordering intentionally futzed, to eliminate any possibility of
> dependence on tuple ordering (as well as having sequences start at some
> random value).

FWIW, SQLite has this, in the form of 'PRAGMA reverse_unordered_selects'.

http://sqlite.org/pragma.html#pragma_reverse_unordered_selects

-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."  -- Skud's Meta-Law


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-10-31 Thread Andreas Karlsson
I have attached a version of the patch rebased on top of the OpenSSL 1.1 
changes.


Andreas

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 668f217..a1b582f 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -77,12 +77,14 @@ static DH  *generate_dh_parameters(int prime_len, int generator);
 static DH  *tmp_dh_cb(SSL *s, int is_export, int keylength);
 static int	verify_cb(int, X509_STORE_CTX *);
 static void info_cb(const SSL *ssl, int type, int args);
-static void initialize_ecdh(void);
+static SSL_CTX *initialize_context(void);
+static bool initialize_ecdh(SSL_CTX *context);
 static const char *SSLerrmessage(unsigned long ecode);
 
 static char *X509_NAME_to_cstring(X509_NAME *name);
 
 static SSL_CTX *SSL_context = NULL;
+static bool SSL_initialized = false;
 
 /*  */
 /*		 Hardcoded values		*/
@@ -157,14 +159,12 @@ KWbuHn491xNO25CQWMtem80uKw+pTnisBRF/454n1Jnhub144YRBoN8CAQI=\n\
 /*
  *	Initialize global SSL context.
  */
-void
+int
 be_tls_init(void)
 {
-	struct stat buf;
+	SSL_CTX *context;
 
-	STACK_OF(X509_NAME) *root_cert_list = NULL;
-
-	if (!SSL_context)
+	if (!SSL_initialized)
 	{
 #ifdef HAVE_OPENSSL_INIT_SSL
 		OPENSSL_init_ssl(OPENSSL_INIT_LOAD_CONFIG, NULL);
@@ -173,177 +173,28 @@ be_tls_init(void)
 		SSL_library_init();
 		SSL_load_error_strings();
 #endif
-
-		/*
-		 * We use SSLv23_method() because it can negotiate use of the highest
-		 * mutually supported protocol version, while alternatives like
-		 * TLSv1_2_method() permit only one specific version.  Note that we
-		 * don't actually allow SSL v2 or v3, only TLS protocols (see below).
-		 */
-		SSL_context = SSL_CTX_new(SSLv23_method());
-		if (!SSL_context)
-			ereport(FATAL,
-	(errmsg("could not create SSL context: %s",
-			SSLerrmessage(ERR_get_error();
-
-		/*
-		 * Disable OpenSSL's moving-write-buffer sanity check, because it
-		 * causes unnecessary failures in nonblocking send cases.
-		 */
-		SSL_CTX_set_mode(SSL_context, SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER);
-
-		/*
-		 * Load and verify server's certificate and private key
-		 */
-		if (SSL_CTX_use_certificate_chain_file(SSL_context,
-			   ssl_cert_file) != 1)
-			ereport(FATAL,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-  errmsg("could not load server certificate file \"%s\": %s",
-		 ssl_cert_file, SSLerrmessage(ERR_get_error();
-
-		if (stat(ssl_key_file, ) != 0)
-			ereport(FATAL,
-	(errcode_for_file_access(),
-	 errmsg("could not access private key file \"%s\": %m",
-			ssl_key_file)));
-
-		if (!S_ISREG(buf.st_mode))
-			ereport(FATAL,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-	 errmsg("private key file \"%s\" is not a regular file",
-			ssl_key_file)));
-
-		/*
-		 * Refuse to load files owned by users other than us or root.
-		 *
-		 * XXX surely we can check this on Windows somehow, too.
-		 */
-#if !defined(WIN32) && !defined(__CYGWIN__)
-		if (buf.st_uid != geteuid() && buf.st_uid != 0)
-			ereport(FATAL,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-	 errmsg("private key file \"%s\" must be owned by the database user or root",
-			ssl_key_file)));
-#endif
-
-		/*
-		 * Require no public access to key file. If the file is owned by us,
-		 * require mode 0600 or less. If owned by root, require 0640 or less
-		 * to allow read access through our gid, or a supplementary gid that
-		 * allows to read system-wide certificates.
-		 *
-		 * XXX temporarily suppress check when on Windows, because there may
-		 * not be proper support for Unix-y file permissions.  Need to think
-		 * of a reasonable check to apply on Windows.  (See also the data
-		 * directory permission check in postmaster.c)
-		 */
-#if !defined(WIN32) && !defined(__CYGWIN__)
-		if ((buf.st_uid == geteuid() && buf.st_mode & (S_IRWXG | S_IRWXO)) ||
-			(buf.st_uid == 0 && buf.st_mode & (S_IWGRP | S_IXGRP | S_IRWXO)))
-			ereport(FATAL,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-  errmsg("private key file \"%s\" has group or world access",
-		 ssl_key_file),
-	 errdetail("File must have permissions u=rw (0600) or less if owned by the database user, or permissions u=rw,g=r (0640) or less if owned by root.")));
-#endif
-
-		if (SSL_CTX_use_PrivateKey_file(SSL_context,
-		ssl_key_file,
-		SSL_FILETYPE_PEM) != 1)
-			ereport(FATAL,
-	(errmsg("could not load private key file \"%s\": %s",
-			ssl_key_file, SSLerrmessage(ERR_get_error();
-
-		if (SSL_CTX_check_private_key(SSL_context) != 1)
-			ereport(FATAL,
-	(errmsg("check of private key failed: %s",
-			SSLerrmessage(ERR_get_error();
-	}
-
-	/* set up ephemeral DH keys, and disallow SSL v2/v3 while at it */
-	SSL_CTX_set_tmp_dh_callback(SSL_context, tmp_dh_cb);
-	SSL_CTX_set_options(SSL_context,
-		SSL_OP_SINGLE_DH_USE |
-		SSL_OP_NO_SSLv2 | SSL_OP_NO_SSLv3);
-
-	/* set 

Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-10-31 Thread Gilles Darold
Le 31/10/2016 à 04:35, Karl O. Pinc a écrit :
> Hi Gilles,
>
> On Sat, 29 Oct 2016 22:00:08 +0200
> Gilles Darold  wrote:
>
>> The attached v10 of the current_logfiles patch 
> Attached is a patch to apply on top of the v10 patch.
>
> It updates current_logfiles only once per log rotation.
> I see no reason to open and write the file twice
> if both csvlog and stderr logging is happening.

I do not take the effort to do that because log rotation is not
something that occurs too often and I don't want to change the
conditional "time_based_rotation" code lines in logfile_rotate() for
readability. My though was also that on high load, log rotation is
automatically disabled so logfile_writename() is not called and there
will be no additional I/O. But why not, if commiters want to save this
additional I/O, this patch can be applied.
 
> I have 2 more questions.
>
> I don't understand why you're calling 
> logfile_writename(last_file_name, last_csv_file_name);
> in the SIGHUP code.  Presumably you've already
> written the old logfile names to current_logfiles.
> On SIGHUP you want to write the new names, not
> the old ones.  But the point of the SIGHUP code
> is to look for changes in logfile writing and then
> call logfile_rotate() to make those changes.  And
> logfile_rotate() calls logfile_writename() as appropriate.
> Shouldn't the logfile_writename call in the SIGHUP
> code be eliminated?

No, when you change the log_destination and reload configuration you
have to refresh the content of current_logfiles, in this case no new
rotation have been done and you have to write last_file_name and/or
last_csv_file_name.

> Second, and I've not paid really close attention here,
> you're calling logfile_writename() with last_file_name
> and last_csv_file_name in a number of places.  Are you
> sure that last_file_name and last_csv_file_name is reset
> to the empty string if stderr/csvlog logging is turned
> off and the config file re-read?  You might be recording
> that logging is going somewhere that's been turned off
> by a config change.  I've not noticed last_file_name and
> (especially) last_csv_file_name getting reset to the empty
> string.
In the patch I do not take care if the value of last_file_name and
last_csv_file_name have been reseted or not because following the
log_destination they are written or not to current_logfiles. When they
are written, they always contain the current log filename because the
call to logfile_writename() always appears after their assignment to the
new rotated filename.

> FYI, ultimately, in order to re-try writes to current_logfiles
> after ENFILE and EMFILE failure, I'm thinking that I'll probably
> wind up with logfile_writename() taking no arguments.  It will
> always write last_file_name and last_csv_file_name.  Then it's
> a matter of making sure that these variables contain accurate
> values.  It would be helpful to let me know if you have any
> insight regards config file re-read and resetting of these
> variables before I dive into writing code which retries writes to
> current_logfiles.

As explain above, last_file_name and last_csv_file_name always contains
the last log file name, then even in case of logfile_writename()
repeated failure. Those variables might have been updated in case of log
rotation occurs before a new call to logfile_writename(). In case of
ENFILE and EMFILE failure, log rotation is disabled and
logfile_writename() not called, last_file_name and last_csv_file_name
will still contain the last log file name.


-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Dumb mistakes in WalSndWriteData()

2016-10-31 Thread Andres Freund
Hi,

I^Wsomebody appears to have made a number of dumb mistakes in
WalSndWriteData(), namely:
1) The timestamp is set way too late, after
   pq_putmessage_noblock(). That'll sometimes work, sometimes not.  I
   have no idea how that ended up happening. It's eye-wateringly dumb.

2) We only do WalSndKeepaliveIfNecessary() if we're blocked on socket
   IO. But on a long-lived connection that might be a lot of data, we
   should really do that once *before* trying to send the payload in the
   first place.

3) Similarly to 2) it might be worthwhile checking for interrupts
   everytime, not just when blocked on network IO.

See also:
http://archives.postgresql.org/message-id/CAMsr%2BYE2dSfHVr7iEv1GSPZihitWX-PMkD9QALEGcTYa%2Bsdsgg%40mail.gmail.com

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Andres Freund
Hi,

On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
> TL;DR: Logical decoding clients need to generate their own keepalives
> and not rely on the server requesting them to prevent timeouts. Or
> admins should raise the wal_sender_timeout by a LOT when using logical
> decoding on DBs with any big rows.

Unconvinced.


> When sending a big message, WalSndWriteData() notices that it's
> approaching timeout and tries to send a keepalive request, but the
> request just gets buffered behind the remaining output plugin data and
> isn't seen by the client until the client has received the rest of the
> pending data.

Only for individual messages, not the entire transaction though.   Are
you sure the problem at hand is that we're sending a keepalive, but it's
too late? It might very well be that the actual issue is that we're
never sending keepalives, because the network is fast enough / the tcp
window is large enough.  IIRC we only send a keepalive if we're blocked
on network IO?


> So: We could ask output plugins to deal with this for us, by chunking
> up their data in small pieces and calling OutputPluginPrepareWrite()
> and OutputPluginWrite() more than once per output plugin callback if
> they expect to send a big message. But this pushes the complexity of
> splitting up and handling big rows, and big Datums, onto each plugin.
> It's awkward to do well and hard to avoid splitting things up
> unnecessarily.

There's decent reason for doing that independently though, namely that
it's a lot more efficient from a memory management POV.


I don't think the "unrequested keepalive" approach really solves the
problem on a fundamental enough level.


> (A separate issue is that we can also time out when doing logical
> _replication_ if the downstream side blocks on a lock, since it's not
> safe to send on a socket from a signal handler ... )

That's strictly speaking not true. write() / sendmsg() are signal safe
functions.  There's good reasons not to do that however, namely that the
non signal handler code might be busy writing data itself.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logical decoding and walsender timeouts

2016-10-31 Thread Craig Ringer
Hi all

I've been investigating some intermittent timeout issues I've seen
with BDR users in the wild and I think I've identified an issue with
how logical decoding sends data and handles feedback.

TL;DR: Logical decoding clients need to generate their own keepalives
and not rely on the server requesting them to prevent timeouts. Or
admins should raise the wal_sender_timeout by a LOT when using logical
decoding on DBs with any big rows.


An output plugin accumulates data in a StringInfo which is then sent
by WalSndWriteData(), which does nonblocking sends, checks for
interrupts, processes client replies, and requests keepalives if
approaching walsender timeout.

A problem arises when the client's buffered data is big enough that it
takes more than wal_sender_timeout to transmit the whole chunk and get
a reply from the client on the link. A slow link, a big row, or some
badly timed packet loss can easily result in timeout.

Unlike in physical replication, a timeout on a logical decoding
session can be expensive:

* We have to restart logical decoding at restart_lsn, which can be a
LONG time ago if there are any long running tx's, and decode up to the
last confirmed xact;

* We then have to resend the whole xact we were working on when we
timed out, which might be quite big.

so we should do our best to avoid timeouts in logical decoding
sessions. Physical replication chunks data into small pieces and can
restart cheaply, but logical decoding will send a 100MB+ row as a
single message and have to re-read hundreds of GB of WAL and resend
gigabytes of data on the wire to get back where it started after a
timeout.

When sending a big message, WalSndWriteData() notices that it's
approaching timeout and tries to send a keepalive request, but the
request just gets buffered behind the remaining output plugin data and
isn't seen by the client until the client has received the rest of the
pending data. The keepalive requests are useful so long as no one
message takes more than wal_sender_timeout to send since it'll defer
timeout and give us enough time to process the next message. But they
won't prevent timeout within a single big message.

How to fix this?

We can't inject keepalive requests in the middle of a CopyData message.

After the cancel-message discussion I know better than to suggest
using out-of-band tcp data and SIGURG.

So: We could ask output plugins to deal with this for us, by chunking
up their data in small pieces and calling OutputPluginPrepareWrite()
and OutputPluginWrite() more than once per output plugin callback if
they expect to send a big message. But this pushes the complexity of
splitting up and handling big rows, and big Datums, onto each plugin.
It's awkward to do well and hard to avoid splitting things up
unnecessarily.

We could extend the logical decoding protocol slightly, so that the
CopyData messages that carry output plugin payloads are limited to
some fixed, reasonable size (1MB? 10MB?) where the CopyData message
overhead is insignificant and most rows won't need to be reassembled
from multiple CopyData messages by the downstream. This gives us gaps
in which we can inject keepalives / status updates during big messages
so we get the client replies and process them. This won't fix existing
releases though, and makes clients reassemble the chunks. It also
means clients can't know the full size of a message based on the
CopyData length anymore so it'll force clients that rely on the
CopyData message length to add their own length fields in the payload.
CopyData doesn't have anything we can use as a "continuation message
follows" flag.

We could permit the walsender to reset the keepalive timer whenever it
confirms that it's sent a reasonable chunk of data such that it's
unlikely to just be lurking in the kernel's socket send buffer or
various routers in transit. We don't care that the client received up
to any particular point, only that it's alive and making progress. We
know the client must be receiving and acking it since we're using TCP.
The problem is knowing what "enough' data sent to mean the client
must've received some is. The socket send buffer is easy to check and
bounded, but TCP window scaling means there can be a huge amount of
un-acked data in transit. AFAIK we have no visibility into the network
stack's view of what's acked, at least portably. Linux has TCP_INFO
but it's hard to use and nonportable. So we can't assume that even
successfully sending 1GB of data means the client got data 1GB-ago
unless we clamp the congestion window or do other
possibly-performance-affecting things.

Or we could expect logical decoding downstream clients to send their
own proactive feedback when receiving and processing large messages.
The walsender is already capable of noticing and processing such
replies, just not requesting them when it should. Again this pushes
complexity onto each client, but less so than requiring output plugins
to chunk up their messages. Also, since each 

Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-10-31 Thread Gilles Darold
Le 30/10/2016 à 08:04, Karl O. Pinc a écrit :
> On Sat, 29 Oct 2016 22:00:08 +0200
> Gilles Darold  wrote:
>
>> The attached v10 of the current_logfiles patch include your last
>> changes on documentation but not the patch on v9 about the
>> user-supplied GUC value. I think the v10 path is ready for committers
>> and that the additional patch to add src/include/utils/guc_values.h
>> to define user GUC values is something that need to be taken outside
>> this one. Imo, thoses GUC values (stderr, csvlog) are not expected to
>> change so often to require a global definition, but why not, if
>> committers think this must be done I can add it to a v11 patch.
> I agree that the guc_values.h patches should be submitted separately
> to the committers, when your patch is submitted.  Creating symbols
> for these values is a matter of coding style they should resolve.
> (IMO it's not whether the values will change, it's whether
> someone reading the code can read the letters "stdout" and know
> to what they refer and where to find related usage elsewhere in 
> the code.)
>
> I'll keep up the guc_values.h patches and have them ready for
> submission along with your patch.
>
> I don't think your patch is quite ready for submission to
> the committers.
>
> Attached is a patch to be applied on top of your v10 patch
> which does basic fixup to logfile_writename().

Attached patch v11 include your patch.

>
> I have some questions about logfile_writename():
>
>  Why does the logfile_open() call fail silently?
>  Why not use ereport() here?  (With a WARNING level.)

logfile_open() "fail silently" in logfile_writename(), like in other
parts of syslogger.c where it is called, because this is a function()
that already report a message when an error occurs ("could not open log
file..."). I think I have already replied to this question.

>  Why do the ereport() invocations all have a LOG level?
>  You're not recovering and the errors are unexpected
>  so I'd think WARNING would be the right level.
>  (I previously, IIRC, suggested LOG level -- only if
>  you are retrying and recovering from an error.)

If you look deeper in syslogger.c, all messages are reported at LOG
level. I guess the reason is to not call syslogger repeatedly. I think I
have already replied to this question in the thread too.

> Have you given any thought to my proposal to change
> CURRENT_LOG_FILENAME to LOG_METAINFO_FILE?
Yes, I don't think the information logged in this file are kind of meta
information and CURRENT_LOG_FILENAME seems obvious.



-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..645cbb9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4143,6 +4143,12 @@ SELECT * FROM parent WHERE key = 2400;
  server log
 
 
+When logs are written to the file-system their paths, names, and
+types are recorded in
+the  file.  This provides
+a convenient way to find and access log content without establishing a
+database connection.
+
 
  Where To Log
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2e64cc4..f5bfe59 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15443,6 +15443,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
   
 
   
+   pg_current_logfile()
+   text
+   primary log file name in use by the logging collector
+  
+
+  
+   pg_current_logfile(text)
+   text
+   log file name, of log in the requested format, in use by the
+   logging collector
+  
+
+  
pg_my_temp_schema()
oid
OID of session's temporary schema, or 0 if none
@@ -15660,6 +15673,34 @@ SET search_path TO schema , schema, ..
 the time when the postmaster process re-read the configuration files.)

 
+
+pg_current_logile
+   
+
+   
+Logging
+pg_current_logfile function
+   
+
+   
+pg_current_logfile returns, as text,
+the path of either the csv or stderr log file currently in use by the
+logging collector.  This is a path including the
+ directory and the log file name.
+Log collection must be active or the return value
+is NULL.  When multiple logfiles exist, each in a
+different format, pg_current_logfile called
+without arguments returns the path of the file having the first format
+found in the ordered
+list: stderr, csvlog.
+NULL is returned when no log file has any of these
+formats.  To request a specific file format supply,
+as text, either csvlog
+or stderr as the value of the optional parameter. The
+return value is NULL when the log format requested
+is not a configured .
+   
+

 pg_my_temp_schema

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index fddb69b..1cfb9da 100644
--- a/doc/src/sgml/storage.sgml
+++ 

Re: [HACKERS] sequential scan result order vs performance

2016-10-31 Thread Albe Laurenz
Jim Nasby wrote:
> On 10/30/16 9:12 AM, Tom Lane wrote:
>> I think there will be a lot of howls.  People expect that creating
>> a table by inserting a bunch of rows, and then reading back those
>> rows, will not change the order.  We already futzed with that guarantee
>> a bit with syncscans, but that only affects quite large tables --- and
>> even there, we were forced to provide a way to turn it off.
> 
> Leaving a 30% performance improvement on the floor because some people
> don't grok how sets work seems insane to me.

+1

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query regarding selectDumpableExtension()

2016-10-31 Thread amul sul
On Fri, Oct 28, 2016 at 6:22 PM, Robert Haas  wrote:
> On Thu, Oct 27, 2016 at 2:11 AM, amul sul  wrote:
>> selectDumpableExtension() function skip
>> s dump of
>> built-in extensions in case of binary-upgrade  only,
>> why not in normal
>> dump
>> ?
>> Can't we assume those will already be installed in the target database
>> at restore
>> ?
>
Apologise for the delayed response.

> There's a comment in dumpExtension() that explains it.
>
> /*
>  * In a regular dump, we use IF NOT EXISTS so that there isn't a
>  * problem if the extension already exists in the target database;
>  * this is essential for installed-by-default extensions such as
>  * plpgsql.
>  *

Understood.

>  * In binary-upgrade mode, that doesn't work well, so instead we skip
>  * built-in extensions based on their OIDs; see
>  * selectDumpableExtension.
>  */
>

I don't see the necessity of dumping it in normal mode, unless I'm
missing something.

Let me explain the case I'm trying to tackle. I have two old dump
data, each of them have couple objects depend on plpgsql. I have
restored first dump and trying restore second dump using 'pg_restore
-c' command, it is failing with following error:

ERROR:  cannot drop extension plpgsql because other objects depend on it

Works well without '-c' option, but that what not a general solution, IMHO.

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers