[HACKERS] DROP FUNCTION of multiple functions
Here is a patch series that implements several changes in the internal grammar and node representation of function signatures. They are not necessarily meant to be applied separately, but they explain the progression of the changes nicely, so I left them like that for review. The end goal is to make some user-visible changes in DROP FUNCTION and possibly other commands that refer to functions. With these patches, it is now possible to use DROP FUNCTION to drop multiple functions at once: DROP FUNCTION func1(), func2(), func3(). Other DROP commands already supported that, but DROP FUNCTION didn't because the internal representation was complicated and couldn't handle it. The next step after this would be to allow referring to functions without having to supply the arguments, if the name is unique. This is an SQL-standard feature and would be very useful for dealing "business logic" functions with 10+ arguments. The details of that are to be worked out, but with the help of the present changes, this would be a quite localized change, because the grammar representation is well encapsulated. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 0a5d27cf9ab91bb03daa956d5167c5c993dfb857 Mon Sep 17 00:00:00 2001 From: Peter EisentrautDate: Thu, 15 Sep 2016 12:00:00 -0500 Subject: [PATCH 1/7] Use grammar symbol function_with_argtypes consistently Instead of sometimes referring to a function signature like func_name func_args, use the existing function_with_argtypes symbol, which combines the two. --- src/backend/parser/gram.y | 44 ++-- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 5547fc8..755b387 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -5366,21 +5366,21 @@ opclass_item: n->order_family = $5; $$ = (Node *) n; } - | FUNCTION Iconst func_name func_args + | FUNCTION Iconst function_with_argtypes { CreateOpClassItem *n = makeNode(CreateOpClassItem); n->itemtype = OPCLASS_ITEM_FUNCTION; - n->name = $3; - n->args = extractArgTypes($4); + n->name = $3->funcname; + n->args = $3->funcargs; n->number = $2; $$ = (Node *) n; } - | FUNCTION Iconst '(' type_list ')' func_name func_args + | FUNCTION Iconst '(' type_list ')' function_with_argtypes { CreateOpClassItem *n = makeNode(CreateOpClassItem); n->itemtype = OPCLASS_ITEM_FUNCTION; - n->name = $6; - n->args = extractArgTypes($7); + n->name = $6->funcname; + n->args = $6->funcargs; n->number = $2; n->class_args = $4; $$ = (Node *) n; @@ -5780,13 +5780,13 @@ CommentStmt: n->comment = $7; $$ = (Node *) n; } - | COMMENT ON FUNCTION func_name func_args IS comment_text + | COMMENT ON FUNCTION function_with_argtypes IS comment_text { CommentStmt *n = makeNode(CommentStmt); n->objtype = OBJECT_FUNCTION; - n->objname = $4; - n->objargs = extractArgTypes($5); - n->comment = $7; + n->objname = $4->funcname; + n->objargs = $4->funcargs; + n->comment = $6; $$ = (Node *) n; } | COMMENT ON OPERATOR any_operator oper_argtypes IS comment_text @@ -5998,15 +5998,15 @@ SecLabelStmt: n->label = $9; $$ = (Node *) n; } - | SECURITY LABEL opt_provider ON FUNCTION func_name func_args + | SECURITY LABEL opt_provider ON FUNCTION function_with_argtypes IS security_label { SecLabelStmt *n = makeNode(SecLabelStmt); n->provider = $3; n->objtype = OBJECT_FUNCTION; - n->objname = $6; - n->objargs = extractArgTypes($7); - n->label = $9; + n->objname = $6->funcname; + n->objargs = $6->funcargs; + n->label = $8; $$ = (Node *) n; } | SECURITY LABEL opt_provider ON LARGE_P OBJECT_P NumericOnly @@ -7236,24 +7236,24 @@ opt_restrict: */ RemoveFuncStmt: - DROP FUNCTION func_name func_args opt_drop_behavior + DROP FUNCTION function_with_argtypes opt_drop_behavior { DropStmt *n = makeNode(DropStmt); n->removeType = OBJECT_FUNCTION; - n->objects = list_make1($3); - n->arguments = list_make1(extractArgTypes($4)); - n->behavior = $5; + n->objects = list_make1($3->funcname); + n->arguments = list_make1($3->funcargs); + n->behavior = $4; n->missing_ok = false; n->concurrent = false; $$ = (Node *)n; } - | DROP FUNCTION IF_P EXISTS func_name func_args opt_drop_behavior + | DROP FUNCTION IF_P EXISTS function_with_argtypes opt_drop_behavior { DropStmt *n = makeNode(DropStmt); n->removeType = OBJECT_FUNCTION; - n->objects = list_make1($5); - n->arguments = list_make1(extractArgTypes($6)); - n->behavior
Re: [HACKERS] Indirect indexes
Alvaro Herrera wrote: > I propose we introduce the concept of "indirect indexes". This is a WIP non-functional patch for indirect indexes. I've been distracted from working on it for some time already and will be off-line for half this month yet, but since this was discussed and seems to be considered a welcome idea, I am posting it for those who want to have a look at what I'm doing. This can write values to indirect indexes (only btrees), but it cannot read values from them yet, so don't expect this to work at all unless you are hoping to read index files using pageinspect. (If you do test this, be aware that "VACUUM FULL pg_class" is a case that I know needs fixed.) I think the most interesting change here is how HeapSatisfiesHOTandKeyUpdate() has accomodated some additional code to return a bitmapset of columns that are not modified by an update. This implements a new command CREATE INDIRECT INDEX which instructs the AM to create an index that stores primary key values instead of CTID values. I have not tried yet to remove the limitation of only six bytes in the PK value. The part of the patch I'm not submitting just yet adds a flag to IndexInfo used by IndexPath, so that when the index is selected by the planner, an IndirectIndexScan node is created instead of a plain IndexScan. This node knows how to invoke the AM so that the PK values are extracted in a first step and the CTIDs are extracted from the PK in a second step (IndirectIndexScan carries two IndexScanDesc structs and two index RelationDescs, so it keeps both the indirect index and the PK index open). The part that generated the most discussion was vacuuming. As I said earlier, I think that instead of trying to shoehorn an index cleanup in regular vacuum (and cause a terrible degradation of maintenance_work_mem consumption, into optimizing which so much work has gone), these indexes would rely on desultory cleanup instead through the "killtuple" interface that causes index tuples to be removed during scan. Timely cleanup is not critical as it is with regular (direct) indexes, given that CTIDs are not stored and thus tuple movement does not affect this type of indexes. This patch is considerably smaller than the toy patch I had, which introduced a separate AM for "ibtree" -- that was duplicating a lot of code and adding more code complexity, which becomes much simpler with the approach in the current code; one thing I didn't like at all was the fact that the ibtree routines were calling the "internal" btree routines, which was not nice. (Also, it would have meant having duplicate operator family/class rows.) I hope to be back at home to collaborate with the commitfest on Nov 14th. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 1b45a4c..9f899c7 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -92,6 +92,7 @@ brinhandler(PG_FUNCTION_ARGS) amroutine->amstorage = true; amroutine->amclusterable = false; amroutine->ampredlocks = false; + amroutine->amcanindirect = false; amroutine->amkeytype = InvalidOid; amroutine->ambuild = brinbuild; diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c index f07eedc..1bc91d2 100644 --- a/src/backend/access/gin/ginutil.c +++ b/src/backend/access/gin/ginutil.c @@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS) amroutine->amstorage = true; amroutine->amclusterable = false; amroutine->ampredlocks = false; + amroutine->amcanindirect = false; amroutine->amkeytype = InvalidOid; amroutine->ambuild = ginbuild; diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index b8aa9bc..4ec34d5 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS) amroutine->amstorage = true; amroutine->amclusterable = true; amroutine->ampredlocks = false; + amroutine->amcanindirect = false; amroutine->amkeytype = InvalidOid; amroutine->ambuild = gistbuild; diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index e3b1eef..3fa3d71 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -66,6 +66,7 @@ hashhandler(PG_FUNCTION_ARGS) amroutine->amstorage = false; amroutine->amclusterable = false; amroutine->ampredlocks = false; + amroutine->amcanindirect = false; amroutine->amkeytype = INT4OID; amroutine->ambuild = hashbuild; diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index b019bc1..9e91b41 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -96,10 +96,10 @@ static XLogRecPtr log_heap_update(Relation
Re: [HACKERS] ECPG BUlk insert support using arrays
From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Haribabu Kommi > I didn't find any relevant thread about the discussion of Bulk insert support > in ECPG using arrays. Oracle supports the same and details are available > in [1]. > > > I see some performance benefits in supporting the same in ECPG also. > Does any one worked on this area before? Are there any problems in preparing > a patch to support the same? Please see "batch/pipelining support for libpq" by Craig. I said I'll use his API to implement the array insert for ECPG, but you can feel free to do it. Regards Takayuki Tsunakawa -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Barriers
On Thu, Aug 18, 2016 at 1:55 PM, Thomas Munrowrote: > On Tue, Aug 16, 2016 at 1:55 AM, Robert Haas wrote: >> If we're going to remove barrier.h, I think that should be a separate >> commit from creating a new barrier.h. > > OK. Here's a patch to remove the old header, and the v2 barrier patch > which adds phases and attach/detach. As before, it depends on > condition-variable-v2.patch. Here's a new version which is rebased and adds support for passing wait_event through to pg_stat_activity. -- Thomas Munro http://www.enterprisedb.com remove-useless-barrier-header-v2.patch Description: Binary data barrier-v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] identity columns
New patch. On 9/9/16 11:45 PM, Vitaly Burovoy wrote: > 1. The standard requires "... ALTER COLUMN ... SET GENERATED { ALWAYS > | BY DEFAULT }" (9075-2:2011 subcl 11.20), but the patch implements it > as "... ALTER COLUMN ... ADD GENERATED { ALWAYS | BY DEFAULT } AS > IDENTITY" Has both now. They do different things, as documented. > 2. The standard requires not more than one identity column, the patch > does not follow that requirement, but it does not mentioned in the > doc. fixed > 3. Changes in the table "information_schema.columns" is not full. fixed > 4. "" is not fully implemented > because "" is implemented > whereas "" is not. done > 5. According to 9075-2:2011 subcl 14.11 Syntax Rule 11)c) for a column > with an indication that values are generated by default the only > possible "" is "OVERRIDING USER VALUE". > Implementation allows to use "OVERRIDING SYSTEM VALUE" (as "do > nothing"), but it should be mentioned in "Compatibility" part in the > doc. done (documented) > 6. "CREATE TABLE ... (LIKE ... INCLUDING ALL)" fails Assertion at > src/backend/commands/tablecmds.c:631 fixed > 7. Changing default is allowed but a column is still "identity": fixed > 8. Changing a column to be "identity" raises "duplicate key" exception: fixed > 9. Changing type of a column deletes linked sequence but leaves > "default" and "identity" marks: fixed > 10. "identity" modifier is lost when the table inherits another one: fixed, but I invite more testing of inheritance-related things > 11. The documentation says "OVERRIDING ... VALUE" can be placed even > before "DEFAULT VALUES", but it is against SQL spec and the > implementation: fixed > 12. Dump/restore is broken for some cases: fixed > 13. doc/src/sgml/ref/create_table.sgml (5th chunk) has "TODO". Why? fixed > 14. It would be fine if psql has support of new clauses. done > 15. Initializing attidentity in most places is ' ' but makefuncs.c has > "n->identity = 0;". Is it correct? fixed > 16. I think it is a good idea to not raise exceptions for "SET > GENERATED/DROP IDENTITY" if a column has the same type of identity/not > an identity. To be consistent with "SET/DROP NOT NULL". The present behavior is per SQL standard. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 29738b0..027c73e 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -1095,6 +1095,17 @@ pg_attribute Columns + attidentity + char + + + If a space character, then not an identity column. Otherwise, + a = generated always, d = + generated by default. + + + + attisdropped bool diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml index c43e325..8ece439 100644 --- a/doc/src/sgml/information_schema.sgml +++ b/doc/src/sgml/information_schema.sgml @@ -1583,13 +1583,20 @@ columns Columns is_identity yes_or_no - Applies to a feature not available in PostgreSQL + + If the column is an identity column, then YES, + else NO. + identity_generation character_data - Applies to a feature not available in PostgreSQL + + If the column is an identity column, then ALWAYS + or BY DEFAULT, reflecting the definition of the + column. + diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index e48ccf2..b272633 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -42,6 +42,9 @@ ALTER [ COLUMN ] column_name SET DEFAULT expression ALTER [ COLUMN ] column_name DROP DEFAULT ALTER [ COLUMN ] column_name { SET | DROP } NOT NULL +ALTER [ COLUMN ] column_name ADD GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ] +ALTER [ COLUMN ] column_name { SET GENERATED { ALWAYS | BY DEFAULT } | SET sequence_option | RESET } [...] +ALTER [ COLUMN ] column_name DROP IDENTITY ALTER [ COLUMN ] column_name SET STATISTICS integer ALTER [ COLUMN ] column_name SET ( attribute_option = value [, ... ] ) ALTER [ COLUMN ] column_name RESET ( attribute_option [, ... ] ) @@ -170,6 +173,32 @@ Description +ADD GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY +SET GENERATED { ALWAYS | BY DEFAULT } +DROP IDENTITY + + + These forms change whether a column is an identity column or change the + generation attribute of an existing identity column. + See for details. + + + + + +SET sequence_option +RESET + + + These forms alter the sequence that underlies an existing identity + column. sequence_option is an option + supported by such + as INCREMENT BY. + + + +
[HACKERS] ECPG BUlk insert support using arrays
Hi All, I didn't find any relevant thread about the discussion of Bulk insert support in ECPG using arrays. Oracle supports the same and details are available in [1]. I see some performance benefits in supporting the same in ECPG also. Does any one worked on this area before? Are there any problems in preparing a patch to support the same? [1] - https://docs.oracle.com/cd/B28359_01/appdev.111/b28427/pc_08arr.htm Regards, Hari Babu Fujitsu Australia
[HACKERS] background sessions
Here is a patch for the background sessions C API and PL/Python support. This was previously submitted as "autonomous transactions", which proved controversial, and there have been several suggestions for a new name. I have renamed everything, removed all the incomplete PL/pgSQL stuff, did some refinement on the PL/Python interfaces, added resource owner management so that you can preserve session handles across transactions. That allows a pg_background-like behavior implemented in a PL function. I have also added documentation, so reviewers could start there. Probably not quite all done yet, but I think it contains a lot of useful pieces that we could make into something nice. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 661c7fe769982e3f0c71f4ad57a768d7eb55f6e2 Mon Sep 17 00:00:00 2001 From: Peter EisentrautDate: Mon, 31 Oct 2016 12:00:00 -0400 Subject: [PATCH] Add background sessions This adds a C API to run SQL statements in a background worker, communicating by FE/BE protocol over a DSM message queue. This can be used to execute statements and transactions separate from the main foreground session. Also included is a PL/Python interface to this functionality. --- doc/src/sgml/bgsession.sgml | 236 +++ doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/plpython.sgml | 102 +++ doc/src/sgml/postgres.sgml | 1 + src/backend/commands/variable.c | 5 + src/backend/libpq/pqmq.c| 23 + src/backend/tcop/Makefile | 2 +- src/backend/tcop/bgsession.c| 890 src/backend/tcop/postgres.c | 24 +- src/include/commands/variable.h | 1 + src/include/libpq/pqmq.h| 1 + src/include/tcop/bgsession.h| 26 + src/include/tcop/tcopprot.h | 9 + src/pl/plpython/Makefile| 2 + src/pl/plpython/expected/plpython_bgsession.out | 188 + src/pl/plpython/expected/plpython_test.out | 7 +- src/pl/plpython/plpy_bgsession.c| 454 src/pl/plpython/plpy_bgsession.h| 18 + src/pl/plpython/plpy_main.h | 3 + src/pl/plpython/plpy_planobject.c | 1 + src/pl/plpython/plpy_planobject.h | 2 + src/pl/plpython/plpy_plpymodule.c | 5 + src/pl/plpython/plpy_spi.c | 7 +- src/pl/plpython/plpy_spi.h | 3 + src/pl/plpython/sql/plpython_bgsession.sql | 148 25 files changed, 2139 insertions(+), 20 deletions(-) create mode 100644 doc/src/sgml/bgsession.sgml create mode 100644 src/backend/tcop/bgsession.c create mode 100644 src/include/tcop/bgsession.h create mode 100644 src/pl/plpython/expected/plpython_bgsession.out create mode 100644 src/pl/plpython/plpy_bgsession.c create mode 100644 src/pl/plpython/plpy_bgsession.h create mode 100644 src/pl/plpython/sql/plpython_bgsession.sql diff --git a/doc/src/sgml/bgsession.sgml b/doc/src/sgml/bgsession.sgml new file mode 100644 index 000..785ee66 --- /dev/null +++ b/doc/src/sgml/bgsession.sgml @@ -0,0 +1,236 @@ + + + + Background Session API + + + The background session API is a C API for creating additional database + sessions in the background and running SQL statements in them. A background + session behaves like a normal (foreground) session in that it has session + state, transactions, can run SQL statements, and so on. Unlike a foreground + session, it is not connected directly to a client. Instead the foreground + session can use this API to execute SQL statements and retrieve their + results. Higher-level integrations, such as in procedural languages, can + make this functionality available to clients. Background sessions are + independent from their foreground sessions in their session and transaction + state. So a background session cannot see uncommitted data in foreground + sessions or vice versa, and there is no preferential treatment about + locking. Like all sessions, background sessions are separate processes. + Foreground and background sessions communicate over shared memory messages + queues instead of the sockets that a client/server connection uses. + + + + Background sessions can be useful in a variety of scenarios when effects + that are independent of the foreground session are to be achieved, for + example: + + + + Commit data independent of whether a foreground transaction commits, for + example for auditing. A trigger in the foreground session could effect + the necessary writes via a background session. + + + + + Large changes can be split up into smaller transactions. A foreground
Re: [HACKERS] commit fest manager for CF 2016-11?
On Tue, Nov 1, 2016 at 4:15 PM, Peter Eisentrautwrote: > On 10/31/16 9:39 PM, Michael Paquier wrote: >> We are on the 1st, at least in my timezone. So the CF should start >> really soon, meaning that no new patches would get in. In a couple of >> hours that would be fine? I recall that the last times we tried to >> sync with the US East coast time. > > https://en.wikipedia.org/wiki/Anywhere_on_Earth +1 -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] commit fest manager for CF 2016-11?
On 10/31/16 9:39 PM, Michael Paquier wrote: > We are on the 1st, at least in my timezone. So the CF should start > really soon, meaning that no new patches would get in. In a couple of > hours that would be fine? I recall that the last times we tried to > sync with the US East coast time. https://en.wikipedia.org/wiki/Anywhere_on_Earth -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hi, At Sun, 2 Oct 2016 21:43:46 +0900, Michael Paquierwrote in > On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI > wrote: > > Hello, > > > > At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier > > wrote in > > > >> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI > >> wrote: > >> > Hello, I return to this before my things:) > >> > > >> > Though I haven't played with the patch yet.. > >> > >> Be sure to run the test cases in the patch or base your tests on them then! > > > > All items of 006_truncate_opt fail on ed0b228 and they are fixed > > with the patch. > > > >> > Though I don't know how it actually impacts the perfomance, it > >> > seems to me that we can live with truncated_to and sync_above in > >> > RelationData and BufferNeedsWAL(rel, buf) instead of > >> > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation > >> > seems to exist at once in the hash. > >> > >> TBH, I still think that the design of this patch as proposed is pretty > >> cool and easy to follow. > > > > It is clean from certain viewpoint but additional hash, > > especially hash-searching on every HeapNeedsWAL seems to me to be > > unacceptable. Do you see it accetable? > > > > > > The attached patch is quiiiccck-and-dirty-hack of Michael's patch > > just as a PoC of my proposal quoted above. This also passes the > > 006 test. The major changes are the following. > > > > - Moved sync_above and truncted_to into RelationData. > > > > - Cleaning up is done in AtEOXact_cleanup instead of explicit > > calling to smgrDoPendingSyncs(). > > > > * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires > > hash_search. It just refers to the additional members in the > > given Relation. > > > > X I feel that I have dropped one of the features of the origitnal > > patch during the hack, but I don't recall it clearly now:( > > > > X I haven't consider relfilenode replacement, which didn't matter > > for the original patch. (but there's few places to consider). > > > > What do you think about this? > > I have moved this patch to next CF. (I still need to look at your patch.) Thanks for considering that. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] commit fest manager for CF 2016-11?
On Tue, Nov 1, 2016 at 10:27 AM, Haribabu Kommiwrote: > On Mon, Oct 31, 2016 at 4:31 PM, Michael Paquier > wrote: >> >> On Mon, Oct 31, 2016 at 1:49 PM, Peter Eisentraut >> wrote: >> > Commit fest CF 2016-11 is supposed to start in about a day. I don't >> > think a commit fest manager was chosen yet. Volunteers? >> >> I'd like to pass my turn on this one as CFM, but I can bring in some >> time for the one of January. > > I am willing to do as CFM for this commit fest, until unless some one > else is interested. Great. Thanks! > I may need some help from previous commit fest managers as I am > new to this process. Sure, feel free to ask anything here if you want. I guess that it would be adapted if you have the admin rights in the CF app so as you can switch the status of the CF itself. Magnus, could it be possible to do so? We are on the 1st, at least in my timezone. So the CF should start really soon, meaning that no new patches would get in. In a couple of hours that would be fine? I recall that the last times we tried to sync with the US East coast time. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] commit fest manager for CF 2016-11?
On Mon, Oct 31, 2016 at 4:31 PM, Michael Paquierwrote: > On Mon, Oct 31, 2016 at 1:49 PM, Peter Eisentraut > wrote: > > Commit fest CF 2016-11 is supposed to start in about a day. I don't > > think a commit fest manager was chosen yet. Volunteers? > > I'd like to pass my turn on this one as CFM, but I can bring in some > time for the one of January. I am willing to do as CFM for this commit fest, until unless some one else is interested. I may need some help from previous commit fest managers as I am new to this process. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Logical decoding and walsender timeouts
>>> When sending a big message, WalSndWriteData() notices that it's >>> approaching timeout and tries to send a keepalive request, but the >>> request just gets buffered behind the remaining output plugin data and >>> isn't seen by the client until the client has received the rest of the >>> pending data. >> >> Only for individual messages, not the entire transaction though. >Right. I initially thought it was the whole tx, but I was mistaken as >I'd failed to notice that WalSndWriteData() queues a keepalive >request. This problem can be resolve periodically send keepalive by client, and this interval should be less than timeout configure on server. For example on server configure timeout wal_sender_timeout=60 so, client should send keep alive message to server with interval 60/3. In that case server will not send keep alive with flag required reply, and also not disconnect client because during decode huge transaction present check income messages. I faced a similar problem in pgjdc and resolve it as I write before. 2016-10-31 16:28 GMT+03:00 Craig Ringer: > On 31 October 2016 at 16:52, Andres Freund wrote: > > Hi, > > > > On 2016-10-31 16:34:38 +0800, Craig Ringer wrote: > >> TL;DR: Logical decoding clients need to generate their own keepalives > >> and not rely on the server requesting them to prevent timeouts. Or > >> admins should raise the wal_sender_timeout by a LOT when using logical > >> decoding on DBs with any big rows. > > > > Unconvinced. > > Yeah. I've seen enough issues in the wild where we keep timing out and > restarting over and over until we increase wal_sender_timeout to know > there's _something_ going on. I am less sure I'm right about what is > or how to solve it. > > >> When sending a big message, WalSndWriteData() notices that it's > >> approaching timeout and tries to send a keepalive request, but the > >> request just gets buffered behind the remaining output plugin data and > >> isn't seen by the client until the client has received the rest of the > >> pending data. > > > > Only for individual messages, not the entire transaction though. > > Right. I initially thought it was the whole tx, but I was mistaken as > I'd failed to notice that WalSndWriteData() queues a keepalive > request. > > > Are > > you sure the problem at hand is that we're sending a keepalive, but it's > > too late? > > No, I'm not sure. I'm trying to identify the cause of an issue I've > seen in the wild, but never under conditions where it's been possible > to sit around and debug in a leisurely manner. > > I'm trying to set up a TAP test to demonstrate that this happens, but > I don't think it's going to work without some kind of network > bandwidth limitation simulation or simulated latency. A local unix > socket is just too fast for Pg's row size limits. > > > It might very well be that the actual issue is that we're > > never sending keepalives, because the network is fast enough / the tcp > > window is large enough. IIRC we only send a keepalive if we're blocked > > on network IO? > > Mm, that's a good point. That might better explain the issues I've > seen in the wild, since I never found strong evidence that individual > big rows were involved, but hadn't been able to come up with anything > else yet. > > >> So: We could ask output plugins to deal with this for us, by chunking > >> up their data in small pieces and calling OutputPluginPrepareWrite() > >> and OutputPluginWrite() more than once per output plugin callback if > >> they expect to send a big message. But this pushes the complexity of > >> splitting up and handling big rows, and big Datums, onto each plugin. > >> It's awkward to do well and hard to avoid splitting things up > >> unnecessarily. > > > > There's decent reason for doing that independently though, namely that > > it's a lot more efficient from a memory management POV. > > Definitely. Though you're always going to be tossing around ridiculous > chunks of memory when dealing with big external compressed toasted > data, unless there are ways to access that progressively that I'm > unaware of. Hopefully there are. > > I'd quite like to extend the bdr/pglogical/logicalrep protocol so that > in-core logical rep, in some later version, can write a field as 'to > follow', like we currently mark unchanged toasted datums separately. > Then send it chunked, after the main row, in follow-up messages. That > way we keep processing keepalives, we don't allocate preposterous > amounts of memory, etc. > > > I don't think the "unrequested keepalive" approach really solves the > > problem on a fundamental enough level. > > Fair. It feels a bit like flailing in the dark, too. > > >> (A separate issue is that we can also time out when doing logical > >> _replication_ if the downstream side blocks on a lock, since it's not > >> safe to send on a socket from a signal handler ... ) > > > > That's strictly speaking not true. write() / sendmsg() are
Re: [HACKERS] Unsafe use of relation->rd_options without checking its type
> > > ^ > > The reason for the error is that transformOnConflictArbiter applies > RelationIsUsedAsCatalogTable() to something that it doesn't know to > be a plain relation --- it's a view in this case. And that macro > blindly assumes that relation->rd_options is a StdRdOptions struct, > when in this case it's actually a ViewOptions struct. > > Now that I've seen this I wonder which other uses of rd_options are > potentially broken. RelationIsUsedAsCatalogTable() is hardly the > only macro that is assuming with little justification that it's > applied to the right kind of reloptions. > Right, there are many macros, which blindly assume the rd_options to be of specific type. Here is the list of such macros : RelationGetFillFactor RelationIsUsedAsCatalogTable RelationGetParallelWorkers RelationIsSecurityView RelationHasCheckOption RelationHasLocalCheckOption RelationHasCascadedCheckOption BrinGetPagesPerRange GinGetUseFastUpdate GinGetPendingListCleanupSize For the macros associated with specific index type, it might be alright to assume the type of the rd_options because those have very limited usage. However for the rest of the macros, the code does not limit its usage. This can lead to problems as you described above . > We could band-aid this by having the RelationIsUsedAsCatalogTable() > macro check relation->relkind, but I'm inclined to think that the > right fix going forward is to insist that StdRdOptions, ViewOptions, > and the other in-memory representations of reloptions ought to be > self-identifying somehow. We could add a field to them similar to > nodeTag, but one of the things that was envisioned to begin with > is relation types that have StdRdOptions as the first part of a > larger struct. I'm not sure how to make that work with a tag. > > Thoughts? > Yes, it seems appropriate to tag all types of the rd_options with an identification and maybe check for that identification within each macro. Currently, there are following types of rd_options as far as I could find: GiSTOptions BrinOptions GinOptions StdRdOptions ViewOptions BloomOptions (from contrib) I am not clear on how the identification field in the above structures can be a problem, for the relations have StdRdOptions as the first part of a larger structure. Regards, Neha
[HACKERS] WIP: [[Parallel] Shared] Hash
Hi hackers, In PostgreSQL 9.6, hash joins can be parallelised under certain conditions, but a copy of the hash table is built in every participating backend. That means that memory and CPU time are wasted. In many cases, that's OK: if the hash table contents are small and cheap to compute, then we don't really care, we're just happy that the probing can be done in parallel. But in cases where the hash table is large and/or expensive to build, we could do much better. I am working on that problem. To recap the situation in 9.6, a hash join can appear below a Gather node and it looks much the same as a non-parallel hash join except that it has a partial outer plan: -> Hash Join -> -> Hash -> A partial plan is one that has some kind of 'scatter' operation as its ultimate source of tuples. Currently the only kind of scatter operation is a Parallel Seq Scan (but see also the Parallel Index Scan and Parallel Bitmap Scan proposals). The scatter operation enables parallelism in all the executor nodes above it, as far as the enclosing 'gather' operation which must appear somewhere above it. Currently the only kind of gather operation is a Gather node (but see also the Gather Merge proposal which adds a new one). The inner plan is built from a non-partial parallel-safe path and will be run in every worker. Note that a Hash Join node in 9.6 isn't parallel-aware itself: it's not doing anything special at execution time to support parallelism. The planner has determined that correct partial results will be produced by this plan, but the executor nodes are blissfully unaware of parallelism. PROPOSED NEW PLAN VARIANTS Shortly I will post a patch which introduces two new hash join plan variants that are parallel-aware: 1. Parallel Hash Join with Shared Hash -> Parallel Hash Join -> -> Shared Hash -> In this case, there is only one copy of the hash table and only one participant loads it. The other participants wait patiently for one chosen backend to finish building the hash table, and then they all wake up and probe. Call the number of participants P, being the number of workers + 1 (for the leader). Compared to a non-shared hash plan, we avoid wasting CPU and IO resources running P copies of the inner plan in parallel (something that is not well captured in our costing model for parallel query today), and we can allow ourselves to use a hash table P times larger while sticking to the same overall space target of work_mem * P. 2. Parallel Hash Join with Parallel Shared Hash -> Parallel Hash Join -> -> Parallel Shared Hash -> In this case, the inner plan is run in parallel by all participants. We have the advantages of a shared hash table as described above, and now we can also divide the work of running the inner plan and hashing the resulting tuples by P participants. Note that Parallel Shared Hash is acting as a special kind of gather operation that is the counterpart to the scatter operation contained in the inner plan. PERFORMANCE So far I have been unable to measure any performance degradation compared with unpatched master for hash joins with non-shared hash. That's good because it means that I didn't slow existing plans down when I introduced a bunch of conditional branches to existing hash join code. Laptop testing shows greater than 2x speedups on several of the TPC-H queries with single batches, and no slowdowns. I will post test numbers on big rig hardware in the coming weeks when I have the batching code in more complete and stable shape. IMPLEMENTATION I have taken the approach of extending the existing hash join algorithm, rather than introducing separate hash join executor nodes or a fundamentally different algorithm. Here's a short description of what the patch does: 1. SHARED HASH TABLE To share data between participants, the patch uses two other patches I have proposed: DSA areas[1], which provide a higher level interface to DSM segments to make programming with processes a little more like programming with threads, and in particular a per-parallel-query DSA area[2] that is made available for any executor node that needs some shared work space. The patch uses atomic operations to push tuples into the hash table buckets while building, rehashing and loading, and then the hash table is immutable during probing (except for match flags used to implement outer joins). The existing memory chunk design is retained for dense allocation of tuples, which provides a convenient way to rehash the table when its size changes. 2. WORK COORDINATION To coordinate parallel work, this patch uses two other patches: barriers[3], to implement a 'barrier' or 'phaser' synchronisation primitive, and those in turn use the condition variables proposed by Robert Haas. Barriers provide a way for participants to break work up into
Re: [HACKERS] WAL consistency check facility
On Mon, Oct 31, 2016 at 9:31 PM, Robert Haaswrote: > On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier >> - Instead of palloc'ing the old and new pages to compare, it would be >> more performant to keep around two static buffers worth of BLCKSZ and >> just use that. This way there is no need as well to perform any palloc >> calls in the masking functions, limiting the risk of errors (those >> code paths had better avoid errors IMO). It would be also less costly >> to just pass to the masking function a pointer to a buffer of size >> BLCKSZ and just do the masking on it. > > We always palloc buffers like this so that they will be aligned. But > we could arrange not to repeat the palloc every time (see, e.g., > BootstrapXLOG()). Yeah, we could go with that and there is clearly no reason to not do so. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
At 2016-09-28 13:13:56 -0400, robertmh...@gmail.com wrote: > > I hope that the fact that there's been no discussion for the last > three weeks doesn't mean this effort is dead; I would like very > much to see it move forward. Here's an updated patch. Sorry, I got busy elswhere. I struggled with the handling of recovery_target a little. For example, one suggested alternative was: recovery_target_type = xid recovery_target_value = … The problem with implementing it this way is that the _value setting cannot be parsed without already having parsed the _type, and I didn't want to force that sort of dependency. What I've done instead is to make this work: recovery_target = xid|time|name|lsn|immediate recovery_target_xid = … recovery_target_time = … recovery_target_name = … recovery_target_lsn = … The recovery_target_xxx values are parsed as they used to be, but the one that's used is the one that's set in recovery_target. That's easy to explain, and the patch is much less intrusive, but I'm certainly open to suggestions to improve this, and I have the time to work on this patch with a view towards getting it committed in this cycle. -- Abhijit P.S. Sorry, I haven't been able to resolve the conflicts between Simon's earlier recovery_startup_r10_api.v1b patch and the "pg_ctl promote -w" changes in master. I was distracted by some illness in the family, but I will post another update very soon. >From 231ccefbc99c07f00560f4e88a961db186db704e Mon Sep 17 00:00:00 2001 From: Abhijit Menon-SenDate: Mon, 31 Oct 2016 11:32:51 +0530 Subject: Convert recovery.conf settings to GUCs Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao. --- contrib/pg_standby/pg_standby.c | 2 +- doc/src/sgml/backup.sgml| 31 +- doc/src/sgml/config.sgml| 480 ++ doc/src/sgml/filelist.sgml | 1 - doc/src/sgml/func.sgml | 2 +- doc/src/sgml/high-availability.sgml | 42 +- doc/src/sgml/pgstandby.sgml | 4 +- doc/src/sgml/postgres.sgml | 1 - doc/src/sgml/recovery-config.sgml | 508 --- doc/src/sgml/ref/pgarchivecleanup.sgml | 6 +- doc/src/sgml/release-9.1.sgml | 5 +- doc/src/sgml/release.sgml | 2 +- src/backend/access/transam/recovery.conf.sample | 29 +- src/backend/access/transam/xlog.c | 520 +++- src/backend/access/transam/xlogarchive.c| 8 +- src/backend/postmaster/startup.c| 2 + src/backend/replication/walreceiver.c | 24 ++ src/backend/utils/misc/guc-file.l | 18 + src/backend/utils/misc/guc.c| 430 src/backend/utils/misc/postgresql.conf.sample | 24 ++ src/bin/pg_archivecleanup/pg_archivecleanup.c | 2 +- src/include/access/xlog.h | 21 + src/include/access/xlog_internal.h | 2 + src/include/utils/guc_tables.h | 2 + 24 files changed, 1202 insertions(+), 964 deletions(-) delete mode 100644 doc/src/sgml/recovery-config.sgml diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c index e4136f9..8fcb85c 100644 --- a/contrib/pg_standby/pg_standby.c +++ b/contrib/pg_standby/pg_standby.c @@ -520,7 +520,7 @@ usage(void) printf(" -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n"); printf(" -?, --help show this help, then exit\n"); printf("\n" - "Main intended use as restore_command in recovery.conf:\n" + "Main intended use as restore_command in postgresql.conf:\n" " restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n" "e.g.\n" " restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n"); diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 6eaed1e..2df0dc6 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1190,8 +1190,15 @@ SELECT pg_stop_backup(); - Create a recovery command file recovery.conf in the cluster - data directory (see ). You might + Set up recovery parameters in postgresql.conf (see + and + ). + + + + + Create a recovery trigger file recovery.trigger + in the cluster data directory. You might also want to temporarily modify pg_hba.conf to prevent ordinary users from connecting until you are sure the recovery was successful. @@ -1203,7 +1210,7 @@ SELECT pg_stop_backup(); recovery be terminated because of an external error, the server can simply be restarted and it will continue recovery. Upon completion of the recovery process, the server will rename - recovery.conf to recovery.done (to prevent + recovery.trigger to
[HACKERS] auto_explain vs. parallel query
Hi, While debugging something on 9.6, I've noticed that auto_explain handles parallel queries in a slightly strange way - both the leader and all the workers log their chunk of the query (i.e. the leader logs explain for the whole query, while workers only log the parts they've been executing). So for example for a query with 2 workers, the log look like this: 2016-10-31 23:10:23.481 CET [12278] LOG: duration: 32.562 ms pl ... Query Text: ... Parallel Seq Scan on tables (cost=0.00..15158.25 rows=220 wi ... Filter: ((table_datoid < '10'::oid) AND (table_nspo ... Rows Removed by Filter: 140614 ... 2016-10-31 23:10:23.481 CET [12277] LOG: duration: 32.325 ms pl ... Query Text: ... Parallel Seq Scan on tables (cost=0.00..15158.25 rows=220 wi ... Filter: ((table_datoid < '10'::oid) AND (table_nspo ... Rows Removed by Filter: 80948 ... 2016-10-31 23:10:23.483 CET [12259] LOG: duration: 38.997 ms pl ... Query Text: explain analyze select * from tables where table_ ... Gather (cost=1000.00..16211.15 rows=529 width=356) (actual t ... Workers Planned: 2 ... Workers Launched: 2 ... -> Parallel Seq Scan on tables (cost=0.00..15158.25 rows= ... Filter: ((table_datoid < '10'::oid) AND (tabl ... Rows Removed by Filter: 105570... I'd say that's fairly surprising, and I haven't found any discussion about auto_explain vs. parallel queries in the archives (within the last year), so I assume this may not be entirely intentional. Not only this does not match the output when running EXPLAIN ANALYZE manually, it also provides no information about which messages from workers belong to which "leader" message. Another thing is that this interacts with sampling in a rather unfortunate way, because each process evaluates the sampling condition independently. So for example with auto_explain.sample_rate=0.5 a random subset of worker/leader explain plans will be logged. But this also applies to the conditions in ExecutorStart, which enables instrumentation only when the query gets sampled - so when the leader gets sampled but not all the workers, the counters in the query logged by the leader are incomplete. I do believe those are bugs that make auto_explain rather unusable with parallel queries. Adding IsParallelWorker() to the condition in ExecutorEnd should fix the extra logging messages (and log only from the leader). Not sure how to fix the sampling - simply adding IsParallelWorker() to ExecutorStart won't do the trick, because we don't want to enable instrumentation only for sample queries. So I guess this needs to be decided in the leader, and communicated to the workers somehow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvements in psql hooks for variables
Hello, I have applied this patch on latest HEAD and have done basic testing which works fine. Some comments, >if (current->assign_hook) >- (*current->assign_hook) (current->value); >- return true; >+ { >+ confirmed = (*current->assign_hook) (value); >+ } >+ if (confirmed) Spurious brackets >static bool >+generic_boolean_hook(const char *newval, const char* varname, bool *flag) >+{ Contrary to what name suggests this function does not seem to have other implementations as in a hook. Also this takes care of rejecting a syntactically wrong value only for boolean variable hooks like autocommit_hook, on_error_stop_hook. However, there are other variable hooks which call ParseVariableBool. For instance, echo_hidden_hook which is handled separately in the patch. Thus there is some duplication of code between generic_boolean_hook and echo_hidden_hook. Similarly for on_error_rollback_hook. >-static void >+static bool > fetch_count_hook(const char *newval) > { >pset.fetch_count = ParseVariableNum(newval, -1, -1, false); >+ return true; > } Shouldn't invalid numeric string assignment for numeric variables be handled too? Instead of generic_boolean_hook cant we have something like follows which like generic_boolean_hook can be called from specific variable assignment hooks, static bool ParseVariable(newval, VariableName, ) { if (VariableName == ‘AUTOCOMMIT’ || ECHO_HIDDEN || other variable with hooks which call ParseVariableBool ) else if (VariableName == ‘FETCH_COUNT’) ParseVariableNum(); } This will help merge the logic which is to check for valid syntax before assigning and returning false on error, in one place. >@@ -260,7 +276,7 @@ SetVariableAssignHook(VariableSpace space, const char *name, VariableAssignHook >current->assign_hook = hook; >current->next = NULL; >previous->next = current; >- (*hook) (NULL); >+ (void)(*hook) (NULL); /* ignore return value */ Sorry for my lack of understanding, can you explain why is above change needed? Thank you, Rahila Syed On Tue, Sep 20, 2016 at 11:16 PM, Daniel Veritewrote: > Ashutosh Bapat wrote: > > > You may want to add this to the November commitfest > > https://commitfest.postgresql.org/11/. > > Done. It's at https://commitfest.postgresql.org/11/799/ > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] WAL consistency check facility
On Mon, Oct 31, 2016 at 9:31 PM, Robert Haaswrote: > On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier > wrote: >> And here we go. Here is a review as well as a large brush-up for this >> patch. A couple of things: >> - wal_consistency is using a list of RMGRs, at the cost of being >> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I >> have been thinking hard about that, and still I don't see the point). >> It is rather easy to for example default it to false, and enable it to >> true to check if a certain code path is correctly exercised or not for >> WAL consistency. Note that this simplification reduces the patch size >> by 100~150 lines. I know, I know, I'd expect some complains about >> that > > I don't understand how you can fail to see the point of that. As you > yourself said, this facility generates a ton of WAL. If you're > focusing on one AM, why would you want to be forced to incur the > overhead for every other AM? A good deal has been written about this > upthread already, and just saying "I don't see the point" seems to be > ignoring the explanations already given. Hehe, I was expecting you to jump on those lines. While looking at the patch I have simplified it first to focus on the core engine of the thing. Adding back this code sounds fine to me as there is a wall of contestation. I offer to do it myself if the effort is the problem. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 10/31/2016 02:24 PM, Tomas Vondra wrote: On 10/31/2016 05:01 AM, Jim Nasby wrote: On 10/30/16 1:32 PM, Tomas Vondra wrote: Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's some sort of CPU / OS scheduling artifact. For example, the system has 36 physical cores, 72 virtual ones (thanks to HT). I find it strange that the "good" client counts are always multiples of 72, while the "bad" ones fall in between. 72 = 72 * 1 (good) 108 = 72 * 1.5 (bad) 144 = 72 * 2 (good) 180 = 72 * 2.5 (bad) 216 = 72 * 3 (good) 252 = 72 * 3.5 (bad) 288 = 72 * 4 (good) So maybe this has something to do with how OS schedules the tasks, or maybe some internal heuristics in the CPU, or something like that. It might be enlightening to run a series of tests that are 72*.1 or *.2 apart (say, 72, 79, 86, ..., 137, 144). Yeah, I've started a benchmark with client a step of 6 clients 36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288 instead of just 36 72 108 144 180 216 252 288 which did a test every 36 clients. To compensate for the 6x longer runs, I'm only running tests for "group-update" and "master", so I should have the results in ~36h. So I've been curious and looked at results of the runs executed so far, and for the group_update patch it looks like this: clients tps - 36 117663 42 139791 48 129331 54 144970 60 124174 66 137227 72 146064 78 100267 84 141538 90 96607 96 139290 102 93976 108 136421 114 91848 120 133563 126 89801 132 132607 138 87912 144 129688 150 87221 156 129608 162 85403 168 130193 174 83863 180 129337 186 81968 192 128571 198 82053 204 128020 210 80768 216 124153 222 80493 228 125503 234 78950 240 125670 246 78418 252 123532 258 77623 264 124366 270 76726 276 119054 282 76960 288 121819 So, similar saw-like behavior, perfectly periodic. But the really strange thing is the peaks/valleys don't match those observed before! That is, during the previous runs, 72, 144, 216 and 288 were "good" while 108, 180 and 252 were "bad". But in those runs, all those client counts are "good" ... Honestly, I have no idea what to think about this ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 10/31/2016 08:43 PM, Amit Kapila wrote: On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondrawrote: On 10/31/2016 02:51 PM, Amit Kapila wrote: And moreover, this setup (single device for the whole cluster) is very common, we can't just neglect it. But my main point here really is that the trade-off in those cases may not be really all that great, because you get the best performance at 36/72 clients, and then the tps drops and variability increases. At least not right now, before tackling contention on the WAL lock (or whatever lock becomes the bottleneck). Okay, but does wait event results show increase in contention on some other locks for pgbench-3000-logged-sync-skip-64? Can you share wait events for the runs where there is a fluctuation? Sure, I do have wait event stats, including a summary for different client counts - see this: http://tvondra.bitbucket.org/by-test/pgbench-3000-logged-sync-skip-64.txt Looking only at group_update patch for three interesting client counts, it looks like this: wait_event_type |wait_event |108 144 180 -+---+- LWLockNamed | WALWriteLock | 661284 847057 1006061 | | 126654 191506 265386 Client | ClientRead| 37273 5279164799 LWLockTranche | wal_insert| 28394 5189379932 LWLockNamed | CLogControlLock | 7766 1491323138 LWLockNamed | WALBufMappingLock | 36153739 3803 LWLockNamed | ProcArrayLock |9131776 2685 Lock| extend|9092082 2228 LWLockNamed | XidGenLock|301 349 675 LWLockTranche | clog |173 331 607 LWLockTranche | buffer_content|163 468 737 LWLockTranche | lock_manager | 88 140 145 Compared to master, this shows significant reduction of contention on CLogControlLock (which on master has 20k, 83k and 200k samples), and moving the contention to WALWriteLock. But perhaps you're asking about variability during the benchmark? I suppose that could be extracted from the collected data, but I haven't done that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sequential scan result order vs performance
On Sun, Oct 30, 2016 at 11:37 PM, Jim Nasbywrote: > BTW, I've sometimes wished for a mode where queries would silently have > result ordering intentionally futzed, to eliminate any possibility of > dependence on tuple ordering (as well as having sequences start at some > random value). I guess with the hooks that are in place today it wouldn't > be hard to stick a ORDER BY random() in if there wasn't already a Sort node > at the top level... +1 In Oracle, we sorta had that feature by adding a parallel hint to a query even if it didn't need it. It came in handy.
Re: [HACKERS] COPY as a set returning function
Attached is a patch that implements copy_srf(). The function signature reflects cstate more than it reflects the COPY options (filename+is_program instead of FILENAME or PROGRAM, etc) CREATE OR REPLACE FUNCTION copy_srf( filenametext DEFAULT null, is_program boolean DEFAULT false, format text DEFAULT 'text', /* accepts text, csv, binary */ delimiter text DEFAULT null, null_string text DEFAULT E'\\N', header boolean DEFAULT false, quote text DEFAULT null, escape text DEFAULT null, encodingtext DEFAULT null) RETURNS SETOF RECORD The major utility for this (at least for me) will be in ETLs that currently make a lot of use of temp tables, and wish to either reduce I/O or avoid pg_attribute bloat. I have not yet implemented STDIN mode, but it's a start. $ psql test psql (10devel) Type "help" for help. # select * fromcopy_srf('echo 1,2; echo 4,5',true,'csv') as t(x text, y text); x | y ---+--- 1 | 2 4 | 5 (2 rows) Time: 1.456 ms # select * fromcopy_srf('/tmp/small_file.txt'::text,false,'text') as t(x text, y text); x | y -+- 1 | 2 a | b cde | fgh (3 rows) On Mon, Oct 17, 2016 at 2:33 PM, Merlin Moncurewrote: > On Fri, Sep 30, 2016 at 9:56 PM, Tom Lane wrote: > > Craig Ringer writes: > >> On 1 Oct. 2016 05:20, "Tom Lane" wrote: > >>> I think the last of those suggestions has come up before. It has the > >>> large advantage that you don't have to remember a different syntax for > >>> copy-as-a-function. > > > >> That sounds fantastic. It'd help this copy variant retain festure parity > >> with normal copy. And it'd bring us closer to being able to FETCH in non > >> queries. > > > > On second thought, though, this couldn't exactly duplicate the existing > > COPY syntax, because COPY relies heavily on the rowtype of the named > > target table to tell it what it's copying. You'd need some new syntax > > to provide the list of column names and types, which puts a bit of > > a hole in the "syntax we already know" argument. A SRF-returning-record > > would have a leg up on that, because we do have existing syntax for > > defining the concrete rowtype that any particular call returns. > > One big disadvantage of SRF-returning-record syntax is that functions > are basically unwrappable with generic wrappers sans major gymnastics > such as dynamically generating the query and executing it. This is a > major disadvantage relative to the null::type hack we use in the > populate_record style functions and perhaps ought to make this > (SRF-returning-record syntax) style of use discouraged for useful > library functions. If there were a way to handle wrapping I'd > withdraw this minor objection -- this has come up in dblink > discussions a few times). > > merlin > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index ada2142..0876ee1 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1006,6 +1006,21 @@ LANGUAGE INTERNAL STRICT IMMUTABLE PARALLEL SAFE AS 'jsonb_insert'; +CREATE OR REPLACE FUNCTION copy_srf( + IN filename text DEFAULT null, + IN is_program boolean DEFAULT false, + IN format text DEFAULT 'text', + IN delimiter text DEFAULT null, + IN null_string text DEFAULT E'\\N', + IN header boolean DEFAULT false, + IN quote text DEFAULT null, + IN escape text DEFAULT null, + IN encoding text DEFAULT null) +RETURNS SETOF RECORD +LANGUAGE INTERNAL +VOLATILE ROWS 1000 COST 1000 CALLED ON NULL INPUT +AS 'copy_srf'; + -- The default permissions for functions mean that anyone can execute them. -- A number of functions shouldn't be executable by just anyone, but rather -- than use explicit 'superuser()' checks in those functions, we use the GRANT diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index b4140eb..90ed2c5 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -30,6 +30,7 @@ #include "commands/defrem.h" #include "commands/trigger.h" #include "executor/executor.h" +#include "funcapi.h" #include "libpq/libpq.h" #include "libpq/pqformat.h" #include "mb/pg_wchar.h" @@ -555,7 +556,6 @@ CopyGetData(CopyState cstate, void *databuf, int minread, int maxread) errmsg("could not read from COPY file: %m"))); break; case COPY_OLD_FE: - /* * We cannot read more than minread bytes (which in practice is 1) * because old protocol doesn't have any clear way of separating @@ -4555,3 +4555,377 @@ CreateCopyDestReceiver(void) return (DestReceiver *) self; } + +Datum +copy_srf(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; +
[HACKERS] Increase pltcl test coverage
This patch increases test coverage for pltcl, from 70% to 83%. Aside from that, the work on this uncovered 2 new bugs (the trigger return one I just submitted, as well as a bug in the SRF/composite patch). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 diff --git a/src/pl/tcl/expected/pltcl_queries.out b/src/pl/tcl/expected/pltcl_queries.out index 6cb1fdb..9e3a0dc 100644 --- a/src/pl/tcl/expected/pltcl_queries.out +++ b/src/pl/tcl/expected/pltcl_queries.out @@ -1,3 +1,7 @@ +BEGIN; +SET LOCAL client_min_messages = WARNING; +CREATE EXTENSION IF NOT EXISTS plpgsql; +COMMIT; -- suppress CONTEXT so that function OIDs aren't in output \set VERBOSITY terse insert into T_pkey1 values (1, 'key1-1', 'test key'); @@ -185,12 +189,23 @@ select * from T_pkey2 order by key1 using @<, key2 collate "C"; -- show dump of trigger data insert into trigger_test values(1,'insert'); -NOTICE: NEW: {i: 1, v: insert} +NOTICE: NEW: {} +NOTICE: OLD: {} +NOTICE: TG_level: STATEMENT +NOTICE: TG_name: statement_trigger +NOTICE: TG_op: INSERT +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} +NOTICE: TG_relid: bogus:12345 +NOTICE: TG_table_name: trigger_test +NOTICE: TG_table_schema: public +NOTICE: TG_when: BEFORE +NOTICE: args: {42 {statement trigger}} +NOTICE: NEW: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: insert} NOTICE: OLD: {} NOTICE: TG_level: ROW NOTICE: TG_name: show_trigger_data_trig NOTICE: TG_op: INSERT -NOTICE: TG_relatts: {{} i v} +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} NOTICE: TG_relid: bogus:12345 NOTICE: TG_table_name: trigger_test NOTICE: TG_table_schema: public @@ -232,13 +247,37 @@ NOTICE: TG_table_name: trigger_test_view NOTICE: TG_table_schema: public NOTICE: TG_when: {INSTEAD OF} NOTICE: args: {24 {skidoo view}} +update trigger_test set v = 'update', test_skip=true where i = 1; +NOTICE: NEW: {} +NOTICE: OLD: {} +NOTICE: TG_level: STATEMENT +NOTICE: TG_name: statement_trigger +NOTICE: TG_op: UPDATE +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} +NOTICE: TG_relid: bogus:12345 +NOTICE: TG_table_name: trigger_test +NOTICE: TG_table_schema: public +NOTICE: TG_when: BEFORE +NOTICE: args: {42 {statement trigger}} +NOTICE: SKIPPING OPERATION UPDATE update trigger_test set v = 'update' where i = 1; -NOTICE: NEW: {i: 1, v: update} -NOTICE: OLD: {i: 1, v: insert} +NOTICE: NEW: {} +NOTICE: OLD: {} +NOTICE: TG_level: STATEMENT +NOTICE: TG_name: statement_trigger +NOTICE: TG_op: UPDATE +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} +NOTICE: TG_relid: bogus:12345 +NOTICE: TG_table_name: trigger_test +NOTICE: TG_table_schema: public +NOTICE: TG_when: BEFORE +NOTICE: args: {42 {statement trigger}} +NOTICE: NEW: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: update} +NOTICE: OLD: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: insert} NOTICE: TG_level: ROW NOTICE: TG_name: show_trigger_data_trig NOTICE: TG_op: UPDATE -NOTICE: TG_relatts: {{} i v} +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} NOTICE: TG_relid: bogus:12345 NOTICE: TG_table_name: trigger_test NOTICE: TG_table_schema: public @@ -246,16 +285,39 @@ NOTICE: TG_when: BEFORE NOTICE: args: {23 skidoo} delete from trigger_test; NOTICE: NEW: {} -NOTICE: OLD: {i: 1, v: update} +NOTICE: OLD: {} +NOTICE: TG_level: STATEMENT +NOTICE: TG_name: statement_trigger +NOTICE: TG_op: DELETE +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} +NOTICE: TG_relid: bogus:12345 +NOTICE: TG_table_name: trigger_test +NOTICE: TG_table_schema: public +NOTICE: TG_when: BEFORE +NOTICE: args: {42 {statement trigger}} +NOTICE: NEW: {} +NOTICE: OLD: {i: 1, test_argisnull: f, test_return_null: f, test_skip: f, v: update} NOTICE: TG_level: ROW NOTICE: TG_name: show_trigger_data_trig NOTICE: TG_op: DELETE -NOTICE: TG_relatts: {{} i v} +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} NOTICE: TG_relid: bogus:12345 NOTICE: TG_table_name: trigger_test NOTICE: TG_table_schema: public NOTICE: TG_when: BEFORE NOTICE: args: {23 skidoo} +truncate trigger_test; +NOTICE: NEW: {} +NOTICE: OLD: {} +NOTICE: TG_level: STATEMENT +NOTICE: TG_name: statement_trigger +NOTICE: TG_op: TRUNCATE +NOTICE: TG_relatts: {{} i v {} test_skip test_return_null test_argisnull} +NOTICE: TG_relid: bogus:12345 +NOTICE: TG_table_name: trigger_test +NOTICE: TG_table_schema: public +NOTICE: TG_when: BEFORE +NOTICE: args: {42 {statement trigger}} -- Test composite-type arguments select tcl_composite_arg_ref1(row('tkey', 42, 'ref2')); tcl_composite_arg_ref1 @@ -288,6 +350,20 @@
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Mon, Oct 31, 2016 at 7:58 PM, Tomas Vondrawrote: > On 10/31/2016 02:51 PM, Amit Kapila wrote: > And moreover, this setup (single device for the whole cluster) is very > common, we can't just neglect it. > > But my main point here really is that the trade-off in those cases may not > be really all that great, because you get the best performance at 36/72 > clients, and then the tps drops and variability increases. At least not > right now, before tackling contention on the WAL lock (or whatever lock > becomes the bottleneck). > Okay, but does wait event results show increase in contention on some other locks for pgbench-3000-logged-sync-skip-64? Can you share wait events for the runs where there is a fluctuation? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fix bug in handling of dropped columns in pltcl triggers
While reviewing code coverage in pltcl, I uncovered a bug in trigger function return handling. If you returned the munged name of a dropped column, that would silently be ignored. It would be unusual to hit this, since dropped columns end up with names like "...pg.dropped.2...", but since that's still a legitimate name for a column silently ignoring it seems rather bogus. Work sponsored by FlightAware. https://github.com/postgres/postgres/compare/master...decibel:tcl_dropped -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 commit 8f88fda52751f88aca4786f45d3d7f16a5343fc0 Author: Jim NasbyDate: Mon Oct 31 14:11:43 2016 -0500 Fix trigger dropped column handling Previously, if a trigger returned a column that had been dropped, using the munged column name, no error would be thrown. Since dropping a column does forcibly overwrite the column name, it would be very unusual for this to happen in practice, but silently ignoring what would otherwise be a legitimate column name seemed rather bogus. diff --git a/src/pl/tcl/expected/pltcl_queries.out b/src/pl/tcl/expected/pltcl_queries.out index 6cb1fdb..c3f33d9 100644 --- a/src/pl/tcl/expected/pltcl_queries.out +++ b/src/pl/tcl/expected/pltcl_queries.out @@ -184,6 +184,21 @@ select * from T_pkey2 order by key1 using @<, key2 collate "C"; (4 rows) -- show dump of trigger data +insert into test_return values(-1,'1 element array'); +ERROR: trigger's return list must have even number of elements +insert into test_return values(-2,'return dropped column'); +ERROR: unrecognized attribute "pg.dropped.2" +insert into test_return values(-3,'return ctid column'); +ERROR: cannot set system attribute "ctid" +insert into test_return values(-10,'do nothing'); +insert into test_return values(1,'good'); +select * from test_return; + i | t +---+-- + | + 1 | good +(2 rows) + insert into trigger_test values(1,'insert'); NOTICE: NEW: {i: 1, v: insert} NOTICE: OLD: {} diff --git a/src/pl/tcl/expected/pltcl_setup.out b/src/pl/tcl/expected/pltcl_setup.out index e65e9e3..707d77b 100644 --- a/src/pl/tcl/expected/pltcl_setup.out +++ b/src/pl/tcl/expected/pltcl_setup.out @@ -89,6 +89,33 @@ FOR EACH ROW EXECUTE PROCEDURE trigger_data(23,'skidoo'); CREATE TRIGGER show_trigger_data_view_trig INSTEAD OF INSERT OR UPDATE OR DELETE ON trigger_test_view FOR EACH ROW EXECUTE PROCEDURE trigger_data(24,'skidoo view'); +-- test handling of return +CREATE TABLE test_return(i int, dropme int, t text); +ALTER TABLE test_return DROP dropme; +CREATE FUNCTION trigger_return() RETURNS trigger LANGUAGE pltcl AS $body$ +switch $NEW(i) { +-1 { + return {"single element"} + } + -2 { + spi_exec "SELECT attname FROM pg_attribute WHERE attrelid=$TG_relid AND attisdropped" + set a($attname) 1 + return [array get a] + } + -3 { + return {ctid 1} + } + -10 { + # Will return nothing + } + default { + return OK + } + } +$body$; +CREATE TRIGGER test_trigger_return +BEFORE INSERT ON test_return +FOR EACH ROW EXECUTE PROCEDURE trigger_return(); -- -- Trigger function on every change to T_pkey1 -- diff --git a/src/pl/tcl/pltcl.c b/src/pl/tcl/pltcl.c index d236890..5d9a857 100644 --- a/src/pl/tcl/pltcl.c +++ b/src/pl/tcl/pltcl.c @@ -1128,7 +1128,9 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, bool pltrusted) * Get the attribute number / attnum = SPI_fnumber(tupdesc, ret_name); - if (attnum == SPI_ERROR_NOATTRIBUTE) + /* Assume system attributes can't be marked as dropped */ + if (attnum == SPI_ERROR_NOATTRIBUTE || + (attnum > 0 && tupdesc->attrs[attnum - 1]->attisdropped)) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_COLUMN), errmsg("unrecognized attribute \"%s\"", @@ -1140,12 +1142,6 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, bool pltrusted) ret_name))); / -* Ignore dropped columns - / - if
Re: [HACKERS] Unsafe use of relation->rd_options without checking its type
On Mon, Oct 31, 2016 at 11:57 AM, Tom Lanewrote: > Now that I've seen this I wonder which other uses of rd_options are > potentially broken. RelationIsUsedAsCatalogTable() is hardly the > only macro that is assuming with little justification that it's > applied to the right kind of reloptions. It seems worth adding an assertion, at least. I wonder what running the regression tests with a bunch of similar assertions shows up... -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unsafe use of relation->rd_options without checking its type
I looked into bug#14397. The submitter was discourteous enough not to provide an in-line example, but here it is: CREATE TABLE IF NOT EXISTS TEST_1 ( ID SERIAL PRIMARY KEY, C1 BYTEA, C2 TEXT NOT NULL, C3 BOOLEAN NOT NULL DEFAULT FALSE, CONSTRAINT TEST_1_unique_idx UNIQUE(C1, C2) create or replace view test as select * from test_1 with cascaded check option; insert into test (c1, c2, c3) values (decode('MTIzAAE=', 'base64'), 'text', true) on conflict (c1, c2) do update set c3=excluded.c3; ERROR: ON CONFLICT is not supported on table "test" used as a catalog table LINE 1: ...lues (decode('MTIzAAE=', 'base64'), 'text', true) on conflic... ^ The reason for the error is that transformOnConflictArbiter applies RelationIsUsedAsCatalogTable() to something that it doesn't know to be a plain relation --- it's a view in this case. And that macro blindly assumes that relation->rd_options is a StdRdOptions struct, when in this case it's actually a ViewOptions struct. Now that I've seen this I wonder which other uses of rd_options are potentially broken. RelationIsUsedAsCatalogTable() is hardly the only macro that is assuming with little justification that it's applied to the right kind of reloptions. We could band-aid this by having the RelationIsUsedAsCatalogTable() macro check relation->relkind, but I'm inclined to think that the right fix going forward is to insist that StdRdOptions, ViewOptions, and the other in-memory representations of reloptions ought to be self-identifying somehow. We could add a field to them similar to nodeTag, but one of the things that was envisioned to begin with is relation types that have StdRdOptions as the first part of a larger struct. I'm not sure how to make that work with a tag. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query regarding selectDumpableExtension()
amul sulwrites: > Thanks for your explanation, I agree that this is not a single scenario > where we need special care, but still my question stands there, why do we > really need to dump built-in extension? It's not built-in. It's installed by default, yes, but it's also droppable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query regarding selectDumpableExtension()
On 31 Oct 2016 6:48 pm, "Tom Lane"wrote: > > amul sul writes: > > On Fri, Oct 28, 2016 at 6:22 PM, Robert Haas wrote: > >> There's a comment in dumpExtension() that explains it. > > > Let me explain the case I'm trying to tackle. I have two old dump > > data, each of them have couple objects depend on plpgsql. I have > > restored first dump and trying restore second dump using 'pg_restore > > -c' command, it is failing with following error: > > ERROR: cannot drop extension plpgsql because other objects depend on it > > This is hardly specific to extensions. If you try a restore with -c into > a database that has other random objects besides what's in the dump, you > could get errors from > * dropping tables that are referenced by foreign keys from tables not > known in the dump > * dropping functions that are used in views not known in the dump > * dropping operators or opclasses used by indexes not known in the dump > etc etc. > > > Works well without '-c' option, but that what not a general solution, IMHO. > > The general solution is either don't restore into a database containing > unrelated objects, or be prepared to ignore errors from the DROP commands. > The extension case actually works more smoothly than most of the others. > Thanks for your explanation, I agree that this is not a single scenario where we need special care, but still my question stands there, why do we really need to dump built-in extension? Of course you could ask, why not? And my answer will be same, "to placate pg_restore at least in the case I've explained before" :) Regards, Amul Sent from a mobile device. Please excuse brevity and tpyos.
Re: [HACKERS] pg_sequence catalog
On 9/10/16 7:17 AM, Peter Eisentraut wrote: > Let's start with that. Here is a patch that adds a pg_sequences view in > the style of pg_tables, pg_indexes, etc. This seems useful independent > of anything else, but would give us more freedom to change things around > behind the scenes. I have registered the pg_sequences view patch separately in the upcoming commit fest: https://commitfest.postgresql.org/11/865/ (The posted patch still applies, so no new patch.) Attached is an updated patch for the pg_sequence catalog. I haven't received an actual technical review in the last CF, so I'll move it forward. However, if the pg_sequences view is accepted, the catalog patch will of course need an update. So if someone is looking here with limited time, review the view patch first. As before, this patch requires the "sequences and pg_upgrade" patch in order for pg_upgrade to work. I think together these three changes will make sequence metadata inspection and modification easier and more robust and will facilitate future changes in this area. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From bb8d2260d27ba8685844d6268550c8b6f11f195a Mon Sep 17 00:00:00 2001 From: Peter EisentrautDate: Mon, 31 Oct 2016 12:00:00 -0400 Subject: [PATCH v2] Add pg_sequence system catalog Move sequence metadata (start, increment, etc.) into a proper system catalog instead of storing it in the sequence heap object. This separates the metadata from the sequence data. Sequence metadata is now operated on transactionally by DDL commands, whereas previously rollbacks of sequence-related DDL commands would be ignored. --- src/backend/catalog/Makefile | 2 +- src/backend/catalog/dependency.c | 3 + src/backend/catalog/information_schema.sql| 13 +- src/backend/commands/sequence.c | 375 +++--- src/backend/utils/cache/syscache.c| 12 + src/include/catalog/indexing.h| 3 + src/include/catalog/pg_sequence.h | 30 +++ src/include/commands/sequence.h | 29 +- src/include/utils/syscache.h | 1 + src/test/regress/expected/sanity_check.out| 1 + src/test/regress/expected/sequence.out| 33 ++- src/test/regress/expected/updatable_views.out | 93 +++ src/test/regress/sql/sequence.sql | 8 + src/test/regress/sql/updatable_views.sql | 2 +- 14 files changed, 358 insertions(+), 247 deletions(-) create mode 100644 src/include/catalog/pg_sequence.h diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile index 1ce7610..cbf0d79 100644 --- a/src/backend/catalog/Makefile +++ b/src/backend/catalog/Makefile @@ -41,7 +41,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\ pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \ pg_foreign_table.h pg_policy.h pg_replication_origin.h \ pg_default_acl.h pg_init_privs.h pg_seclabel.h pg_shseclabel.h \ - pg_collation.h pg_range.h pg_transform.h \ + pg_collation.h pg_range.h pg_transform.h pg_sequence.h \ toasting.h indexing.h \ ) diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 04d7840..8e1e1ac 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -66,6 +66,7 @@ #include "commands/proclang.h" #include "commands/schemacmds.h" #include "commands/seclabel.h" +#include "commands/sequence.h" #include "commands/trigger.h" #include "commands/typecmds.h" #include "nodes/nodeFuncs.h" @@ -1155,6 +1156,8 @@ doDeletion(const ObjectAddress *object, int flags) else heap_drop_with_catalog(object->objectId); } +if (relKind == RELKIND_SEQUENCE) + DeleteSequenceTuple(object->objectId); break; } diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql index 00550eb..182d2d0 100644 --- a/src/backend/catalog/information_schema.sql +++ b/src/backend/catalog/information_schema.sql @@ -1535,15 +1535,16 @@ CREATE VIEW sequences AS CAST(64 AS cardinal_number) AS numeric_precision, CAST(2 AS cardinal_number) AS numeric_precision_radix, CAST(0 AS cardinal_number) AS numeric_scale, - CAST(p.start_value AS character_data) AS start_value, - CAST(p.minimum_value AS character_data) AS minimum_value, - CAST(p.maximum_value AS character_data) AS maximum_value, - CAST(p.increment AS character_data) AS increment, - CAST(CASE WHEN p.cycle_option THEN 'YES' ELSE 'NO' END AS yes_or_no) AS cycle_option -FROM pg_namespace nc, pg_class c, LATERAL pg_sequence_parameters(c.oid) p + CAST(s.seqstart AS character_data) AS start_value, + CAST(s.seqmin AS character_data) AS minimum_value, + CAST(s.seqmax AS character_data) AS
Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Andres Freundwrites: > On 2016-09-14 19:28:25 -0400, Tom Lane wrote: >> So I think we should continue investigating this way of doing things. >> I'll try to take a look at the executor end of it tomorrow. However >> I'm leaving Friday for a week's vacation, and may not have anything to >> show before that. > Are you planning to work on the execution side of things? I otherwise > can take a stab... My plan is to start on this when I go back into commitfest mode, but right now I'm trying to produce a draft patch for RLS changes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On 2016-10-31 09:28:00 -0400, Robert Haas wrote: > On Fri, Oct 28, 2016 at 2:46 AM, Andres Freundwrote: > > Well, that'll also make the feature not particularly useful :(. My > > suspicion is that the way to suceed here isn't to rely more on testing > > as part of the scan, but create a more general fastpath for qual > > evaluation, which atm is a *LOT* more heavyweight than what > > HeapKeyTest() does. But maybe I'm biased since I'm working on the > > latter... > > I think you might be right, but I'm not very clear on what the > timeline for your work is. Me neither. But I think if we can stomach Dilip's approach of using a slot in heapam, then I think my concerns are addressed, and this is probably going go to be a win regardless of faster expression evaluation and/or batching. > It would be easier to say, sure, let's put > this on hold if we knew that in a month or two we could come back and > retest after you've made some progress. But I don't know whether > we're talking about months or years. I sure hope it's months. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Radix tree for character conversion
> On 27 Oct 2016, at 09:23, Kyotaro HORIGUCHI> wrote: > > Hello, thank you very much for the work. My work became quite > easier with it. > > At Tue, 25 Oct 2016 12:23:48 +0300, Heikki Linnakangas > wrote in <08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi> >> >> [..] >> The perl scripts are still quite messy. For example, I lost the checks >> for duplicate mappings somewhere along the way - that ought to be put >> back. My Perl skills are limited. > > Perl scripts are to be messy, I believe. Anyway the duplicate > check as been built into the sub print_radix_trees. Maybe the > same check is needed by some plain map files but it would be just > duplication for the maps having radix tree. I took a small stab at doing some cleaning of the Perl scripts, mainly around using the more modern (well, modern as in +15 years old) form for open(..), avoiding global filehandles for passing scalar references and enforcing use strict. Some smaller typos and fixes were also included. It seems my Perl has become a bit rusty so I hope the changes make sense. The produced files are identical with these patches applied, they are merely doing cleaning as opposed to bugfixing. The attached patches are against the 0001-0006 patches from Heikki and you in this series of emails, the separation is intended to make them easier to read. cheers ./daniel 0007-Fix-filehandle-usage.patch Description: Binary data 0008-Make-all-scripts-use-strict-and-rearrange-logic.patch Description: Binary data 0009-Use-my-instead-of-local.patch Description: Binary data 0010-Various-small-style-nits-and-typos.patch Description: Binary data 0011-Fix-hash-lookup.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Hi Tom, On 2016-09-14 19:28:25 -0400, Tom Lane wrote: > Andres Freundwrites: > > On 2016-09-12 19:35:22 -0400, Tom Lane wrote: > >> Anyway I'll draft a prototype and then we can compare. > > > Ok, cool. > > Here's a draft patch that is just meant to investigate what the planner > changes might look like if we do it in the pipelined-result way. > Accordingly, I didn't touch the executor, but just had it emit regular > Result nodes for SRF-execution steps. However, the SRFs are all > guaranteed to appear at top level of their respective tlists, so that > those Results could be replaced with something that works like > nodeFunctionscan. > So I think we should continue investigating this way of doing things. > I'll try to take a look at the executor end of it tomorrow. However > I'm leaving Friday for a week's vacation, and may not have anything to > show before that. Are you planning to work on the execution side of things? I otherwise can take a stab... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
27.10.2016, 21:53, Merlin Moncure kirjoitti: As noted earlier, I was not able to reproduce the issue with crashme.sh, which was: NUM_FORKS=16 do_parallel psql -p 5432 -c"select PushMarketSample('1740')" castaging_test do_parallel psql -p 5432 -c"select PushMarketSample('4400')" castaging_test do_parallel psql -p 5432 -c"select PushMarketSample('2160')" castaging_test do_parallel psql -p 5432 -c"select PushMarketSample('6680')" castaging_test (do_parallel is simple wrapper to executing the command in parallel up to NUM_FORKS). This is on the same server and cluster as above. This kind of suggests that either A) there is some concurrent activity from another process that is tripping the issue or B) there is something particular to the session invoking the function that is participating in the problem. As the application is structured, a single threaded node.js app is issuing the query that is high traffic and long lived. It's still running in fact and I'm kind of tempted to find some downtime to see if I can still reproduce via the UI. Your production system's postgres backends probably have a lot more open files associated with them than the simple test case does. Since Postgres likes to keep files open as long as possible and only closes them when you need to free up fds to open new files, it's possible that your production backends have almost all allowed fds used when you execute your pl/sh function. If that's the case, the sqsh process that's executed may not have enough fds to do what it wanted to do and because of busted error handling could end up writing to fds that were opened by Postgres and point to $PGDATA files. / Oskari -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL consistency check facility
On Fri, Oct 28, 2016 at 11:35 AM, Michael Paquierwrote: > And here we go. Here is a review as well as a large brush-up for this > patch. A couple of things: Thanks for reviewing the patch in detail. > - Speaking of which using BKPIMAGE_IS_REQUIRED_FOR_REDO stored in the > block definition is sort of weird because we want to know if > consistency should be checked at a higher level. A full page image can be included in the WAL record because of the following reasons: 1. It needs to be restored during replay. 2. WAL consistency should be checked for the record. 3. Both of above. In your patch, you've included a full page image whenever wal_consistency is true. So, XLogReadBufferForRedoExtended always restores the image and returns BLK_RESTORED, which is unacceptable. We can't change the default WAL replay behaviour. A full image should only be restored if it is necessary to do so. Although, I agree that BKPIMAGE_IS_REQUIRED_FOR_REDO doesn't look a clean way to implement this feature. > - wal_consistency is using a list of RMGRs, at the cost of being > PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I > have been thinking hard about that, and still I don't see the point). > It is rather easy to for example default it to false, and enable it to > true to check if a certain code path is correctly exercised or not for > WAL consistency. Note that this simplification reduces the patch size > by 100~150 lines. I know, I know, I'd expect some complains about > that As Robert also told, if I'm focusing on a single AM, I really don't want to store full images and perform consistency check for other AMs. > On top of that, I have done a fair amount of testing, creating > manually some inconsistencies in the REDO routines to trigger failures > on standbys. And that was sort of fun to break things intentionally. I know the feeling. :) -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)
Hello, The attached patch implements the suggestion by Amit before. What I'm motivated is to collect extra run-time statistics specific to a particular ForeignScan/CustomScan, not only the standard Instrumentation; like DMA transfer rate or execution time of GPU kernels in my case. Per-node DSM toc is one of the best way to return run-time statistics to the master backend, because FDW/CSP can assign arbitrary length of the region according to its needs. It is quite easy to require. However, one problem is, the per-node DSM toc is already released when ExecEndNode() is called on the child node of Gather. This patch allows extensions to get control on the master backend's context when all the worker node gets finished but prior to release of the DSM segment. If FDW/CSP has its special statistics on the segment, it can move to the private memory area for EXPLAIN output or something other purpose. One design consideration is whether the hook shall be called from ExecParallelRetrieveInstrumentation() or ExecParallelFinish(). The former is a function to retrieve the standard Instrumentation information, thus, it is valid only if EXPLAIN ANALYZE. On the other hands, if we put entrypoint at ExecParallelFinish(), extension can get control regardless of EXPLAIN ANALYZE, however, it also needs an extra planstate_tree_walker(). Right now, we don't assume anything onto the requirement by FDW/CSP. It may want run-time statistics regardless of EXPLAIN ANALYZE, thus, hook shall be invoked always when Gather node confirmed termination of the worker processes. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei> -Original Message- > From: Amit Kapila [mailto:amit.kapil...@gmail.com] > Sent: Monday, October 17, 2016 11:22 AM > To: Kaigai Kouhei(海外 浩平) > Cc: Robert Haas; pgsql-hackers > Subject: ##freemail## Re: [HACKERS] Steps inside ExecEndGather > > On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai wrote: > > Hello, > > > > I'm now trying to carry extra performance statistics on CustomScan > > (like DMA transfer rate, execution time of GPU kernels, etc...) > > from parallel workers to the leader process using the DSM segment > > attached by the parallel-context. > > We can require an arbitrary length of DSM using ExecCustomScanEstimate > > hook by extension, then it looks leader/worker can share the DSM area. > > However, we have a problem on this design. > > > > Below is the implementation of ExecEndGather(). > > > > void > > ExecEndGather(GatherState *node) > > { > > ExecShutdownGather(node); > > ExecFreeExprContext(>ps); > > ExecClearTuple(node->ps.ps_ResultTupleSlot); > > ExecEndNode(outerPlanState(node)); > > } > > > > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode(). > > The DSM segment shall be released on this call, so child node cannot > > reference the DSM at the time of ExecEndNode(). > > > > Before releasing DSM, we do collect all the statistics or > instrumentation information of each node. Refer > ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am > wondering why can't you collect the additional information in the same > way? > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com parallel-finish-fdw_csp.v1.patch Description: parallel-finish-fdw_csp.v1.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 10/31/2016 02:51 PM, Amit Kapila wrote: On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondrawrote: Hi, On 10/27/2016 01:44 PM, Amit Kapila wrote: I've read that analysis, but I'm not sure I see how it explains the "zig zag" behavior. I do understand that shifting the contention to some other (already busy) lock may negatively impact throughput, or that the group_update may result in updating multiple clog pages, but I don't understand two things: (1) Why this should result in the fluctuations we observe in some of the cases. For example, why should we see 150k tps on, 72 clients, then drop to 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180 clients etc. That seems fairly strange. I don't think hitting multiple clog pages has much to do with client-count. However, we can wait to see your further detailed test report. (2) Why this should affect all three patches, when only group_update has to modify multiple clog pages. No, all three patches can be affected due to multiple clog pages. Read second paragraph ("I think one of the probable reasons that could happen for both the approaches") in same e-mail [1]. It is basically due to frequent release-and-reacquire of locks. On logged tables it usually looks like this (i.e. modest increase for high client counts at the expense of significantly higher variability): http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64 What variability are you referring to in those results? Good question. What I mean by "variability" is how stable the tps is during the benchmark (when measured on per-second granularity). For example, let's run a 10-second benchmark, measuring number of transactions committed each second. Then all those runs do 1000 tps on average: run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000 run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500 run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000 Generally, such behaviours are seen due to writes. Are WAL and DATA on same disk in your tests? Yes, there's one RAID device on 10 SSDs, with 4GB of the controller. I've done some tests and it easily handles > 1.5GB/s in sequential writes, and >500MB/s in sustained random writes. Also, let me point out that most of the tests were done so that the whole data set fits into shared_buffers, and with no checkpoints during the runs (so no writes to data files should really happen). For example these tests were done on scale 3000 (45GB data set) with 64GB shared buffers: [a] http://tvondra.bitbucket.org/index2.html#pgbench-3000-unlogged-sync-noskip-64 [b] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-async-noskip-64 and I could show similar cases with scale 300 on 16GB shared buffers. In those cases, there's very little contention between WAL and the rest of the data base (in terms of I/O). And moreover, this setup (single device for the whole cluster) is very common, we can't just neglect it. But my main point here really is that the trade-off in those cases may not be really all that great, because you get the best performance at 36/72 clients, and then the tps drops and variability increases. At least not right now, before tackling contention on the WAL lock (or whatever lock becomes the bottleneck). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]
Hello, The attached patch is revised version of the pass-down-bounds feature. Its functionality is not changed from the previous version, however, implementation was revised according to the discussion at the last CF. This patch add a new fields (ps_numTuples) to the PlanState. This is a hint for optimization when parent node needs first N-tuples only. It shall be set prior to ExecProcNode() after ExecInitNode() or ExecReScan() by the parent node, then child nodes can adjust its execution behavior (e.g, Sort will take top-N heapsort if ps_numTuples is set) and pass down the hint to its child nodes furthermore. As an example, I enhanced postgres_fdw to understand the ps_numTuples if it is set. If and when remote ORDER BY is pushed down, the latest code tries to sort the entire remote table because it does not know how many rows to be returned. Thus, it took larger execution time. On the other hands, the patched one runs the remote query with LIMIT clause according to the ps_numTuples; which is informed by the Limit node on top of the ForeignScan node. * without patch = postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10; QUERY PLAN Limit (cost=100.00..100.43 rows=10 width=52) (actual time=2332.548..2332.550 rows=10 loops=1) Output: id, x, y, z -> Foreign Scan on public.ft (cost=100.00..146.46 rows=1077 width=52) (actual time=2332.547..2332.548 rows=10 loops=1) Output: id, x, y, z Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS LAST, y ASC NULLS LAST Planning time: 0.177 ms Execution time: 2445.590 ms (7 rows) * with patch == postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10; QUERY PLAN -- Limit (cost=100.00..100.43 rows=10 width=52) (actual time=579.469..579.471 rows=10 loops=1) Output: id, x, y, z -> Foreign Scan on public.ft (cost=100.00..146.46 rows=1077 width=52) (actual time=579.468..579.469 rows=10 loops=1) Output: id, x, y, z Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS LAST, y ASC NULLS LAST Planning time: 0.123 ms Execution time: 579.858 ms (7 rows) Right now, I have a few concerns for this patch. 1. Because LIMIT clause can have expression not only constant value, we cannot determine the value of ps_numTuples until execution time. So, it is not possible to adjust remote query on planning time, and EXPLAIN does not show exact remote query even if LIMIT clause was attached actually. 2. Where is the best location to put the interface contract to set ps_numTuples field. It has to be set prior to the first ExecProcNode() after ExecInitNode() or ExecReScan(). Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei> -Original Message- > From: Robert Haas [mailto:robertmh...@gmail.com] > Sent: Friday, September 16, 2016 12:39 AM > To: Kaigai Kouhei(海外 浩平) > Cc: Jeevan Chalke; pgsql-hackers@postgresql.org; Etsuro Fujita; Andres Freund > Subject: ##freemail## Re: [HACKERS] PassDownLimitBound for > ForeignScan/CustomScan > > On Tue, Sep 13, 2016 at 9:07 PM, Kouhei Kaigai wrote: > > In the current implementation calls recompute_limits() on the first > > invocation of ExecLimit and ExecReScanLimit. Do we expect the > > ps->numTuples will be also passed down to the child nodes on the same > > timing? > > Sure, unless we find some reason why that's not good. > > > I also think this new executor contract shall be considered as a hint > > (but not a requirement) for the child nodes, because it allows the > > parent nodes to re-distribute the upper limit regardless of the type > > of the child nodes as long as the parent node can work correctly and > > has benefit even if the child node returns a part of tuples. It makes > > the decision whether the upper limit should be passed down much simple. > > The child node "can" ignore the hint but can utilize for more optimization. > > +1. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company passdown-limit-fdw.v1.patch Description: passdown-limit-fdw.v1.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DML and column cound in aggregated subqueries
On 2016-10-31 09:35:57 -0400, Tom Lane wrote: > Andres Freundwrites: > > this doesn't look right. The ctid shouldn't be in the aggregate output - > > after all it's pretty much meaningless here. > > I suspect it's being added to support EvalPlanQual row re-fetches. Hm, that doesn't seem particularly meaningful though? I wonder if it's not actually more an accident than anything. Looks like preprocess_rowmarks() adds them pretty unconditionally to everything for DML. > > Casting a wider net: find_hash_columns() and it's subroutines seem like > > pretty much dead code, including outdated comments (look for "SQL99 > > semantics"). > > Hm, maybe. In principle the planner could do that instead, but it doesn't > look like it actually does at the moment; I'm not seeing any distinction > in tlist-building for AGG_HASHED vs other cases in create_agg_plan(). Isn't the important part what's inside Agg->numCols/grpColIdx/grpOperators? Those should Because those are the ones that are minimal, and it looks we do the right thing there already (and if not, we'd be in trouble anyways afaics). We already only store columns that find_hash_columns() figures out to be required in the hashtable, but afaics we can instead just iterate over grpColIdx[] and just store the ones in there. The reason I'm looking into this in the first place is that the slot access inside execGrouping turns out to be pretty expensive, especially because all the tuples have nulls, so we can't even skip columns and such. I wrote code to create a new tupledesc which only containing the columns referenced by grpColIdx, and asserted that it's the same set that find_hash_columns() - but that's not always the case, but afaics spuriously. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Mon, Oct 31, 2016 at 7:02 PM, Tomas Vondrawrote: > > The remaining benchmark with 512 clog buffers completed, and the impact > roughly matches Dilip's benchmark - that is, increasing the number of clog > buffers eliminates all positive impact of the patches observed on 128 > buffers. Compare these two reports: > > [a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest > > [b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512 > > With 128 buffers the group_update and granular_locking patches achieve up to > 50k tps, while master and no_content_lock do ~30k tps. After increasing > number of clog buffers, we get only ~30k in all cases. > > I'm not sure what's causing this, whether we're hitting limits of the simple > LRU cache used for clog buffers, or something else. > I have also seen previously that increasing clog buffers to 256 can impact performance negatively. So, probably here the gains due to group_update patch is negated due to the impact of increasing clog buffers. I am not sure if it is good idea to see the impact of increasing clog buffers along with this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dumb mistakes in WalSndWriteData()
On 2016-10-31 09:44:16 -0400, Robert Haas wrote: > On Mon, Oct 31, 2016 at 4:59 AM, Andres Freundwrote: > > I^Wsomebody appears to have made a number of dumb mistakes in > > WalSndWriteData(), namely: > > 1) The timestamp is set way too late, after > >pq_putmessage_noblock(). That'll sometimes work, sometimes not. I > >have no idea how that ended up happening. It's eye-wateringly dumb. > > > > 2) We only do WalSndKeepaliveIfNecessary() if we're blocked on socket > >IO. But on a long-lived connection that might be a lot of data, we > >should really do that once *before* trying to send the payload in the > >first place. > > > > 3) Similarly to 2) it might be worthwhile checking for interrupts > >everytime, not just when blocked on network IO. > > > > See also: > > http://archives.postgresql.org/message-id/CAMsr%2BYE2dSfHVr7iEv1GSPZihitWX-PMkD9QALEGcTYa%2Bsdsgg%40mail.gmail.com > > Do you intend to do something about these problems? At least 1) and 2), yes. I basically wrote this email to have something to reference in my todo list... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondrawrote: > Hi, > > On 10/27/2016 01:44 PM, Amit Kapila wrote: > > I've read that analysis, but I'm not sure I see how it explains the "zig > zag" behavior. I do understand that shifting the contention to some other > (already busy) lock may negatively impact throughput, or that the > group_update may result in updating multiple clog pages, but I don't > understand two things: > > (1) Why this should result in the fluctuations we observe in some of the > cases. For example, why should we see 150k tps on, 72 clients, then drop to > 92k with 108 clients, then back to 130k on 144 clients, then 84k on 180 > clients etc. That seems fairly strange. > I don't think hitting multiple clog pages has much to do with client-count. However, we can wait to see your further detailed test report. > (2) Why this should affect all three patches, when only group_update has to > modify multiple clog pages. > No, all three patches can be affected due to multiple clog pages. Read second paragraph ("I think one of the probable reasons that could happen for both the approaches") in same e-mail [1]. It is basically due to frequent release-and-reacquire of locks. > > >>> On logged tables it usually looks like this (i.e. modest increase for >>> high >>> client counts at the expense of significantly higher variability): >>> >>> http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-skip-64 >>> >> >> What variability are you referring to in those results? > >> > > Good question. What I mean by "variability" is how stable the tps is during > the benchmark (when measured on per-second granularity). For example, let's > run a 10-second benchmark, measuring number of transactions committed each > second. > > Then all those runs do 1000 tps on average: > > run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000 > run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500 > run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000 > Generally, such behaviours are seen due to writes. Are WAL and DATA on same disk in your tests? [1] - https://www.postgresql.org/message-id/CAA4eK1J9VxJUnpOiQDf0O%3DZ87QUMbw%3DuGcQr4EaGbHSCibx9yA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dumb mistakes in WalSndWriteData()
On Mon, Oct 31, 2016 at 4:59 AM, Andres Freundwrote: > I^Wsomebody appears to have made a number of dumb mistakes in > WalSndWriteData(), namely: > 1) The timestamp is set way too late, after >pq_putmessage_noblock(). That'll sometimes work, sometimes not. I >have no idea how that ended up happening. It's eye-wateringly dumb. > > 2) We only do WalSndKeepaliveIfNecessary() if we're blocked on socket >IO. But on a long-lived connection that might be a lot of data, we >should really do that once *before* trying to send the payload in the >first place. > > 3) Similarly to 2) it might be worthwhile checking for interrupts >everytime, not just when blocked on network IO. > > See also: > http://archives.postgresql.org/message-id/CAMsr%2BYE2dSfHVr7iEv1GSPZihitWX-PMkD9QALEGcTYa%2Bsdsgg%40mail.gmail.com Do you intend to do something about these problems? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 10/30/2016 07:32 PM, Tomas Vondra wrote: Hi, On 10/27/2016 01:44 PM, Amit Kapila wrote: On Thu, Oct 27, 2016 at 4:15 AM, Tomas Vondrawrote: FWIW I plan to run the same test with logged tables - if it shows similar regression, I'll be much more worried, because that's a fairly typical scenario (logged tables, data set > shared buffers), and we surely can't just go and break that. Sure, please do those tests. OK, so I do have results for those tests - that is, scale 3000 with shared_buffers=16GB (so continuously writing out dirty buffers). The following reports show the results slightly differently - all three "tps charts" next to each other, then the speedup charts and tables. Overall, the results are surprisingly positive - look at these results (all ending with "-retest"): [1] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest [2] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-noskip-retest [3] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest All three show significant improvement, even with fairly low client counts. For example with 72 clients, the tps improves 20%, without significantly affecting variability variability of the results( measured as stdddev, more on this later). It's however interesting that "no_content_lock" is almost exactly the same as master, while the other two cases improve significantly. The other interesting thing is that "pgbench -N" [3] shows no such improvement, unlike regular pgbench and Dilip's workload. Not sure why, though - I'd expect to see significant improvement in this case. I have also repeated those tests with clog buffers increased to 512 (so 4x the current maximum of 128). I only have results for Dilip's workload and "pgbench -N": [4] http://tvondra.bitbucket.org/index2.html#dilip-3000-logged-sync-retest-512 [5] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-sync-skip-retest-512 The results are somewhat surprising, I guess, because the effect is wildly different for each workload. For Dilip's workload increasing clog buffers to 512 pretty much eliminates all benefits of the patches. For example with 288 client, the group_update patch gives ~60k tps on 128 buffers [1] but only 42k tps on 512 buffers [4]. With "pgbench -N", the effect is exactly the opposite - while with 128 buffers there was pretty much no benefit from any of the patches [3], with 512 buffers we suddenly get almost 2x the throughput, but only for group_update and master (while the other two patches show no improvement at all). The remaining benchmark with 512 clog buffers completed, and the impact roughly matches Dilip's benchmark - that is, increasing the number of clog buffers eliminates all positive impact of the patches observed on 128 buffers. Compare these two reports: [a] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest [b] http://tvondra.bitbucket.org/#pgbench-3000-logged-sync-noskip-retest-512 With 128 buffers the group_update and granular_locking patches achieve up to 50k tps, while master and no_content_lock do ~30k tps. After increasing number of clog buffers, we get only ~30k in all cases. I'm not sure what's causing this, whether we're hitting limits of the simple LRU cache used for clog buffers, or something else. But maybe there's something in the design of clog buffers that make them work less efficiently with more clog buffers? I'm not sure whether that's something we need to fix before eventually committing any of them. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DML and column cound in aggregated subqueries
Andres Freundwrites: > this doesn't look right. The ctid shouldn't be in the aggregate output - > after all it's pretty much meaningless here. I suspect it's being added to support EvalPlanQual row re-fetches. > Casting a wider net: find_hash_columns() and it's subroutines seem like > pretty much dead code, including outdated comments (look for "SQL99 > semantics"). Hm, maybe. In principle the planner could do that instead, but it doesn't look like it actually does at the moment; I'm not seeing any distinction in tlist-building for AGG_HASHED vs other cases in create_agg_plan(). As an example: regression=# explain verbose select avg(ten), hundred from tenk1 group by 2; QUERY PLAN --- HashAggregate (cost=508.00..509.25 rows=100 width=36) Output: avg(ten), hundred Group Key: tenk1.hundred -> Seq Scan on public.tenk1 (cost=0.00..458.00 rows=1 width=8) Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, tw othousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4 (5 rows) If you wanted to forgo find_hash_columns() then it would be important for the seqscan to output a minimal tlist rather than try to optimize away its projection step. I'm not that excited about making such a change in terms of performance: you'd essentially be skipping a handmade projection step inside nodeAgg at the cost of probably adding one at the input node, as in this example. But it might be worth doing anyway just on the grounds that this ought to be the planner's domain not a custom hack in the executor. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding and walsender timeouts
On 31 October 2016 at 16:52, Andres Freundwrote: > Hi, > > On 2016-10-31 16:34:38 +0800, Craig Ringer wrote: >> TL;DR: Logical decoding clients need to generate their own keepalives >> and not rely on the server requesting them to prevent timeouts. Or >> admins should raise the wal_sender_timeout by a LOT when using logical >> decoding on DBs with any big rows. > > Unconvinced. Yeah. I've seen enough issues in the wild where we keep timing out and restarting over and over until we increase wal_sender_timeout to know there's _something_ going on. I am less sure I'm right about what is or how to solve it. >> When sending a big message, WalSndWriteData() notices that it's >> approaching timeout and tries to send a keepalive request, but the >> request just gets buffered behind the remaining output plugin data and >> isn't seen by the client until the client has received the rest of the >> pending data. > > Only for individual messages, not the entire transaction though. Right. I initially thought it was the whole tx, but I was mistaken as I'd failed to notice that WalSndWriteData() queues a keepalive request. > Are > you sure the problem at hand is that we're sending a keepalive, but it's > too late? No, I'm not sure. I'm trying to identify the cause of an issue I've seen in the wild, but never under conditions where it's been possible to sit around and debug in a leisurely manner. I'm trying to set up a TAP test to demonstrate that this happens, but I don't think it's going to work without some kind of network bandwidth limitation simulation or simulated latency. A local unix socket is just too fast for Pg's row size limits. > It might very well be that the actual issue is that we're > never sending keepalives, because the network is fast enough / the tcp > window is large enough. IIRC we only send a keepalive if we're blocked > on network IO? Mm, that's a good point. That might better explain the issues I've seen in the wild, since I never found strong evidence that individual big rows were involved, but hadn't been able to come up with anything else yet. >> So: We could ask output plugins to deal with this for us, by chunking >> up their data in small pieces and calling OutputPluginPrepareWrite() >> and OutputPluginWrite() more than once per output plugin callback if >> they expect to send a big message. But this pushes the complexity of >> splitting up and handling big rows, and big Datums, onto each plugin. >> It's awkward to do well and hard to avoid splitting things up >> unnecessarily. > > There's decent reason for doing that independently though, namely that > it's a lot more efficient from a memory management POV. Definitely. Though you're always going to be tossing around ridiculous chunks of memory when dealing with big external compressed toasted data, unless there are ways to access that progressively that I'm unaware of. Hopefully there are. I'd quite like to extend the bdr/pglogical/logicalrep protocol so that in-core logical rep, in some later version, can write a field as 'to follow', like we currently mark unchanged toasted datums separately. Then send it chunked, after the main row, in follow-up messages. That way we keep processing keepalives, we don't allocate preposterous amounts of memory, etc. > I don't think the "unrequested keepalive" approach really solves the > problem on a fundamental enough level. Fair. It feels a bit like flailing in the dark, too. >> (A separate issue is that we can also time out when doing logical >> _replication_ if the downstream side blocks on a lock, since it's not >> safe to send on a socket from a signal handler ... ) > > That's strictly speaking not true. write() / sendmsg() are signal safe > functions. There's good reasons not to do that however, namely that the > non signal handler code might be busy writing data itself. Huh, ok. And since in pglogical/bdr and as far as I can tell in core logical rep we don't send anything on the socket while we're calling in to heap access, the executor, etc, that'd actually be an option. We could possibly safeguard it with a volatile "socket busy" flag since we don't do much sending anyway. But I'd need to do my reading on signal handler safety etc. Still, good to know it's not completely absurd to do this if the issue comes up. Thanks very much for the input. I saw your post with proposed changes. Once I can get the issue simulated reliably I'll test the patch and see if it solves it, but it looks like it's sensible to just apply it anyway TBH. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On Fri, Oct 28, 2016 at 2:46 AM, Andres Freundwrote: > Well, that'll also make the feature not particularly useful :(. My > suspicion is that the way to suceed here isn't to rely more on testing > as part of the scan, but create a more general fastpath for qual > evaluation, which atm is a *LOT* more heavyweight than what > HeapKeyTest() does. But maybe I'm biased since I'm working on the > latter... I think you might be right, but I'm not very clear on what the timeline for your work is. It would be easier to say, sure, let's put this on hold if we knew that in a month or two we could come back and retest after you've made some progress. But I don't know whether we're talking about months or years. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On 10/31/2016 05:01 AM, Jim Nasby wrote: On 10/30/16 1:32 PM, Tomas Vondra wrote: Now, maybe this has nothing to do with PostgreSQL itself, but maybe it's some sort of CPU / OS scheduling artifact. For example, the system has 36 physical cores, 72 virtual ones (thanks to HT). I find it strange that the "good" client counts are always multiples of 72, while the "bad" ones fall in between. 72 = 72 * 1 (good) 108 = 72 * 1.5 (bad) 144 = 72 * 2 (good) 180 = 72 * 2.5 (bad) 216 = 72 * 3 (good) 252 = 72 * 3.5 (bad) 288 = 72 * 4 (good) So maybe this has something to do with how OS schedules the tasks, or maybe some internal heuristics in the CPU, or something like that. It might be enlightening to run a series of tests that are 72*.1 or *.2 apart (say, 72, 79, 86, ..., 137, 144). Yeah, I've started a benchmark with client a step of 6 clients 36 42 48 54 60 66 72 78 ... 252 258 264 270 276 282 288 instead of just 36 72 108 144 180 216 252 288 which did a test every 36 clients. To compensate for the 6x longer runs, I'm only running tests for "group-update" and "master", so I should have the results in ~36h. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Overlook in 2bd9e412?
On Fri, Oct 28, 2016 at 8:46 AM, Julien Rouhaudwrote: > It looks like pq_putmessage_hook and pq_flush_hook were meant for > development purpose and not supposed to be committed. > > Attached patch remove them. Thanks. Those were actually remnants of an earlier design which didn't survive contact with Andres, and which I failed to clean up properly. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query regarding selectDumpableExtension()
amul sulwrites: > On Fri, Oct 28, 2016 at 6:22 PM, Robert Haas wrote: >> There's a comment in dumpExtension() that explains it. > Let me explain the case I'm trying to tackle. I have two old dump > data, each of them have couple objects depend on plpgsql. I have > restored first dump and trying restore second dump using 'pg_restore > -c' command, it is failing with following error: > ERROR: cannot drop extension plpgsql because other objects depend on it This is hardly specific to extensions. If you try a restore with -c into a database that has other random objects besides what's in the dump, you could get errors from * dropping tables that are referenced by foreign keys from tables not known in the dump * dropping functions that are used in views not known in the dump * dropping operators or opclasses used by indexes not known in the dump etc etc. > Works well without '-c' option, but that what not a general solution, IMHO. The general solution is either don't restore into a database containing unrelated objects, or be prepared to ignore errors from the DROP commands. The extension case actually works more smoothly than most of the others. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Fri, Oct 28, 2016 at 3:40 AM, Andres Freundwrote: >> >> There is not much common functionality between the two. > > I don't really agree. For me manual and automated prewarming are pretty > closely related. Sure they have their independent usages, and not too > much code might be shared. But grouping them in the same extension seems > to make sense, it's confusing to have closely related but independent > extensions. I agree that putting them together would be fine. >> One point that seems to be worth discussing is when should the buffer >> information be dumped to a file? Shall we dump at each checkpoint or >> at regular intervals via auto prewarm worker or at some other time? > > Should probably be at some regular interval - not sure if checkpoints > are the best time (or if it's even realistic to tie a bgworker to > checkpoints), since checkpoints have a significant impact on the state > of shared_buffers. Checkpoints don't cause any buffer replacement, which I think is what would be relevant here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapatwrote: > I think there are going to be two kinds of partitioning use-cases. > First, carefully hand-crafted by DBAs so that every partition is > different from other and so is every join between two partitions. > There will be lesser number of partitions, but creating paths for each > join between partitions will be crucial from performance point of > view. Consider, for example, systems which use partitions to > consolidate results from different sources for analytical purposes or > sharding. If we consider various points you have listed in [1] as to > why a partition is equivalent to a table, each join between partitions > is going to have very different characteristics and thus deserves a > set of paths for its own. Add to that possibility of partition pruning > or certain conditions affecting particular partitions, the need for > detailed planning evident. > > The other usage of partitioning is to distribute the data and/or > quickly eliminate the data by partition pruning. In such case, all > partitions of a given table will have very similar properties. There > is a large chance that we will end up having same plans for every > partition and for joins between partitions. In such cases, I think it > suffices to create paths for just one or may be a handful partitions > of join and repeat that plan for other partitions of join. But in such > cases it also makes sense to have a light-weight representation for > partitions as compared to partitions being a full-fledged tables. If > we have such a light-weight representation, we may not even create > RelOptInfos representing joins between partitions, and different paths > for each join between partitions. I'm not sure I see a real distinction between these two use cases. I think that the problem of differing data distribution between partitions is almost always going to be an issue. Take the simple case of an "orders" table which is partitioned by month. First, the month that's currently in progress may be much smaller than a typical completed month. Second, many businesses are seasonal and may have many more orders at certain times of year. For example, in American retail, many businesses have large spikes in December. I think some businesses may do four times as much business in December as any other month, for example. So you will have that sort of variation, at least. > A typical join tree will be composite: some portion partitioned and > some portion unpartitioned or different portions partitioned by > different partition schemes. In such case, inaccurate costs for > PartitionJoinPath, can affect the plan heavily, causing a suboptimal > path to be picked. Assuming that partitioning will be useful for large > sets of data, choosing a suboptimal plan can be more dangerous than > consuming memory for creating paths. Well, sure. But, I mean, every simplifying assumption which the planner makes to limit resource consumption could have that effect. join_collapse_limit, for example, can cause horrible plans. However, we have it anyway, because the alternative of having planning take far too long is unpalatable. Planning is always, at some level, guesswork. >> For each >> partition, we switch to a new memory context, perform planning, copy >> the best path and its substructure back to the parent context, and >> then reset the context. > > This could be rather tricky. It assumes that all the code that creates > paths for joins, should not allocate any memory which is linked to > some object in a context that lives longer than the path creation > context. There is some code like create_join_clause() or > make_canonical_pathkey(), which carefully chooses which memory context > to allocate memory in. But can we ensure it always? postgres_fdw for > example allocates memory for PgFdwRelationInfo in current memory > context and attaches it in RelOptInfo, which should be in the > planner's original context. So, if we create a new memory context for > each partition, fpinfos would be invalidated when those contexts are > released. Not that, we can not enforce some restriction on the memory > usage while planning, it's hard to enforce it and bugs arising from it > may go unnoticed. GEQO planner might have its own problems with this > approach. Third party FDWs will pose a problem. Yep, there are problems. :-) > A possible solution would be to keep the track of used paths using a > reference count. Once the paths for given join tree are created, free > up the unused paths by traversing pathlist in each of the RelOptInfos. > Attached patch has a prototype implementation for the same. There are > some paths which are not linked to RelOptInfos, which need a bit > different treatment, but they can be handled too. So, if you apply this with your previous patch, how much does it cut down memory consumption? >> In that way, peak memory usage only grows by >> about a factor of
Re: [HACKERS] sequences and pg_upgrade
On 9/30/16 12:50 PM, Anastasia Lubennikova wrote: > The patches are good, no complaints. > But again, I have the same question. > I was confused, why do we always dump sequence data, > because I'd overlooked the --sequence-data key. I'd rather leave this > option, > because it's quite non intuitive behaviour... > /* dump sequence data even in schema-only mode */ Here are rebased patches. Regarding your question: The initial patch had a separate option for this behavior, which was then used by pg_upgrade. It was commented that this option is not useful outside of pg_upgrade, so it doesn't need to be exposed as a user-facing option. I agreed with that and removed the option. We can always add the option back easily if someone really wants it, but so far no use case has been presented. So I suggest we proceed with this proposal ignoring whether this option is exposed or not. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From d2b98ba5df815018dac1650134398c1bac7164a4 Mon Sep 17 00:00:00 2001 From: Peter EisentrautDate: Tue, 23 Aug 2016 12:00:00 -0400 Subject: [PATCH v3 1/2] pg_dump: Separate table and sequence data object types Instead of handling both sequence data and table data internally as "table data", handle sequences separately under a "sequence set" type. We already handled materialized view data differently, so it makes the code somewhat cleaner to handle each relation kind separately at the top level. This does not change the output format, since there already was a separate "SEQUENCE SET" archive entry type. A noticeable difference is that SEQUENCE SET entries now always appear after TABLE DATA entries. And in parallel mode there is less sorting to do, because the sequence data entries are no longer considered table data. --- src/bin/pg_dump/pg_dump.c | 11 +++ src/bin/pg_dump/pg_dump.h | 1 + src/bin/pg_dump/pg_dump_sort.c | 28 +--- 3 files changed, 25 insertions(+), 15 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 4da297f..3485cab 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -2090,6 +2090,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo, bool oids) if (tbinfo->relkind == RELKIND_MATVIEW) tdinfo->dobj.objType = DO_REFRESH_MATVIEW; + else if (tbinfo->relkind == RELKIND_SEQUENCE) + tdinfo->dobj.objType = DO_SEQUENCE_SET; else tdinfo->dobj.objType = DO_TABLE_DATA; @@ -8498,11 +8500,11 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj) case DO_TRANSFORM: dumpTransform(fout, (TransformInfo *) dobj); break; + case DO_SEQUENCE_SET: + dumpSequenceData(fout, (TableDataInfo *) dobj); + break; case DO_TABLE_DATA: - if (((TableDataInfo *) dobj)->tdtable->relkind == RELKIND_SEQUENCE) -dumpSequenceData(fout, (TableDataInfo *) dobj); - else -dumpTableData(fout, (TableDataInfo *) dobj); + dumpTableData(fout, (TableDataInfo *) dobj); break; case DO_DUMMY_TYPE: /* table rowtypes and array types are never dumped separately */ @@ -16226,6 +16228,7 @@ addBoundaryDependencies(DumpableObject **dobjs, int numObjs, addObjectDependency(preDataBound, dobj->dumpId); break; case DO_TABLE_DATA: + case DO_SEQUENCE_SET: case DO_BLOB_DATA: /* Data objects: must come between the boundaries */ addObjectDependency(dobj, preDataBound->dumpId); diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h index a60cf95..642c4d5 100644 --- a/src/bin/pg_dump/pg_dump.h +++ b/src/bin/pg_dump/pg_dump.h @@ -63,6 +63,7 @@ typedef enum DO_PROCLANG, DO_CAST, DO_TABLE_DATA, + DO_SEQUENCE_SET, DO_DUMMY_TYPE, DO_TSPARSER, DO_TSDICT, diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c index 195b84a..5b96334 100644 --- a/src/bin/pg_dump/pg_dump_sort.c +++ b/src/bin/pg_dump/pg_dump_sort.c @@ -47,14 +47,15 @@ static const int dbObjectTypePriority[] = 11, /* DO_CONVERSION */ 18, /* DO_TABLE */ 20, /* DO_ATTRDEF */ - 27, /* DO_INDEX */ - 28, /* DO_RULE */ - 29, /* DO_TRIGGER */ - 26, /* DO_CONSTRAINT */ - 30, /* DO_FK_CONSTRAINT */ + 28, /* DO_INDEX */ + 29, /* DO_RULE */ + 30, /* DO_TRIGGER */ + 27, /* DO_CONSTRAINT */ + 31, /* DO_FK_CONSTRAINT */ 2, /* DO_PROCLANG */ 10, /* DO_CAST */ 23, /* DO_TABLE_DATA */ + 24, /* DO_SEQUENCE_SET */ 19, /* DO_DUMMY_TYPE */ 12, /* DO_TSPARSER */ 14, /* DO_TSDICT */ @@ -62,15 +63,15 @@ static const int dbObjectTypePriority[] = 15, /* DO_TSCONFIG */ 16, /* DO_FDW */ 17, /* DO_FOREIGN_SERVER */ - 31, /* DO_DEFAULT_ACL */ + 32, /* DO_DEFAULT_ACL */ 3, /* DO_TRANSFORM */ 21, /* DO_BLOB */ - 24, /* DO_BLOB_DATA */ + 25, /* DO_BLOB_DATA */ 22,
Re: [HACKERS] WAL consistency check facility
On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquierwrote: > And here we go. Here is a review as well as a large brush-up for this > patch. A couple of things: > - wal_consistency is using a list of RMGRs, at the cost of being > PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I > have been thinking hard about that, and still I don't see the point). > It is rather easy to for example default it to false, and enable it to > true to check if a certain code path is correctly exercised or not for > WAL consistency. Note that this simplification reduces the patch size > by 100~150 lines. I know, I know, I'd expect some complains about > that I don't understand how you can fail to see the point of that. As you yourself said, this facility generates a ton of WAL. If you're focusing on one AM, why would you want to be forced to incur the overhead for every other AM? A good deal has been written about this upthread already, and just saying "I don't see the point" seems to be ignoring the explanations already given. > - Looking for wal_consistency at replay has no real value. What if on > a standby the parameter value is inconsistent than the one on the > master? This logic adds a whole new level of complications and > potential bugs. So instead my suggestion is to add a marker at WAL > record level to check if this record should be checked for consistency > at replay or not. Agreed. > This is also quite flexible if you think about it, > the standby is made independent of the WAL generated on the master and > just applies, or checks what it sees is fit checking for. +1. > The best > match here is to add a flag for XLogRecord->xl_info and make use of > one of the low 4 bits and only one is used now for > XLR_SPECIAL_REL_UPDATE. Seems reasonable. > - in maskPage, the new rmgr routine, there is no need for the info and > blkno arguments. info is not used at all to begin with. blkno is used > for gin pages to detect meta pages but this can be guessed using the > opaque pointer. For heap pages and speculative inserts, masking the > blkno would be fine. That's not worth it. Passing the blkno doesn't cost anything. If it avoids guessing, that's entirely worth it. > - Instead of palloc'ing the old and new pages to compare, it would be > more performant to keep around two static buffers worth of BLCKSZ and > just use that. This way there is no need as well to perform any palloc > calls in the masking functions, limiting the risk of errors (those > code paths had better avoid errors IMO). It would be also less costly > to just pass to the masking function a pointer to a buffer of size > BLCKSZ and just do the masking on it. We always palloc buffers like this so that they will be aligned. But we could arrange not to repeat the palloc every time (see, e.g., BootstrapXLOG()). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] DML and column cound in aggregated subqueries
Hi, while hacking up nodeAgg.c to use a dedicated slot/desc for the hashtable, an assert I placed brought up a weird case: regression[29535][1]=# EXPLAIN VERBOSE update bar set f2 = f2 + 100 where f1 in (select f1 from foo); ┌───┐ │ QUERY PLAN │ ├───┤ │ Update on public.bar (cost=42.75..109.25 rows=1130 width=20) │ │ -> Hash Join (cost=42.75..109.25 rows=1130 width=20) │ │ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid │ │ Hash Cond: (bar.f1 = foo.f1) │ │ -> Seq Scan on public.bar (cost=0.00..32.60 rows=2260 width=14) │ │ Output: bar.f1, bar.f2, bar.ctid │ │ -> Hash (cost=40.25..40.25 rows=200 width=10) │ │ Output: foo.ctid, foo.f1 │ │ -> HashAggregate (cost=38.25..40.25 rows=200 width=10) │ │ Output: foo.ctid, foo.f1 │ │ Group Key: foo.f1 │ │ -> Seq Scan on public.foo (cost=0.00..32.60 rows=2260 width=10) │ │ Output: foo.ctid, foo.f1 │ └───┘ (13 rows) this doesn't look right. The ctid shouldn't be in the aggregate output - after all it's pretty much meaningless here. Note that in this case find_hash_columns(aggstate) will return two columns, but node->numCols will be 1 (and thus node->grpOperators only have one element). So it seems we're doing something wrong here. Casting a wider net: find_hash_columns() and it's subroutines seem like pretty much dead code, including outdated comments (look for "SQL99 semantics"). The main explanation says: /* * Create a list of the tuple columns that actually need to be stored in * hashtable entries. The incoming tuples from the child plan node will * contain grouping columns, other columns referenced in our targetlist and * qual, columns used to compute the aggregate functions, and perhaps just * junk columns we don't use at all. Only columns of the first two types * need to be stored in the hashtable, and getting rid of the others can * make the table entries significantly smaller. To avoid messing up Var * numbering, we keep the same tuple descriptor for hashtable entries as the * incoming tuples have, but set unwanted columns to NULL in the tuples that * go into the table. but that seems bogus - when are we referencing "other columns referenced in our targetlist and qual" from representative tuple that aren't also grouped upon? This method would select pretty arbitrary value for those, since we don't hash them, so it'll just be the tuple first occurred? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE
The BRIN Bitmap Index Scan has the same problem. I have seen people confused by this. I think N/A would clearly improve the situation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sequential scan result order vs performance
Jim Nasbywrites: > BTW, I've sometimes wished for a mode where queries would silently have > result ordering intentionally futzed, to eliminate any possibility of > dependence on tuple ordering (as well as having sequences start at some > random value). FWIW, SQLite has this, in the form of 'PRAGMA reverse_unordered_selects'. http://sqlite.org/pragma.html#pragma_reverse_unordered_selects -- "The surreality of the universe tends towards a maximum" -- Skud's Law "Never formulate a law or axiom that you're not prepared to live with the consequences of." -- Skud's Meta-Law -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP
I have attached a version of the patch rebased on top of the OpenSSL 1.1 changes. Andreas diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c index 668f217..a1b582f 100644 --- a/src/backend/libpq/be-secure-openssl.c +++ b/src/backend/libpq/be-secure-openssl.c @@ -77,12 +77,14 @@ static DH *generate_dh_parameters(int prime_len, int generator); static DH *tmp_dh_cb(SSL *s, int is_export, int keylength); static int verify_cb(int, X509_STORE_CTX *); static void info_cb(const SSL *ssl, int type, int args); -static void initialize_ecdh(void); +static SSL_CTX *initialize_context(void); +static bool initialize_ecdh(SSL_CTX *context); static const char *SSLerrmessage(unsigned long ecode); static char *X509_NAME_to_cstring(X509_NAME *name); static SSL_CTX *SSL_context = NULL; +static bool SSL_initialized = false; /* */ /* Hardcoded values */ @@ -157,14 +159,12 @@ KWbuHn491xNO25CQWMtem80uKw+pTnisBRF/454n1Jnhub144YRBoN8CAQI=\n\ /* * Initialize global SSL context. */ -void +int be_tls_init(void) { - struct stat buf; + SSL_CTX *context; - STACK_OF(X509_NAME) *root_cert_list = NULL; - - if (!SSL_context) + if (!SSL_initialized) { #ifdef HAVE_OPENSSL_INIT_SSL OPENSSL_init_ssl(OPENSSL_INIT_LOAD_CONFIG, NULL); @@ -173,177 +173,28 @@ be_tls_init(void) SSL_library_init(); SSL_load_error_strings(); #endif - - /* - * We use SSLv23_method() because it can negotiate use of the highest - * mutually supported protocol version, while alternatives like - * TLSv1_2_method() permit only one specific version. Note that we - * don't actually allow SSL v2 or v3, only TLS protocols (see below). - */ - SSL_context = SSL_CTX_new(SSLv23_method()); - if (!SSL_context) - ereport(FATAL, - (errmsg("could not create SSL context: %s", - SSLerrmessage(ERR_get_error(); - - /* - * Disable OpenSSL's moving-write-buffer sanity check, because it - * causes unnecessary failures in nonblocking send cases. - */ - SSL_CTX_set_mode(SSL_context, SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER); - - /* - * Load and verify server's certificate and private key - */ - if (SSL_CTX_use_certificate_chain_file(SSL_context, - ssl_cert_file) != 1) - ereport(FATAL, - (errcode(ERRCODE_CONFIG_FILE_ERROR), - errmsg("could not load server certificate file \"%s\": %s", - ssl_cert_file, SSLerrmessage(ERR_get_error(); - - if (stat(ssl_key_file, ) != 0) - ereport(FATAL, - (errcode_for_file_access(), - errmsg("could not access private key file \"%s\": %m", - ssl_key_file))); - - if (!S_ISREG(buf.st_mode)) - ereport(FATAL, - (errcode(ERRCODE_CONFIG_FILE_ERROR), - errmsg("private key file \"%s\" is not a regular file", - ssl_key_file))); - - /* - * Refuse to load files owned by users other than us or root. - * - * XXX surely we can check this on Windows somehow, too. - */ -#if !defined(WIN32) && !defined(__CYGWIN__) - if (buf.st_uid != geteuid() && buf.st_uid != 0) - ereport(FATAL, - (errcode(ERRCODE_CONFIG_FILE_ERROR), - errmsg("private key file \"%s\" must be owned by the database user or root", - ssl_key_file))); -#endif - - /* - * Require no public access to key file. If the file is owned by us, - * require mode 0600 or less. If owned by root, require 0640 or less - * to allow read access through our gid, or a supplementary gid that - * allows to read system-wide certificates. - * - * XXX temporarily suppress check when on Windows, because there may - * not be proper support for Unix-y file permissions. Need to think - * of a reasonable check to apply on Windows. (See also the data - * directory permission check in postmaster.c) - */ -#if !defined(WIN32) && !defined(__CYGWIN__) - if ((buf.st_uid == geteuid() && buf.st_mode & (S_IRWXG | S_IRWXO)) || - (buf.st_uid == 0 && buf.st_mode & (S_IWGRP | S_IXGRP | S_IRWXO))) - ereport(FATAL, - (errcode(ERRCODE_CONFIG_FILE_ERROR), - errmsg("private key file \"%s\" has group or world access", - ssl_key_file), - errdetail("File must have permissions u=rw (0600) or less if owned by the database user, or permissions u=rw,g=r (0640) or less if owned by root."))); -#endif - - if (SSL_CTX_use_PrivateKey_file(SSL_context, - ssl_key_file, - SSL_FILETYPE_PEM) != 1) - ereport(FATAL, - (errmsg("could not load private key file \"%s\": %s", - ssl_key_file, SSLerrmessage(ERR_get_error(); - - if (SSL_CTX_check_private_key(SSL_context) != 1) - ereport(FATAL, - (errmsg("check of private key failed: %s", - SSLerrmessage(ERR_get_error(); - } - - /* set up ephemeral DH keys, and disallow SSL v2/v3 while at it */ - SSL_CTX_set_tmp_dh_callback(SSL_context, tmp_dh_cb); - SSL_CTX_set_options(SSL_context, - SSL_OP_SINGLE_DH_USE | - SSL_OP_NO_SSLv2 | SSL_OP_NO_SSLv3); - - /* set
Re: [HACKERS] Patch to implement pg_current_logfile() function
Le 31/10/2016 à 04:35, Karl O. Pinc a écrit : > Hi Gilles, > > On Sat, 29 Oct 2016 22:00:08 +0200 > Gilles Daroldwrote: > >> The attached v10 of the current_logfiles patch > Attached is a patch to apply on top of the v10 patch. > > It updates current_logfiles only once per log rotation. > I see no reason to open and write the file twice > if both csvlog and stderr logging is happening. I do not take the effort to do that because log rotation is not something that occurs too often and I don't want to change the conditional "time_based_rotation" code lines in logfile_rotate() for readability. My though was also that on high load, log rotation is automatically disabled so logfile_writename() is not called and there will be no additional I/O. But why not, if commiters want to save this additional I/O, this patch can be applied. > I have 2 more questions. > > I don't understand why you're calling > logfile_writename(last_file_name, last_csv_file_name); > in the SIGHUP code. Presumably you've already > written the old logfile names to current_logfiles. > On SIGHUP you want to write the new names, not > the old ones. But the point of the SIGHUP code > is to look for changes in logfile writing and then > call logfile_rotate() to make those changes. And > logfile_rotate() calls logfile_writename() as appropriate. > Shouldn't the logfile_writename call in the SIGHUP > code be eliminated? No, when you change the log_destination and reload configuration you have to refresh the content of current_logfiles, in this case no new rotation have been done and you have to write last_file_name and/or last_csv_file_name. > Second, and I've not paid really close attention here, > you're calling logfile_writename() with last_file_name > and last_csv_file_name in a number of places. Are you > sure that last_file_name and last_csv_file_name is reset > to the empty string if stderr/csvlog logging is turned > off and the config file re-read? You might be recording > that logging is going somewhere that's been turned off > by a config change. I've not noticed last_file_name and > (especially) last_csv_file_name getting reset to the empty > string. In the patch I do not take care if the value of last_file_name and last_csv_file_name have been reseted or not because following the log_destination they are written or not to current_logfiles. When they are written, they always contain the current log filename because the call to logfile_writename() always appears after their assignment to the new rotated filename. > FYI, ultimately, in order to re-try writes to current_logfiles > after ENFILE and EMFILE failure, I'm thinking that I'll probably > wind up with logfile_writename() taking no arguments. It will > always write last_file_name and last_csv_file_name. Then it's > a matter of making sure that these variables contain accurate > values. It would be helpful to let me know if you have any > insight regards config file re-read and resetting of these > variables before I dive into writing code which retries writes to > current_logfiles. As explain above, last_file_name and last_csv_file_name always contains the last log file name, then even in case of logfile_writename() repeated failure. Those variables might have been updated in case of log rotation occurs before a new call to logfile_writename(). In case of ENFILE and EMFILE failure, log rotation is disabled and logfile_writename() not called, last_file_name and last_csv_file_name will still contain the last log file name. -- Gilles Darold Consultant PostgreSQL http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Dumb mistakes in WalSndWriteData()
Hi, I^Wsomebody appears to have made a number of dumb mistakes in WalSndWriteData(), namely: 1) The timestamp is set way too late, after pq_putmessage_noblock(). That'll sometimes work, sometimes not. I have no idea how that ended up happening. It's eye-wateringly dumb. 2) We only do WalSndKeepaliveIfNecessary() if we're blocked on socket IO. But on a long-lived connection that might be a lot of data, we should really do that once *before* trying to send the payload in the first place. 3) Similarly to 2) it might be worthwhile checking for interrupts everytime, not just when blocked on network IO. See also: http://archives.postgresql.org/message-id/CAMsr%2BYE2dSfHVr7iEv1GSPZihitWX-PMkD9QALEGcTYa%2Bsdsgg%40mail.gmail.com Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding and walsender timeouts
Hi, On 2016-10-31 16:34:38 +0800, Craig Ringer wrote: > TL;DR: Logical decoding clients need to generate their own keepalives > and not rely on the server requesting them to prevent timeouts. Or > admins should raise the wal_sender_timeout by a LOT when using logical > decoding on DBs with any big rows. Unconvinced. > When sending a big message, WalSndWriteData() notices that it's > approaching timeout and tries to send a keepalive request, but the > request just gets buffered behind the remaining output plugin data and > isn't seen by the client until the client has received the rest of the > pending data. Only for individual messages, not the entire transaction though. Are you sure the problem at hand is that we're sending a keepalive, but it's too late? It might very well be that the actual issue is that we're never sending keepalives, because the network is fast enough / the tcp window is large enough. IIRC we only send a keepalive if we're blocked on network IO? > So: We could ask output plugins to deal with this for us, by chunking > up their data in small pieces and calling OutputPluginPrepareWrite() > and OutputPluginWrite() more than once per output plugin callback if > they expect to send a big message. But this pushes the complexity of > splitting up and handling big rows, and big Datums, onto each plugin. > It's awkward to do well and hard to avoid splitting things up > unnecessarily. There's decent reason for doing that independently though, namely that it's a lot more efficient from a memory management POV. I don't think the "unrequested keepalive" approach really solves the problem on a fundamental enough level. > (A separate issue is that we can also time out when doing logical > _replication_ if the downstream side blocks on a lock, since it's not > safe to send on a socket from a signal handler ... ) That's strictly speaking not true. write() / sendmsg() are signal safe functions. There's good reasons not to do that however, namely that the non signal handler code might be busy writing data itself. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Logical decoding and walsender timeouts
Hi all I've been investigating some intermittent timeout issues I've seen with BDR users in the wild and I think I've identified an issue with how logical decoding sends data and handles feedback. TL;DR: Logical decoding clients need to generate their own keepalives and not rely on the server requesting them to prevent timeouts. Or admins should raise the wal_sender_timeout by a LOT when using logical decoding on DBs with any big rows. An output plugin accumulates data in a StringInfo which is then sent by WalSndWriteData(), which does nonblocking sends, checks for interrupts, processes client replies, and requests keepalives if approaching walsender timeout. A problem arises when the client's buffered data is big enough that it takes more than wal_sender_timeout to transmit the whole chunk and get a reply from the client on the link. A slow link, a big row, or some badly timed packet loss can easily result in timeout. Unlike in physical replication, a timeout on a logical decoding session can be expensive: * We have to restart logical decoding at restart_lsn, which can be a LONG time ago if there are any long running tx's, and decode up to the last confirmed xact; * We then have to resend the whole xact we were working on when we timed out, which might be quite big. so we should do our best to avoid timeouts in logical decoding sessions. Physical replication chunks data into small pieces and can restart cheaply, but logical decoding will send a 100MB+ row as a single message and have to re-read hundreds of GB of WAL and resend gigabytes of data on the wire to get back where it started after a timeout. When sending a big message, WalSndWriteData() notices that it's approaching timeout and tries to send a keepalive request, but the request just gets buffered behind the remaining output plugin data and isn't seen by the client until the client has received the rest of the pending data. The keepalive requests are useful so long as no one message takes more than wal_sender_timeout to send since it'll defer timeout and give us enough time to process the next message. But they won't prevent timeout within a single big message. How to fix this? We can't inject keepalive requests in the middle of a CopyData message. After the cancel-message discussion I know better than to suggest using out-of-band tcp data and SIGURG. So: We could ask output plugins to deal with this for us, by chunking up their data in small pieces and calling OutputPluginPrepareWrite() and OutputPluginWrite() more than once per output plugin callback if they expect to send a big message. But this pushes the complexity of splitting up and handling big rows, and big Datums, onto each plugin. It's awkward to do well and hard to avoid splitting things up unnecessarily. We could extend the logical decoding protocol slightly, so that the CopyData messages that carry output plugin payloads are limited to some fixed, reasonable size (1MB? 10MB?) where the CopyData message overhead is insignificant and most rows won't need to be reassembled from multiple CopyData messages by the downstream. This gives us gaps in which we can inject keepalives / status updates during big messages so we get the client replies and process them. This won't fix existing releases though, and makes clients reassemble the chunks. It also means clients can't know the full size of a message based on the CopyData length anymore so it'll force clients that rely on the CopyData message length to add their own length fields in the payload. CopyData doesn't have anything we can use as a "continuation message follows" flag. We could permit the walsender to reset the keepalive timer whenever it confirms that it's sent a reasonable chunk of data such that it's unlikely to just be lurking in the kernel's socket send buffer or various routers in transit. We don't care that the client received up to any particular point, only that it's alive and making progress. We know the client must be receiving and acking it since we're using TCP. The problem is knowing what "enough' data sent to mean the client must've received some is. The socket send buffer is easy to check and bounded, but TCP window scaling means there can be a huge amount of un-acked data in transit. AFAIK we have no visibility into the network stack's view of what's acked, at least portably. Linux has TCP_INFO but it's hard to use and nonportable. So we can't assume that even successfully sending 1GB of data means the client got data 1GB-ago unless we clamp the congestion window or do other possibly-performance-affecting things. Or we could expect logical decoding downstream clients to send their own proactive feedback when receiving and processing large messages. The walsender is already capable of noticing and processing such replies, just not requesting them when it should. Again this pushes complexity onto each client, but less so than requiring output plugins to chunk up their messages. Also, since each
Re: [HACKERS] Patch to implement pg_current_logfile() function
Le 30/10/2016 à 08:04, Karl O. Pinc a écrit : > On Sat, 29 Oct 2016 22:00:08 +0200 > Gilles Daroldwrote: > >> The attached v10 of the current_logfiles patch include your last >> changes on documentation but not the patch on v9 about the >> user-supplied GUC value. I think the v10 path is ready for committers >> and that the additional patch to add src/include/utils/guc_values.h >> to define user GUC values is something that need to be taken outside >> this one. Imo, thoses GUC values (stderr, csvlog) are not expected to >> change so often to require a global definition, but why not, if >> committers think this must be done I can add it to a v11 patch. > I agree that the guc_values.h patches should be submitted separately > to the committers, when your patch is submitted. Creating symbols > for these values is a matter of coding style they should resolve. > (IMO it's not whether the values will change, it's whether > someone reading the code can read the letters "stdout" and know > to what they refer and where to find related usage elsewhere in > the code.) > > I'll keep up the guc_values.h patches and have them ready for > submission along with your patch. > > I don't think your patch is quite ready for submission to > the committers. > > Attached is a patch to be applied on top of your v10 patch > which does basic fixup to logfile_writename(). Attached patch v11 include your patch. > > I have some questions about logfile_writename(): > > Why does the logfile_open() call fail silently? > Why not use ereport() here? (With a WARNING level.) logfile_open() "fail silently" in logfile_writename(), like in other parts of syslogger.c where it is called, because this is a function() that already report a message when an error occurs ("could not open log file..."). I think I have already replied to this question. > Why do the ereport() invocations all have a LOG level? > You're not recovering and the errors are unexpected > so I'd think WARNING would be the right level. > (I previously, IIRC, suggested LOG level -- only if > you are retrying and recovering from an error.) If you look deeper in syslogger.c, all messages are reported at LOG level. I guess the reason is to not call syslogger repeatedly. I think I have already replied to this question in the thread too. > Have you given any thought to my proposal to change > CURRENT_LOG_FILENAME to LOG_METAINFO_FILE? Yes, I don't think the information logged in this file are kind of meta information and CURRENT_LOG_FILENAME seems obvious. -- Gilles Darold Consultant PostgreSQL http://dalibo.com - http://dalibo.org diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index adab2f8..645cbb9 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4143,6 +4143,12 @@ SELECT * FROM parent WHERE key = 2400; server log +When logs are written to the file-system their paths, names, and +types are recorded in +the file. This provides +a convenient way to find and access log content without establishing a +database connection. + Where To Log diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2e64cc4..f5bfe59 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -15443,6 +15443,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n); + pg_current_logfile() + text + primary log file name in use by the logging collector + + + + pg_current_logfile(text) + text + log file name, of log in the requested format, in use by the + logging collector + + + pg_my_temp_schema() oid OID of session's temporary schema, or 0 if none @@ -15660,6 +15673,34 @@ SET search_path TO schema , schema, .. the time when the postmaster process re-read the configuration files.) + +pg_current_logile + + + +Logging +pg_current_logfile function + + + +pg_current_logfile returns, as text, +the path of either the csv or stderr log file currently in use by the +logging collector. This is a path including the + directory and the log file name. +Log collection must be active or the return value +is NULL. When multiple logfiles exist, each in a +different format, pg_current_logfile called +without arguments returns the path of the file having the first format +found in the ordered +list: stderr, csvlog. +NULL is returned when no log file has any of these +formats. To request a specific file format supply, +as text, either csvlog +or stderr as the value of the optional parameter. The +return value is NULL when the log format requested +is not a configured . + + pg_my_temp_schema diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml index fddb69b..1cfb9da 100644 --- a/doc/src/sgml/storage.sgml +++
Re: [HACKERS] sequential scan result order vs performance
Jim Nasby wrote: > On 10/30/16 9:12 AM, Tom Lane wrote: >> I think there will be a lot of howls. People expect that creating >> a table by inserting a bunch of rows, and then reading back those >> rows, will not change the order. We already futzed with that guarantee >> a bit with syncscans, but that only affects quite large tables --- and >> even there, we were forced to provide a way to turn it off. > > Leaving a 30% performance improvement on the floor because some people > don't grok how sets work seems insane to me. +1 Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Query regarding selectDumpableExtension()
On Fri, Oct 28, 2016 at 6:22 PM, Robert Haaswrote: > On Thu, Oct 27, 2016 at 2:11 AM, amul sul wrote: >> selectDumpableExtension() function skip >> s dump of >> built-in extensions in case of binary-upgrade only, >> why not in normal >> dump >> ? >> Can't we assume those will already be installed in the target database >> at restore >> ? > Apologise for the delayed response. > There's a comment in dumpExtension() that explains it. > > /* > * In a regular dump, we use IF NOT EXISTS so that there isn't a > * problem if the extension already exists in the target database; > * this is essential for installed-by-default extensions such as > * plpgsql. > * Understood. > * In binary-upgrade mode, that doesn't work well, so instead we skip > * built-in extensions based on their OIDs; see > * selectDumpableExtension. > */ > I don't see the necessity of dumping it in normal mode, unless I'm missing something. Let me explain the case I'm trying to tackle. I have two old dump data, each of them have couple objects depend on plpgsql. I have restored first dump and trying restore second dump using 'pg_restore -c' command, it is failing with following error: ERROR: cannot drop extension plpgsql because other objects depend on it Works well without '-c' option, but that what not a general solution, IMHO. Regards, Amul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers