Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On Tue, Feb 17, 2015 at 9:02 AM, Andres Freund wrote: On 2015-02-17 05:51:22 +0900, Michael Paquier wrote: -- inv_api.c uses bytea in an internal structure without putting it at the end of the structure. For clarity I think that we should just use a bytea pointer and do a sufficient allocation for the duration of the lobj manipulation. Hm. I don't really see the problem tbh. There's actually is a reason the bytea is at the beginning - the other fields are *are* the data part of the bytea (which is just the varlena header). -- Similarly, tuptoaster.c needed to be patched for toast_save_datum I'm not a big fan of these two changes. This adds some not that small memory allocations to a somewhat hot path. Without a big win in clarity. And it doesn't seem to have anything to do with with FLEXIBLE_ARRAY_MEMBER. We could replace those palloc calls with a simple buffer that has a predefined size, but this somewhat reduces the readability of the code. There are as well couple of things that are not changed on purpose: - in namespace.h for FuncCandidateList. I tried manipulating it but it resulted in allocation overflow in PortalHeapMemory Did you change the allocation in FuncnameGetCandidates()? Note the - sizeof(Oid) there. Yeah. Missed it. - I don't think that the t_bits fields in htup_details.h should be updated either. Why not? Any not broken code should already use MinHeapTupleSize and similar macros. Changing t_bits impacts HeapTupleHeaderData, ReorderBufferTupleBuf and similarly a couple of redo routines in heapam.c using HeapTupleHeaderData in a couple of structures not placing it at the end (compiler complains). We could use for each of them a buffer that has enough room with sizeof(HeapTupleHeaderData) + MaxHeapTupleSize, but wouldn't it reduce the readability of the current code? Opinions welcome. diff --git a/src/include/catalog/pg_attribute.h b/src/include/catalog/pg_attribute.h [...] Generally the catalog changes are much less clearly a benefit imo. OK, let's drop them then. From ad8f54cdb5776146f17d1038bb295b5f13b549f1 Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Mon, 16 Feb 2015 03:53:38 +0900 Subject: [PATCH 3/3] Update varlena in c.h to use FLEXIBLE_ARRAY_MEMBER Places using a variable-length variable not at the end of a structure are changed with workaround, without impacting what those features do. I vote for rejecting most of this, except a (corrected version) of the pg_authid change. Which doesn't really belong to the rest of the patch anyway ;)x Changing bytea to use FLEXIBLE_ARRAY_MEMBER requires those changes, or at least some changes in this area as something with FLEXIBLE_ARRAY_MEMBER can only be placed at the end of a structure. But I guess that we can do fine as well by replacing those structures with some buffers with a pre-defined size. I'll draft an additional patch on top of 0001 with all those less-trivial changes implemented. #define VARHDRSZ ((int32) sizeof(int32)) diff --git a/src/include/catalog/pg_authid.h b/src/include/catalog/pg_authid.h index e01e6aa..d8789a5 100644 --- a/src/include/catalog/pg_authid.h +++ b/src/include/catalog/pg_authid.h @@ -56,8 +56,10 @@ CATALOG(pg_authid,1260) BKI_SHARED_RELATION BKI_ROWTYPE_OID(2842) BKI_SCHEMA_MAC int32 rolconnlimit; /* max connections allowed (-1=no limit) */ /* remaining fields may be null; use heap_getattr to read them! */ - textrolpassword;/* password, if any */ timestamptz rolvaliduntil; /* password expiration time, if any */ +#ifdef CATALOG_VARLEN + textrolpassword;/* password, if any */ +#endif } FormData_pg_authid; That change IIRC is wrong, because it'll make rolvaliduntil until NOT NULL (any column that's fixed width and has only fixed with columns before it is marked as such). This sounds better as a separate patch... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
Hello, this is the last patch for pg_basebackup/pg_receivexlog on master (9.5). Preor versions don't have this issue. 4. basebackup_reply_fix_mst_v2.patch receivelog.c patch applyable on master. This is based on the same design with walrcv_reply_fix_91_v2.patch in the aspect of gettimeofday(). regards, At Tue, 17 Feb 2015 19:45:19 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote in 20150217.194519.58137941.horiguchi.kyot...@lab.ntt.co.jp Thank you for the comment. Three new patches are attached. I forgot to give a revision number on the previous patch, but I think this is the 2nd version. 1. walrcv_reply_fix_94_v2.patch Walreceiver patch applyable on master/REL9_4_STBLE/REL9_3_STABLE 2. walrcv_reply_fix_92_v2.patch Walreceiver patch applyable on REL9_2_STABLE 3. walrcv_reply_fix_91_v2.patch Walreceiver patch applyable on REL9_1_STABLE At Sat, 14 Feb 2015 03:01:22 +0900, Fujii Masao masao.fu...@gmail.com wrote in cahgqgwhz_4ywyoka+5ks9s_698adjh8+0hamcsc9wyfh37g...@mail.gmail.com On Tue, Feb 10, 2015 at 7:48 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Introduce the maximum number of records that we can receive and process between feedbacks? For example, we can change XLogWalRcvSendHSFeedback() so that it's called per at least 8 records. I'm not sure what number is good, though... At the beginning of the while(len 0){if (len 0){ block, last_recv_timestamp is always kept up to date (by using gettimeotda():). So we can use the variable instead of gettimeofday() in my previous proposal. Yes, but something like last_recv_timestamp doesn't exist in REL9_1_STABLE. So you don't think that your proposed fix should be back-patched to all supported versions. Right? Back to 9.3 has the loop with the same structure. 9.2 is in a bit differenct shape but looks to have the same issue. 9.1 and 9.0 has the same staps with 9.2 but 9.0 doesn't has wal sender timeout and 9.1 does not have a variable having current time. 9.3 and later: the previous patch works, but revised as your comment. 9.2: The time of the last reply is stored in the file-scope variable reply_message, and the current time is stored in walrcv-lastMsgReceiptTime. The timeout is determined using theses variables. 9.1: walrcv doesn't have lastMsgReceiptTime. The current time should be acquired explicitly in the innermost loop. I tried calling gettimeofday() once per several loops. The skip count is determined by anticipated worst duration to process a wal record and wal_receiver_status_interval. However, I doubt the necessity to do so.. The value of the worst duration is simply a random guess. 9.0: No point to do this kind of fix. The start time of the timeout could be last_recv_timestamp at the beginning of the while loop, since it is a bit earlier than the actual time at the point. Sounds strange to me. I think that last_recv_timestamp should be compared with the time when the last feedback message was sent, e.g., maybe you can expose sendTime in XLogWalRcvSendReplay() as global variable, and use it to compare with last_recv_timestamp. You're right to do exactly right thing, but, as you mentioned, exposing sendTime is requied to do so. The solution I proposed is the second-best assuming that wal_sender_timeout is recommended to be longer enough (several times longer) than wal_receiver_status_interval. If no one minds to expose sendTime, it is the best solution. The attached patch does it. # The added comment in the patch was wrong, though. When we get out of the WAL receive loop due to the timeout check which your patch added, XLogWalRcvFlush() is always executed. I don't think that current WAL file needs to be flushed at that time. Thank you for pointing it out. Run XLogWalRcvSendReply(force = true) immediately instead of breaking from the loop. The another solution would be using RegisterTimeout/enable_timeout_after and it seemed to be work for me. It can throw out the time counting stuff in the loop we are talking about and that of XLogWalRcvSendHSFeedback and XLogWalRcvSendReply, but it might be a bit too large for the gain. Yes, sounds overkill. However, gettimeofday() for each recv is also a overkill for most cases. I'll post the patches for receivelog.c based on the patch for 9.1 wal receiver. -- Kyotaro Horiguchi NTT Open Source Software Center From cfe01eadd4d8567f4410bccb8334c52fc897c002 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp Date: Wed, 18 Feb 2015 12:30:58 +0900 Subject: [PATCH] Make pg_bascbackup and pg_receivexlog to keep sending WAL receive feedback regularly on heavy load v2. pg_basebackup and pg_receivexlog fail to send receiver reply message while they are receiving continuous WAL stream caused by heavy load or something else. This patch makes
Re: [HACKERS] How about to have relnamespace and relrole?
Sorry, I sent the previous mail without patches by accident. The patches are attached to this mail. Hello, this is the patchset v2 of this feature. 0001-Add-regrole.patch Adding regrole as the name says. 0002-Add-regnamespace.patch Adding regnamespace. This depends on 0001 patch. 0003-Check-new-reg-types-are-not-used-as-default-values.patch Inhibiting the new OID aliss types from being used as constants in default values, which make dependencies on the other (existing) OID alias types. 0004-Documentation-for-new-OID-alias-types.patch Documentation patch for this new types. regards, At Tue, 17 Feb 2015 20:19:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote in 20150217.201911.239516619.horiguchi.kyot...@lab.ntt.co.jp Hello, thank you for the comment. The current patch lacks change in documentation and dependency stuff. Current framework doesn't consider changing pg_shdepend from column default expressions so the possible measures are the followings, I think. 1. Make pg_shdepend to have refobjsubid and add some deptypes of pg_depend, specifically DEPENDENCY_NORMAL is needed. Then, change the dependency stuff. 2. Simplly inhibit specifying them in default expressions. Integer or OID can act as the replacement. =# create table t1 (a int, b regrole default 'horiguti'::regrole); ERROR: Type 'regrole' cannot have a default value 1 is quite overkill but hardly seems to be usable. So I will go on 2 and post additional patch. At Thu, 12 Feb 2015 17:12:24 -0600, Jim Nasby jim.na...@bluetreble.com wrote in 54dd3358.9030...@bluetreble.com On 2/12/15 5:28 AM, Kyotaro HORIGUCHI wrote: Hello, I changed the subject. # Ouch! the subject remaines wrong.. 1. Expected frequency of use ... For that reason, although the current policy of deciding whether to have reg* types seems to be whether they have schema-qualified names, I think regrole and regnamespace are valuable to have. Perhaps schema qualification was the original intent, but I think at this point everyone uses them for convenience. Why would I want to type out JOIN pg_namespace n ON n.oid = blah.???namespace when I could simply do ???namespace::regnamespace? Yes, that is the most important point of this patch. 2. Anticipaed un-optimizability Tom pointed that these reg* types prevents planner from optimizing the query, so we should refrain from having such machinary. It should have been a long-standing issue but reg*s sees to be rather faster than joining corresponding catalogs for moderate number of the result rows, but this raises another more annoying issue. 3. Potentially breakage of MVCC The another issue Tom pointed is potentially breakage of MVCC by using these reg* types. Syscache is invalidated on updates so this doesn't seem to be a problem on READ COMMITTED mode, but breaks SERIALIZABLE mode. But IMHO it is not so serious problem as long as such inconsistency occurs only on system catalog and it is explicitly documented togethee with the un-optimizability issue. I'll add a patch for this later. The way I look at it, all these arguments went out the window when regclass was introduced. The reality is that reg* is *way* more convenient than manual joins. The arguments about optimization and MVCC presumably apply to all reg* casts, which means that ship has long since sailed. I agree basically, but I think some caveat is needed. I'll include that in the coming documentation patch. If we offered views that made it easier to get at this data then maybe this wouldn't matter so much. But DBA's don't want to join 3 catalogs together to try and answer a simple question. It has been annoying me these days. -- Kyotaro Horiguchi NTT Open Source Software Center From 2a6f689afdc8197c2fe2fc235a4819ce1a5e9928 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp Date: Wed, 18 Feb 2015 14:38:32 +0900 Subject: [PATCH 1/4] Add regrole Add new regtype regrole. For the development reason, the system OIDs used for the new procs, types, casts are making a gap from existing OIDs. --- src/backend/bootstrap/bootstrap.c | 2 + src/backend/utils/adt/regproc.c | 101 ++ src/backend/utils/adt/selfuncs.c | 2 + src/backend/utils/cache/catcache.c| 1 + src/include/catalog/pg_cast.h | 7 +++ src/include/catalog/pg_proc.h | 10 src/include/catalog/pg_type.h | 5 ++ src/include/utils/builtins.h | 5 ++ src/test/regress/expected/regproc.out | 26 - src/test/regress/sql/regproc.sql | 7 +++ 10 files changed, 165 insertions(+), 1 deletion(-) diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index bc66eac..11e40ee 100644 --- a/src/backend/bootstrap/bootstrap.c +++
Re: [HACKERS] How about to have relnamespace and relrole?
Hello, this is the patchset v2 of this feature. 0001-Add-regrole.patch Adding regrole as the name says. 0002-Add-regnamespace.patch Adding regnamespace. This depends on 0001 patch. 0003-Check-new-reg-types-are-not-used-as-default-values.patch Inhibiting the new OID aliss types from being used as constants in default values, which make dependencies on the other (existing) OID alias types. 0004-Documentation-for-new-OID-alias-types.patch Documentation patch for this new types. regards, At Tue, 17 Feb 2015 20:19:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote in 20150217.201911.239516619.horiguchi.kyot...@lab.ntt.co.jp Hello, thank you for the comment. The current patch lacks change in documentation and dependency stuff. Current framework doesn't consider changing pg_shdepend from column default expressions so the possible measures are the followings, I think. 1. Make pg_shdepend to have refobjsubid and add some deptypes of pg_depend, specifically DEPENDENCY_NORMAL is needed. Then, change the dependency stuff. 2. Simplly inhibit specifying them in default expressions. Integer or OID can act as the replacement. =# create table t1 (a int, b regrole default 'horiguti'::regrole); ERROR: Type 'regrole' cannot have a default value 1 is quite overkill but hardly seems to be usable. So I will go on 2 and post additional patch. At Thu, 12 Feb 2015 17:12:24 -0600, Jim Nasby jim.na...@bluetreble.com wrote in 54dd3358.9030...@bluetreble.com On 2/12/15 5:28 AM, Kyotaro HORIGUCHI wrote: Hello, I changed the subject. # Ouch! the subject remaines wrong.. 1. Expected frequency of use ... For that reason, although the current policy of deciding whether to have reg* types seems to be whether they have schema-qualified names, I think regrole and regnamespace are valuable to have. Perhaps schema qualification was the original intent, but I think at this point everyone uses them for convenience. Why would I want to type out JOIN pg_namespace n ON n.oid = blah.???namespace when I could simply do ???namespace::regnamespace? Yes, that is the most important point of this patch. 2. Anticipaed un-optimizability Tom pointed that these reg* types prevents planner from optimizing the query, so we should refrain from having such machinary. It should have been a long-standing issue but reg*s sees to be rather faster than joining corresponding catalogs for moderate number of the result rows, but this raises another more annoying issue. 3. Potentially breakage of MVCC The another issue Tom pointed is potentially breakage of MVCC by using these reg* types. Syscache is invalidated on updates so this doesn't seem to be a problem on READ COMMITTED mode, but breaks SERIALIZABLE mode. But IMHO it is not so serious problem as long as such inconsistency occurs only on system catalog and it is explicitly documented togethee with the un-optimizability issue. I'll add a patch for this later. The way I look at it, all these arguments went out the window when regclass was introduced. The reality is that reg* is *way* more convenient than manual joins. The arguments about optimization and MVCC presumably apply to all reg* casts, which means that ship has long since sailed. I agree basically, but I think some caveat is needed. I'll include that in the coming documentation patch. If we offered views that made it easier to get at this data then maybe this wouldn't matter so much. But DBA's don't want to join 3 catalogs together to try and answer a simple question. It has been annoying me these days. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Combining Aggregates
Hi Rowley, Let me put some comments on this patch. This patch itself looks good as an infrastructure towards the big picture, however, we still don't reach the consensus how combined functions are used instead of usual translation functions. Aggregate function usually consumes one or more values extracted from a tuple, then it accumulates its internal state according to the argument. Exiting transition function performs to update its internal state with assumption of a function call per records. On the other hand, new combined function allows to update its internal state with partial aggregated values which is processed by preprocessor node. An aggregate function is represented with Aggref node in plan tree, however, we have no certain way to determine which function shall be called to update internal state of aggregate. For example, avg(float) has an internal state with float[3] type for number of rows, sum of X and X^2. If combined function can update its internal state with partially aggregated values, its argument should be float[3]. It is obviously incompatible to float8_accum(float) that is transition function of avg(float). I think, we need a new flag on Aggref node to inform executor which function shall be called to update internal state of aggregate. Executor cannot decide it without this hint. Also, do you have idea to push down aggregate function across joins? Even though it is a bit old research, I could find a systematic approach to push down aggregate across join. https://cs.uwaterloo.ca/research/tr/1993/46/file.pdf I think, it is great if core functionality support this query rewriting feature based on cost estimation, without external modules. How about your opinions? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of David Rowley Sent: Friday, December 19, 2014 8:39 PM To: Simon Riggs Cc: Kaigai Kouhei(海外 浩平); PostgreSQL-development; Amit Kapila Subject: Re: [HACKERS] Combining Aggregates On 18 December 2014 at 02:48, Simon Riggs si...@2ndquadrant.com wrote: David, if you can update your patch with some docs to explain the behaviour, it looks complete enough to think about committing it in early January, to allow other patches that depend upon it to stand a chance of getting into 9.5. (It is not yet ready, but I see it could be). Sure, I've more or less added the docs from your patch. I still need to trawl through and see if there's anywhere else that needs some additions. The above example is probably the best description of the need, since user defined aggregates must also understand this. Could we please call these combine functions or other? MERGE is an SQL Standard statement type that we will add later, so it will be confusing if we use the merge word in this context. Ok, changed. David, your patch avoids creating any mergefuncs for existing aggregates. We would need to supply working examples for at least a few of the builtin aggregates, so we can test the infrastructure. We can add examples for all cases later. In addition to MIN(), MAX(), BIT_AND(), BIT_OR, SUM() for floating point types, cash and interval. I've now added combine functions for count(*) and count(col). It seems that int8pl() is suitable for this. Do you think it's worth adding any new functions at this stage, or is it best to wait until there's a patch which can use these? Regards David Rowley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables
On 2015/02/18 7:44, Stephen Frost wrote: * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote: On 2015/02/11 4:06, Stephen Frost wrote: I had been trying to work out an FDW-specific way to address this, but I think Dean's right that this should be addressed in expand_security_qual(), which means it'll apply to all cases and not just these FDW calls. I don't think that's actually an issue though and it'll match up to how SELECT FOR UPDATE is handled today. Sorry, my explanation was not accurate, but I also agree with Dean's idea. In the above, I just wanted to make it clear that such a lock request done by expand_security_qual() should be limited to the case where the relation that is a former result relation is a foreign table. Attached is a patch which should address this. Would love your (or anyone else's) feedback on it. It appears to address the issue which you raised and the regression test changes are all in-line with inserting a LockRows into the subquery, as anticipated. I've looked into the patch. * The patch applies to the latest head, 'make' passes successfully, but 'make check' fails in the rowsecurity test. * I found one place in expand_security_qual that I'm concerned about: + if (targetRelation) + applyLockingClause(subquery, 1, LCS_FORUPDATE, + false, false); ISTM that it'd be better to use LockWaitBlock as the fourth argument of applyLockingClause. Other than that, the patch looks good to me. Thanks for the work! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2015 - mentors, students and admins.
On 9 February 2015 at 20:52, Thom Brown t...@linux.com wrote: Hi all, Google Summer of Code 2015 is approaching. I'm intending on registering PostgreSQL again this year. Before I do that, I'd like to have an idea of how many people are interested in being either a student or a mentor. I've volunteered to be admin again, but if anyone else has a strong interest of seeing the projects through this year, please let yourself be known. Who would be willing to mentor projects this year? Project ideas are welcome, and I've copied many from last year's submissions into this year's wiki page. Please feel free to add more (or delete any that stand no chance or are redundant): https://wiki.postgresql.org/wiki/GSoC_2015 Students can find more information about GSoC here: http://www.postgresql.org/developer/summerofcode I'd also like to ask whether anyone would be willing to be a backup admin for this year's GSoC? Please message me directly if you're interested. I need to know A.S.A.P though. Thanks Thom
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On Wed, Feb 18, 2015 at 1:26 AM, David Steele da...@pgmasters.net wrote: On 2/17/15 10:23 AM, Simon Riggs wrote: I vote to include pgaudit in 9.5, albeit with any changes. In particular, David may have some changes to recommend, but I haven't seen a spec or a patch, just a new version of code (which isn't how we do things...). I submitted the new patch in my name under a separate thread Auditing extension for PostgreSQL (Take 2) (54e005cc.1060...@pgmasters.net) I played the patch version of pg_audit a bit and have basic comments about its spec. The pg_audit doesn't log BIND parameter values when prepared statement is used. Seems this is an oversight of the patch. Or is this intentional? The pg_audit cannot log the statement like SELECT 1 which doesn't access to the database object. Is this intentional? I think that there are many users who want to audit even such statement. Imagine the case where you call the user-defined function which executes many nested statements. In this case, pg_audit logs only top-level statement (i.e., issued directly by client) every time nested statement is executed. In fact, one call of such UDF can cause lots of *same* log messages. I think this is problematic. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ExplainModifyTarget doesn't work as expected
On 2015/02/18 8:13, Tom Lane wrote: So I went back to your v1 patch and have now committed that with some cosmetic modifications. Sorry for making you put time into a dead end. I don't mind it. Thanks for committing the patch! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Tue, Feb 17, 2015 at 9:52 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-11 15:49:17 -0500, Robert Haas wrote: A query whose runetime is dominated by a sequential scan (+ attached filter) is certainly going to require a bigger prefetch size than one that does other expensive stuff. Imagine parallelizing SELECT * FROM largetable WHERE col = low_cardinality_value; and SELECT * FROM largetable JOIN gigantic_table ON (index_nestloop_condition) WHERE col = high_cardinality_value; The first query will be a simple sequential and disk reads on largetable will be the major cost of executing it. In contrast the second query might very well sensibly be planned as a parallel sequential scan with the nested loop executing in the same worker. But the cost of the sequential scan itself will likely be completely drowned out by the nestloop execution - index probes are expensive/unpredictable. I think the work/task given to each worker should be as granular as possible to make it more predictable. I think the better way to parallelize such a work (Join query) is that first worker does sequential scan and filtering on large table and then pass it to next worker for doing join with gigantic_table. I think it makes sense to think of a set of tasks in which workers can assist. So you a query tree which is just one query tree, with no copies of the nodes, and then there are certain places in that query tree where a worker can jump in and assist that node. To do that, it will have a copy of the node, but that doesn't mean that all of the stuff inside the node becomes shared data at the code level, because that would be stupid. My only problem with that description is that I think workers will have to work on more than one node - it'll be entire subtrees of the executor tree. There could be some cases where it could be beneficial for worker to process a sub-tree, but I think there will be more cases where it will just work on a part of node and send the result back to either master backend or another worker for further processing. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] deparsing utility commands
Andres Freund wrote: Hi, On 2015-02-15 01:48:15 -0300, Alvaro Herrera wrote: This is a repost of the patch to add CREATE command deparsing support to event triggers. It now supports not only CREATE but also ALTER and other commands such as GRANT/REVOKE, COMMENT ON and SECURITY LABEL. This patch series is broken up in a rather large number of patches, but my intention would be to commit separately only the first 6 patches; the remaining parts are split up only so that it's easy to see how deparsing each patch is implemented, and would be committed as a single changeset (but before that I need a bunch of extra fixes and additions to other parts.) I think you should just go ahead and commit 0001, 0002, 0006. Those have previously been discussed and seem like a good idea and uncontroversial even if the rest of the series doesn't go in. I have pushed 0001 (changed pg_identify_object for opclasses and opfamilies to use USING instead of FOR) to master only, and backpatched a fix for pg_conversion object identities, which were missing schema-qualification. Patch 0002 I think is good to go as well, AFAICT (have the various RENAME commands return the OID and attnum of affected objects). On 0006 (which is about having ALTER TABLE return the OID/attnum of the affected object on each subcommand), I have a problem about the ALTER TABLE ALTER COLUMN SET DATA TYPE USING subcommand. The problem with that is that in order to fully deparse that subcommand we need to deparse the expression of the USING clause. But in the parse node we only have info about the untransformed expression, so it is not possible to pass it through ruleutils directly; it needs to be run by transformExpr() first. Doing that in the deparse code seems the wrong solution, so I am toying with the idea of adding a Node * output parameter for some ATExec* routines; the Prep step would insert the transformed expression there, so that it is available at the deparse stage. As far as I know, only the SET DATA TYPE USING form has this issue: for other subcommands, the current code is simply grabbing the expression from catalogs. Maybe it would be good to use that Node in those cases as well and avoid catalog lookups, not sure. I think the GRANT/REVOKE, COMMENT ON and SECURITY LABEL changes are good independently as well, but there previously have been raised some concerns about shared objects. I think the answer in the patches which is to raise events when affecting database local objects makes sense, but others might disagree. Yes, I will push these unless somebody objects soon, as they seem perfectly reasonable to me. The only troubling thing is that there is no regression test for this kind of thing in event triggers (i.e. verify which command tags get support and which don't), which seems odd to me. Not these patches's fault, though, so I'm not considering adding any ATM. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] deparsing utility commands
Stephen Frost wrote: I've started taking a look at this as the pgaudit bits depend on it and noticed that the patch set implies there's 42 patches, but there were only 37 attached..? Ah, I didn't realize when I posted that the subject was counting those extra patches. They are later patches that add the testing framework, but since the tests don't pass currently, they are not usable yet. Mostly they are about the running deparse_init.sql file that I posted separately. I will post a real patch for that stuff later today to make it clear what it is that we're talking about. FWIW, one of Robert's main objections is that future hackers will forget to add deparse support for new commands as they are added. In an attempt to get this sorted out, I have modified the stuff in ProcessUtilitySlow() so that instead of having one EventTriggerStashCommand call for each node type, there is only one call at the end of the function. That way, new cases in the big switch there will automatically get something added to the stash, which should make the test fail appropriately. (Things like adding a new clause to existing commands will be tested by running pg_dump on the databases and comparing.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] deparsing utility commands
Alvaro, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: This is a repost of the patch to add CREATE command deparsing support to event triggers. It now supports not only CREATE but also ALTER and other commands such as GRANT/REVOKE, COMMENT ON and SECURITY LABEL. This patch series is broken up in a rather large number of patches, but my intention would be to commit separately only the first 6 patches; the remaining parts are split up only so that it's easy to see how deparsing each patch is implemented, and would be committed as a single changeset (but before that I need a bunch of extra fixes and additions to other parts.) I've started taking a look at this as the pgaudit bits depend on it and noticed that the patch set implies there's 42 patches, but there were only 37 attached..? Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] deparsing utility commands
Alvaro, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Stephen Frost wrote: I've started taking a look at this as the pgaudit bits depend on it and noticed that the patch set implies there's 42 patches, but there were only 37 attached..? Ah, I didn't realize when I posted that the subject was counting those extra patches. They are later patches that add the testing framework, but since the tests don't pass currently, they are not usable yet. Mostly they are about the running deparse_init.sql file that I posted separately. I will post a real patch for that stuff later today to make it clear what it is that we're talking about. Oh, ok, no problem, just wanted to make sure things weren't missing. FWIW, one of Robert's main objections is that future hackers will forget to add deparse support for new commands as they are added. In an attempt to get this sorted out, I have modified the stuff in ProcessUtilitySlow() so that instead of having one EventTriggerStashCommand call for each node type, there is only one call at the end of the function. That way, new cases in the big switch there will automatically get something added to the stash, which should make the test fail appropriately. (Things like adding a new clause to existing commands will be tested by running pg_dump on the databases and comparing.) Right, I've been following the discussion and fully agree with Robert that we really need a way to make sure that future work doesn't forget to address deparseing (or the various other bits, object classes, etc, really). The approach you've outlined sounds interesting but I haven't yet gotten to that bit of the code. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] deparsing utility commands
Stephen, Stephen Frost wrote: * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Patch 0002 I think is good to go as well, AFAICT (have the various RENAME commands return the OID and attnum of affected objects). It's not a huge complaint, but it feels a bit awkward to me that ExecRenameStmt is now returning one item and using an out variable for the other when the two really go together (Oid and Object Sub ID, that is). Further, the comment above ExecRenameStmt should make it clear that it's safe to pass NULL into objsubid if you don't care about it. The same probably goes for the COMMENT bits. Hmm, while I agree that it's a relatively minor point, it seems a fair one. I think we could handle this by returning ObjectAddress rather than Oid in ExecRenameStmt() and CommentObject(); then you have all the bits you need in a single place. Furthermore, the function in another patch EventTriggerStashCommand() instead of getting separately (ObjType, objectId, objectSubId) could take a single argument of type ObjectAddress. Now, we probably don't want to hack *all* the utility commands to return ObjectAddress instead of OID, because it many cases that's just not going to be convenient (not to speak of the code churn); so I think for most objtypes the ProcessUtilitySlow stanza would look like this: case T_AlterTSConfigurationStmt: objectId = AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree); objectType = OBJECT_TSCONFIGURATION; break; For ExecRenameStmt and CommentObject (and probably other cases such as security labels) the stanza in ProcessUtilitySlow would be simpler: case T_CommentStmt: address = CommentObject((CommentStmt *) parsetree); break; and at the bottom of the loop we would transform the objid/type into address for the cases that need it: if (!commandStashed) { if (objectId != InvalidOid) { address.classId = get_objtype_catalog_oid(objectType); address.objectId = objectId; address.objectSubId = 0; } EventTriggerStashCommand(address, secondaryOid, parsetree); } On 0006 (which is about having ALTER TABLE return the OID/attnum of the affected object on each subcommand), I have a problem about the ALTER TABLE ALTER COLUMN SET DATA TYPE USING subcommand. The problem with that is that in order to fully deparse that subcommand we need to deparse the expression of the USING clause. But in the parse node we only have info about the untransformed expression, so it is not possible to pass it through ruleutils directly; it needs to be run by transformExpr() first. I agree- I'm pretty sure we definitely don't want to run through transformExpr again in the deparse code. I'm not a huge fan of adding a Node* output parameter, but I havn't got any other great ideas about how to address that. Yeah, my thoughts exactly :-( I think the GRANT/REVOKE, COMMENT ON and SECURITY LABEL changes are good independently as well, but there previously have been raised some concerns about shared objects. I think the answer in the patches which is to raise events when affecting database local objects makes sense, but others might disagree. Yes, I will push these unless somebody objects soon, as they seem perfectly reasonable to me. The only troubling thing is that there is no regression test for this kind of thing in event triggers (i.e. verify which command tags get support and which don't), which seems odd to me. Not these patches's fault, though, so I'm not considering adding any ATM. Ugh. I dislike that when we say an event trigger will fire before 'GRANT' what we really mean is GRANT when it's operating against a local object. At the minimum we absolutely need to be very clear in the documentation about that limitation. Perhaps there is something already which reflects that, but I don't see anything in the patch being added about that. Hmm, good point, will give this some thought. I'm thinking perhaps we can add a table of which object types are supported for generic commands such as GRANT, COMMENT and SECURITY LABEL. Still looking at the rest of the patches. Great, thanks -- and thanks for the review so far. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] deparsing utility commands
Alvaro, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Andres Freund wrote: On 2015-02-15 01:48:15 -0300, Alvaro Herrera wrote: I think you should just go ahead and commit 0001, 0002, 0006. Those have previously been discussed and seem like a good idea and uncontroversial even if the rest of the series doesn't go in. I have pushed 0001 (changed pg_identify_object for opclasses and opfamilies to use USING instead of FOR) to master only, and backpatched a fix for pg_conversion object identities, which were missing schema-qualification. That looked fine to me. Patch 0002 I think is good to go as well, AFAICT (have the various RENAME commands return the OID and attnum of affected objects). It's not a huge complaint, but it feels a bit awkward to me that ExecRenameStmt is now returning one item and using an out variable for the other when the two really go together (Oid and Object Sub ID, that is). Further, the comment above ExecRenameStmt should make it clear that it's safe to pass NULL into objsubid if you don't care about it. The same probably goes for the COMMENT bits. On 0006 (which is about having ALTER TABLE return the OID/attnum of the affected object on each subcommand), I have a problem about the ALTER TABLE ALTER COLUMN SET DATA TYPE USING subcommand. The problem with that is that in order to fully deparse that subcommand we need to deparse the expression of the USING clause. But in the parse node we only have info about the untransformed expression, so it is not possible to pass it through ruleutils directly; it needs to be run by transformExpr() first. Doing that in the deparse code seems the wrong solution, so I am toying with the idea of adding a Node * output parameter for some ATExec* routines; the Prep step would insert the transformed expression there, so that it is available at the deparse stage. As far as I know, only the SET DATA TYPE USING form has this issue: for other subcommands, the current code is simply grabbing the expression from catalogs. Maybe it would be good to use that Node in those cases as well and avoid catalog lookups, not sure. I agree- I'm pretty sure we definitely don't want to run through transformExpr again in the deparse code. I'm not a huge fan of adding a Node* output parameter, but I havn't got any other great ideas about how to address that. I think the GRANT/REVOKE, COMMENT ON and SECURITY LABEL changes are good independently as well, but there previously have been raised some concerns about shared objects. I think the answer in the patches which is to raise events when affecting database local objects makes sense, but others might disagree. Yes, I will push these unless somebody objects soon, as they seem perfectly reasonable to me. The only troubling thing is that there is no regression test for this kind of thing in event triggers (i.e. verify which command tags get support and which don't), which seems odd to me. Not these patches's fault, though, so I'm not considering adding any ATM. Ugh. I dislike that when we say an event trigger will fire before 'GRANT' what we really mean is GRANT when it's operating against a local object. At the minimum we absolutely need to be very clear in the documentation about that limitation. Perhaps there is something already which reflects that, but I don't see anything in the patch being added about that. Still looking at the rest of the patches. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] deparsing utility commands
Alvaro, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Stephen Frost wrote: * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Patch 0002 I think is good to go as well, AFAICT (have the various RENAME commands return the OID and attnum of affected objects). It's not a huge complaint, but it feels a bit awkward to me that ExecRenameStmt is now returning one item and using an out variable for the other when the two really go together (Oid and Object Sub ID, that is). Further, the comment above ExecRenameStmt should make it clear that it's safe to pass NULL into objsubid if you don't care about it. The same probably goes for the COMMENT bits. Hmm, while I agree that it's a relatively minor point, it seems a fair one. I think we could handle this by returning ObjectAddress rather than Oid in ExecRenameStmt() and CommentObject(); then you have all the bits you need in a single place. Furthermore, the function in another patch EventTriggerStashCommand() instead of getting separately (ObjType, objectId, objectSubId) could take a single argument of type ObjectAddress. Agreed, that thought occured to me as well while I was looking through the other deparse patches and I agree that it makes sense. Now, we probably don't want to hack *all* the utility commands to return ObjectAddress instead of OID, because it many cases that's just not going to be convenient (not to speak of the code churn); so I think for most objtypes the ProcessUtilitySlow stanza would look like this: case T_AlterTSConfigurationStmt: objectId = AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree); objectType = OBJECT_TSCONFIGURATION; break; For ExecRenameStmt and CommentObject (and probably other cases such as security labels) the stanza in ProcessUtilitySlow would be simpler: case T_CommentStmt: address = CommentObject((CommentStmt *) parsetree); break; and at the bottom of the loop we would transform the objid/type into address for the cases that need it: if (!commandStashed) { if (objectId != InvalidOid) { address.classId = get_objtype_catalog_oid(objectType); address.objectId = objectId; address.objectSubId = 0; } EventTriggerStashCommand(address, secondaryOid, parsetree); } That'd be fine with me, though for my 2c, I wouldn't object to changing them all to return ObjectAddress either. I agree that it'd cause a fair bit of code churn to do so, but there's a fair bit of code churn happening here anyway (looking at what 0008 does to ProcessUtilitySlow anyway). Yes, I will push these unless somebody objects soon, as they seem perfectly reasonable to me. The only troubling thing is that there is no regression test for this kind of thing in event triggers (i.e. verify which command tags get support and which don't), which seems odd to me. Not these patches's fault, though, so I'm not considering adding any ATM. Ugh. I dislike that when we say an event trigger will fire before 'GRANT' what we really mean is GRANT when it's operating against a local object. At the minimum we absolutely need to be very clear in the documentation about that limitation. Perhaps there is something already which reflects that, but I don't see anything in the patch being added about that. Hmm, good point, will give this some thought. I'm thinking perhaps we can add a table of which object types are supported for generic commands such as GRANT, COMMENT and SECURITY LABEL. That sounds like a good idea. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] deparsing utility commands
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Now, we probably don't want to hack *all* the utility commands to return ObjectAddress instead of OID, because it many cases that's just not going to be convenient (not to speak of the code churn); so I think for most objtypes the ProcessUtilitySlow stanza would look like this: That'd be fine with me, though for my 2c, I wouldn't object to changing them all to return ObjectAddress either. I agree that it'd cause a fair bit of code churn to do so, but there's a fair bit of code churn happening here anyway (looking at what 0008 does to ProcessUtilitySlow anyway). Well, that would make my life easier I think (even if it's a bit more work), so unless there are objections I will do things this way. It's a bit of a pity that Robert and Dimitri went to huge lengths to have these functions return OID and we're now changing it all to ObjAddress instead, but oh well. Not sure that I see it as that huge a deal.. They're still returning an Oid, it's just embedded in the ObjAddress to provide a complete statement of what the object is. btw, the hunk in 0026 which adds a 'break;' into standard_ProcessUtility caught me by surprise. Looks like that 'break;' was missing from 0003 (for T_GrantStmt). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Allow snapshot too old error, to prevent bloat
Magnus Hagander mag...@hagander.net wrote: On Feb 17, 2015 12:26 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote: It seems we already have a mechanism in place that allows tuning of query cancel on standbys vs. preventing standby queries from seeing old data, specifically max_standby_streaming_delay/max_standby_archive_delay. We obsessed about how users were going to react to these odd variables, but there has been little negative feedback. FWIW, I think that's a somewhat skewed perception. I think it was right to introduce those, because we didn't really have any alternatives. As far as I can see, the alternatives suggested so far are all about causing heap bloat to progress more slowly, but still without limit. I suggest, based on a lot of user feedback (from the customer I've talked about at some length on this thread, as well as numerous others), that unlimited bloat based on the activity of one connection is a serious deficiency in the product; and that there is no real alternative to something like a snapshot too old error if we want to fix that deficiency. Enhancements to associate a snapshot with a database and using a separate vacuum xmin per database, not limiting vacuum of a particular object by snapshots that cannot see that snapshot, etc., are all Very Good Things and I hope those changes are made, but none of that fixes a very fundamental flaw. But max_standby_streaming_delay, max_standby_archive_delay and hot_standby_feedback are among the most frequent triggers for questions and complaints that I/we see. Agreed. And a really bad one used to be vacuum_defer_cleanup_age, because of confusing units amongst other things. Which in terms seems fairly close to Kevins suggestions, unfortunately. Particularly my initial suggestion, which was to base snapshot age it on the number of transaction IDs assigned. Does this look any better to you if it is something that can be set to '20min' or '1h'? Just to restate, that would not automatically cancel the snapshots past that age; it would allow vacuum of any tuples which became dead that long ago, and would cause a snapshot too old message for any read of a page modified more than that long ago using a snapshot which was older than that. Unless this idea gets some +1 votes Real Soon Now, I will mark the patch as Rejected and move on. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] deparsing utility commands
Now, we probably don't want to hack *all* the utility commands to return ObjectAddress instead of OID, because it many cases that's just not going to be convenient (not to speak of the code churn); so I think for most objtypes the ProcessUtilitySlow stanza would look like this: That'd be fine with me, though for my 2c, I wouldn't object to changing them all to return ObjectAddress either. I agree that it'd cause a fair bit of code churn to do so, but there's a fair bit of code churn happening here anyway (looking at what 0008 does to ProcessUtilitySlow anyway). Well, that would make my life easier I think (even if it's a bit more work), so unless there are objections I will do things this way. It's a bit of a pity that Robert and Dimitri went to huge lengths to have these functions return OID and we're now changing it all to ObjAddress instead, but oh well. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On 2015-02-19 07:10:19 +0900, Michael Paquier wrote: On Wed, Feb 18, 2015 at 10:09 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-18 17:15:18 +0900, Michael Paquier wrote: - I don't think that the t_bits fields in htup_details.h should be updated either. Why not? Any not broken code should already use MinHeapTupleSize and similar macros. Changing t_bits impacts HeapTupleHeaderData, ReorderBufferTupleBuf and similarly a couple of redo routines in heapam.c using HeapTupleHeaderData in a couple of structures not placing it at the end (compiler complains). The compiler will complain if you use a FLEXIBLE_ARRAY_MEMBER in the middle of a struct but not when when you embed a struct that uses it into the middle another struct. At least gcc doesn't and I think it'd be utterly broken if another compiler did that. If there's a compiler that does so, we need to make it define FLEXIBLE_ARRAY_MEMBER to 1. clang does complain on my OSX laptop regarding that ;) I think that then puts the idea of doing it for those structs pretty much to bed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On Wed, Feb 18, 2015 at 10:09 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-18 17:15:18 +0900, Michael Paquier wrote: - I don't think that the t_bits fields in htup_details.h should be updated either. Why not? Any not broken code should already use MinHeapTupleSize and similar macros. Changing t_bits impacts HeapTupleHeaderData, ReorderBufferTupleBuf and similarly a couple of redo routines in heapam.c using HeapTupleHeaderData in a couple of structures not placing it at the end (compiler complains). The compiler will complain if you use a FLEXIBLE_ARRAY_MEMBER in the middle of a struct but not when when you embed a struct that uses it into the middle another struct. At least gcc doesn't and I think it'd be utterly broken if another compiler did that. If there's a compiler that does so, we need to make it define FLEXIBLE_ARRAY_MEMBER to 1. clang does complain on my OSX laptop regarding that ;) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
Michael Paquier michael.paqu...@gmail.com writes: On Wed, Feb 18, 2015 at 10:09 PM, Andres Freund and...@2ndquadrant.com wrote: The compiler will complain if you use a FLEXIBLE_ARRAY_MEMBER in the middle of a struct but not when when you embed a struct that uses it into the middle another struct. At least gcc doesn't and I think it'd be utterly broken if another compiler did that. If there's a compiler that does so, we need to make it define FLEXIBLE_ARRAY_MEMBER to 1. clang does complain on my OSX laptop regarding that ;) I'm a bit astonished that gcc doesn't consider this an error. Sure seems like it should. (Has anyone tried it on recent gcc?) I am entirely opposed to Andreas' claim that we ought to consider compilers that do warn to be broken; if anything it's the other way around. Moreover, if we have any code that is assuming such cases are okay, it probably needs a second look. Isn't this situation effectively assuming that a variable-length array is fixed-length? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow snapshot too old error, to prevent bloat
Kevin, * Kevin Grittner (kgri...@ymail.com) wrote: Magnus Hagander mag...@hagander.net wrote: On Feb 17, 2015 12:26 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote: It seems we already have a mechanism in place that allows tuning of query cancel on standbys vs. preventing standby queries from seeing old data, specifically max_standby_streaming_delay/max_standby_archive_delay. We obsessed about how users were going to react to these odd variables, but there has been little negative feedback. FWIW, I think that's a somewhat skewed perception. I think it was right to introduce those, because we didn't really have any alternatives. As far as I can see, the alternatives suggested so far are all about causing heap bloat to progress more slowly, but still without limit. I suggest, based on a lot of user feedback (from the customer I've talked about at some length on this thread, as well as numerous others), that unlimited bloat based on the activity of one connection is a serious deficiency in the product; and that there is no real alternative to something like a snapshot too old error if we want to fix that deficiency. Enhancements to associate a snapshot with a database and using a separate vacuum xmin per database, not limiting vacuum of a particular object by snapshots that cannot see that snapshot, etc., are all Very Good Things and I hope those changes are made, but none of that fixes a very fundamental flaw. I also agree with the general idea that it makes sense to provide a way to control bloat, but I think you've missed what Andres was getting at with his suggestion (if I understand correctly, apologies if I don't). The problem is that we're only looking at the overall xmin / xmax horizon when it comes to deciding if a given tuple is dead. That's not quite right- the tuple might actually be dead to all *current* transactions by being newer than the oldest transaction but dead for all later transactions. Basically, there exist gaps between our cluster wide xmin / xmax where we might find actually dead rows. Marking those rows dead and reusable *would* stop the bloat, not just slow it down. In the end, with a single long-running transaction, the worst bloat you would have is double the size of the system at the time the long-running transaction started. Another way of thinking about it is with this timeline: xx /\ /\ /\ /\ Long runningrow createdrow deletedvacuum transaction starts Now, if all transactions currently running started either before the row was created or after the row was deleted, then the row is dead and vacuum can reclaim it. Finding those gaps in the transaction space for all of the currently running processes might not be cheap, but for this kind of a use-case, it might be worth the effort. On a high-churn system where the actual set of rows being copied over constantly isn't very high then you could end up with some amount of duplication of rows- those to satisfy the ancient transaction and the current working set, but there wouldn't be any further bloat and it'd almost cerainly be a much better situation than what is being seen now. Particularly my initial suggestion, which was to base snapshot age it on the number of transaction IDs assigned. Does this look any better to you if it is something that can be set to '20min' or '1h'? Just to restate, that would not automatically cancel the snapshots past that age; it would allow vacuum of any tuples which became dead that long ago, and would cause a snapshot too old message for any read of a page modified more than that long ago using a snapshot which was older than that. Unless this idea gets some +1 votes Real Soon Now, I will mark the patch as Rejected and move on. I'm not against having a knob like this, which is defaulted to off, but I do think we'd be a lot better off with a system that could realize when rows are not visible to any currently running transaction and clean them up. If this knob's default is off then I don't think we'd be likely to get the complaints which are being discussed (or, if we did, we could point the individual at the admin who set the knob...). Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Mon, Feb 16, 2015 at 8:55 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-16 11:30:20 +, Syed, Rahila wrote: - * As a trivial form of data compression, the XLOG code is aware that - * PG data pages usually contain an unused hole in the middle, which - * contains only zero bytes. If hole_length 0 then we have removed - * such a hole from the stored data (and it's not counted in the - * XLOG record's CRC, either). Hence, the amount of block data actually - * present is BLCKSZ - hole_length bytes. + * Block images are able to do several types of compression: + * - When wal_compression is off, as a trivial form of compression, the + * XLOG code is aware that PG data pages usually contain an unused hole + * in the middle, which contains only zero bytes. If length BLCKSZ + * then we have removed such a hole from the stored data (and it is + * not counted in the XLOG record's CRC, either). Hence, the amount + * of block data actually present is length bytes. The hole offset + * on page is defined using hole_offset. + * - When wal_compression is on, block images are compressed using a + * compression algorithm without their hole to improve compression + * process of the page. length corresponds in this case to the length + * of the compressed block. hole_offset is the hole offset of the page, + * and the length of the uncompressed block is defined by raw_length, + * whose data is included in the record only when compression is enabled + * and with_hole is set to true, see below. + * + * is_compressed is used to identify if a given block image is compressed + * or not. Maximum page size allowed on the system being 32k, the hole + * offset cannot be more than 15-bit long so the last free bit is used to + * store the compression state of block image. If the maximum page size + * allowed is increased to a value higher than that, we should consider + * increasing this structure size as well, but this would increase the + * length of block header in WAL records with alignment. + * + * with_hole is used to identify the presence of a hole in a block image. + * As the length of a block cannot be more than 15-bit long, the extra bit in + * the length field is used for this identification purpose. If the block image + * has no hole, it is ensured that the raw size of a compressed block image is + * equal to BLCKSZ, hence the contents of XLogRecordBlockImageCompressionInfo + * are not necessary. */ typedef struct XLogRecordBlockImageHeader { - uint16 hole_offset;/* number of bytes before hole */ - uint16 hole_length;/* number of bytes in hole */ + uint16 length:15, /* length of block data in record */ + with_hole:1;/* status of hole in the block */ + + uint16 hole_offset:15, /* number of bytes before hole */ + is_compressed:1;/* compression status of image */ + + /* Followed by the data related to compression if block is compressed */ } XLogRecordBlockImageHeader; Yikes, this is ugly. I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id. Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon. Yeah, that would help for readability and does not cost much compared to BLCKSZ. Still could you explain what kind of extensibility you have in mind except code readability? It is hard to make a nice picture with only the paper and the pencils, and the current patch approach has been taken to minimize the record length, particularly for users who do not care about WAL compression. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Join push-down support for foreign tables
2015-02-17 10:39 GMT+09:00 Kouhei Kaigai kai...@ak.jp.nec.com: Let me put some comments in addition to where you're checking now. [design issues] * Cost estimation Estimation and evaluation of cost for remote join query is not an obvious issue. In principle, local side cannot determine the cost to run remote join without remote EXPLAIN, because local side has no information about JOIN logic applied on the remote side. Probably, we have to put an assumption for remote join algorithm, because local planner has no idea about remote planner's choice unless foreign-join don't take use_remote_estimate. I think, it is reasonable assumption (even if it is incorrect) to calculate remote join cost based on local hash-join algorithm. If user wants more correct estimation, remote EXPLAIN will make more reliable cost estimation. Hm, I guess that you chose hash-join as least-costed join. In the pgbench model, most combination between two tables generate hash join as cheapest path. Remote EXPLAIN is very expensive in the context of planning, so it would easily make the plan optimization meaningless. But giving an option to users is good, I agree. It also needs a consensus whether cost for remote CPU execution is equivalent to local CPU. If we think local CPU is rare resource than remote one, a discount rate will make planner more preferable to choose remote join than local one Something like cpu_cost_ratio as a new server-level FDW option? Once we assume a join algorithm for remote join, unit cost for remote CPU, we can calculate a cost for foreign join based on the local join logic plus cost for network translation (maybe fdw_tuple_cost?). Yes, sum of these costs is the total cost of a remote join. o fdw_startup_cost o hash-join cost, estimated as a local join o fdw_tuple_cost * rows * width * FDW options Unlike table scan, FDW options we should refer is unclear. Table level FDW options are associated with a foreign table as literal. I think we have two options here: 1. Foreign-join refers FDW options for foreign-server, but ones for foreign-tables are ignored. 2. Foreign-join is prohibited when both of relations don't have identical FDW options. My preference is 2. Even though N-way foreign join, it ensures all the tables involved with (N-1)-way foreign join has identical FDW options, thus it leads we can make N-way foreign join with all identical FDW options. One exception is updatable flag of postgres_fdw. It does not make sense on remote join, so I think mixture of updatable and non-updatable foreign tables should be admitted, however, it is a decision by FDW driver. Probably, above points need to take time for getting consensus. I'd like to see your opinion prior to editing your patch. postgres_fdw can't push down a join which contains foreign tables on multiple servers, so use_remote_estimate and fdw_startup_cost are the only FDW options to consider. So we have options for each option. 1-a. If all foreign tables in the join has identical use_remote_estimate, allow pushing down. 1-b. If any of foreign table in the join has true as use_remote_estimate, use remote estimate. 2-a. If all foreign tables in the join has identical fdw_startup_cost, allow pushing down. 2-b. Always use max value in the join. (cost would be more expensive) 2-c. Always use min value in the join. (cost would be cheaper) I prefer 1-a and 2-b, so more joins avoid remote EXPLAIN but have reasonable cost about startup. I agree about updatable option. [implementation issues] The interface does not intend to add new Path/Plan type for each scan that replaces foreign joins. What postgres_fdw should do is, adding ForeignPath towards a particular joinrel, then it populates ForeignScan with remote join query once it got chosen by the planner. That idea is interesting, and make many things simpler. Please let me consider. A few functions added in src/backend/foreign/foreign.c are not called by anywhere, at this moment. create_plan_recurse() is reverted to static. It is needed for custom- join enhancement, if no other infrastructure can support. I made it back to static because I thought that create_plan_recurse can be called by core before giving control to FDWs. But I'm not sure it can be applied to custom scans. I'll recheck that part. -- Shigeru HANADA -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On Thu, Feb 19, 2015 at 3:57 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Feb 19, 2015 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: 1-2) sizeof(ParamListInfoData) is present in a couple of places, assuming that sizeof(ParamListInfoData) has the equivalent of 1 parameter, like prepare.c, functions.c, spi.c and postgres.c: - /* sizeof(ParamListInfoData) includes the first array element */ paramLI = (ParamListInfo) palloc(sizeof(ParamListInfoData) + - (num_params - 1) * sizeof(ParamExternData)); + num_params * sizeof(ParamExternData)); 1-3) FuncCandidateList in namespace.c (thanks Andres!): newResult = (FuncCandidateList) - palloc(sizeof(struct _FuncCandidateList) - sizeof(Oid) - + effective_nargs * sizeof(Oid)); + palloc(sizeof(struct _FuncCandidateList) + + effective_nargs * sizeof(Oid)); I imagine that we do not want for those palloc calls to use ifdef FLEXIBLE_ARRAY_MEMBER to save some memory for code readability even if compiler does not support flexible-array length, right? These are just wrong. As a general rule, we do not want to *ever* take sizeof() a struct that contains a flexible array: the results will not be consistent across platforms. The right thing is to use offsetof() instead. See the helpful comment autoconf provides: [...] And I had this one in front of my eyes a couple of hours ago... Thanks. This point is actually the main reason we've not done this change long since. People did not feel like running around to make sure there were no overlooked uses of sizeof(). Thanks for the clarifications and the review. Attached is a new set. Grr. Completely forgot to use offsetof in dumputils.c as well. Patch that can be applied on top of 0001 is attached. -- Michael diff --git a/src/bin/pg_dump/dumputils.c b/src/bin/pg_dump/dumputils.c index 892d3fa..3459adc 100644 --- a/src/bin/pg_dump/dumputils.c +++ b/src/bin/pg_dump/dumputils.c @@ -1217,7 +1217,7 @@ simple_string_list_append(SimpleStringList *list, const char *val) SimpleStringListCell *cell; /* this calculation correctly accounts for the null trailing byte */ - cell = (SimpleStringListCell *) pg_malloc(sizeof(SimpleStringListCell)); + cell = (SimpleStringListCell *) pg_malloc(offsetof(SimpleStringListCell, val)); cell-next = NULL; strcpy(cell-val, val); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables
Etsuro, * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote: On 2015/02/18 7:44, Stephen Frost wrote: * The patch applies to the latest head, 'make' passes successfully, but 'make check' fails in the rowsecurity test. Here's the patch against master. I'm still fiddling with the comment wording and the commit message a bit, but barring objections these patches are what I'm planning to move forward with. Thanks! Stephen From ea3713a8b648459d3024d331ef0374f6c9622247 Mon Sep 17 00:00:00 2001 From: Stephen Frost sfr...@snowman.net Date: Tue, 17 Feb 2015 15:43:33 -0500 Subject: [PATCH] Add locking clause for SB views for update/delete In expand_security_qual(), we were handling locking correctly when a PlanRowMark existed, but not when we were working with the target relation (which doesn't have any PlanRowMarks, but the subquery created for the security barrier quals still needs to lock the rows under it). Noted by Etsuro Fujita when working with the Postgres FDW, which wasn't properly issuing a SELECT ... FOR UPDATE to the remote side under a DELETE. Back-patch to 9.4 where updatable security barrier views were introduced. --- src/backend/optimizer/prep/prepsecurity.c | 24 ++- src/test/regress/expected/rowsecurity.out | 64 --- src/test/regress/expected/updatable_views.out | 264 ++ 3 files changed, 199 insertions(+), 153 deletions(-) diff --git a/src/backend/optimizer/prep/prepsecurity.c b/src/backend/optimizer/prep/prepsecurity.c index af3ee61..ce7b203 100644 --- a/src/backend/optimizer/prep/prepsecurity.c +++ b/src/backend/optimizer/prep/prepsecurity.c @@ -37,7 +37,7 @@ typedef struct } security_barrier_replace_vars_context; static void expand_security_qual(PlannerInfo *root, List *tlist, int rt_index, - RangeTblEntry *rte, Node *qual); + RangeTblEntry *rte, Node *qual, bool targetRelation); static void security_barrier_replace_vars(Node *node, security_barrier_replace_vars_context *context); @@ -63,6 +63,7 @@ expand_security_quals(PlannerInfo *root, List *tlist) Query *parse = root-parse; int rt_index; ListCell *cell; + bool targetRelation = false; /* * Process each RTE in the rtable list. @@ -98,6 +99,12 @@ expand_security_quals(PlannerInfo *root, List *tlist) { RangeTblEntry *newrte = copyObject(rte); + /* + * We need to let expand_security_qual know if this is the target + * relation, as it has additional work to do in that case. + */ + targetRelation = true; + parse-rtable = lappend(parse-rtable, newrte); parse-resultRelation = list_length(parse-rtable); @@ -147,7 +154,8 @@ expand_security_quals(PlannerInfo *root, List *tlist) rte-securityQuals = list_delete_first(rte-securityQuals); ChangeVarNodes(qual, rt_index, 1, 0); - expand_security_qual(root, tlist, rt_index, rte, qual); + expand_security_qual(root, tlist, rt_index, rte, qual, + targetRelation); } } } @@ -160,7 +168,7 @@ expand_security_quals(PlannerInfo *root, List *tlist) */ static void expand_security_qual(PlannerInfo *root, List *tlist, int rt_index, - RangeTblEntry *rte, Node *qual) + RangeTblEntry *rte, Node *qual, bool targetRelation) { Query *parse = root-parse; Oid relid = rte-relid; @@ -256,6 +264,16 @@ expand_security_qual(PlannerInfo *root, List *tlist, int rt_index, } /* + * We need to handle the case where this is the target relation + * explicitly since it won't have any row marks, because we still + * need to lock the records coming back from the with-security-quals + * subquery. This might not appear obivous, but it matches what + * we're doing above and keeps FDWs happy too. + */ + if (targetRelation) +applyLockingClause(subquery, 1, LCS_FORUPDATE, + LockWaitBlock, false); + /* * Replace any variables in the outer query that refer to the * original relation RTE with references to columns that we will * expose in the new subquery, building the subquery's targetlist diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out index 21817d8..f41bef1 100644 --- a/src/test/regress/expected/rowsecurity.out +++ b/src/test/regress/expected/rowsecurity.out @@ -1034,22 +1034,25 @@ EXPLAIN (COSTS OFF) EXECUTE p2(2); -- SET SESSION AUTHORIZATION rls_regress_user1; EXPLAIN (COSTS OFF) UPDATE t1 SET b = b || b WHERE f_leak(b); - QUERY PLAN -- +QUERY PLAN +--- Update on t1 t1_3 - Subquery Scan on t1 Filter: f_leak(t1.b) - - Seq Scan on t1 t1_4 - Filter: ((a % 2) = 0) + - LockRows + - Seq Scan on t1 t1_4 + Filter: ((a % 2) = 0) - Subquery Scan on t1_1 Filter: f_leak(t1_1.b) - - Seq Scan on t2 -
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
Stephen, I meant it to go to the list, but hit the wrong button. On Feb 17, 2015, at 7:01 PM, Stephen Frost sfr...@snowman.net wrote: Neil, I noticed that you email'd me directly on this reply. Not sure if you intended to or not, but I'm fine with my response going to the list. * Neil Tiffin (ne...@neiltiffin.com) wrote: On Feb 17, 2015, at 1:10 PM, Stephen Frost sfr...@snowman.net wrote: If the DB account isn't a superuser then everything changes. There's no point fighting with the OS user- they can run some other PG binary which they've copied over locally and run SQL with that, or copy all the files over to another server which they have complete access to. The fact that they can also connect to the DB and run SQL isn’t really an issue. Thats the point. If this environment matters then the db superuser would not be an authorized os superuser (with root or even close privileges). And no, they could not be running some other binary. One way to do pg superuser auditing is to utilize a file (outside of the pg data directory, which probably violates something in pg) like /etc/pg_su_audit that has os root rw and r for all others (or the equivalent for other os’s) containing a URL. If this file is present, send all db superuser usage to the URL. If this file is present with the wrong privileges, then don’t start pg. Done. No configuration in pg config files, no GUCs, no nothing for the pg superuser to mess around with, not tables, no ability for any pg superuser to configure or control. This approach doesn't violate anything in PG and can be used with any of the pgaudit approaches being discussed. The fact that it's postgresql.conf, which represents GUCs, doesn't change anything regarding what you’re getting it. It changes everything, pg superusers have complete control of files in the pg data directory. If set up correctly pg superusers have no control in /etc. If they can copy or install a PG binary, then the OS auditing and security failed. PG code need not be concerned. Sure someone could still break access to the URL, but again, not PG’s concern. Other security and auditing would have responsibility to pick that up. It is a really simple use case, record everything any db superuser does and send it to the audit system. Done. Then a designated role can control what gets audited in production. As long as everything the db superuser does can be written to an audit log, then it no longer matters technically if the db superuser can change the rest of the auditing. If they do and it violates policy, then when the audit picks it up, they lose their job plus depending on the environment. The issue is really around what we claim to provide with this auditing. We can't claim to provide *full* superuser auditing with any of these approaches since the superuser can disable auditing. We can, however, provide auditing up until that happens, which is likely to be sufficient in many environments. For those environments where full superuser auditing is required, an independent system must be used. Of course, it's important that any auditing mechanism which is used to audit superusers be impossible for the superuser to modify after the fact, meaning that syslog or similar needs to be used. I’m still confused since you do do not differentiate between db superuser and os superuser and what you mean by an independent system? With the scheme I described above, how does the db superuser disable auditing without os root privileges? If they can, then pg security is fundamentally broken, and I don’t believe it is. How can an independent system monitor what commands are issued inside the database? I understand my comments do not cover what is being proposed or worked on and that is fine. But saying it does not have value because the superuser could disable any system in pg, is wrong IMO. Being able to reliability log db superusers without their ability to change the logging would be a fundamentally good security tool as companies become more serious about internal security. This is, and will be more, important since a lot of people consider insider breaches the biggest security challenge today. Neil -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On 2015-02-18 17:15:18 +0900, Michael Paquier wrote: - I don't think that the t_bits fields in htup_details.h should be updated either. Why not? Any not broken code should already use MinHeapTupleSize and similar macros. Changing t_bits impacts HeapTupleHeaderData, ReorderBufferTupleBuf and similarly a couple of redo routines in heapam.c using HeapTupleHeaderData in a couple of structures not placing it at the end (compiler complains). The compiler will complain if you use a FLEXIBLE_ARRAY_MEMBER in the middle of a struct but not when when you embed a struct that uses it into the middle another struct. At least gcc doesn't and I think it'd be utterly broken if another compiler did that. If there's a compiler that does so, we need to make it define FLEXIBLE_ARRAY_MEMBER to 1. Code embedding structs using *either* [FLEXIBLE_ARRAY_MEMBER] or [1] for variable length obviously has to take care where the variable length part goes. And that what the structs you removed where doing - that's where the t_bits et al go: { ... HeapTupleHeaderData header; chardata[MaxHeapTupleSize]; ... } the 'data' bit is just the t_bits + data together. And similar in - struct - { - struct varlena hdr; - chardata[TOAST_MAX_CHUNK_SIZE]; /* make struct big enough */ - int32 align_it; /* ensure struct is aligned well enough */ - } chunk_data; The 'data' is where the varlena's vl_dat stuff is stored. Places using a variable-length variable not at the end of a structure are changed with workaround, without impacting what those features do. I vote for rejecting most of this, except a (corrected version) of the pg_authid change. Which doesn't really belong to the rest of the patch anyway ;)x Changing bytea to use FLEXIBLE_ARRAY_MEMBER requires those changes, or at least some changes in this area as something with FLEXIBLE_ARRAY_MEMBER can only be placed at the end of a structure. Again, I think you're confusing things that may not be be done in a single struct, and structs that are embedded in other places. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw foreign keys with default sequence
Tim Kane tim.k...@gmail.com wrote: CREATE MATERIALISED VIEW local.devices; CREATE test_table (device_id bigint FOREIGN KEY (device_id) REFERENCES local.devices (device_id) ); ERROR: referenced relation devices is not a table In the future, please show code that you have actually run. In this case it's pretty easy to know how to answer, but in others it may really draw out the process of helping you. At this time materialized views do not support constraints, and may not be referenced in foreign key definitions. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables
Etsuro, * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote: On 2015/02/18 7:44, Stephen Frost wrote: Attached is a patch which should address this. Would love your (or anyone else's) feedback on it. It appears to address the issue which you raised and the regression test changes are all in-line with inserting a LockRows into the subquery, as anticipated. I've looked into the patch. * The patch applies to the latest head, 'make' passes successfully, but 'make check' fails in the rowsecurity test. Apologies for not being clear- the patch was against 9.4, where it passes all the regression tests (at least for me- if you see differently, please let me know!). * I found one place in expand_security_qual that I'm concerned about: + if (targetRelation) + applyLockingClause(subquery, 1, LCS_FORUPDATE, +false, false); ISTM that it'd be better to use LockWaitBlock as the fourth argument of applyLockingClause. LockWaitBlock isn't in 9.4. :) Otherwise, I'd agree, and it's what I plan to do for master. Other than that, the patch looks good to me. Great, thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
Neil, * Neil Tiffin (ne...@neiltiffin.com) wrote: I meant it to go to the list, but hit the wrong button. No problem. On Feb 17, 2015, at 7:01 PM, Stephen Frost sfr...@snowman.net wrote: I noticed that you email'd me directly on this reply. Not sure if you intended to or not, but I'm fine with my response going to the list. This approach doesn't violate anything in PG and can be used with any of the pgaudit approaches being discussed. The fact that it's postgresql.conf, which represents GUCs, doesn't change anything regarding what you’re getting it. It changes everything, pg superusers have complete control of files in the pg data directory. If set up correctly pg superusers have no control in /etc. If set up correctly, postgresql.conf is in /etc. :) That's distribution specific though. However, postgresql.auto.conf ends up causing problems since it can't be disabled and it's forced to live in the PG data directory. Even so though, your argument was that, as long as the system is set up to be auditing initially and whatever action the superuser takes to disable auditing will be audited, and disabling auditing is against policy, then it's sufficient to meet the requirement. That's true in either case. The issue is really around what we claim to provide with this auditing. We can't claim to provide *full* superuser auditing with any of these approaches since the superuser can disable auditing. We can, however, provide auditing up until that happens, which is likely to be sufficient in many environments. For those environments where full superuser auditing is required, an independent system must be used. Of course, it's important that any auditing mechanism which is used to audit superusers be impossible for the superuser to modify after the fact, meaning that syslog or similar needs to be used. I’m still confused since you do do not differentiate between db superuser and os superuser and what you mean by an independent system? When I'm talking about 'superuser' here, it's the PG superuser, not the OS superuser. What I mean by independent system is a process which is not owned by the same user that the database is running as, and isn't owned by the user who is connecting, that facilitates the connection from the user to PG, but which logs everything that happens on that connection. With the scheme I described above, how does the db superuser disable auditing without os root privileges? If they can, then pg security is fundamentally broken, and I don’t believe it is. The DB superuser can modify the running process, through mechanisms as simple as changing GUCs to creating functions in untrusted languages (including C) and then using them to change or break more-or-less anything that's happening in the backend. How can an independent system monitor what commands are issued inside the database? It can log everything which happens on the connection between the user and the database because it's in the middle. I understand my comments do not cover what is being proposed or worked on and that is fine. But saying it does not have value because the superuser could disable any system in pg, is wrong IMO. Being able to reliability log db superusers without their ability to change the logging would be a fundamentally good security tool as companies become more serious about internal security. This is, and will be more, important since a lot of people consider insider breaches the biggest security challenge today. It'd be a great tool, certainly, but it's not actually possible to do completely inside the DB process given the capabilities the PG superuser has. Being able to create and run functions in untrusted languages means that the superuser can completely hijack the process, that's why those languages are untrusted. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello, I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id. Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon. To clarify my understanding of the above change, Instead of a block id to reference different fragments of an xlog record , a single byte field chunk_id should be used. chunk_id will be same as XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments. But for block references, it will take store following values in order to store information about the backup blocks. #define XLR_CHUNK_BKP_COMPRESSED 0x01 #define XLR_CHUNK_BKP_WITH_HOLE 0x02 ... The new xlog format should look like follows, Fixed-size header (XLogRecord struct) Chunk_id(add a field before id field in XLogRecordBlockHeader struct) XLogRecordBlockHeader Chunk_id XLogRecordBlockHeader ... ... Chunk_id ( rename id field of the XLogRecordDataHeader struct) XLogRecordDataHeader[Short|Long] block data block data ... main data I will post a patch based on this. Thank you, Rahila Syed -Original Message- From: Andres Freund [mailto:and...@2ndquadrant.com] Sent: Monday, February 16, 2015 5:26 PM To: Syed, Rahila Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes On 2015-02-16 11:30:20 +, Syed, Rahila wrote: - * As a trivial form of data compression, the XLOG code is aware that - * PG data pages usually contain an unused hole in the middle, which - * contains only zero bytes. If hole_length 0 then we have removed - * such a hole from the stored data (and it's not counted in the - * XLOG record's CRC, either). Hence, the amount of block data actually - * present is BLCKSZ - hole_length bytes. + * Block images are able to do several types of compression: + * - When wal_compression is off, as a trivial form of compression, + the + * XLOG code is aware that PG data pages usually contain an unused hole + * in the middle, which contains only zero bytes. If length BLCKSZ + * then we have removed such a hole from the stored data (and it is + * not counted in the XLOG record's CRC, either). Hence, the amount + * of block data actually present is length bytes. The hole offset + * on page is defined using hole_offset. + * - When wal_compression is on, block images are compressed using a + * compression algorithm without their hole to improve compression + * process of the page. length corresponds in this case to the + length + * of the compressed block. hole_offset is the hole offset of the + page, + * and the length of the uncompressed block is defined by + raw_length, + * whose data is included in the record only when compression is + enabled + * and with_hole is set to true, see below. + * + * is_compressed is used to identify if a given block image is + compressed + * or not. Maximum page size allowed on the system being 32k, the + hole + * offset cannot be more than 15-bit long so the last free bit is + used to + * store the compression state of block image. If the maximum page + size + * allowed is increased to a value higher than that, we should + consider + * increasing this structure size as well, but this would increase + the + * length of block header in WAL records with alignment. + * + * with_hole is used to identify the presence of a hole in a block image. + * As the length of a block cannot be more than 15-bit long, the + extra bit in + * the length field is used for this identification purpose. If the + block image + * has no hole, it is ensured that the raw size of a compressed block + image is + * equal to BLCKSZ, hence the contents of + XLogRecordBlockImageCompressionInfo + * are not necessary. */ typedef struct XLogRecordBlockImageHeader { - uint16 hole_offset;/* number of bytes before hole */ - uint16 hole_length;/* number of bytes in hole */ + uint16 length:15, /* length of block data in record */ + with_hole:1;/* status of hole in the block */ + + uint16 hole_offset:15, /* number of bytes before hole */ + is_compressed:1;/* compression status of image */ + + /* Followed by the data related to compression if block is +compressed */ } XLogRecordBlockImageHeader; Yikes, this is ugly. I think we should change the xlog format so that the block_id
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
Stephen Frost wrote: Abhijit, * Abhijit Menon-Sen (a...@2ndquadrant.com) wrote: I'm confused and unhappy about your characterisation of the state of this patch. You make it seem as though there was broad consensus about the changes that were needed, and that I left the patch sitting in the archives for a long time without addressing important issues. The specific issue which I was referring to there was the #ifdef's for the deparse branch. I thought it was pretty clear, based on the feedback from multiple people, that all of that really needed to be taken out as we don't commit code to the main repo which has such external dependencies. We can remove the #ifdef lines as soon as DDL deparse gets committed, of course. There is no external dependency here, only a dependency on a patch that's being submitted in parallel. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
Andres Freund and...@2ndquadrant.com writes: On 2015-02-16 21:34:57 -0500, Tom Lane wrote: Also, if we want to insist that these fields be accessed through heap_getattr, I'd be inclined to put them inside the #ifdef CATALOG_VARLEN to enforce that. That we definitely should do. It's imo just a small bug that it was omitted here. I'll fix it, but not backpatch unless you prefer? Agreed, that's really independent of FLEXIBLE_ARRAY_MEMBER, so fix it as its own patch. I also agree that back-patching is probably not appropriate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
On 15 February 2015 at 02:34, David Steele da...@pgmasters.net wrote: I've posted a couple of messages over the last few weeks about the work I've been doing on the pg_audit extension. The lack of response could be due to either universal acclaim or complete apathy, but in any case I think this is a very important topic so I want to give it another try. You mentioned you had been following the thread for some time and yet had not contributed to it. Did that indicate your acclaim for the earlier patch, or was that apathy? I think neither. People have been working on this feature for 9 months now, so you having to wait 9 days for a response is neither universal acclaim, nor apathy. I've waited much longer than that for Stephen to give the review he promised, but have not bad mouthed him for that wait, nor do I do so now. In your first post you had removed the author's email addresses, so they were likely unaware of your post. I certainly was. I've extensively reworked the code that was originally submitted by 2ndQuandrant. This is not an indictment of their work, but rather an attempt to redress concerns that were expressed by members of the community. I've used core functions to determine how audit events should be classified and simplified and tightened the code wherever possible. I've removed deparse and event triggers and opted for methods that rely only on existing hooks. In my last message I provided numerous examples of configuration, usage, and output which I hoped would alleviate concerns of complexity. I've also written a ton of unit tests to make sure that the code works as expected. Some people that have contributed ideas to this patch are from 2ndQuadrant, some are not. The main point is that we work together on things, rather than writing a slightly altered version and then claiming credit. If you want to help, please do. We give credit where its due, not to whoever touched the code last in some kind of bidding war. If we let this happen, we'd generate a flood of confusing patch versions and little would ever get committed. Let's keep to one thread and work to include everybody's ideas then we'll get something useful committed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On 2015-02-16 21:34:57 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2015-02-17 05:51:22 +0900, Michael Paquier wrote: diff --git a/src/include/catalog/pg_authid.h b/src/include/catalog/pg_authid.h index e01e6aa..d8789a5 100644 --- a/src/include/catalog/pg_authid.h +++ b/src/include/catalog/pg_authid.h @@ -56,8 +56,10 @@ CATALOG(pg_authid,1260) BKI_SHARED_RELATION BKI_ROWTYPE_OID(2842) BKI_SCHEMA_MAC int32 rolconnlimit; /* max connections allowed (-1=no limit) */ /* remaining fields may be null; use heap_getattr to read them! */ - textrolpassword;/* password, if any */ timestamptz rolvaliduntil; /* password expiration time, if any */ +#ifdef CATALOG_VARLEN + textrolpassword;/* password, if any */ +#endif } FormData_pg_authid; That change IIRC is wrong, because it'll make rolvaliduntil until NOT NULL (any column that's fixed width and has only fixed with columns before it is marked as such). You were muttering about a BKI_FORCE_NOT_NULL option ... for symmetry, maybe we could add BKI_FORCE_NULL as well, and then use that for cases like this? Yea, I guess it'd not be too hard. Also, if we want to insist that these fields be accessed through heap_getattr, I'd be inclined to put them inside the #ifdef CATALOG_VARLEN to enforce that. That we definitely should do. It's imo just a small bug that it was omitted here. I'll fix it, but not backpatch unless you prefer? I'm generally -1 on reordering any catalog columns as part of this patch. There should be zero user-visible change from it IMO. However, if we stick both those columns inside the ifdef, we don't need to reorder. Agreed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
Abhijit, * Abhijit Menon-Sen (a...@2ndquadrant.com) wrote: At 2015-02-17 13:01:46 -0500, sfr...@snowman.net wrote: I have to admit that I'm confused by this. Patches don't stabilise through sitting in the archives, they stabilise when the comments are being addressed, the patch updated, and further comments are addressing less important issues. The issues which Robert and I had both commented on didn't appear to be getting addressed. I'm confused and unhappy about your characterisation of the state of this patch. You make it seem as though there was broad consensus about the changes that were needed, and that I left the patch sitting in the archives for a long time without addressing important issues. The specific issue which I was referring to there was the #ifdef's for the deparse branch. I thought it was pretty clear, based on the feedback from multiple people, that all of that really needed to be taken out as we don't commit code to the main repo which has such external dependencies. That wasn't meant as a characterization of the patch itself but rather a comment on the state of the process and that I, at least, had been hoping for a new version which addressed those bits. You revised and refined your GRANT proposal in stages, and I tried to adapt the code to suit. I'm sorry that my efforts were not fast enough or responsive enough to make you feel that progress was being made. But nobody else commented in detail on the GRANT changes except to express general misgivings, and nobody else even disagreed when I inferred, in light of the lack of consensus that Robert pointed out, that the code was unlikely to make it into 9.5. Progress was being made and I certainly appreciate your efforts. I don't think anyone would want to stand up and say it's going to be in 9.5 or not be in 9.5 which is likely why you didn't get any reaction from your comment about it being unlikely. I'm certainly hopeful that something gets in along these lines as it's a pretty big capability that we're currently missing. Given that I've maintained the code over the past year despite its being rejected in an earlier CF, and given the circumstances outlined above, I do not think it's reasonable to conclude after a couple of weeks without a new version that it was abandoned. As I had mentioned earlier, there are people who already use pgaudit as-is, and complain if I break it. For my part, at least, I didn't intend to characterize it as abandoned but rather that it simply didn't seem to be moving forward, perhaps due to a lack of time to work on it or other priorities; I didn't mean to imply that you wouldn't be continuing to maintain it. As for the comment about people depending on it, I'm not sure what that's getting at, but I don't think that's really a consideration for us as it relates to having a contrib module to provide this capability. We certainly want to consider what capabilities users are looking for and make sure we cover those cases, if possible, but this is at a development stage with regard to core and therefore things may break for existing users. If that's an issue for your users then it would probably be good to have a clean distinction between the stable code you're providing to them for auditing and what's being submitted for inclusion in core, with an expectation that users will need to make some changes when they switch to the version which is included in core. Anyway, I guess there is no such thing as a constructive history discussion, so I'll drop it. It's not my intent to continue this as I certainly agree that it seems unlikely to be a constructive use of time, but I wanted to reply and clarify that it wasn't my intent to characterize pgaudit as abandoned and to apologize for my comments coming across that way. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Commit fest 2015-12 enters money time
On Wed, Feb 18, 2015 at 12:34 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Feb 18, 2015 at 5:55 AM, Corey Huinker corey.huin...@gmail.com wrote: What is required to get the New Patch superpower? I'm also in need of it. CCing Magnus... I am not sure myself, but I would imagine that the commit fest needs to be Open and not In Process to be able to add patches. Sorry, I was out on a long travel day with lots of airplane time. I had this thread on my watch list but hadn't time to fix it. So the basic reason is that there was no Open commitfest available. Only Commitfest Managers are allowed to break the rules and add patches to the wrong commitfest - this is why Michael was able to do that, and pretty much nobody else. There's supposed to always be one open commitfest. Since none was added when 2015-02 was switched to In Progress (probably because I didn't actually tell Michael about it), there was nowhere to post new patches. I have fixed this now by creating 2015-06, which is open for submissions. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Commit fest 2015-12 enters money time
On Tue, Feb 17, 2015 at 12:34 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Feb 17, 2015 at 7:39 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: 2015-02-16 12:25 GMT+09:00 Michael Paquier michael.paqu...@gmail.com: On Fri, Feb 13, 2015 at 10:06 AM, Michael Paquier michael.paqu...@gmail.com wrote: In order to move on to the next CF, I am going to go through all the remaining patches and update their status accordingly. And sorry for slacking a bit regarding that. If you wish to complain, of course feel free to post messages on this thread or on the thread related to the patch itself. As all the entries have been either moved or updated, CF 2014-12 is closed, and CF 2015-02 has been changed to In Progress. I tried to add my name on the CF entry for the Join pushdown support for foreign tables patch, however, it was unintentionally marked as rejected on the last CF, even though it should be returned with feedback. https://commitfest.postgresql.org/3/20/ Is it available to move it to the current CF? I don't recall you can... You're correct, you can't. It would probably make sense for at least a CF manager to be able to do that - added to my TODO. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] deparsing utility commands
Hi, On 2015-02-15 01:48:15 -0300, Alvaro Herrera wrote: This is a repost of the patch to add CREATE command deparsing support to event triggers. It now supports not only CREATE but also ALTER and other commands such as GRANT/REVOKE, COMMENT ON and SECURITY LABEL. This patch series is broken up in a rather large number of patches, but my intention would be to commit separately only the first 6 patches; the remaining parts are split up only so that it's easy to see how deparsing each patch is implemented, and would be committed as a single changeset (but before that I need a bunch of extra fixes and additions to other parts.) I think you should just go ahead and commit 0001, 0002, 0006. Those have previously been discussed and seem like a good idea and uncontroversial even if the rest of the series doesn't go in. I think the GRANT/REVOKE, COMMENT ON and SECURITY LABEL changes are good independently as well, but there previously have been raised some concerns about shared objects. I think the answer in the patches which is to raise events when affecting database local objects makes sense, but others might disagree. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On 2015-02-18 16:59:26 +0530, Amit Kapila wrote: On Tue, Feb 17, 2015 at 9:52 PM, Andres Freund and...@2ndquadrant.com wrote: A query whose runetime is dominated by a sequential scan (+ attached filter) is certainly going to require a bigger prefetch size than one that does other expensive stuff. Imagine parallelizing SELECT * FROM largetable WHERE col = low_cardinality_value; and SELECT * FROM largetable JOIN gigantic_table ON (index_nestloop_condition) WHERE col = high_cardinality_value; The first query will be a simple sequential and disk reads on largetable will be the major cost of executing it. In contrast the second query might very well sensibly be planned as a parallel sequential scan with the nested loop executing in the same worker. But the cost of the sequential scan itself will likely be completely drowned out by the nestloop execution - index probes are expensive/unpredictable. I think the work/task given to each worker should be as granular as possible to make it more predictable. I think the better way to parallelize such a work (Join query) is that first worker does sequential scan and filtering on large table and then pass it to next worker for doing join with gigantic_table. I'm pretty sure that'll result in rather horrible performance. IPC is rather expensive, you want to do as little of it as possible. I think it makes sense to think of a set of tasks in which workers can assist. So you a query tree which is just one query tree, with no copies of the nodes, and then there are certain places in that query tree where a worker can jump in and assist that node. To do that, it will have a copy of the node, but that doesn't mean that all of the stuff inside the node becomes shared data at the code level, because that would be stupid. My only problem with that description is that I think workers will have to work on more than one node - it'll be entire subtrees of the executor tree. There could be some cases where it could be beneficial for worker to process a sub-tree, but I think there will be more cases where it will just work on a part of node and send the result back to either master backend or another worker for further processing. I think many parallelism projects start out that way, and then notice that it doesn't parallelize very efficiently. The most extreme example, but common, is aggregation over large amounts of data - unless you want to ship huge amounts of data between processes eto parallize it you have to do the sequential scan and the pre-aggregate step (that e.g. selects count() and sum() to implement a avg over all the workers) inside one worker. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 02/17/2015 08:21 PM, Peter Eisentraut wrote: On 1/20/15 6:32 PM, David G Johnston wrote: In fact, as far as the database knows, the values provided to this function do represent an entire population and such a correction would be unnecessary. I guess it boils down to whether future queries are considered part of the population or whether the population changes upon each query being run and thus we are calculating the ever-changing population variance. I think we should be calculating the population variance. We are clearly taking the population to be all past queries (from the last reset point). Otherwise calculating min and max wouldn't make sense. The difference is likely to be very small in any case where you actually want to examine the standard deviation, so I feel we're rather arguing about how many angels can fit on the head of a pin, but if this is the consensus I'll change it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : REINDEX xxx VERBOSE
VACUUM [FULL] [FREEZE] ... [ANALYZE] [tname [(cname, ...)] | VACUUM [({FULL [bool]|FREEZE [bool]|...}[,...])] [tname [(cname, ...)] | VACUUM [tname [(cname, ...)] [[WITH ]({FULL [bool]|FREEZE [bool]|...})] REINDEX [{INDEX|TABLE|...}] name [[WITH] (VERBOSE [bool]|...)] EXPLAIN [[WITH] ({ANALYZE [bool]|VERBOSE [bool]|... [,...]})] statement I don't think (OptionName [bool]) style like (VERBOSE on, FORCE on) is needed for REINDEX command. EXPLAIN command has such option style because it has the FORMAT option can have value excepting ON/TRUE or OFF/FALSE.(e.g., TEXT, XML) But the value of REINDEX command option can have only ON or OFF. I think the option name is good enough. Next, regarding of the location of such option, the several maintenance command like CLUSTER, VACUUM has option at immediately after command name. From consistency perspective, I tend to agree with Robert to put option at immediately after command name as follows. REINDEX [(VERBOSE | FORCE)] {INDEX | ...} name; Btw how long will the FORCE command available? Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
On 2/18/15 8:25 AM, Simon Riggs wrote: On 15 February 2015 at 02:34, David Steele da...@pgmasters.net wrote: I've posted a couple of messages over the last few weeks about the work I've been doing on the pg_audit extension. The lack of response could be due to either universal acclaim or complete apathy, but in any case I think this is a very important topic so I want to give it another try. You mentioned you had been following the thread for some time and yet had not contributed to it. Did that indicate your acclaim for the earlier patch, or was that apathy? I think neither. In my case it actually was acclaim. I was happy with the direction things were going and had nothing in particular to add - and I didn't think a +1 from me was going to carry any weight with the community. I can see now that everyone's opinion matters here, so I'll be more active about weighing in when I think something is valuable. People have been working on this feature for 9 months now, so you having to wait 9 days for a response is neither universal acclaim, nor apathy. I've waited much longer than that for Stephen to give the review he promised, but have not bad mouthed him for that wait, nor do I do so now. In your first post you had removed the author's email addresses, so they were likely unaware of your post. I certainly was. I understand that, but with the CF closing I felt like I had to act. Abhijit's last comment on the thread was that he was no longer going to work on it in relation to 9.5. I felt that it was an important feature (and one that I have a lot of interest in), so that's when I got involved. I posted two messages, but I only addressed one of them directly to Abhijit. As you said, I'm new here and I'm still getting used to the way things are done. I've extensively reworked the code that was originally submitted by 2ndQuandrant. This is not an indictment of their work, but rather an attempt to redress concerns that were expressed by members of the community. I've used core functions to determine how audit events should be classified and simplified and tightened the code wherever possible. I've removed deparse and event triggers and opted for methods that rely only on existing hooks. In my last message I provided numerous examples of configuration, usage, and output which I hoped would alleviate concerns of complexity. I've also written a ton of unit tests to make sure that the code works as expected. Some people that have contributed ideas to this patch are from 2ndQuadrant, some are not. The main point is that we work together on things, rather than writing a slightly altered version and then claiming credit. If you want to help, please do. We give credit where its due, not to whoever touched the code last in some kind of bidding war. If we let this happen, we'd generate a flood of confusing patch versions and little would ever get committed. Agreed, and I apologize if I came off that way. It certainly wasn't my intention. I was hesitant because I had made so many changes and I wasn't sure how the authors would feel about it. I wrote to them privately to get their take on the situation. Let's keep to one thread and work to include everybody's ideas then we'll get something useful committed. I'm a little confused about how to proceed here. I created a new thread because the other patch had already been rejected. How should I handle that? -- - David Steele da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
Hi Fujii, Thanks for taking a look at the patch. Comments below: On 2/18/15 6:11 AM, Fujii Masao wrote: On Wed, Feb 18, 2015 at 1:26 AM, David Steele da...@pgmasters.net wrote: On 2/17/15 10:23 AM, Simon Riggs wrote: I vote to include pgaudit in 9.5, albeit with any changes. In particular, David may have some changes to recommend, but I haven't seen a spec or a patch, just a new version of code (which isn't how we do things...). I submitted the new patch in my name under a separate thread Auditing extension for PostgreSQL (Take 2) (54e005cc.1060...@pgmasters.net) I played the patch version of pg_audit a bit and have basic comments about its spec. The pg_audit doesn't log BIND parameter values when prepared statement is used. Seems this is an oversight of the patch. Or is this intentional? It's actually intentional - following the model I talked about in my earlier emails, the idea is to log statements only. This also follows on 2ndQuadrant's implementation. Logging values is interesting, but I'm sure the user would want to specify which columns to log, which I felt was beyond the scope of the patch. The pg_audit cannot log the statement like SELECT 1 which doesn't access to the database object. Is this intentional? I think that there are many users who want to audit even such statement. I think I see how to make this work. I'll work on it for the next version of the patch. Imagine the case where you call the user-defined function which executes many nested statements. In this case, pg_audit logs only top-level statement (i.e., issued directly by client) every time nested statement is executed. In fact, one call of such UDF can cause lots of *same* log messages. I think this is problematic. I agree - not sure how to go about addressing it, though. I've tried to cut down on the verbosity of the logging in general, but of course it can still be a problem. Using security definer and a different logging GUC for the defining role might work. I'll add that to my unit tests and see what happens. I appreciate your feedback! -- - David Steele da...@pgmasters.net signature.asc Description: OpenPGP digital signature
[HACKERS] array_push is just stupid ...
While hacking around with expanded arrays, I realized that array_push() is just a horribly badly designed function. There is no reason at all for the same C function to underlie both array_append and array_prepend: the function spends significant time and effort reverse-engineering which way it was called, and there's very little logic-in-common that would justify merging the two code paths. If we split it apart into separate array_append and array_prepend functions, I think we'd actually end up with less code. And it would certainly be noticeably faster because we'd not need to be doing get_fn_expr_argtype and get_element_type (the latter of which is a syscache lookup, hence not so cheap) on *both* arguments every time through. (Admittedly, some of that could be bought back with a bit more effort on locally caching the result, but it would be better to know which input is the array a priori.) Barring objections, I'm going to go fix this independently of the expanded-array changes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On Thu, Feb 19, 2015 at 7:29 AM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: On Wed, Feb 18, 2015 at 10:09 PM, Andres Freund and...@2ndquadrant.com wrote: The compiler will complain if you use a FLEXIBLE_ARRAY_MEMBER in the middle of a struct but not when when you embed a struct that uses it into the middle another struct. At least gcc doesn't and I think it'd be utterly broken if another compiler did that. If there's a compiler that does so, we need to make it define FLEXIBLE_ARRAY_MEMBER to 1. clang does complain on my OSX laptop regarding that ;) I'm a bit astonished that gcc doesn't consider this an error. Sure seems like it should. (Has anyone tried it on recent gcc?) Just tried with gcc 4.9.2 on an ArchLinux bix and it does not complain after switching the declaration of varlena declaration from [1] to FLEXIBLE_ARRAY_MEMBER in c.h on HEAD. But it does with clang 3.5.1 on the same box. I am entirely opposed to Andreas' claim that we ought to consider compilers that do warn to be broken; if anything it's the other way around. I'm on board with that. Moreover, if we have any code that is assuming such cases are okay, it probably needs a second look. Isn't this situation effectively assuming that a variable-length array is fixed-length? AFAIK, switching a bunch of things to use FLEXIBLE_ARRAY_MEMBER has put a couple of things in light that could be revisited: 1) tuptoaster.c, with this declaration of varlena: struct { struct varlena hdr; chardata[TOAST_MAX_CHUNK_SIZE]; /* make struct big enough */ int32 align_it; /* ensure struct is aligned well enough */ } chunk_data; 2) inv_api.c with this thing: struct { bytea hdr; chardata[LOBLKSIZE];/* make struct big enough */ int32 align_it; /* ensure struct is aligned well enough */ } workbuf; 3) heapam.c in three places with HeapTupleHeaderData: struct { HeapTupleHeaderData hdr; chardata[MaxHeapTupleSize]; } tbuf; 4) pg_authid.h with its use of relpasswd. 5) reorderbuffer.h with its use of HeapTupleHeaderData: typedef struct ReorderBufferTupleBuf { /* position in preallocated list */ slist_node node; /* tuple, stored sequentially */ HeapTupleData tuple; HeapTupleHeaderData header; chardata[MaxHeapTupleSize]; } ReorderBufferTupleBuf; Those issues can be grouped depending on where foo[1] is switched to FLEXIBLE_ARRAY_MEMBER, so I will try to get a set of patches depending on that. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 16/02/15 10:46, Andres Freund wrote: On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. I agree that limiting the overhead is important. But I have related though about this - I wonder if it's worth to try to map this more directly to the CommitTsNodeId. I mean it is currently using that for the actual storage, but I am thinking of having the Replication Identifiers be the public API and the CommitTsNodeId stuff be just hidden implementation detail. This thought is based on the fact that CommitTsNodeId provides somewhat meaningless number while Replication Identifiers give us nice name for that number. And if we do that then the type should perhaps be same for both? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Exposing the stats snapshot timestamp to SQL
Hi there, On 30.1.2015 06:27, Matt Kelly wrote: Actually, I just did one more code review of myself, and somehow missed that I submitted the version with the wrong oid. The oid used in the first version is wrong (1) and was from before I read the docs on properly picking one. Attached is the fixed version. (hopefully with the right mime-type and wrong extension. Alas, gmail doesn't let you set mime-types; time to find a new email client...) I do have a question regarding the patch, although I see the patch is already marked as 'ready for committer' (sorry, somehow managed to miss the patch until now). I see the patch only works with the top-level snapshot timestamp, stored in globalStats, but since 9.3 (when the stats were split into per-db files) we track per-database timestamps too. Shouldn't we make those timestamps accessible too? It's not necessary for detecting unresponsive statistics collector (if it's stuck it's not writing anything, so the global timestamp will be old too), but it seems more appropriate for querying database-level stats to query database-level timestamp too. But maybe that's not necessary, because to query database stats you have to be connected to that particular database and that should write fresh stats, so the timestamps should not be very different. regards -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow snapshot too old error, to prevent bloat
On Wed, Feb 18, 2015 at 10:57 PM, Kevin Grittner kgri...@ymail.com wrote: Magnus Hagander mag...@hagander.net wrote: On Feb 17, 2015 12:26 AM, Andres Freund and...@2ndquadrant.com wrote: On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote: But max_standby_streaming_delay, max_standby_archive_delay and hot_standby_feedback are among the most frequent triggers for questions and complaints that I/we see. Agreed. And a really bad one used to be vacuum_defer_cleanup_age, because of confusing units amongst other things. Which in terms seems fairly close to Kevins suggestions, unfortunately. Particularly my initial suggestion, which was to base snapshot age it on the number of transaction IDs assigned. Does this look any better to you if it is something that can be set to '20min' or '1h'? Just to restate, that would not automatically cancel the snapshots past that age; it would allow vacuum of any tuples which became dead that long ago, and would cause a snapshot too old message for any read of a page modified more than that long ago using a snapshot which was older than that. Yes, it would definitely look much better. My reference per above was exactly that - having a setting in the unit number of xids confused a lot of users and made it really hard to tune. Having something in time units is a lot easier to understand and tune for most people. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 02/18/2015 08:34 PM, David Fetter wrote: On Tue, Feb 17, 2015 at 08:21:32PM -0500, Peter Eisentraut wrote: On 1/20/15 6:32 PM, David G Johnston wrote: In fact, as far as the database knows, the values provided to this function do represent an entire population and such a correction would be unnecessary. I guess it boils down to whether future queries are considered part of the population or whether the population changes upon each query being run and thus we are calculating the ever-changing population variance. I think we should be calculating the population variance. Why population variance and not sample variance? In distributions where the second moment about the mean exists, it's an unbiased estimator of the variance. In this, it's different from the population variance. Because we're actually measuring the whole population, and not a sample? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anyarray
On 13.2.2015 16:20, Teodor Sigaev wrote: Some of users of intarray contrib module wish to use its features with another kind of arrays, not only for int4 type. Suggested module generalizes intarray over other (not all) types op pgsql. Anyarray also provides a calculation of similarity two one dimensinal arrays similar to smlar module. Anyarray module doesn't provide an something similar to query_int feature of intarray, because this feature is very hard to implement (it requires new pseudo-type anyquery), it is close to impossible to have operation extensibility and it's complicated queries are hidden from pgsql's optimizer (like to jsquery). As far I know, query_int isn't very popular for now. Hi Teodor, I started reviewing this patch today - first of all, kudos for having 944kB of regression tests in 1MB patch ;-) I've noticed two unrelated files ../src/test/modules/dummy_seclabel/expected/dummy_seclabel.out ../src/test/modules/dummy_seclabel/sql/dummy_seclabel.sql I suppose that's not intentional, right? -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
On 19/02/15 15:00, Tom Lane wrote: Michael Paquier michael.paqu...@gmail.com writes: On Thu, Feb 19, 2015 at 7:29 AM, Tom Lane t...@sss.pgh.pa.us wrote: Moreover, if we have any code that is assuming such cases are okay, it probably needs a second look. Isn't this situation effectively assuming that a variable-length array is fixed-length? AFAIK, switching a bunch of things to use FLEXIBLE_ARRAY_MEMBER has put a couple of things in light that could be revisited: 1) tuptoaster.c, with this declaration of varlena: struct { struct varlena hdr; chardata[TOAST_MAX_CHUNK_SIZE]; /* make struct big enough */ int32 align_it; /* ensure struct is aligned well enough */ } chunk_data; I'm pretty sure that thing ought to be a union, not a struct. 2) inv_api.c with this thing: struct { bytea hdr; chardata[LOBLKSIZE];/* make struct big enough */ int32 align_it; /* ensure struct is aligned well enough */ } workbuf; And probably this too. 3) heapam.c in three places with HeapTupleHeaderData: struct { HeapTupleHeaderData hdr; chardata[MaxHeapTupleSize]; } tbuf; And this, though I'm not sure if we'd have to change the size of the padding data[] member. 5) reorderbuffer.h with its use of HeapTupleHeaderData: Hmm. Andres will have to answer for that one ;-) regards, tom lane Curious, has this problem been raised with the gcc maintainers? Is this still a problem with gcc 5.0 (which is due to be released soon)? Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] INSERT ... ON CONFLICT UPDATE and logical decoding
Andres pointed out that the INSERT ... ON CONFLICT UPDATE patch doesn't work well with logical decoding. Basically, as things stand, there is no means of determining that an INSERT had an ON CONFLICT UPDATE clause from logical decoding. It isn't obvious that this matters - after all, the relevant WAL records are generally consistent with there being a vanilla INSERT (or vanilla UPDATE). If there was no possibility of concurrent conflicts, that might actually be the end of it, but of course there is. I guess that the best way of fixing this is exposing to output plugins that a super deletion is a REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE. This is kind of an internal ReorderBufferChangeType constant, because it means that the output plugin should probably just omit the tuple just inserted and now deleted. I tend to think that it would be best if the fact that a speculative insertion was a speculative insertion is not exposed. We just formalize that a REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE is a variant of REORDER_BUFFER_CHANGE_DELETE...it's implicit that things like triggers are not fired here (which may or may not matter, depending on the use case, I suppose). Would that be sufficiently flexible for the various logical replication use cases? I guess we are somewhat forced to push that kind of thing into output plugins, because doing so lets them decide how to handle this. It's a little unfortunate that this implementation detail is exposed to output plugins, though, which is why I'd be willing to believe that it'd be better to have transaction reassembly normalize the records such that a super deleted tuple was never exposed to output plugins in the first place...they'd only see a REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that was the definitive outcome. No need for output plugins to consider REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all. Thoughts? Can't say that I've given conflict resolution for multi-master systems a great deal of thought before now, so I might be quite off the mark here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anyarray
On 19.2.2015 03:14, Tomas Vondra wrote: I've noticed two unrelated files Meh, should be I noticed the patch removes two unrelated files ... ../src/test/modules/dummy_seclabel/expected/dummy_seclabel.out ../src/test/modules/dummy_seclabel/sql/dummy_seclabel.sql I suppose that's not intentional, right? -- Tomas Vondrahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Tue, Feb 17, 2015 at 08:21:32PM -0500, Peter Eisentraut wrote: On 1/20/15 6:32 PM, David G Johnston wrote: In fact, as far as the database knows, the values provided to this function do represent an entire population and such a correction would be unnecessary. I guess it boils down to whether future queries are considered part of the population or whether the population changes upon each query being run and thus we are calculating the ever-changing population variance. I think we should be calculating the population variance. Why population variance and not sample variance? In distributions where the second moment about the mean exists, it's an unbiased estimator of the variance. In this, it's different from the population variance. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
Michael Paquier michael.paqu...@gmail.com writes: On Thu, Feb 19, 2015 at 7:29 AM, Tom Lane t...@sss.pgh.pa.us wrote: Moreover, if we have any code that is assuming such cases are okay, it probably needs a second look. Isn't this situation effectively assuming that a variable-length array is fixed-length? AFAIK, switching a bunch of things to use FLEXIBLE_ARRAY_MEMBER has put a couple of things in light that could be revisited: 1) tuptoaster.c, with this declaration of varlena: struct { struct varlena hdr; chardata[TOAST_MAX_CHUNK_SIZE]; /* make struct big enough */ int32 align_it; /* ensure struct is aligned well enough */ } chunk_data; I'm pretty sure that thing ought to be a union, not a struct. 2) inv_api.c with this thing: struct { bytea hdr; chardata[LOBLKSIZE];/* make struct big enough */ int32 align_it; /* ensure struct is aligned well enough */ } workbuf; And probably this too. 3) heapam.c in three places with HeapTupleHeaderData: struct { HeapTupleHeaderData hdr; chardata[MaxHeapTupleSize]; } tbuf; And this, though I'm not sure if we'd have to change the size of the padding data[] member. 5) reorderbuffer.h with its use of HeapTupleHeaderData: Hmm. Andres will have to answer for that one ;-) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables
On 2015/02/18 21:44, Stephen Frost wrote: * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote: On 2015/02/18 7:44, Stephen Frost wrote: Attached is a patch which should address this. Would love your (or anyone else's) feedback on it. It appears to address the issue which you raised and the regression test changes are all in-line with inserting a LockRows into the subquery, as anticipated. I've looked into the patch. * The patch applies to the latest head, 'make' passes successfully, but 'make check' fails in the rowsecurity test. Apologies for not being clear- the patch was against 9.4, where it passes all the regression tests (at least for me- if you see differently, please let me know!). Sorry, I assumed that the patch was against HEAD. I comfermed that the back-patched 9.4 passes all the regression tests! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow snapshot too old error, to prevent bloat
Stephen Frost sfr...@snowman.net wrote: I also agree with the general idea that it makes sense to provide a way to control bloat, but I think you've missed what Andres was getting at with his suggestion (if I understand correctly, apologies if I don't). The problem is that we're only looking at the overall xmin / xmax horizon when it comes to deciding if a given tuple is dead. That's not quite right- the tuple might actually be dead to all *current* transactions by being newer than the oldest transaction but dead for all later transactions. Basically, there exist gaps between our cluster wide xmin / xmax where we might find actually dead rows. Marking those rows dead and reusable *would* stop the bloat, not just slow it down. In the end, with a single long-running transaction, the worst bloat you would have is double the size of the system at the time the long-running transaction started. I agree that limiting bloat to one dead tuple for every live one for each old snapshot is a limit that has value, and it was unfair of me to characterize that as not being a limit. Sorry for that. This possible solution was discussed with the user whose feedback caused me to write this patch, but there were several reasons they dismissed that as a viable total solution for them, two of which I can share: (1) They have a pool of connections each of which can have several long-running cursors, so the limit from that isn't just doubling the size of their database, it is limiting it to some two-or-three digit multiple of the necessary size. (2) They are already prepared to deal with snapshot too old errors on queries that run more than about 20 minutes and which access tables which are modified. They would rather do that than suffer the bloat beyond that point. IMO all of these changes people are working are very valuable, and complement each other. This particular approach is likely to be especially appealing to those moving from Oracle because it is a familiar mechanism, and one which some of them have written their software to be aware of and deal with gracefully. I'm not against having a knob like this, which is defaulted to off, Thanks! I'm not sure that amounts to a +1, but at least it doesn't sound like a -1. :-) but I do think we'd be a lot better off with a system that could realize when rows are not visible to any currently running transaction and clean them up. +1 But they are not mutually exclusive; I see them as complementary. If this knob's default is off then I don't think we'd be likely to get the complaints which are being discussed (or, if we did, we could point the individual at the admin who set the knob...). That's how I see it, too. I would not suggest making the default anything other than off, but there are situations where it would be a nice tool to have in the toolbox. Thanks for the feedback! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow snapshot too old error, to prevent bloat
On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: There might be something in that, but again it's not much like this patch. The key point I think we're both making is that nondeterministic failures are bad, especially when you're talking about long-running, expensive-to- retry transactions. But I think Kevin would agree with you there. That's why he's interested in having the errors not occur if you don't read from the volatile tables. Ie, your reporting query reading from read-only tables would be guaranteed to succeed even if some other table had had some rows vacuumed away. I'm not sure that's really going to work because of things like hint bit updates or hot pruning. That is, say your table is read-only now but last week some of the records were updated. You might reasonably expect those rows to be safe and indeed the rows themselves will still be in your report. But the earlier versions of the rows could still be sitting on some pages invisible to every query but not vacuumable quite yet and then suddenly they get vacuumed away and the LSN on the page gets bumped. Fwiw I would strongly suggest that instead of having a number of transactions to have a time difference. I haven't been following the commit timestamp thread but I'm hoping that patch has the infrastructure needed to look up an lsn and find out the timestamp and vice versa. So when you take a snapshot you could look up the effective cut-off LSN based on the time difference specified in the GUC. As a DBA I would find it infinitely more comforting specifying old_snapshot_threshold=4h than specifying some arbitrary large number of transactions and hoping I calculated the transaction rate reasonably accurately. -- greg
Re: [HACKERS] Allow snapshot too old error, to prevent bloat
* Kevin Grittner (kgri...@ymail.com) wrote: Stephen Frost sfr...@snowman.net wrote: I also agree with the general idea that it makes sense to provide a way to control bloat, but I think you've missed what Andres was getting at with his suggestion (if I understand correctly, apologies if I don't). The problem is that we're only looking at the overall xmin / xmax horizon when it comes to deciding if a given tuple is dead. That's not quite right- the tuple might actually be dead to all *current* transactions by being newer than the oldest transaction but dead for all later transactions. Basically, there exist gaps between our cluster wide xmin / xmax where we might find actually dead rows. Marking those rows dead and reusable *would* stop the bloat, not just slow it down. In the end, with a single long-running transaction, the worst bloat you would have is double the size of the system at the time the long-running transaction started. I agree that limiting bloat to one dead tuple for every live one for each old snapshot is a limit that has value, and it was unfair of me to characterize that as not being a limit. Sorry for that. This possible solution was discussed with the user whose feedback caused me to write this patch, but there were several reasons they dismissed that as a viable total solution for them, two of which I can share: (1) They have a pool of connections each of which can have several long-running cursors, so the limit from that isn't just doubling the size of their database, it is limiting it to some two-or-three digit multiple of the necessary size. This strikes me as a bit off-the-cuff; was an analysis done which deteremined that would be the result? If there is overlap between the long-running cursors then there would be less bloat, and most systems which I'm familiar with don't turn the entire database over in 20 minutes, 20 hours, or even 20 days except in pretty specific cases. Perhaps this is one of those, and if so then I'm all wet, but the feeling I get is that this is a way to dismiss this solution because it's not what's wanted, which is what Oracle did. (2) They are already prepared to deal with snapshot too old errors on queries that run more than about 20 minutes and which access tables which are modified. They would rather do that than suffer the bloat beyond that point. That, really, is the crux here- they've already got support for dealing with it the way Oracle did and they'd like PG to do that too. Unfortunately, that, by itself, isn't a good reason for a particular capability (we certainly aren't going to be trying to duplicate PL/SQL in PG any time soon). That said, there are capabilities in other RDBMS's which are valuable and which we *do* want, so the fact that Oracle does this also isn't a reason to not include it. IMO all of these changes people are working are very valuable, and complement each other. This particular approach is likely to be especially appealing to those moving from Oracle because it is a familiar mechanism, and one which some of them have written their software to be aware of and deal with gracefully. For my 2c, I'd much rather provide them with a system where they don't have to deal with broken snapshots than give them a way to have them the way Oracle provided them. :) That said, even the approach Andres outlined will cause bloat and it may be beyond what's acceptable in some environments, and it's certainly more complicated and unlikely to get done in the short term. I'm not against having a knob like this, which is defaulted to off, Thanks! I'm not sure that amounts to a +1, but at least it doesn't sound like a -1. :-) So, at the time I wrote that, I wasn't sure if it was a +1 or not myself. I've been thinking about it since then, however, and I'm leaning more towards having the capability than not, so perhaps that's a +1, but it doesn't excuse the need to come up with an implementation that everyone can be happy with and what you've come up with so far doesn't have a lot of appeal, based on the feedback (I've only glanced through it myself, but I agree with Andres and Tom that it's a larger footprint than we'd want for this and the number of places having to be touched is concerning as it could lead to future bugs). A lot of that would go away if there was a way to avoid having to mess with the index AMs, I'd think, but I wonder if we'd actually need more done there- it's not immediately obvious to me how an index-only scan is safe with this. Whenever an actual page is visited, we can check the LSN, and the relation can't be truncated by vacuum since the transaction will still have a lock on the table which prevents it, but does the visibility-map update check make sure to never mark pages all-visible when one of these old transactions is running around? On what basis? but I do think we'd be a lot better off with a system that could realize when rows