Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Wed, Sep 24, 2014 at 3:18 PM, Dilip kumar wrote: > On 24 August 2014 11:33, Amit Kapila Wrote > > >7. I think in new mechanism cancel handler will not work. > > >In single connection vacuum it was always set/reset > > >in function executeMaintenanceCommand(). You might need > > >to set/reset it in function run_parallel_vacuum(). > > > > Good catch, Now i have called SetCancelConn(pSlot[0].connection), on first connection. this will enable cancle > > handler to cancle the query on first connection so that select loop will come out. > I don't think this can handle cancel requests properly because you are just setting it in GetIdleSlot() what if the cancel request came during GetQueryResult() after sending sql for all connections (probably thats the reason why Jeff is not able to cancel the vacuumdb when using parallel option). > > > > >1. I could see one shortcoming in the way the patch has currently parallelize the > > > work for --analyze-in-stages. Basically patch is performing the work for each stage > > > for multiple tables in concurrent connections that seems okay for the cases when > > > number of parallel connections is less than equal to number of tables, but for > > > the case when user has asked for more number of connections than number of tables, > > > then I think this strategy will not be able to use the extra connections. > > > > I think --analyze-in-stages should maintain the order. Yes, you are right. So lets keep the code as it is for this case. > > > >2. Similarly for the case of multiple databases, currently it will not be able > > > to use connections more than number of tables in each database because the > > > parallelizing strategy is to just use the conncurrent connections for > > > tables inside single database. > > > > I think for multiple database doing the parallel execution we need to maintain the multiple connection with multiple databases. > And we need to maintain a table list for all the databases together to run them concurrently. I think this may impact the startup cost, > as we need to create a multiple connection and disconnect for preparing the list Yeah probably startup cost will be bit higher, but that cost we will anyway incur during overall operation. > and i think it will increase the complexity also. I understand that complexity of code might increase and the strategy to parallelize can also be different incase we want to parallelize for all databases case, so lets leave as it is unless someone else raises voice for the same. Today while again thinking about the startegy used in patch to parallelize the operation (vacuum database), I think we can improve the same for cases when number of connections are lesser than number of tables in database (which I presume will normally be the case). Currently we are sending command to vacuum one table per connection, how about sending multiple commands (example Vacuum t1; Vacuum t2) on one connection. It seems to me there is extra roundtrip for cases when there are many small tables in database and few large tables. Do you think we should optimize for any such cases? Few other points 1. + vacuum_parallel(const char *dbname, bool full, bool verbose, { .. + connSlot = (ParallelSlot*)pg_malloc(concurrentCons * sizeof(ParallelSlot)); + connSlot[0].connection = conn; a. Does above memory gets freed anywhere, if not isn't it good idea to do the same b. For slot 0, you are not seeting it as PQsetnonblocking, where as I think it can be used to run commands like any other connection. 2. + /* + * If user has given the vacuum of complete db, then if + * any of the object vacuum failed it can be ignored and vacuuming + * of other object can be continued, this is the same behavior as + * vacuuming of complete db is handled without --jobs option + */ s/object/object's 3. + if(!completedb || + (sqlState && strcmp(sqlState, ERRCODE_UNDEFINED_TABLE) != 0)) + { + + fprintf(stderr, _("%s: vacuuming of database \"%s\" failed: %s"), + progname, dbname, PQerrorMessage (conn)); Indentation on both places is wrong. Check other palces for similar issues. 4. + bool analyze_only, bool freeze, int numAsyncCons, In code still there is reference to AsyncCons, as decided lets change it to concurrent_connections | conc_cons With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] KNN-GiST with recheck
On Fri, Sep 26, 2014 at 5:18 AM, Bruce Momjian wrote: > On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote: > > > Cost estimation of GiST is a big problem anyway. It doesn't care > (and > > > can't) about amount of recheck for regular operators. In this > patch, same > > > would be for knn recheck. The problem is that touching heap from > access > > This is very important work. While our existing KNN-GiST index code > works fine for scalar values and point-to-point distance ordering, it > doesn't work well for 2-dimensional objects because they are only > indexed by their bounding boxes (a rectangle around the object). The > indexed bounding box can't produce accurate distances to other objects. > > As an example, see this PostGIS blog post showing how to use LIMIT in a > CTE to filter results and then compute the closest object (search for > "LIMIT 50"): > > > http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html > > This patch fixes our code for distances from a point to indexed 2-D > objects. > > Does this also fix the identical PostGIS problem or is there something > PostGIS needs to do? > This patch provides general infrastructure for recheck in KNN-GiST. PostGIS need corresponding change in its GiST opclass. Since PostGIS already define <-> and <#> operators as distance to bounding box border and bounding box center, it can't change their behaviour. it has to support new operator "exact distance" in opclass. -- With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 08:10 PM, Tom Lane wrote: > I wrote: >> The "offsets-and-lengths" patch seems like the approach we ought to >> compare to my patch, but it looks pretty unfinished to me: AFAICS it >> includes logic to understand offsets sprinkled into a mostly-lengths >> array, but no logic that would actually *store* any such offsets, >> which means it's going to act just like my patch for performance >> purposes. > >> In the interests of pushing this forward, I will work today on >> trying to finish and review Heikki's offsets-and-lengths patch >> so that we have something we can do performance testing on. >> I doubt that the performance testing will tell us anything we >> don't expect, but we should do it anyway. > > I've now done that, and attached is what I think would be a committable > version. Having done this work, I no longer think that this approach > is significantly messier code-wise than the all-lengths version, and > it does have the merit of not degrading on very large objects/arrays. > So at the moment I'm leaning to this solution not the all-lengths one. > > To get a sense of the compression effects of varying the stride distance, > I repeated the compression measurements I'd done on 14 August with Pavel's > geometry data (<24077.1408052...@sss.pgh.pa.us>). The upshot of that was > > min max avg > > external text representation 220 172685 880.3 > JSON representation (compressed text) 224 78565 541.3 > pg_column_size, JSONB HEAD repr. 225 82540 639.0 > pg_column_size, all-lengths repr. 225 66794 531.1 > > Here's what I get with this patch and different stride distances: > > JB_OFFSET_STRIDE = 8 225 68551 559.7 > JB_OFFSET_STRIDE = 16 225 67601 552.3 > JB_OFFSET_STRIDE = 32 225 67120 547.4 > JB_OFFSET_STRIDE = 64 225 66886 546.9 > JB_OFFSET_STRIDE = 128225 66879 546.9 > JB_OFFSET_STRIDE = 256225 66846 546.8 > > So at least for that test data, 32 seems like the sweet spot. > We are giving up a couple percent of space in comparison to the > all-lengths version, but this is probably an acceptable tradeoff > for not degrading on very large arrays. > > I've not done any speed testing. I'll do some tommorrow. I should have some different DBs to test on, too. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I wrote: > The "offsets-and-lengths" patch seems like the approach we ought to > compare to my patch, but it looks pretty unfinished to me: AFAICS it > includes logic to understand offsets sprinkled into a mostly-lengths > array, but no logic that would actually *store* any such offsets, > which means it's going to act just like my patch for performance > purposes. > In the interests of pushing this forward, I will work today on > trying to finish and review Heikki's offsets-and-lengths patch > so that we have something we can do performance testing on. > I doubt that the performance testing will tell us anything we > don't expect, but we should do it anyway. I've now done that, and attached is what I think would be a committable version. Having done this work, I no longer think that this approach is significantly messier code-wise than the all-lengths version, and it does have the merit of not degrading on very large objects/arrays. So at the moment I'm leaning to this solution not the all-lengths one. To get a sense of the compression effects of varying the stride distance, I repeated the compression measurements I'd done on 14 August with Pavel's geometry data (<24077.1408052...@sss.pgh.pa.us>). The upshot of that was min max avg external text representation220 172685 880.3 JSON representation (compressed text) 224 78565 541.3 pg_column_size, JSONB HEAD repr.225 82540 639.0 pg_column_size, all-lengths repr. 225 66794 531.1 Here's what I get with this patch and different stride distances: JB_OFFSET_STRIDE = 8225 68551 559.7 JB_OFFSET_STRIDE = 16 225 67601 552.3 JB_OFFSET_STRIDE = 32 225 67120 547.4 JB_OFFSET_STRIDE = 64 225 66886 546.9 JB_OFFSET_STRIDE = 128 225 66879 546.9 JB_OFFSET_STRIDE = 256 225 66846 546.8 So at least for that test data, 32 seems like the sweet spot. We are giving up a couple percent of space in comparison to the all-lengths version, but this is probably an acceptable tradeoff for not degrading on very large arrays. I've not done any speed testing. regards, tom lane diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 2fd87fc..9beebb3 100644 *** a/src/backend/utils/adt/jsonb.c --- b/src/backend/utils/adt/jsonb.c *** jsonb_from_cstring(char *json, int len) *** 196,207 static size_t checkStringLen(size_t len) { ! if (len > JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_POSMASK))); return len; } --- 196,207 static size_t checkStringLen(size_t len) { ! if (len > JENTRY_OFFLENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("string too long to represent as jsonb string"), errdetail("Due to an implementation restriction, jsonb strings cannot exceed %d bytes.", ! JENTRY_OFFLENMASK))); return len; } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..f157df3 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *** *** 26,40 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_POSMASK, ! * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JEntry *array, int index, char *base_addr, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); --- 26,41 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (The total size of an array's or object's elements is also limited by ! * JENTRY_OFFLENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JsonbContainer *container, int index, ! char *base_addr, uint32 offset, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b);
Re: [HACKERS] Replication identifiers, take 3
Thanks for this write-up. On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund wrote: > 1) The ability Monitor how for replication has progressed in a >crashsafe manner to allow it to restart at the right point after >errors/crashes. > 2) Efficiently identify the origin of individual changes and >transactions. In multimaster and some cascading scenarios it is >necessary to do so to avoid sending out replayed changes again. > 3) Allow to efficiently filter out replayed changes from logical >decoding. It's currently possible to filter changes from inside the >output plugin, but it's much more efficient to filter them out >before decoding. I agree with these goals. Let me try to summarize the information requirements for each of these things. For #1, you need to know, after crash recovery, for each standby, the last commit LSN which the client has confirmed via a feedback message. For #2, you need to know, when decoding each change, what the origin node was. And for #3, you need to know, when decoding each change, whether it is of local origin. The information requirements for #3 are a subset of those for #2. > A rather efficient solution for 1) is to attach the 'origin node' and > the remote commit LSN to every local commit record produced by > replay. That allows to have a shared memory "table" (remote_node, > local_lsn, remote_lsn). This seems OK to me, modulo some arguing about what the origin node information ought to look like. People who are not using logical replication can use the compact form of the commit record in most cases, and people who are using logical replication can pay for it. > Similarly, to solve the problem of identifying the origin of changes > during decoding, the problem can be solved nicely by adding the origin > node of every change to changes/transactions. At first it might seem > to be sufficient to do so on transaction level, but for cascading > scenarios it's very useful to be able to have changes from different > source transactions combinded into a larger one. I think this is a lot more problematic. I agree that having the data in the commit record isn't sufficient here, because for filtering purposes (for example) you really want to identify the problematic transactions at the beginning, so you can chuck their WAL, rather than reassembling the transaction first and then throwing it out. But putting the origin ID in every insert/update/delete is pretty unappealing from my point of view - not just because it adds bytes to WAL, though that's a non-trivial concern, but also because it adds complexity - IMHO, a non-trivial amount of complexity. I'd be a lot happier with a solution where, say, we have a separate WAL record that says "XID 1234 will be decoding for origin 567 until further notice". > == Replication Identifiers == > > The above explains the need to have as small as possible identifiers > for nodes. Two years back I'd suggested that we rely on the user to > manually assign 16bit ids to individual nodes. Not very surprisingly > that was shot down because a) 16bit numbers are not descriptive b) a > per node identifier is problematic because it prohibits replication > inside the same cluster. > > So, what I've proposed since is to have two different forms of > identifiers. A long one, that's as descriptive as > $replication_solution wants. And a small one (16bit in my case) that's > *only meaningful within one node*. The long, user facing, identifier > is locally mapped to the short one. > > In the first version I proposed these long identifiers had a fixed > form, including the system identifier, timeline id, database id, and a > freeform name. That wasn't well received and I agree that that's too > narrow. I think it should be a freeform text of arbitrary length. > > Note that it depends on the replication solution whether these > external identifiers need to be coordinated across systems or not. I > think it's *good* if we don't propose a solution for that - different > replication solutions will have different requirements. I'm pretty fuzzy on how this actually works. Like, the short form here is just getting injected into WAL by the apply process. How does it figure out what value to inject? What if it injects a value that doesn't have a short-to-long mapping? What's the point of the short-to-long mappings in the first place? Is that only required because of the possibility that there might be multiple replication solutions in play on the same node? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On Fri, Sep 26, 2014 at 6:59 AM, Alvaro Herrera wrote: > Actually here's a different split of these patches, which I hope makes > more sense. My intention here is that patches 0001 to 0004 are simple > changes that can be pushed right away; they are not directly related to > the return-creation-command feature. Patches 0005 to 0027 implement > that feature incrementally. You can see in patch 0005 the DDL commands > that are still not implemented in deparse (they are the ones that have > an elog(ERROR) rather than a "command = NULL"). Patch 0006 adds calls > in ProcessUtilitySlow() to each command, so that the object(s) being > touched are added to the event trigger command stash. > > Patches from 0007 to 0027 (excepting patch 0017) implement one or a > small number of commands in deparse. Patch 0017 is necessary > infrastructure in ALTER TABLE to support deparsing that one. > > My intention with the later patches is that they would all be pushed as > a single commit, i.e. the deparse support would be implemented for all > commands in a fell swoop rather than piecemeal -- except possibly patch > 0017 (the ALTER TABLE infrastructure). I split them up only for ease of > review. Of course, before pushing we (I) need to implement deparsing > for all the remaining commands. Thanks for the update. This stuff is still on my TODO list and I was planning to look at it at some extent today btw :) Andres has already sent some comments (20140925235724.gh16...@awork2.anarazel.de) though, I'll try to not overlap with what has already been mentioned. Patch 1: I still like this patch as it gives a clear separation of the built-in functions and the sub-functions of ruleutils.c that are completely independent. Have you considered adding the external declaration of quote_all_identifiers as well? It is true that this impacts extensions (some of my stuff as well), but my point is to bite the bullet and make the separation cleaner between builtins.h and ruleutils.h. Attached is a patch that can be applied on top of patch 1 doing so... Feel free to discard for the potential breakage this would create though. Patch 2: 1) Declarations of renameatt are inconsistent: @tablecmds.c: -renameatt(RenameStmt *stmt) +renameatt(RenameStmt *stmt, int32 *objsubid) @tablecmds.h: -extern Oid renameatt(RenameStmt *stmt); +extern Oid renameatt(RenameStmt *stmt, int *attnum); Looking at this code, for correctness renameatt should use everywhere "int *attnum" and ExecRenameStmt should use "int32 *subobjid". 2) Just a smallish picky formatting remark, I would have done that on a single line: + attnum = + renameatt_internal(relid, 3) in ExecRenameStmt, I think that you should set subobjid for aggregate, collations, fdw, etc. There is an access to ObjectAddress so that's easy to set using address.objectSubId. 4) This comment is misleading, as for an attribute what is returned is actually an attribute number: + * Return value is the OID of the renamed object. The objectSubId, if any, + * is returned in objsubid. */ 5) Perhaps it is worth mentioning at the top of renameatt that the attribute number is returned as well? Patch 3: Looks fine, a bit of testing is showing up that this works as expected: =# CREATE ROLE foo; CREATE ROLE =# CREATE TABLE aa (a int); CREATE TABLE =# CREATE OR REPLACE FUNCTION abort_grant() RETURNS event_trigger AS $$ BEGIN RAISE NOTICE 'Event: %', tg_event; RAISE EXCEPTION 'Execution of command % forbidden', tg_tag; END; $$ LANGUAGE plpgsql; CREATE FUNCTION =# CREATE EVENT TRIGGER abort_grant_trigger ON ddl_command_start WHEN TAG IN ('GRANT', 'REVOKE') EXECUTE PROCEDURE abort_grant(); CREATE EVENT TRIGGER =# GRANT SELECT ON aa TO foo; NOTICE: 0: Event: ddl_command_start LOCATION: exec_stmt_raise, pl_exec.c:3068 ERROR: P0001: Execution of command GRANT forbidden LOCATION: exec_stmt_raise, pl_exec.c:3068 =# DROP EVENT TRIGGER abort_grant_trigger; DROP EVENT TRIGGER =# CREATE EVENT TRIGGER abort_grant_trigger ON ddl_command_end WHEN TAG IN ('GRANT', 'REVOKE') EXECUTE PROCEDURE abort_grant(); CREATE EVENT TRIGGER =# REVOKE SELECT ON aa FROM foo; NOTICE: 0: Event: ddl_command_end LOCATION: exec_stmt_raise, pl_exec.c:3068 ERROR: P0001: Execution of command REVOKE forbidden LOCATION: exec_stmt_raise, pl_exec.c:3068 Patch 4: Shouldn't int32 be used instead of uint32 in the declaration of CommentObject? And yes, adding support for only ddl_command_start and ddl_command_end is enough. Let's not play with dropped objects in this area... Similarly to the test above: =# comment on table aa is 'test'; NOTICE: 0: Event: ddl_command_start LOCATION: exec_stmt_raise, pl_exec.c:3068 ERROR: P0001: Execution of command COMMENT forbidden LOCATION: exec_stmt_raise, pl_exec.c:3068 Regards, -- Michael diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index d77b3ee..fc8dc80 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -57,6 +57,7 @@ #include "utils/l
Re: [HACKERS] KNN-GiST with recheck
On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote: > > Cost estimation of GiST is a big problem anyway. It doesn't care (and > > can't) about amount of recheck for regular operators. In this patch, > same > > would be for knn recheck. The problem is that touching heap from access This is very important work. While our existing KNN-GiST index code works fine for scalar values and point-to-point distance ordering, it doesn't work well for 2-dimensional objects because they are only indexed by their bounding boxes (a rectangle around the object). The indexed bounding box can't produce accurate distances to other objects. As an example, see this PostGIS blog post showing how to use LIMIT in a CTE to filter results and then compute the closest object (search for "LIMIT 50"): http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html This patch fixes our code for distances from a point to indexed 2-D objects. Does this also fix the identical PostGIS problem or is there something PostGIS needs to do? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.6
On 2014-09-25 21:02:28 -0400, Robert Haas wrote: > I feel like this could really use a developer README: what primitives > do we have, which ones are likely to be efficient or inefficient on > which platforms, how do atomics interact with barriers, etc. atomics.h has most of that. Including documenting which barrier semantics the individual operations have. One change I can see as being a good idea is to move/transform the definition of acquire/release barriers from s_lock.h to README.barrier. It doesn't document which operations are efficient on which platform, but I wouldn't know what to say about that topic? The current set of operations should be fast on pretty much all platforms. What information would you like to see? An example of how to use atomics? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.6
On Thu, Sep 25, 2014 at 6:03 PM, Andres Freund wrote: > On 2014-09-24 20:27:39 +0200, Andres Freund wrote: >> On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote: >> I won't repost a version with it removed, as removing a function as the >> only doesn't seem to warrant reposting it. > > I've fixed (thanks Alvaro!) some minor additional issues besides the > removal and addressing earlier comments from Heikki: > * removal of two remaining non-ascii copyright signs. The only non ascii > name character that remains is Alvaro's name in the commit message. > * additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE > * there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split > into different configure tests and thus has a different name. A later > patch entirely removes that reference, which is why I'd missed that... > > Heikki has marked the patch as 'ready for commiter' in the commitfest > (conditional on his remarks being addressed) and I agree. There seems to > be little benefit in waiting further. There *definitely* will be some > platform dependant issues, but that won't change by waiting longer. > > I plan to commit this quite soon unless somebody protests really > quickly. > > Sorting out the issues on platforms I don't have access to based on > buildfarm feedback will take a while given how infrequent some of the > respective animals run... But that's just life. I feel like this could really use a developer README: what primitives do we have, which ones are likely to be efficient or inefficient on which platforms, how do atomics interact with barriers, etc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] delta relations in AFTER triggers
On 09/25/2014 11:54 PM, Kevin Grittner wrote: > I have fixed the bug reported by Heikki; be sure to grab that. Will do. > I have been merging in changes to master as I go, so that bit rot > doesn't accumulate, but I don't squash or rebase; hopefully that > style works for you. IMO it only really matters before the final push to master; before then it's all just a matter of how you prefer to work. I'm a big fan of rebasing my feature branches as I go: git tag before-rebase git pull --rebase ... do any merges during rebase ... git tag -d before-rebase For bug fixes I tend to commit them separately, then when I rebase I squash them into the relevant patch. Git's "fixup! " commits are really handy for this; if I have a commit: Add widget support Everyone wants more widgets. and want to fix an issue in that commit I can just commit fixup! Add widget support It's spelled widget not wodget and when I "git rebase --autosquash master" they get automatically squashed into the relevant changeset. (I usually run with the config rebase.autosquash enabled so this happens during my rebase pulls on top of master). I got in the habit while working on RLS, to keep me sane with that patchset, and find it works well for me. However, everyone has a different work style. Colour me happy if it's in git at all. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 2014-09-25 18:59:31 -0300, Alvaro Herrera wrote: > /* tid.c */ > diff --git a/src/include/utils/ruleutils.h b/src/include/utils/ruleutils.h > new file mode 100644 > index 000..520b066 > --- /dev/null > +++ b/src/include/utils/ruleutils.h > @@ -0,0 +1,34 @@ > +/*- > + * > + * ruleutils.h > + * Declarations for ruleutils.c > + * > + * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group > + * Portions Copyright (c) 1994, Regents of the University of California > + * > + * src/include/ruleutils.h > + * > + *- > + */ > +#ifndef RULEUTILS_H > +#define RULEUTILS_H > + > +#include "nodes/nodes.h" > +#include "nodes/parsenodes.h" > +#include "nodes/pg_list.h" > + > + > +extern char *pg_get_indexdef_string(Oid indexrelid); > +extern char *pg_get_indexdef_columns(Oid indexrelid, bool pretty); > + > +extern char *pg_get_constraintdef_string(Oid constraintId); > +extern char *deparse_expression(Node *expr, List *dpcontext, > +bool forceprefix, bool showimplicit); > +extern List *deparse_context_for(const char *aliasname, Oid relid); > +extern List *deparse_context_for_planstate(Node *planstate, List *ancestors, > + List *rtable, List > *rtable_names); > +extern List *select_rtable_names_for_explain(List *rtable, > + Bitmapset > *rels_used); > +extern char *generate_collation_name(Oid collid); > + > +#endif /* RULEUTILS_H */ > -- > 1.9.1 > I wondered for a minute whether any of these are likely to cause problems for code just including builtins.h - but I think there will be sufficiently few callers for them to make that not much of a concern. > From 5e3555058a1bbb93e25a3a104c26e6b96dce994d Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera > Date: Thu, 25 Sep 2014 16:34:50 -0300 > Subject: [PATCH 02/27] deparse/core: have RENAME return attribute number Looks "harmless". I.e. +1. > From 72c4d0ef875f9e9b0f3b424499738fd1bd946c32 Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera > Date: Fri, 9 May 2014 18:32:23 -0400 > Subject: [PATCH 03/27] deparse/core: event triggers support GRANT/REVOKE Hm. The docs don't mention whether these only work for database local objects or also shared objects. Generally event trigger are only triggered for the former... But the code doesn't seem to make a difference? > From 5bb35d2e51fe6c6e219fe5cf7ddbab5423e6be1b Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera > Date: Thu, 25 Sep 2014 15:44:44 -0300 > Subject: [PATCH 04/27] deparse/core: event triggers support COMMENT > > --- > doc/src/sgml/event-trigger.sgml | 8 +++- > src/backend/commands/comment.c | 5 - > src/backend/commands/event_trigger.c | 1 + > src/backend/tcop/utility.c | 21 + > src/include/commands/comment.h | 2 +- > 5 files changed, 30 insertions(+), 7 deletions(-) > > diff --git a/doc/src/sgml/event-trigger.sgml b/doc/src/sgml/event-trigger.sgml > index 49b8384..39ecd94 100644 > --- a/doc/src/sgml/event-trigger.sgml > +++ b/doc/src/sgml/event-trigger.sgml > @@ -37,7 +37,7 @@ > > The ddl_command_start event occurs just before the > execution of a CREATE, ALTER, DROP, > - GRANT or REVOKE > + COMMENT, GRANT or REVOKE > command. No check whether the affected object exists or doesn't exist > is > performed before the event trigger fires. > As an exception, however, this event does not occur for > @@ -281,6 +281,12 @@ > - > > > +COMMENT > +X > +X > +- > + > + Hm. No drop support because COMMENTs are odd and use odd NULL semantics. Fair enough. > From 098f5acabd774004dc5d9c750d55e7c9afa60238 Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera > Date: Wed, 24 Sep 2014 15:53:04 -0300 > Subject: [PATCH 05/27] deparse: infrastructure needed for command deparsing > --- > src/backend/catalog/objectaddress.c | 115 + > src/backend/commands/event_trigger.c | 941 > ++- > src/backend/tcop/Makefile| 2 +- > src/backend/tcop/deparse_utility.c | 877 > src/backend/tcop/utility.c | 2 + > src/backend/utils/adt/format_type.c | 113 - > src/include/catalog/objectaddress.h | 2 + > src/include/catalog/pg_proc.h| 4 + > src/include/commands/event_trigger.h | 3 + > src/include/commands/extension.h | 2 +- > src/include/nodes/parsenodes.h | 2 + > src/include/tcop/deparse_utility.h | 60 +++ > src/include/utils/builtins.h | 5 + > 13 files changed, 2117 insertions(+), 11 deletions(-) > create mode 100644 src/backend/tcop/deparse_utility.c > create mode 100644 src/include/tcop/deparse_utility.h
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 26/09/14 08:21, Simon Riggs wrote: On 25 September 2014 20:11, Robert Haas wrote: My approach would be to insert an index tuple for that value into the index, but with the leaf ituple marked with an xid rather than a ctid. If someone tries to insert into the index they would see this and wait for the inserting transaction to end. The inserting transaction would then resolve what happens in the heap (insert/update) and later repoint the index tuple to the inserted/updated row version. I don't see the need for page level locking since it would definitely result in deadlocks (e.g. SQLServer). I think that something like this might work, but the devil is in the details. Suppose two people try to upsert into the same table at the same time. There's one index. If the transactions search that index for conflicts first, neither sees any conflicting tuples, and both proceed. That's no good. OK, so suppose each transaction inserts the special index tuple which you mention, to lock out concurrent inserts of that value, and then searches for already-existing conflicts. Each sees the other's tuple, and they deadlock. That's no good, either. The test index is unique, so our to-be-inserted value exists on only one page, hence page locking applies while we insert it. The next person to insert waits for the page lock and then sees the test tuple. The page lock lasts only for the duration of the insertion of the ituple, not for the whole operation. Also, I think there are other cases where we think we're going to insert, so we put the special index tuple in there, but then we decide to update, so we don't need the promise tuple any more, but other sessions are potentially still waiting for our XID to terminate even though there's no conflict any more. I'm having a hard time bringing the details of those cases to mind ATM, though. We make the decision to INSERT or UPDATE based upon what we find in the test index. If a value if there already, we assume its an UPDATE and go to update the row this points to. If it has been deleted we loop back and try again/error. If the value is not present, we insert the test tuple and progress as an INSERT, then loop back later to set the ctid. There is no case of "don't need promise id anymore". We would use the PK, identity or first unique index as the test index. There is a case where an UPSERT conflicts with an INSERT causing the latter to abort. Anyway, this is why we need the design more clearly exposed, so you can tell me I'm wrong by showing me the URL of it done right. What happens if the new value(s) of the INERT/UPDATE require the page to be split? I assume the mechanics of this are catered for, but how does it affect locking & potential deadlocks? Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.6
On Thu, Sep 25, 2014 at 3:23 PM, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: >> Heikki has marked the patch as 'ready for commiter' in the commitfest >> (conditional on his remarks being addressed) and I agree. There seems to >> be little benefit in waiting further. There *definitely* will be some >> platform dependant issues, but that won't change by waiting longer. > > Agreed. I won't claim to have done as good a review as others, but I've > been following along as best I've been able to. FWIW, I agree that you've already done plenty to ensure that this works well on common platforms. We cannot reasonably expect you to be just as sure that there are not problems on obscure platforms ahead of commit. It isn't unheard of for certain committers (that don't like Windows, say), to commit things despite having a suspicion that their patch will turn some fraction of the buildfarm red. To a certain extent, that's what we have it for. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.6
Andres, * Andres Freund (and...@anarazel.de) wrote: > Heikki has marked the patch as 'ready for commiter' in the commitfest > (conditional on his remarks being addressed) and I agree. There seems to > be little benefit in waiting further. There *definitely* will be some > platform dependant issues, but that won't change by waiting longer. Agreed. I won't claim to have done as good a review as others, but I've been following along as best I've been able to. > I plan to commit this quite soon unless somebody protests really > quickly. +1 from me for at least moving forward to see what happens. > Sorting out the issues on platforms I don't have access to based on > buildfarm feedback will take a while given how infrequent some of the > respective animals run... But that's just life. The buildfarm is another tool (as well as Coverity) that provides a lot of insight into issues we may not realize ahead of time, but using them (currently) means we have to eventually go beyond looking at the code individually and actually commit it. I'm very excited to see this progress and trust you, Heikki, Robert, and the others involved in this work to have done an excellent job and to be ready and prepared to address any issues which come from it, or to pull it out if it turns out to be necessary (which I seriously doubt, but still). Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] RLS feature has been committed
On Thu, Sep 25, 2014 at 3:07 PM, Stephen Frost wrote: > A revert wasn't requested by the individual who raised the concern (or, > indeed, explicitly by *anyone*.. it was hinted at, but I felt the > individuals who were hinting at it were leaving it up to that individual > who had a concern), and I discussed it with him extensively and we came > to what I believe is an understanding. I'm not sure that I understand > the need to bring it up again, unless you have a concern regarding what > was committed. FWIW, I think Heikki's proposal to revisit this in the next commitfest is fine. I don't think that this is something that needs to be pored over. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS feature has been committed
Simon, * Simon Riggs (si...@2ndquadrant.com) wrote: > On 23 September 2014 07:45, Heikki Linnakangas > wrote: > > OTOH, if the patch is actually OK as it was committed, there's no point > > reverting it only to commit it again later. At the end of the day, the > > important thing is that the patch gets sufficient review. Clearly Stephen > > thinks that it did, while you and Andres do not. > > I would observe that not requesting a revert would be inconsistent > against all other situations I have seen or been involved with. A revert wasn't requested by the individual who raised the concern (or, indeed, explicitly by *anyone*.. it was hinted at, but I felt the individuals who were hinting at it were leaving it up to that individual who had a concern), and I discussed it with him extensively and we came to what I believe is an understanding. I'm not sure that I understand the need to bring it up again, unless you have a concern regarding what was committed. I agree that I jumped the gun on it and said as much. On the flip side, it's not had quite a bit of review and there is a promise of more, which I feel is great for the project as a whole. > My major reason to revert is the following: the documentation contains > no examples of real world usage, making the feature essentially > unusable out of the box. I find this to be an interesting argument considering most of our documentation doesn't include real-world examples. I'm happy to add more examples than those listed thus far, certainly, though I've understood documentation to be of less general importance than code quality and maintainability- and someone willing and committed to maintain it. > I attended Stephen's talk at PostgesOpen and > even that didn't contain real cases either, leaving me to ask > questions about stuff I thought I knew. This would be sufficient for > me to reject commit of any other patch, so it is sufficient reason > here also. Yeb can supply a useful real world case if there is some > restriction on explaining what this might be used for. This wouldn't be the only case of documentation (indeed, *any* documentation) being added after a commit, and so I'm mystified by this requirement for *real-world* examples in documentation to be provided prior to commit. > Stephen, I want this patch in 9.5 and I would very much like to see it > go in before Oct 31. But please follow consensus, revert the patch > now, address the minor issues as requested and then ask for re-commit > later, as you should have done in the first place. Simon, if you want me to revert it because of an objection over the design, code quality, maintainability, or utter lack of documentation, then I absolutely respect that request and would be happy to do so. This request I am completely lost on. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] better atomics - v0.6
On 2014-09-24 20:27:39 +0200, Andres Freund wrote: > On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote: > I won't repost a version with it removed, as removing a function as the > only doesn't seem to warrant reposting it. I've fixed (thanks Alvaro!) some minor additional issues besides the removal and addressing earlier comments from Heikki: * removal of two remaining non-ascii copyright signs. The only non ascii name character that remains is Alvaro's name in the commit message. * additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE * there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split into different configure tests and thus has a different name. A later patch entirely removes that reference, which is why I'd missed that... Heikki has marked the patch as 'ready for commiter' in the commitfest (conditional on his remarks being addressed) and I agree. There seems to be little benefit in waiting further. There *definitely* will be some platform dependant issues, but that won't change by waiting longer. I plan to commit this quite soon unless somebody protests really quickly. Sorting out the issues on platforms I don't have access to based on buildfarm feedback will take a while given how infrequent some of the respective animals run... But that's just life. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services >From b64d92f1a5602c55ee8b27a7ac474f03b7aee340 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 25 Sep 2014 23:49:05 +0200 Subject: [PATCH] Add a basic atomic ops API abstracting away platform/architecture details. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Several upcoming performance/scalability improvements require atomic operations. This new API avoids the need to splatter compiler and architecture dependent code over all the locations employing atomic ops. For several of the potential usages it'd be problematic to maintain both, a atomics using implementation and one using spinlocks or similar. In all likelihood one of the implementations would not get tested regularly under concurrency. To avoid that scenario the new API provides a automatic fallback of atomic operations to spinlocks. All properties of atomic operations are maintained. This fallback - obviously - isn't as fast as just using atomic ops, but it's not bad either. For one of the future users the atomics ontop spinlocks implementation was actually slightly faster than the old purely spinlock using implementation. That's important because it reduces the fear of regressing older platforms when improving the scalability for new ones. The API, loosely modeled after the C11 atomics support, currently provides 'atomic flags' and 32 bit unsigned integers. If the platform efficiently supports atomic 64 bit unsigned integers those are also provided. To implement atomics support for a platform/architecture/compiler for a type of atomics 32bit compare and exchange needs to be implemented. If available and more efficient native support for flags, 32 bit atomic addition, and corresponding 64 bit operations may also be provided. Additional useful atomic operations are implemented generically ontop of these. The implementation for various versions of gcc, msvc and sun studio have been tested. Additional existing stub implementations for * Intel icc * HUPX acc * IBM xlc are included but have never been tested. These will likely require fixes based on buildfarm and user feedback. As atomic operations also require barriers for some operations the existing barrier support has been moved into the atomics code. Author: Andres Freund with contributions from Oskari Saarenmaa Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=vxanvrhp3kess6pn3dmxap...@mail.gmail.com, 20131015123303.gh5...@awork2.anarazel.de, 20131028205522.gi20...@awork2.anarazel.de --- config/c-compiler.m4 | 98 + configure| 274 ++-- configure.in | 34 +- src/backend/main/main.c | 1 + src/backend/port/Makefile| 2 +- src/backend/port/atomics.c | 127 ++ src/backend/storage/lmgr/spin.c | 7 +- src/include/c.h | 21 +- src/include/pg_config.h.in | 24 +- src/include/pg_config.h.win32| 3 + src/include/pg_config_manual.h | 8 + src/include/port/atomics.h | 531 +++ src/include/port/atomics/arch-arm.h | 25 ++ src/include/port/atomics/arch-hppa.h | 17 + src/include/port/atomics/arch-ia64.h | 26 ++ src/include/port/atomics/ar
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 2:37 PM, Simon Riggs wrote: >> I'd hoped that the commit messages, and my discussion of the feature >> were adequate. > > I'd hoped that my discussion was sufficient to persuade you too, but it > wasn't. I'll write user-visible docs soon, then. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 25 September 2014 22:13, Peter Geoghegan wrote: >> All of the above, as a separate committable patch. I hate the fact >> that you have written no user facing documentation for this feature. >> How can anyone tell whether the tests you've written are correct or >> even consistent to a particular definition of correctness? > > I'd hoped that the commit messages, and my discussion of the feature > were adequate. I'd hoped that my discussion was sufficient to persuade you too, but it wasn't. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS feature has been committed
On 23 September 2014 07:45, Heikki Linnakangas wrote: > On 09/23/2014 09:15 AM, Peter Geoghegan wrote: >> >> On Mon, Sep 22, 2014 at 8:25 PM, Robert Haas >> wrote: Should RLS be reverted, and revisited in a future CF? >>> >>> >>> IMHO, that would be entirely appropriate. >> >> >> That seems pretty straightforward, then. I think that it will have to >> be reverted. > > > OTOH, if the patch is actually OK as it was committed, there's no point > reverting it only to commit it again later. At the end of the day, the > important thing is that the patch gets sufficient review. Clearly Stephen > thinks that it did, while you and Andres do not. I would observe that not requesting a revert would be inconsistent against all other situations I have seen or been involved with. Stephen has acted against explicit requests not to commit. I think Robert should apply more care about issuing lock instructions, since the concurrency of our community is important, but nonetheless, an explicit request was made and that should be honored, even if it does very quickly get re-committed. My major reason to revert is the following: the documentation contains no examples of real world usage, making the feature essentially unusable out of the box. I attended Stephen's talk at PostgesOpen and even that didn't contain real cases either, leaving me to ask questions about stuff I thought I knew. This would be sufficient for me to reject commit of any other patch, so it is sufficient reason here also. Yeb can supply a useful real world case if there is some restriction on explaining what this might be used for. Stephen, I want this patch in 9.5 and I would very much like to see it go in before Oct 31. But please follow consensus, revert the patch now, address the minor issues as requested and then ask for re-commit later, as you should have done in the first place. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 1:48 PM, Simon Riggs wrote: > A. UPDATE/INSERT privilege infrastructure. > Add tests to it, make it separately committable, so we can get that done. > Submit to Oct CF; get that done early. Makes sense. As long as we assume that we want a unified syntax like this - that is, that we need something vaguely insert-update or update-insertish - then we need this. Unfortunately, we cannot add regression tests for this without almost the full patch set. > B. Agree command semantics by producing these things > * Explanatory documentation (Ch6.4 Data Manipulation - Upsert) > * SQL Reference Documentation (INSERT) > * Test cases for feature > * Test cases for concurrency > * Test cases for pgbench Okay. I do have stress-tests, that are separately maintained, in case you missed that: https://github.com/petergeoghegan/upsert > All of the above, as a separate committable patch. I hate the fact > that you have written no user facing documentation for this feature. > How can anyone tell whether the tests you've written are correct or > even consistent to a particular definition of correctness? I'd hoped that the commit messages, and my discussion of the feature were adequate. > Submit as patch for review only to Oct 15 CF > We then agree what is required for further work > At this stage, poll the Django and Rails communities for acceptance > and early warning of these features. Listen. I know an original founder of the Django project quite well - Jacob Kaplan-Moss (a co-worker - the guy that keynoted pgOpen in its second year). He is very interested in this effort. > C. Internal weirdness > Submit C based upon earlier agreed B, submit to Dec 15 CF, major patch > deadline, so we can fine tune for last CF. > Then Heikki rewrites half your patch in a better way, you thank him > and then we commit. All done. I don't have a problem with Heikki or anyone else rewriting the value locking part of the patch, provided it meets my requirements for such a mechanism. Since Heikki already agreed that that standard should be imposed, he'd hardly take issue with it now. However, the fact is that once you actually make something like promise tuples meet that standard, at the very least it becomes a lot messier than you'd think. Heikki's final prototype "super deleted" tuples by setting their xmin to InvalidTransactionId. We weren't sure that that doesn't break some random other heapam code. Consider this, for example: https://github.com/postgres/postgres/blob/REL9_4_STABLE/src/backend/executor/execMain.c#L1961 So that looks safe in the face of setting xmin to InvalidTransactionId in the way the later prototype patch did if you think about it for a while, but there are other places where that is less clear. In short, it becomes something that we have to worry about for ever, because "xmin cannot change without the tuple in the slot changing" is clearly an invariant for certain purposes. It might accidentally fail to fail right now, but I'm not comfortable with it. Now, I might be convinced that that's actually the way to go. I have an open mind. But that will take discussion. I like that page hwlocking is something that many systems do (even including Oracle, I believe). Making big changes to nbtree is always something that deserves to be met with skepticism, but it is nice to have an implementation that lives in the head of AM. Sorry, I forgot to not talk about locking. > But we're still discussing SQL semantics. So first things first, then > loop back around, hoping our design has not been concurrently > deleted... I hope the discussion can avoid "unprincipled deadlocks" -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()
On 2014-09-26 02:34:06 +0530, Abhijit Menon-Sen wrote: > At 2014-09-25 22:41:18 +0200, and...@2ndquadrant.com wrote: > > Unless I miss something this isn't sufficient. We need to fsync the > > files in the data directory, not just the toplevel directory? > > No, of course you're right. So a separate function that does the moral > equivalent of "find $PGDATA -exec fsync_fname …"? Probably will require some care to deal correctly with tablespaces. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()
At 2014-09-25 22:41:18 +0200, and...@2ndquadrant.com wrote: > > * Also recovery shouldn't be regarded successful if the reset fails - > * e.g. because of ENOSPC. OK. > * Doing this in a separate pass is advantageous for performance reasons > * because it allows the kernel to perform all the flushes at once. OK. > Unless I miss something this isn't sufficient. We need to fsync the > files in the data directory, not just the toplevel directory? No, of course you're right. So a separate function that does the moral equivalent of "find $PGDATA -exec fsync_fname …"? Will resubmit with the additional comments. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On 25 September 2014 18:30, Andres Freund wrote: > On 2014-09-25 18:18:09 +0100, Simon Riggs wrote: >> On 25 September 2014 16:29, Andres Freund wrote: >> > I think that's not really related. Such a promotion doesn't cause data >> > loss in the sense of loosing data a *clueful* operator wanted to >> > keep. Yes, it can be used wrongly, but it's far from alone in that. >> >> Yes it does cause data loss. The clueful operator has no idea where >> they are so there is no "used rightly" in that case. > > What? There definitely are cases where you don't need to know that to > the T. Just think of the - quite frequently happening - need to promote > a standby to run tests or reporting queries that can't be run on a > standby. What do they do with the standby afterwards? Perhaps for testing, but I'd hope that Business Intelligence is done by freezing databases at known target times. So at least you can say, "using a database snapshot of 9am, we had the following results". We seem to be trying to justify something that is dangerous and will destroy data for incautious users. Of course it has uses, but thats not the point, its the danger that is the problem, not the lack of use. We go to a lot of trouble to avoid footguns elsewhere across many years, so I can't see why you'd want to have the --footgun option added here. recovery-target = 'vague' > Sure, you shouldn't use it if you expect a very specific set of the data > being there, but that's not always necessary. And that's why it should > never, ever be the default. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 1:21 PM, Simon Riggs wrote: > The test index is unique, so our to-be-inserted value exists on only > one page, hence page locking applies while we insert it. The next > person to insert waits for the page lock and then sees the test tuple. > > The page lock lasts only for the duration of the insertion of the > ituple, not for the whole operation. (by page lock, I take it you mean buffer lock - converting that into a page hwlock is what I do). This is where it gets quite complicated. What happens if row locking on upsert finds a conflict update changing uniquely-constrained attributes? Sure, a vanilla non-HOT update will fail on inserting a unique index tuple, but *it* can still cause us a row-level conflict, and *it* can only fail (with a dup violation) when we commit/abort. But now we're obligated to wait on it to get the row lock, and it's obligated to wait on us to get the promise tuple lock, or any other sort of "value lock" that hasn't already been released when we go to row lock. Deadlock. You cannot get away with failing to release the promise tuple/value lock if you want to maintain useful guarantees. It doesn't need to be a vanilla non-HOT update. That's just the simplest example I can think of. >> Also, I think there are other cases where we think we're going to >> insert, so we put the special index tuple in there, but then we decide >> to update, so we don't need the promise tuple any more, but other >> sessions are potentially still waiting for our XID to terminate even >> though there's no conflict any more. I'm having a hard time bringing >> the details of those cases to mind ATM, though. > > We make the decision to INSERT or UPDATE based upon what we find in > the test index. If a value if there already, we assume its an UPDATE > and go to update the row this points to. If it has been deleted we > loop back and try again/error. Sure, you can throw an error, and that makes things a lot easier. It also implies that the implementation is inferior to the subxact looping pattern, which you've already implied is a thing we must beat in every way. Frankly, I think it's a cop-out to just throw an error, and I don't think it'll end up being some theoretical risk. It'll happen often if it is allowed to happen at all. Allowing it to happen almost defeats the purpose of the feature - the big appeal of the feature is that it makes guarantees about the outcome. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 25 September 2014 20:38, Peter Geoghegan wrote: > On Thu, Sep 25, 2014 at 7:12 AM, Simon Riggs wrote: >> The way forwards, in my view, is to define precisely the behaviour we >> wish to have. That definition will include the best current mechanism >> for running an UPSERT using INSERT/UPDATE/loops and comparing that >> against what is being provided here. We will then have a functional >> test of equivalence of the approaches, and a basis for making a >> performance test that shows that performance is increased without any >> loss of concurrency. > > That sounds very reasonable. So I promise not to discuss locking until we get the first things done. My suggested approach to get this committed is... A. UPDATE/INSERT privilege infrastructure. Add tests to it, make it separately committable, so we can get that done. Submit to Oct CF; get that done early. B. Agree command semantics by producing these things * Explanatory documentation (Ch6.4 Data Manipulation - Upsert) * SQL Reference Documentation (INSERT) * Test cases for feature * Test cases for concurrency * Test cases for pgbench All of the above, as a separate committable patch. I hate the fact that you have written no user facing documentation for this feature. How can anyone tell whether the tests you've written are correct or even consistent to a particular definition of correctness? Submit as patch for review only to Oct 15 CF We then agree what is required for further work At this stage, poll the Django and Rails communities for acceptance and early warning of these features. Listen. C. Internal weirdness Submit C based upon earlier agreed B, submit to Dec 15 CF, major patch deadline, so we can fine tune for last CF. Then Heikki rewrites half your patch in a better way, you thank him and then we commit. All done. > While I'm sure that what I have here can > decisively beat the xact looping pattern in terms of performance as > measured by pgbench, the real performance advantage is that this > approach doesn't burn through XIDs. That was a concern that Andres > highlighted in relation to using the subxact looping pattern with > BDR's multi-master replication conflict resolution. But we're still discussing SQL semantics. So first things first, then loop back around, hoping our design has not been concurrently deleted... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
* Gregory Smith (gregsmithpg...@gmail.com) wrote: > On 9/25/14, 2:02 PM, Peter Eisentraut wrote: > >But having the same parameter setting mean different things in > >different versions is the path to complete madness. > > Could we go so far as to remove support for unitless time settings > eventually? The fact that people are setting raw numbers in the > configuration file and have to know the unit to understand what they > just did has never been something I like. I could certainly get behind that idea... Tho I do understand that people will complain about backwards compatibility, etc, etc. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()
Hi, On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote: > Hi Andres, Robert. > > I've attached four patches here. Cool. Will review. > 1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to >earlier in StartupXLOG. > > 2. Inside that function, issue fsync()s for the main forks we create by >copying the _init fork. These two imo should definitely be backpatched. > 3. A small fixup to add a const to "typedef char *FileName", because the >earlier patch gave me warnings about discarding const-ness. This is >consistent with many other functions in fd.c that take const char *. I'm happy with consider that one for master (although I seem to recall having had a patch for it rejected?), but I don't think we want to do that in the backbranches. > 4. Issue an fsync() on the data directory at startup if we need to >perform crash recovery. I personally think this one should be backpatched too. Does anyone believe differently? > From d8726c06cdf11674661eac1d091cf7edd05c2a0c Mon Sep 17 00:00:00 2001 > From: Abhijit Menon-Sen > Date: Wed, 24 Sep 2014 14:43:18 +0530 > Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier > > We need to call this after recovery, but not after the END_OF_RECOVERY > checkpoint is written. If we crash after that checkpoint, for example > because of ENOSPC in PreallocXlogFiles, the checkpoint has already set > the ControlFile->state to DB_SHUTDOWNED, so we don't enter recovery > again at startup. > > Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) > in the first cleanup, this leaves the database with a bunch of _init > forks for unlogged relations, but no main forks, leading to scary > errors. > > See thread from 20140912112246.ga4...@alap3.anarazel.de for details. With a explanation. Shiny! > --- > src/backend/access/transam/xlog.c | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/src/backend/access/transam/xlog.c > b/src/backend/access/transam/xlog.c > index 46eef5f..218f7fb 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -6863,6 +6863,14 @@ StartupXLOG(void) > ShutdownWalRcv(); > > /* > + * Reset initial contents of unlogged relations. This has to be done > + * AFTER recovery is complete so that any unlogged relations created > + * during recovery also get picked up. > + */ > + if (InRecovery) > + ResetUnloggedRelations(UNLOGGED_RELATION_INIT); > + + * Also recovery shouldn't be regarded successful if the reset fails - * e.g. because of ENOSPC. > + > + /* > + * copy_file() above has already called pg_flush_data() on the > + * files it created. Now we need to fsync those files, because > + * a checkpoint won't do it for us while we're in recovery. > + */ + * Doing this in a separate pass is advantageous for performance reasons * because it allows the kernel to perform all the flushes at once. > + /* > + * If we need to perform crash recovery, we issue an fsync on the > + * data directory to try to ensure that any data written before the > + * crash are flushed to disk. Otherwise a power failure in the near > + * future might mean that earlier unflushed writes are lost, but the > + * more recent data written to disk from here on are persisted. > + */ > + > + if (ControlFile->state != DB_SHUTDOWNED && > + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > + fsync_fname(data_directory, true); > + > if (ControlFile->state == DB_SHUTDOWNED) > { > /* This is the expected case, so don't be chatty in standalone > mode */ > -- Unless I miss something this isn't sufficient. We need to fsync the files in the data directory, not just the toplevel directory? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 3:17 PM, Peter Geoghegan wrote: >> To find out how much that optimization buys, you >> should use tuples with many variable-length columns (say, 50) >> preceding the text column you're sorting on. I won't be surprised if >> that turns out to be expensive enough to be worth worrying about, but >> I have not benchmarked it. > > Sorry, but I don't follow. I don't think the pertinent question is if > it's a noticeable cost. I think the pertinent question is if it's > worth it. Doing something about it necessitates a lot of extra memory > access. Not doing something about it hardly affects the amount of > memory access required, perhaps not at all. I think you're mincing words. If you go back and fix datum1, you'll spend a bunch of effort doing that.If you don't, you'll pay a cost on every comparison to re-find the relevant column inside each tuple. You can compare those costs in a variety of cases, including the one I mentioned, where the latter cost will be relatively high. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of GetUserId() Usage
* Peter Eisentraut (pete...@gmx.net) wrote: > On 9/24/14 4:58 PM, Stephen Frost wrote: > > * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: > >> I think the case for pgstat_get_backend_current_activity() and > >> pg_stat_get_activity and the other pgstatfuncs.c callers is easy to make > >> and seems acceptable to me; but I would leave pg_signal_backend out of > >> that discussion, because it has a potentially harmful side effect. By > >> requiring SET ROLE you add an extra layer of protection against > >> mistakes. (Hopefully, pg_signal_backend() is not a routine thing for > >> well-run systems, which means human intervention, and therefore the room > >> for error isn't insignificant.) > > > > While I certainly understand where you're coming from, I don't really > > buy into it. Yes, cancelling a query (the only thing normal users can > > do anyway- they can't terminate backends) could mean the loss of any > > in-progress work, but it's not like 'rm' and I don't see that it needs > > to require extra hoops for individuals to go through. > > It would be weird if it were inconsistent: some things require role > membership, some things require SET ROLE. Try explaining that. I agree.. We already have that distinction, through inherit vs. noinherit. I don't think it makes sense to have it also for individual commands which we feel "might not be as safe". You could still go delete all their data w/o a set role if you wanted... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 25 September 2014 19:59, Peter Geoghegan wrote: > On Thu, Sep 25, 2014 at 9:20 AM, Robert Haas wrote: >> I've never been a fan of putting the index name in there. > > Me neither. Although I do understand Kevin's concern about the user's > intent surrounding which unique index to merge on. The use case cited is real. My solution of using an after trigger works yet without adding specific functionality to this command. So we can achieve what users want without complicating things here. If we do decide we really want it, lets add it as a later patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 25 September 2014 20:11, Robert Haas wrote: >> My approach would be to insert an index tuple for that value into the >> index, but with the leaf ituple marked with an xid rather than a ctid. >> If someone tries to insert into the index they would see this and wait >> for the inserting transaction to end. The inserting transaction would >> then resolve what happens in the heap (insert/update) and later >> repoint the index tuple to the inserted/updated row version. I don't >> see the need for page level locking since it would definitely result >> in deadlocks (e.g. SQLServer). > > I think that something like this might work, but the devil is in the > details. Suppose two people try to upsert into the same table at the > same time. There's one index. If the transactions search that index > for conflicts first, neither sees any conflicting tuples, and both > proceed. That's no good. OK, so suppose each transaction inserts the > special index tuple which you mention, to lock out concurrent inserts > of that value, and then searches for already-existing conflicts. Each > sees the other's tuple, and they deadlock. That's no good, either. The test index is unique, so our to-be-inserted value exists on only one page, hence page locking applies while we insert it. The next person to insert waits for the page lock and then sees the test tuple. The page lock lasts only for the duration of the insertion of the ituple, not for the whole operation. > Also, I think there are other cases where we think we're going to > insert, so we put the special index tuple in there, but then we decide > to update, so we don't need the promise tuple any more, but other > sessions are potentially still waiting for our XID to terminate even > though there's no conflict any more. I'm having a hard time bringing > the details of those cases to mind ATM, though. We make the decision to INSERT or UPDATE based upon what we find in the test index. If a value if there already, we assume its an UPDATE and go to update the row this points to. If it has been deleted we loop back and try again/error. If the value is not present, we insert the test tuple and progress as an INSERT, then loop back later to set the ctid. There is no case of "don't need promise id anymore". We would use the PK, identity or first unique index as the test index. There is a case where an UPSERT conflicts with an INSERT causing the latter to abort. Anyway, this is why we need the design more clearly exposed, so you can tell me I'm wrong by showing me the URL of it done right. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
On 9/25/14, 2:02 PM, Peter Eisentraut wrote: But having the same parameter setting mean different things in different versions is the path to complete madness. Could we go so far as to remove support for unitless time settings eventually? The fact that people are setting raw numbers in the configuration file and have to know the unit to understand what they just did has never been something I like. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Thu, Sep 25, 2014 at 10:00 AM, Jeff Janes wrote: > On Wed, Sep 24, 2014 at 2:48 AM, Dilip kumar > wrote: > >> On 24 August 2014 11:33, Amit Kapila Wrote >> >> >> >> Thanks for you comments, i have worked on both the review comment lists, >> sent on 19 August, and 24 August. >> >> >> >> Latest patch is attached with the mail.. >> > > Hi Dilip, > > I think you have an off-by-one error in the index into the array of file > handles. > Actually the problem is that the socket for the master connection was not getting initialized, see my one line addition here. connSlot = (ParallelSlot*)pg_malloc(concurrentCons * sizeof(ParallelSlot)); connSlot[0].connection = conn; + connSlot[0].sock = PQsocket(conn); However, I don't think it is good to just ignore errors from the select call (like the EBADF) and go into a busy loop instead, so there are more changes needed than this. Also, cancelling the run (by hitting ctrl-C in the shell that invoked it) does not seem to work on linux. I get a message that says "Cancel request sent", but then it continues to finish the job anyway. Cheers, Jeff
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 7:12 AM, Simon Riggs wrote: > The way forwards, in my view, is to define precisely the behaviour we > wish to have. That definition will include the best current mechanism > for running an UPSERT using INSERT/UPDATE/loops and comparing that > against what is being provided here. We will then have a functional > test of equivalence of the approaches, and a basis for making a > performance test that shows that performance is increased without any > loss of concurrency. That sounds very reasonable. While I'm sure that what I have here can decisively beat the xact looping pattern in terms of performance as measured by pgbench, the real performance advantage is that this approach doesn't burn through XIDs. That was a concern that Andres highlighted in relation to using the subxact looping pattern with BDR's multi-master replication conflict resolution. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 12:11 PM, Robert Haas wrote: > I think that something like this might work, but the devil is in the > details. Suppose two people try to upsert into the same table at the > same time. There's one index. If the transactions search that index > for conflicts first, neither sees any conflicting tuples, and both > proceed. That's no good. OK, so suppose each transaction inserts the > special index tuple which you mention, to lock out concurrent inserts > of that value, and then searches for already-existing conflicts. Each > sees the other's tuple, and they deadlock. That's no good, either. I'm very glad that you share my concern about deadlocks like this. > Also, I think there are other cases where we think we're going to > insert, so we put the special index tuple in there, but then we decide > to update, so we don't need the promise tuple any more, but other > sessions are potentially still waiting for our XID to terminate even > though there's no conflict any more. I'm having a hard time bringing > the details of those cases to mind ATM, though. Well, you might have a promise tuple in a unique index on attributes not appearing in the UPDATE's targetlist, for one. You have the other session waiting (doesn't have to be an upserter) just because we *thought about* inserting a value as part of an upsert. That's pretty bad. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Bruce Momjian wrote: > 3. 9.3 multi-xact bugs spooked us into being more careful Uh. Multixact changes in 9.3 were infinitely more invasive than the jsonb changes will ever be. a) they touched basic visibility design and routines, which are complex, understood by very few people, and have remained mostly unchanged for ages; b) they changed on-disk format for an underlying support structure, requiring pg_upgrade to handle the conversion; c) they added new catalog infrastructure to keep track of required freezing; d) they introduced new uint32 counters subject to wraparound; e) they introduced a novel user of slru.c with 5-char long filenames; f) they messed with tuple locking protocol and EvalPlanQual logic for traversing update chains. Maybe I'm forgetting others. JSONB has none of these properties. As far as I can see, the only hairy issue here (other than getting Josh Berkus to actually test the proposed patches) is that JSONB is changing on-disk format; but we're avoiding most issues there by dictating that people with existing JSONB databases need to pg_dump them, i.e. there is no conversion step being written for pg_upgrade. It's good to be careful; it's even better to be more careful. I too have learned a lesson there. Anyway I have no opinion on the JSONB stuff, other than considering that ignoring performance for large arrays and large objects seems to run counter to the whole point of JSONB in the first place (and of course failing to compress is part of that, too.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 25, 2014 at 09:00:07PM +0200, Andres Freund wrote: > On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote: > > On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > > > BTW, it seems like there is consensus that we ought to reorder the items > > > in a jsonb object to have keys first and then values, independently of the > > > other issues under discussion. This means we *will* be breaking on-disk > > > compatibility with 9.4beta2, which means pg_upgrade will need to be taught > > > to refuse an upgrade if the database contains any jsonb columns. Bruce, > > > do you have time to crank out a patch for that? > > > > Yes, I can do that easily. Tell me when you want it --- I just need a > > catalog version number to trigger on. > > Do you plan to make it conditional on jsonb being used in the database? > That'd not be bad to reduce the pain for testers that haven't used jsonb. Yes, I already have code that scans pg_attribute looking for columns with problematic data types and output them to a file, and then throw an error. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 11:53 AM, Robert Haas wrote: > I haven't looked at that part of the patch in detail yet, so... not > really. But I don't see why you'd ever need to restart heap tuple > copying. At most you'd need to re-extract datum1 from the tuples you > have already copied. Well, okay, technically it's not restarting heap tuple copying, but it's about the same thing. The point is that you have to stream a significant chunk of the memtuples array through memory again. Not to mention, having infrastructure to do that, and pick up where we left off, which is significantly more code, all to make comparetup_heap() a bit clearer (i.e. making it so that it won't have to think about an extra sortsupport state). > To find out how much that optimization buys, you > should use tuples with many variable-length columns (say, 50) > preceding the text column you're sorting on. I won't be surprised if > that turns out to be expensive enough to be worth worrying about, but > I have not benchmarked it. Sorry, but I don't follow. I don't think the pertinent question is if it's a noticeable cost. I think the pertinent question is if it's worth it. Doing something about it necessitates a lot of extra memory access. Not doing something about it hardly affects the amount of memory access required, perhaps not at all. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 2:17 PM, Simon Riggs wrote: >> 1. You don't accept that value locks must be easily released in the >> event of a conflict. Is anyone in this camp? It's far from obvious to >> me what side of this question Andres is on at this stage, for example. >> Robert might have something to say here too. >> >> 2. Having taken into account the experience of myself and Heikki, and >> all that is implied by taking that approach ***while avoiding >> unprincipled deadlocks***, you continue to believe that an approach >> based on speculative heap insertion, or some alternative scheme is >> better than what I have done to the nbtree code here, or you otherwise >> dislike something about the proposed value locking scheme. You accept >> that value locks must be released and released easily in the event of >> a conflict, but like Heikki you just don't like what I've done to get >> there. >> >> Since we can (I believe) talk about the value locking aspect and the >> rest of the patch independently, we should do so...unless you're in >> camp 1, in which case I guess that we'll have to thrash it out. > > I'm trying to understand and help out with pushing this patch forwards > to completion. > > Basically, I have absolutely no idea whether I object to or agree with > 1) and don't know where to look to find out. We need a clear > exposition of design and the alternatives. I laughed when I read this, because I think a lot of the discussion on this topic has been unnecessarily muddled by jargon. > My approach would be to insert an index tuple for that value into the > index, but with the leaf ituple marked with an xid rather than a ctid. > If someone tries to insert into the index they would see this and wait > for the inserting transaction to end. The inserting transaction would > then resolve what happens in the heap (insert/update) and later > repoint the index tuple to the inserted/updated row version. I don't > see the need for page level locking since it would definitely result > in deadlocks (e.g. SQLServer). I think that something like this might work, but the devil is in the details. Suppose two people try to upsert into the same table at the same time. There's one index. If the transactions search that index for conflicts first, neither sees any conflicting tuples, and both proceed. That's no good. OK, so suppose each transaction inserts the special index tuple which you mention, to lock out concurrent inserts of that value, and then searches for already-existing conflicts. Each sees the other's tuple, and they deadlock. That's no good, either. Also, I think there are other cases where we think we're going to insert, so we put the special index tuple in there, but then we decide to update, so we don't need the promise tuple any more, but other sessions are potentially still waiting for our XID to terminate even though there's no conflict any more. I'm having a hard time bringing the details of those cases to mind ATM, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote: > On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > > BTW, it seems like there is consensus that we ought to reorder the items > > in a jsonb object to have keys first and then values, independently of the > > other issues under discussion. This means we *will* be breaking on-disk > > compatibility with 9.4beta2, which means pg_upgrade will need to be taught > > to refuse an upgrade if the database contains any jsonb columns. Bruce, > > do you have time to crank out a patch for that? > > Yes, I can do that easily. Tell me when you want it --- I just need a > catalog version number to trigger on. Do you plan to make it conditional on jsonb being used in the database? That'd not be bad to reduce the pain for testers that haven't used jsonb. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 9:20 AM, Robert Haas wrote: > I've never been a fan of putting the index name in there. Me neither. Although I do understand Kevin's concern about the user's intent surrounding which unique index to merge on. > I agree > that's stuff that a DML statement shouldn't need to know about. What > I've advocated for in the past is specifying the list of columns that > should be used to determine whether to insert or update. If you have > a match on those columns, update the row; else insert. Any other > unique indexes stand or fall as may be. > > I still think that idea has merit. As I've said, my problem with that idea is the corner cases. Consider the possible ambiguity. Could DML queries in production start failing ("ambiguous unique index specification") because the DBA created a new unique index on attributes that somewhat overlap the attributes of the unique index that the DML author actually meant? What about the effects of BEFORE triggers, and their interaction with partial unique indexes? If you can describe an exact behavior that overcomes these issues, then I'll give serious consideration to implementing it. As things stand, to be perfectly honest it sounds like a footgun to me. There are interactions that make getting it right very ticklish. I don't want to make a unique index specification mandatory because that's ugly - that's the only reason, TBH. However, while what you describe here accomplishes the same thing without being ugly, it is potentially very surprising. Naming the unique index directly has the great advantage of very clearly demonstrating user intent. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 2:05 PM, Peter Geoghegan wrote: > On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas wrote: >> The top issue on my agenda is figuring out a way to get rid of the >> extra SortSupport object. > > Really? I'm surprised. Clearly the need to restart heap tuple copying > from scratch, in order to make the datum1 representation consistent, > rather than abandoning datum1 for storing abbreviated keys or pointers > entirely is a very important aspect of whether or not we should change > that. In turn, that's something that's going to (probably > significantly) affect the worst case. > > Do you have an opinion on that? I haven't looked at that part of the patch in detail yet, so... not really. But I don't see why you'd ever need to restart heap tuple copying. At most you'd need to re-extract datum1 from the tuples you have already copied. To find out how much that optimization buys, you should use tuples with many variable-length columns (say, 50) preceding the text column you're sorting on. I won't be surprised if that turns out to be expensive enough to be worth worrying about, but I have not benchmarked it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: > BTW, it seems like there is consensus that we ought to reorder the items > in a jsonb object to have keys first and then values, independently of the > other issues under discussion. This means we *will* be breaking on-disk > compatibility with 9.4beta2, which means pg_upgrade will need to be taught > to refuse an upgrade if the database contains any jsonb columns. Bruce, > do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 7:35 AM, Robert Haas wrote: > On Thu, Sep 25, 2014 at 10:12 AM, Simon Riggs wrote: >> IMHO it is impossible to know if any of the other code is correct >> until we have a clear and stable vision of what the command is >> supposed to perform. > > +1. > >> The inner workings are less important than what the feature does. > > +1. > >> FWIW, the row available at the end of all BEFORE triggers is clearly >> the object we should be manipulating, not the original VALUES() >> clause. Otherwise this type of INSERT would behave differently from >> normal INSERTs. Which would likely violate RLS, if nothing else. > > +1. I agree with all of this. I'm glad that my opinion on how a CONFLICTING() expression interacts with BEFORE triggers is accepted, too. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 11:17 AM, Simon Riggs wrote: > Basically, I have absolutely no idea whether I object to or agree with > 1) and don't know where to look to find out. We need a clear > exposition of design and the alternatives. > > My approach would be to insert an index tuple for that value into the > index, but with the leaf ituple marked with an xid rather than a ctid. > If someone tries to insert into the index they would see this and wait > for the inserting transaction to end. The inserting transaction would > then resolve what happens in the heap (insert/update) and later > repoint the index tuple to the inserted/updated row version. I don't > see the need for page level locking since it would definitely result > in deadlocks (e.g. SQLServer). The page level locks are only used to prevent concurrent insertion for as long as it takes to get consensus to proceed among unique indexes, and to actually insert a heap tuple. They're all released before we lock the tuple for update, should we take that path (yes, really). This is consistent with the behavior of other systems, I think. That's my whole reason for preferring to do things that way. If you have a "promise tuples" approach - be it what you outline here, or what Heikki prototyped with heap tuple insertion, or any other - then you need a way to *release* those "value locks" in the event of a conflict/needing to update, before locking/updating. Otherwise, you get deadlocks. This is an issue I highlighted when it came up with Heikki's prototype. AFAICT, any scheme for "value locking" needs to strongly consider the need to *release* value locks inexpensively. Whatever else they do, they cannot persist for the duration of the transaction IMV. Does that make sense? If not, my next suggestion is applying an earlier revision of Heikki's prototype, and seeing for yourself how it can be made to deadlock in an unprincipled/impossible to prevent way [1]. You've quite rightly highlighted the existing subxact looping pattern as something that this needs to be better than in every way. This is one important way in which we might fail to live up to that standard. [1] http://www.postgresql.org/message-id/52b4aaf0.5090...@vmware.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 11:22 AM, Tom Lane wrote: > In the interests of pushing this forward, I will work today on > trying to finish and review Heikki's offsets-and-lengths patch > so that we have something we can do performance testing on. > I doubt that the performance testing will tell us anything we > don't expect, but we should do it anyway. OK. I'll spend some time trying to get Socorro with JSONB working so that I'll have a second test case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Josh Berkus writes: > On 09/25/2014 10:26 AM, Andres Freund wrote: >> On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: >>> If Heikki says it's ready, I'll test. So far he's said that it wasn't >>> done yet. >> http://www.postgresql.org/message-id/541c242e.3030...@vmware.com > Yeah, and that didn't include some of Tom's bug fixes apparently, per > the succeeding message. Which is why I asked Heikki if he was done, to > which he has not replied. I took a quick look at the two patches Heikki posted. I find the "separate offsets array" approach unappealing. It takes more space than the other approaches, and that space will be filled with data that we already know will not be at all compressible. Moreover, AFAICS we'd have to engrave the stride on stone tablets, which as I already mentioned I'd really like to not do. The "offsets-and-lengths" patch seems like the approach we ought to compare to my patch, but it looks pretty unfinished to me: AFAICS it includes logic to understand offsets sprinkled into a mostly-lengths array, but no logic that would actually *store* any such offsets, which means it's going to act just like my patch for performance purposes. In the interests of pushing this forward, I will work today on trying to finish and review Heikki's offsets-and-lengths patch so that we have something we can do performance testing on. I doubt that the performance testing will tell us anything we don't expect, but we should do it anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 28 August 2014 03:43, Peter Geoghegan wrote: > "Value locking" > === > > To date, on-list discussion around UPSERT has almost exclusively > concerned what I've called "value locking"; the idea of locking values > in unique indexes in the abstract (to establish the right to insert > ahead of time). There was some useful discussion on this question > between myself and Heikki back around December/January. Ultimately, we > were unable to reach agreement on an approach and discussion tapered > off. However, Heikki did understand the concerns that informed by > design. He recognized the need to be able to easily *release* value > locks, so as to avoid "unprincipled deadlocks", where under high > concurrency there are deadlocks between sessions that only UPSERT a > single row at a time. I'm not sure how widely appreciated this point > is, but I believe that Heikki appreciates it. It is a very important > point in my opinion. I don't want an implementation that is in any way > inferior to the "UPSERT looping subxact" pattern does (i.e. the plpsql > thing that the docs suggest). > > When we left off, Heikki continued to favor an approach that involved > speculatively inserting heap tuples, and then deleting them in the > event of a conflict. This design was made more complicated when the > need to *release* value locks became apparent (Heikki ended up making > some changes to HeapTupleSatisfiesDirty(), as well as sketching a > design for what you might call a "super delete", where xmin can be set > to InvalidTransactionId for speculatively-inserted heap tuples). After > all, it wasn't as if we could abort a subxact to release locks, which > is what the "UPSERT looping subxact" pattern does. I think it's fair > to say that that design became more complicated than initially > anticipated [4] [5]. > > Anyway, the greater point here is that fundamentally, AFAICT Heikki > and I were in agreement. Once you buy into the idea that we must avoid > holding on to "value locks" of whatever form - as Heikki evidently did > - then exactly what form they take is ultimately only a detail. > Granted, it's a very important detail, but a detail nonetheless. It > can be discussed entirely independently of all of this new stuff, and > thank goodness for that. > > If anyone finds my (virtually unchanged) page heavyweight lock based > value locking approach objectionable, I ask that the criticism be > framed in a way that makes a sharp distinction between each of the > following: > > 1. You don't accept that value locks must be easily released in the > event of a conflict. Is anyone in this camp? It's far from obvious to > me what side of this question Andres is on at this stage, for example. > Robert might have something to say here too. > > 2. Having taken into account the experience of myself and Heikki, and > all that is implied by taking that approach ***while avoiding > unprincipled deadlocks***, you continue to believe that an approach > based on speculative heap insertion, or some alternative scheme is > better than what I have done to the nbtree code here, or you otherwise > dislike something about the proposed value locking scheme. You accept > that value locks must be released and released easily in the event of > a conflict, but like Heikki you just don't like what I've done to get > there. > > Since we can (I believe) talk about the value locking aspect and the > rest of the patch independently, we should do so...unless you're in > camp 1, in which case I guess that we'll have to thrash it out. I'm trying to understand and help out with pushing this patch forwards to completion. Basically, I have absolutely no idea whether I object to or agree with 1) and don't know where to look to find out. We need a clear exposition of design and the alternatives. My approach would be to insert an index tuple for that value into the index, but with the leaf ituple marked with an xid rather than a ctid. If someone tries to insert into the index they would see this and wait for the inserting transaction to end. The inserting transaction would then resolve what happens in the heap (insert/update) and later repoint the index tuple to the inserted/updated row version. I don't see the need for page level locking since it would definitely result in deadlocks (e.g. SQLServer). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review of GetUserId() Usage
On 9/24/14 4:58 PM, Stephen Frost wrote: > Alvaro, > > * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: >> I think the case for pgstat_get_backend_current_activity() and >> pg_stat_get_activity and the other pgstatfuncs.c callers is easy to make >> and seems acceptable to me; but I would leave pg_signal_backend out of >> that discussion, because it has a potentially harmful side effect. By >> requiring SET ROLE you add an extra layer of protection against >> mistakes. (Hopefully, pg_signal_backend() is not a routine thing for >> well-run systems, which means human intervention, and therefore the room >> for error isn't insignificant.) > > While I certainly understand where you're coming from, I don't really > buy into it. Yes, cancelling a query (the only thing normal users can > do anyway- they can't terminate backends) could mean the loss of any > in-progress work, but it's not like 'rm' and I don't see that it needs > to require extra hoops for individuals to go through. It would be weird if it were inconsistent: some things require role membership, some things require SET ROLE. Try explaining that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On 12 March 2014 16:29, Heikki Linnakangas wrote: > Good point. We have done two major changes to GIN in this release cycle: > changed the data page format and made it possible to skip items without > fetching all the keys ("fast scan"). gincostestimate doesn't know about > either change. Did this cost estimation issue get fixed? If not, and if this is due for 9.5, can this be removed from the 9.4 open items list? Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas wrote: > The top issue on my agenda is figuring out a way to get rid of the > extra SortSupport object. Really? I'm surprised. Clearly the need to restart heap tuple copying from scratch, in order to make the datum1 representation consistent, rather than abandoning datum1 for storing abbreviated keys or pointers entirely is a very important aspect of whether or not we should change that. In turn, that's something that's going to (probably significantly) affect the worst case. Do you have an opinion on that? If you want me to start from scratch, and then have a consistent datum1 representation, and then be able to change the structure of comparetup_heap() as you outline (so as to get rid of the extra SortSupport object), I can do that. My concern is the regression. The datum1 pointer optimization appears to matter very little for pass by value types (traditionally, before abbreviated keys), and so I have a hard time imagining this working out. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
On 9/25/14 11:03 AM, Robert Haas wrote: > I couldn't agree more. There's something to be said for just leaving > this alone. I agree. > potentitally draw complaints. But I also agree with his last one - of > those three possible complaints, I certainly prefer "I had to fix my > configuration file for the new, stricter validation" over any variant > of "my configuration file still worked but it did something > surprisingly different from what it used to do.". Yes. I don't mind that we rename parameters from time to time when semantics changed or are refined. But having the same parameter setting mean different things in different versions is the path to complete madness. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On Thu, Sep 25, 2014 at 1:30 PM, Andres Freund wrote: >> Yes it does cause data loss. The clueful operator has no idea where >> they are so there is no "used rightly" in that case. > > What? There definitely are cases where you don't need to know that to > the T. Just think of the - quite frequently happening - need to promote > a standby to run tests or reporting queries that can't be run on a > standby. > > Sure, you shouldn't use it if you expect a very specific set of the data > being there, but that's not always necessary. Very well put. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: > If Heikki says it's ready, I'll test. So far he's said that it wasn't > done yet. http://www.postgresql.org/message-id/541c242e.3030...@vmware.com Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 10:29:51 -0700, Josh Berkus wrote: > On 09/25/2014 10:26 AM, Andres Freund wrote: > > On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: > >> If Heikki says it's ready, I'll test. So far he's said that it wasn't > >> done yet. > > > > http://www.postgresql.org/message-id/541c242e.3030...@vmware.com > > Yeah, and that didn't include some of Tom's bug fixes apparently, per > the succeeding message. Which is why I asked Heikki if he was done, to > which he has not replied. Well, Heikki said he doesn't see any fixes in Tom's patch. But either way, this isn't anything that should prevent you from testing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On 2014-09-25 18:18:09 +0100, Simon Riggs wrote: > On 25 September 2014 16:29, Andres Freund wrote: > > I think that's not really related. Such a promotion doesn't cause data > > loss in the sense of loosing data a *clueful* operator wanted to > > keep. Yes, it can be used wrongly, but it's far from alone in that. > > Yes it does cause data loss. The clueful operator has no idea where > they are so there is no "used rightly" in that case. What? There definitely are cases where you don't need to know that to the T. Just think of the - quite frequently happening - need to promote a standby to run tests or reporting queries that can't be run on a standby. Sure, you shouldn't use it if you expect a very specific set of the data being there, but that's not always necessary. And that's why it should never, ever be the default. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 10:26 AM, Andres Freund wrote: > On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: >> If Heikki says it's ready, I'll test. So far he's said that it wasn't >> done yet. > > http://www.postgresql.org/message-id/541c242e.3030...@vmware.com Yeah, and that didn't include some of Tom's bug fixes apparently, per the succeeding message. Which is why I asked Heikki if he was done, to which he has not replied. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 10:20 AM, Andres Freund wrote: > On 2014-09-25 10:18:24 -0700, Josh Berkus wrote: >> On 09/25/2014 10:14 AM, Bruce Momjian wrote: >>> On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. >>> >>> I think there are a few reasons this has been delayed, aside from the >>> scheduling ones: >>> >>> 1. compression issues were a surprise, and we are wondering if >>> there are any other surprises >>> 2. pg_upgrade makes future data format changes problematic >>> 3. 9.3 multi-xact bugs spooked us into being more careful >>> >>> I am not sure what we can do to increase our speed based on these items. >> >> Alternately, this is delayed because: >> >> 1. We have one tested patch to fix the issue. >> >> 2. However, people are convinced that there's a better patch possible. >> >> 3. But nobody is working on this better patch except "in their spare time". >> >> Given this, I once again vote for releasing based on Tom's lengths-only >> patch, which is done, tested, and ready to go. > > Heikki's patch is there and polished. If Heikki says it's ready, I'll test. So far he's said that it wasn't done yet. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] KNN-GiST with recheck
On Thu, Sep 25, 2014 at 9:00 PM, Emre Hasegeli wrote: > > Fixed, thanks. > > Here are my questions and comments about the code. > > doc/src/sgml/gist.sgml:812: > >be rechecked from heap tuple before tuple is returned. If > >recheck flag isn't set then it's true by default for > >compatibility reasons. The recheck flag can be used > only > > Recheck flag is set to false on gistget.c so I think it should say > "false by default". On the other hand, it is true by default on > the consistent function. It is written as "the safest assumption" > on the code comments. I don't know why the safest is chosen over > the backwards compatible for the consistent function. > Agree. It should be clarified in docs. src/backend/access/gist/gistget.c:505: > > /* Recheck distance from heap tuple if needed */ > > if (GISTSearchItemIsHeap(*item) && > > searchTreeItemNeedDistanceRecheck(scan, > so->curTreeItem)) > > { > > searchTreeItemDistanceRecheck(scan, > so->curTreeItem, item); > > continue; > > } > > Why so->curTreeItem is passed to these functions? They can use > scan->opaque->curTreeItem. > I didn't get the difference. Few lines before: GISTScanOpaque so = (GISTScanOpaque) scan->opaque; > src/backend/access/gist/gistscan.c:49: > > /* > >* When all distance values are the same, items without > recheck > >* can be immediately returned. So they are placed first. > >*/ > > if (recheckCmp == 0 && distance_a.recheck != > distance_b.recheck) > > recheckCmp = distance_a.recheck ? 1 : -1; > > I don't understand why items without recheck can be immediately > returned. Do you think it will work correctly when there is > an operator class which will return recheck true and false for > the items under the same page? > Yes, I believe so. Item with recheck can't decrease it's distance, it can only increase it. In the corner case item can have same distance after recheck as it was before. Then anyway items which distances are the same can be returned in any order. > src/backend/access/index/indexam.c:258: > > /* Prepare data structures for getting original indexed values > from heap */ > > scan->indexInfo = BuildIndexInfo(scan->indexRelation); > > scan->estate = CreateExecutorState(); > > scan->slot = > MakeSingleTupleTableSlot(RelationGetDescr(heapRelation)); > > With the changes in indexam.c, heap access become legal for all index > access methods. I think it is better than the previous version but > I am leaving the judgement to someone experienced. I will try to > summarize the pros and cons of sorting the rows in the GiST access > method, as far as I understand. > > Pros: > > * It does not require another queue. It should be effective to sort > the rows inside the queue the GiST access method already has. > * It does not complicate index access method infrastructure. > > Cons: > > * It could be done without additional heap access. > * Other access methods could make use of the sorting infrastructure > one day. > * It could be more transparent to the users. Sorting information > could be shown on the explain output. > It would be also nice to show some information about KNN itself. > * A more suitable data structure like binary heap could be used > for the queue to sort the rows. > Binary heap seems to be better data structure for whole KNN-GiST. But it's a subject for a separate patch: replace RB-tree to heap in KNN-GiST. It's not related to recheck stuff. -- With best regards, Alexander Korotkov.
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 10:18:24 -0700, Josh Berkus wrote: > On 09/25/2014 10:14 AM, Bruce Momjian wrote: > > On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: > >> But independent of which version is chosen, we *REALLY* need to make the > >> decision soon. This issue has held up the next beta (like jsonb has > >> blocked previous beta) for *weeks*. > >> > >> Personally it doesn't make me very happy that Heikki and Tom had to be > >> the people stepping up to fix this. > > > > I think there are a few reasons this has been delayed, aside from the > > scheduling ones: > > > > 1. compression issues were a surprise, and we are wondering if > > there are any other surprises > > 2. pg_upgrade makes future data format changes problematic > > 3. 9.3 multi-xact bugs spooked us into being more careful > > > > I am not sure what we can do to increase our speed based on these items. > > Alternately, this is delayed because: > > 1. We have one tested patch to fix the issue. > > 2. However, people are convinced that there's a better patch possible. > > 3. But nobody is working on this better patch except "in their spare time". > > Given this, I once again vote for releasing based on Tom's lengths-only > patch, which is done, tested, and ready to go. Heikki's patch is there and polished. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 10:14 AM, Bruce Momjian wrote: > On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: >> But independent of which version is chosen, we *REALLY* need to make the >> decision soon. This issue has held up the next beta (like jsonb has >> blocked previous beta) for *weeks*. >> >> Personally it doesn't make me very happy that Heikki and Tom had to be >> the people stepping up to fix this. > > I think there are a few reasons this has been delayed, aside from the > scheduling ones: > > 1. compression issues were a surprise, and we are wondering if > there are any other surprises > 2. pg_upgrade makes future data format changes problematic > 3. 9.3 multi-xact bugs spooked us into being more careful > > I am not sure what we can do to increase our speed based on these items. Alternately, this is delayed because: 1. We have one tested patch to fix the issue. 2. However, people are convinced that there's a better patch possible. 3. But nobody is working on this better patch except "in their spare time". Given this, I once again vote for releasing based on Tom's lengths-only patch, which is done, tested, and ready to go. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On 25 September 2014 16:29, Andres Freund wrote: >> > To me, being able to say "pg_ctl promote_right_now -m yes_i_mean_it" >> > seems like a friendlier interface than making somebody shut down the >> > server, run pg_resetxlog, and start it up again. >> >> It makes sense to go from paused --> promoted. >> >> It doesn't make sense to go from normal running --> promoted, since >> that is just random data loss. > > Why? I don't see what's random in promoting a node in the current state > *iff* it's currently consistent. > > Just imagine something like promoting a current standby to a full node > because you want to run some tests on it that require writes. There's > absolutely no need to investigate the current state for that. > >> I very much understand the case where >> somebody is shouting "get the web site up, we are losing business". >> Implementing a feature that allows people to do exactly what they >> asked (go live now), but loses business transactions that we thought >> had been safely recorded is not good. It implements only the exact >> request, not its actual intention. > > That seems to be a problem of massively understanding on the part of the > user. And I don't see how this is going to be safer by requiring the > user to first issue a pause reuest. > > I think we should attempt to solve this by naming the command > appropriately. Something like 'abort_replay_and_promote'. Long, > nontrivial to type, and descriptive. > >> Any feature that lumps both cases together is wrongly designed and >> will cause data loss. >> >> We go to a lot of trouble to ensure data is successfully on disk and >> in WAL. I won't give that up, nor do I want to make it easier to lose >> data than it already is. > > I think that's not really related. Such a promotion doesn't cause data > loss in the sense of loosing data a *clueful* operator wanted to > keep. Yes, it can be used wrongly, but it's far from alone in that. Yes it does cause data loss. The clueful operator has no idea where they are so there is no "used rightly" in that case. If I were to give this feature a name it would be --discard or --random-data-loss, or --reset-hard The point of pausing is misunderstood. That is close but not quite relevant. If you are at a known location and request promotion, we can presume you know what you are doing, so it is simply Promote. If you are at an unknown location and therefore have clearly not verified any state before promotion, you are clearly making an uninformed decision that will likely result in data loss, for which there is no way of knowing the impact and no mechanism for recovering from. Trying to promote something while it is still recovering proves we don't know the state, we're just picking a random LSN. So if you have a time delayed standby and the master breaks, or there is a bad transaction then the correct action would be to * pause the delayed standby * discover where the master broke, or the xid of the bad transaction * restart recovery to go up to the correct time/xid/lsn * promote standby That is already possible in 9.4 The original patch for pausing contained code to reset the PITR target with functions, which would make the above even easier. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: > But independent of which version is chosen, we *REALLY* need to make the > decision soon. This issue has held up the next beta (like jsonb has > blocked previous beta) for *weeks*. > > Personally it doesn't make me very happy that Heikki and Tom had to be > the people stepping up to fix this. I think there are a few reasons this has been delayed, aside from the scheduling ones: 1. compression issues were a surprise, and we are wondering if there are any other surprises 2. pg_upgrade makes future data format changes problematic 3. 9.3 multi-xact bugs spooked us into being more careful I am not sure what we can do to increase our speed based on these items. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Wed, Sep 24, 2014 at 2:48 AM, Dilip kumar wrote: > On 24 August 2014 11:33, Amit Kapila Wrote > > > > Thanks for you comments, i have worked on both the review comment lists, > sent on 19 August, and 24 August. > > > > Latest patch is attached with the mail.. > Hi Dilip, I think you have an off-by-one error in the index into the array of file handles. vacuumdb runs at full CPU, and if I run: strace -ttt -T -f ../parallel_vac/bin/vacuumdb -z -a -j 8 I get the select returning immediately with a bad file descriptor error: 1411663937.641177 select(55, [4 5 6 7 8 9 10 54], NULL, NULL, NULL) = -1 EBADF (Bad file descriptor) <0.12> 1411663937.641232 recvfrom(3, 0x104e3f0, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.12> 1411663937.641279 recvfrom(4, 0x1034bc0, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.11> 1411663937.641326 recvfrom(5, 0x10017c0, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.14> 1411663937.641415 recvfrom(6, 0x10097e0, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.20> 1411663937.641487 recvfrom(7, 0x1012330, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.13> 1411663937.641538 recvfrom(8, 0x101af00, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.11> 1411663937.641584 recvfrom(9, 0x1023af0, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.12> 1411663937.641631 recvfrom(10, 0x1054600, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.12> Cheers, Jeff
Re: [HACKERS] KNN-GiST with recheck
> Fixed, thanks. Here are my questions and comments about the code. doc/src/sgml/gist.sgml:812: >be rechecked from heap tuple before tuple is returned. If >recheck flag isn't set then it's true by default for >compatibility reasons. The recheck flag can be used only Recheck flag is set to false on gistget.c so I think it should say "false by default". On the other hand, it is true by default on the consistent function. It is written as "the safest assumption" on the code comments. I don't know why the safest is chosen over the backwards compatible for the consistent function. src/backend/access/gist/gistget.c:505: > /* Recheck distance from heap tuple if needed */ > if (GISTSearchItemIsHeap(*item) && > searchTreeItemNeedDistanceRecheck(scan, > so->curTreeItem)) > { > searchTreeItemDistanceRecheck(scan, > so->curTreeItem, item); > continue; > } Why so->curTreeItem is passed to these functions? They can use scan->opaque->curTreeItem. src/backend/access/gist/gistscan.c:49: > /* >* When all distance values are the same, items without recheck >* can be immediately returned. So they are placed first. >*/ > if (recheckCmp == 0 && distance_a.recheck != distance_b.recheck) > recheckCmp = distance_a.recheck ? 1 : -1; I don't understand why items without recheck can be immediately returned. Do you think it will work correctly when there is an operator class which will return recheck true and false for the items under the same page? src/backend/access/index/indexam.c:258: > /* Prepare data structures for getting original indexed values from > heap */ > scan->indexInfo = BuildIndexInfo(scan->indexRelation); > scan->estate = CreateExecutorState(); > scan->slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation)); With the changes in indexam.c, heap access become legal for all index access methods. I think it is better than the previous version but I am leaving the judgement to someone experienced. I will try to summarize the pros and cons of sorting the rows in the GiST access method, as far as I understand. Pros: * It does not require another queue. It should be effective to sort the rows inside the queue the GiST access method already has. * It does not complicate index access method infrastructure. Cons: * It could be done without additional heap access. * Other access methods could make use of the sorting infrastructure one day. * It could be more transparent to the users. Sorting information could be shown on the explain output. * A more suitable data structure like binary heap could be used for the queue to sort the rows. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 09:01 AM, Andres Freund wrote: > But independent of which version is chosen, we *REALLY* need to make the > decision soon. This issue has held up the next beta (like jsonb has > blocked previous beta) for *weeks*. Yes, please! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 24, 2014 at 7:04 PM, Peter Geoghegan wrote: > On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan wrote: >> Probably not - it appears to make very little difference to >> unoptimized pass-by-reference types whether or not datum1 can be used >> (see my simulation of Kevin's worst case, for example [1]). Streaming >> through a not inconsiderable proportion of memtuples again is probably >> a lot worse. The datum1 optimization (which is not all that old) made >> a lot of sense when initially introduced, because it avoided chasing >> through a pointer for pass-by-value types. I think that's its sole >> justification, though. > > Just to be clear -- I am blocked on this. Do you really prefer to > restart copying heap tuples from scratch in the event of aborting, > just to make sure that the datum1 representation is consistently > either a pointer to text, or an abbreviated key? I don't think that > the costs involved make that worth it, as I've said, but I'm not sure > how to resolve that controversy. > > I suggest that we focus on making sure the abort logic itself is > sound. There probably hasn't been enough discussion of that. Once that > is resolved, we can revisit the question of whether or not copying > should restart to keep the datum1 representation consistent. I suspect > that leaving that until later will be easier all around. The top issue on my agenda is figuring out a way to get rid of the extra SortSupport object. I'm not going to commit any version of this patch that uses a second SortSupport for the tiebreak. I doubt anyone else will like that either, but you can try. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 11:21 AM, Simon Riggs wrote: > On 25 September 2014 15:35, Robert Haas wrote: >>> The only problem I see is if the newly inserted row matches one row on >>> one unique value and a different row on a different unique index. >>> Turning the INSERT into an UPDATE will still fail on one or other, no >>> matter which index you pick. If there is one row for ALL unique >>> indexes then it is irrelevant which index you pick. So either way, I >>> cannot see a reason to specify an index. >> >> Failure could be the right thing in some cases. For example, imagine >> that a user has a table containing names, email addresses, and (with >> apologies for the American-ism, but I don't know what would be >> comparable elsewhere) social security numbers. The user has unique >> indexes on both email addresses and SSNs. If a new record arrives for >> the same email address, they want to replace the existing record; but >> a new record arrives with the same SSN, they want the transaction to >> fail. Otherwise, a newly-arrived record might overwrite the email >> address of an existing record, which they never want to do, because >> they view email address as the primary key. > > I agree with your example, but not your conclusion. > > If a new record arrives with a new email address that matches an > existing record it will fail. There is a case that would be allowed, > which would be a record that creates an entirely new email address. So > you do have a point to argue from. > > However, IMV enforcing such a restriction should be done with an After > trigger, which is already possible, not by complicating a DML > statement with information it shouldn't need to know, or that might > change in the future. I've never been a fan of putting the index name in there. I agree that's stuff that a DML statement shouldn't need to know about. What I've advocated for in the past is specifying the list of columns that should be used to determine whether to insert or update. If you have a match on those columns, update the row; else insert. Any other unique indexes stand or fall as may be. I still think that idea has merit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On Thu, Sep 25, 2014 at 11:34 AM, Andres Freund wrote: > On 2014-09-25 11:13:50 -0400, Robert Haas wrote: >> On Wed, Sep 24, 2014 at 4:36 PM, Simon Riggs wrote: >> >> To me, being able to say "pg_ctl promote_right_now -m yes_i_mean_it" >> >> seems like a friendlier interface than making somebody shut down the >> >> server, run pg_resetxlog, and start it up again. >> > >> > It makes sense to go from paused --> promoted. >> >> Well, we could certainly require that people call >> pg_xlog_replay_pause() before they call pg_promote_now(). I don't >> think I like that better, but it beats not having pg_promote_now(). > > FWIW I think it's a mistake to only allow this on a hot standby. This > should be doable using pg_ctl alone. a) works with wal_level=archive, b) > sometimes hot_standby=off is a good bit more efficient c) sometimes you > don't want to allow connections. Good point. Also, a pg_ctl command is more friendly to cluster-ware, of which there is a lot these days. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS Design
On 25 September 2014 15:26, Stephen Frost wrote: >> I expected this to still trigger an error due to the first policy. Am >> I to infer from this that the policy model is permissive rather than >> restrictive? > > That's correct and I believe pretty clear in the documentation- policies > are OR'd together, just the same as how roles are handled. As a > logged-in user, you have the rights of all of the roles you are a member > of (subject to inheiritance rules, of course), and similairly, you are > able to view and add all rows which match any policy which applies to > you (either through role membership or through different policies). Okay, I see now. This is a mindset issue for me as I'm looking at them like constraints rather than permissions. Thanks for the explanation. Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-19 15:40:14 +0300, Heikki Linnakangas wrote: > On 09/18/2014 09:27 PM, Heikki Linnakangas wrote: > >I'll try to write a more polished patch tomorrow. We'll then see what it > >looks like, and can decide if we want it. > > Ok, here are two patches. One is a refined version of my earlier patch, and > the other implements the separate offsets array approach. They are both > based on Tom's jsonb-lengths-merged.patch, so they include all the > whitespace fixes etc. he mentioned. > > There is no big difference in terms of code complexity between the patches. > IMHO the separate offsets array is easier to understand, but it makes for > more complicated accessor macros to find the beginning of the > variable-length data. I personally am pretty clearly in favor of Heikki's version. I think it could stand to slightly expand the reasoning behind the mixed length/offset format; it's not immediately obvious why the offsets are problematic for compression. Otherwise, based on a cursory look, it looks good. But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] delta relations in AFTER triggers
Heikki Linnakangas wrote: > You cast the TuplestoreRelation to Plan, and pass it to CopyPlanFields. > That will crash, because TuplestoreRelation is nothing like a Plan: Oops. That's a copy/paste error I should have noticed. Fixed, even though the node type might be going away. Since all of this seems to be working very well from a user point of view, I'm going to try to generate a lot more regression tests against the existing code before taking another run at the API, to make sure that things don't break in the refactoring. I didn't hit the copy/out bugs in testing so far -- any suggestions on a test that would exercise this code? (I'm probably missing something obvious.) Craig Ringer wrote: > On 09/15/2014 10:25 PM, Kevin Grittner wrote: > >> I broke out the changes from the previous patch in multiple commits >> in my repository on github: > > *Thankyou* > A nice patch series published in a git repo is so much easier to work > with than a giant squashed patch as an attachment. I have fixed the bug reported by Heikki; be sure to grab that. I have been merging in changes to master as I go, so that bit rot doesn't accumulate, but I don't squash or rebase; hopefully that style works for you. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On 2014-09-25 11:13:50 -0400, Robert Haas wrote: > On Wed, Sep 24, 2014 at 4:36 PM, Simon Riggs wrote: > >> To me, being able to say "pg_ctl promote_right_now -m yes_i_mean_it" > >> seems like a friendlier interface than making somebody shut down the > >> server, run pg_resetxlog, and start it up again. > > > > It makes sense to go from paused --> promoted. > > Well, we could certainly require that people call > pg_xlog_replay_pause() before they call pg_promote_now(). I don't > think I like that better, but it beats not having pg_promote_now(). FWIW I think it's a mistake to only allow this on a hot standby. This should be doable using pg_ctl alone. a) works with wal_level=archive, b) sometimes hot_standby=off is a good bit more efficient c) sometimes you don't want to allow connections. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On 2014-09-24 21:36:50 +0100, Simon Riggs wrote: > On 18 September 2014 01:22, Robert Haas wrote: > > >> "fast" promotion was actually a supported option in r8 of Postgres but > >> this option was removed when we implemented streaming replication in > >> r9.0 > >> > >> The *rough* requirement is sane, but that's not the same thing as > >> saying this exact patch makes sense. > > > > Granted. Fair point. > > > >> If you are paused and you can see that WAL up ahead is damaged, then > >> YES, you do want to avoid applying it. That is possible by setting a > >> PITR target so that recovery stops at a precise location specified by > >> you. As an existing option is it better than the blunt force trauma > >> suggested here. > > > > You can pause at a recovery target, but then what if you want to go > > read/write at that point? Or what if you've got a time-delayed > > standby and you want to break replication so that it doesn't replay > > the DROP TABLE students that somebody ran on the master? It doesn't > > have to be that WAL is unreadable or corrupt; it's enough for it to > > contain changes you wish to avoid replaying. > > > >> If you really don't care, just shutdown server, resetxlog and start > >> her up - again, no need for new option. I think that should pretty much never be something an admin has to run. It's just about impossible to get this right. In all likelihood just running pg_resetxlog on a database in recovery will have corrupted your database. Which is why pg_resetxlog won't even let you proceed without using -f because it checks for DB_SHUTDOWNED. Rightly so. pg_resetxlog *removes* *all* existing WAL and sets the current control file state to DB_SHUTDOWNED. Thus there will be no recovery when starting afterwards. > > To me, being able to say "pg_ctl promote_right_now -m yes_i_mean_it" > > seems like a friendlier interface than making somebody shut down the > > server, run pg_resetxlog, and start it up again. > > It makes sense to go from paused --> promoted. > > It doesn't make sense to go from normal running --> promoted, since > that is just random data loss. Why? I don't see what's random in promoting a node in the current state *iff* it's currently consistent. Just imagine something like promoting a current standby to a full node because you want to run some tests on it that require writes. There's absolutely no need to investigate the current state for that. > I very much understand the case where > somebody is shouting "get the web site up, we are losing business". > Implementing a feature that allows people to do exactly what they > asked (go live now), but loses business transactions that we thought > had been safely recorded is not good. It implements only the exact > request, not its actual intention. That seems to be a problem of massively understanding on the part of the user. And I don't see how this is going to be safer by requiring the user to first issue a pause reuest. I think we should attempt to solve this by naming the command appropriately. Something like 'abort_replay_and_promote'. Long, nontrivial to type, and descriptive. > Any feature that lumps both cases together is wrongly designed and > will cause data loss. > > We go to a lot of trouble to ensure data is successfully on disk and > in WAL. I won't give that up, nor do I want to make it easier to lose > data than it already is. I think that's not really related. Such a promotion doesn't cause data loss in the sense of loosing data a *clueful* operator wanted to keep. Yes, it can be used wrongly, but it's far from alone in that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
* Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Sep 25, 2014 at 12:46 AM, Tom Lane wrote: > > TBH I've also been wondering whether any of these proposed cures are > > better than the disease. > > I couldn't agree more. There's something to be said for just leaving > this alone. I've been coming around to this also. I had thought earlier that there was consensus happening, but clearly that's not the case. > > The changes that can be argued to make the > > behavior more sane are also ones that introduce backwards compatibility > > issues of one magnitude or another. > > But on this point I think David Johnston said it best: > > # Any change has the potential to draw complaints. For you it seems that > "hey, > # I upgraded to 9.5 and my logs are being rotated out every minute now. I > # thought I had that turned off" is the desired complaint. Greg wants: "hey, > my > # 1 hour log rotation is now happening every minute". If the error message is > # written correctly most people upon seeing the error will simply fix their > # configuration and move on - regardless of whether they were proactive in > # doing so having read the release notes. > > I particularly agree with his first sentence - any change can > potentitally draw complaints. But I also agree with his last one - of > those three possible complaints, I certainly prefer "I had to fix my > configuration file for the new, stricter validation" over any variant > of "my configuration file still worked but it did something > surprisingly different from what it used to do.". I'll agree with this also (which is why I had suggested moving forward with the idea that I thought had consensus- keep things the way they are, but toss an error if we round down a non-zero value to zero). As with Tom, I'm not against being argued to a different position, such as rounding up instead of down, but I still don't like the "near-zero goes to zero" situation we currently have. I'd be much happier if we'd pick one or the other and move forward with it, or agree that we can't reach a consensus and leave well enough alone. Not entirely sure what the best way to get to one of the above is, but I don't feel like we're really making much more progress at this point. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 25 September 2014 15:35, Robert Haas wrote: >> The only problem I see is if the newly inserted row matches one row on >> one unique value and a different row on a different unique index. >> Turning the INSERT into an UPDATE will still fail on one or other, no >> matter which index you pick. If there is one row for ALL unique >> indexes then it is irrelevant which index you pick. So either way, I >> cannot see a reason to specify an index. > > Failure could be the right thing in some cases. For example, imagine > that a user has a table containing names, email addresses, and (with > apologies for the American-ism, but I don't know what would be > comparable elsewhere) social security numbers. The user has unique > indexes on both email addresses and SSNs. If a new record arrives for > the same email address, they want to replace the existing record; but > a new record arrives with the same SSN, they want the transaction to > fail. Otherwise, a newly-arrived record might overwrite the email > address of an existing record, which they never want to do, because > they view email address as the primary key. I agree with your example, but not your conclusion. If a new record arrives with a new email address that matches an existing record it will fail. There is a case that would be allowed, which would be a record that creates an entirely new email address. So you do have a point to argue from. However, IMV enforcing such a restriction should be done with an After trigger, which is already possible, not by complicating a DML statement with information it shouldn't need to know, or that might change in the future. Let's keep this new feature as simple as possible. ORMs everywhere need to be encouraged to implement this and they won't do it unless it is bone simple to use. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Immediate standby promotion
On Wed, Sep 24, 2014 at 4:36 PM, Simon Riggs wrote: >> To me, being able to say "pg_ctl promote_right_now -m yes_i_mean_it" >> seems like a friendlier interface than making somebody shut down the >> server, run pg_resetxlog, and start it up again. > > It makes sense to go from paused --> promoted. Well, we could certainly require that people call pg_xlog_replay_pause() before they call pg_promote_now(). I don't think I like that better, but it beats not having pg_promote_now(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
On Thu, Sep 25, 2014 at 12:46 AM, Tom Lane wrote: > TBH I've also been wondering whether any of these proposed cures are > better than the disease. I couldn't agree more. There's something to be said for just leaving this alone. > The changes that can be argued to make the > behavior more sane are also ones that introduce backwards compatibility > issues of one magnitude or another. But on this point I think David Johnston said it best: # Any change has the potential to draw complaints. For you it seems that "hey, # I upgraded to 9.5 and my logs are being rotated out every minute now. I # thought I had that turned off" is the desired complaint. Greg wants: "hey, my # 1 hour log rotation is now happening every minute". If the error message is # written correctly most people upon seeing the error will simply fix their # configuration and move on - regardless of whether they were proactive in # doing so having read the release notes. I particularly agree with his first sentence - any change can potentitally draw complaints. But I also agree with his last one - of those three possible complaints, I certainly prefer "I had to fix my configuration file for the new, stricter validation" over any variant of "my configuration file still worked but it did something surprisingly different from what it used to do.". YMMV. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 09:34:57 -0500, Merlin Moncure wrote: > On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund wrote: > >> Why stop at 128 mapping locks? Theoretical downsides to having more > >> mapping locks have been mentioned a few times but has this ever been > >> measured? I'm starting to wonder if the # mapping locks should be > >> dependent on some other value, perhaps the # of shared bufffers... > > > > Wrong way round. You need to prove the upside of increasing it further, > > not the contrary. The primary downside is cache hit ratio and displacing > > other cache entries... > > I can't do that because I don't have the hardware. One interesting part of this is making sure it doesn't regress older/smaller machines. So at least that side you could check... > what's wrong with trying it out? If somebody is willing to do it: nothing. I'd just much rather do the, by now proven, simple change before starting with more complex solutions. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund wrote: > On 2014-09-25 10:22:47 -0400, Robert Haas wrote: >> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund >> wrote: >> > That leads me to wonder: Have you measured different, lower, number of >> > buffer mapping locks? 128 locks is, if we'd as we should align them >> > properly, 8KB of memory. Common L1 cache sizes are around 32k... >> >> Amit has some results upthread showing 64 being good, but not as good >> as 128. I haven't verified that myself, but have no reason to doubt >> it. > > How about you push the spinlock change and I crosscheck the partition > number on a multi socket x86 machine? Seems worthwile to make sure that > it doesn't cause problems on x86. I seriously doubt it'll, but ... OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund wrote: >> Why stop at 128 mapping locks? Theoretical downsides to having more >> mapping locks have been mentioned a few times but has this ever been >> measured? I'm starting to wonder if the # mapping locks should be >> dependent on some other value, perhaps the # of shared bufffers... > > Wrong way round. You need to prove the upside of increasing it further, > not the contrary. The primary downside is cache hit ratio and displacing > other cache entries... I can't do that because I don't have the hardware. I wasn't suggesting to just set it but to measure the affects of setting it. But the benefits from going from 16 to 128 are pretty significant at least on this hardware; I'm curious how much further it can be pushed...what's wrong with trying it out? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 10:09:30 -0400, Robert Haas wrote: > I think the long-term solution here is that we need a lock-free hash > table implementation for our buffer mapping tables, because I'm pretty > sure that just cranking the number of locks up and up is going to > start to have unpleasant side effects at some point. We may be able > to buy a few more years by just cranking it up, though. I think mid to long term we actually need something else than a hashtable. Capable of efficiently looking for the existance of 'neighboring' buffers so we can intelligently prefetch far enough that the read actually completes when we get there. Also I'm pretty sure that we'll need a way to efficiently remove all buffers for a relfilenode from shared buffers - linearly scanning for that isn't a good solution. So I think we need a different data structure. I've played a bit around with just replacing buf_table.c with a custom handrolled hashtable because I've seen more than one production workload where hash_search_with_hash_value() is both cpu and cache miss wise top#1 of profiles. With most calls coming from the buffer mapping and then from the lock manager. There's two reasons for that: a) dynahash just isn't very good and it does a lot of things that will never be necessary for these hashes. b) the key into the hash table is *far* too wide. A significant portion of the time is spent comparing buffer/lock tags. The aforementioned replacement hash table was a good bit faster for fully cached workloads - but at the time I wrote I could still make it crash in very high cache pressure workloads, so that should be taken with a fair bit of salt. I think we can comparatively easily get rid of the tablespace in buffer tags. Getting rid of the database already would be a fair bit harder. I haven't really managed to get an idea how to remove the fork number without making the catalog much more complicated. I don't think we can go too long without at least some of these steps :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Thu, Sep 25, 2014 at 10:12 AM, Simon Riggs wrote: > IMHO it is impossible to know if any of the other code is correct > until we have a clear and stable vision of what the command is > supposed to perform. +1. > The inner workings are less important than what the feature does. +1. > FWIW, the row available at the end of all BEFORE triggers is clearly > the object we should be manipulating, not the original VALUES() > clause. Otherwise this type of INSERT would behave differently from > normal INSERTs. Which would likely violate RLS, if nothing else. +1. > Surely if there are multiple unique indexes then the result row must > be validated against all unique indexes before it is allowed at all? > > The only problem I see is if the newly inserted row matches one row on > one unique value and a different row on a different unique index. > Turning the INSERT into an UPDATE will still fail on one or other, no > matter which index you pick. If there is one row for ALL unique > indexes then it is irrelevant which index you pick. So either way, I > cannot see a reason to specify an index. Failure could be the right thing in some cases. For example, imagine that a user has a table containing names, email addresses, and (with apologies for the American-ism, but I don't know what would be comparable elsewhere) social security numbers. The user has unique indexes on both email addresses and SSNs. If a new record arrives for the same email address, they want to replace the existing record; but a new record arrives with the same SSN, they want the transaction to fail. Otherwise, a newly-arrived record might overwrite the email address of an existing record, which they never want to do, because they view email address as the primary key. I think this kind of scenario will actually be pretty common. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index scan optimization
On 22 September 2014 14:46, Heikki Linnakangas wrote: > On 09/22/2014 04:45 PM, Tom Lane wrote: >> >> Heikki Linnakangas writes: >>> >>> On 09/22/2014 07:47 AM, Rajeev rastogi wrote: So my proposal is to skip the condition check on the first scan key condition for every tuple. >> >> >>> The same happens in a single-column case. If you have a query like >>> "SELECT * FROM tbl2 where id2 > 'a'", once you've found the start >>> position of the scan, you know that all the rows that follow match too. >> >> >> ... unless you're doing a backwards scan. > > > Sure. And you have to still check for NULLs. Have to get the details right.. And also that the tests don't use volatile functions. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS Design
Thom, * Thom Brown (t...@linux.com) wrote: > I find it a bit of a limitation that I can't specify both INSERT and > UPDATE for a policy. I'd want to be able to specify something like > this: > > CREATE POLICY no_greys_allowed > ON colours > FOR INSERT, UPDATE > WITH CHECK (name NOT IN ('grey','gray')); > > I would expect this to be rather common to prevent certain values > making their way into a table. Instead I'd have to create 2 policies > as it stands. That's not actually the case... CREATE POLICY no_greys_allowed ON colours FOR ALL USING (true) -- assuming this is what you intended WITH CHECK (name NOT IN ('grey','gray')); Right? That said, I'm not against the idea of supporting mulitple commands with one policy (similar to how ALL is done). It wouldn't be difficult or much of a change- make the 'cmd' a bitfield instead. If others feel the same then I'll look at doing that. > In order to debug issues with accessing table data, perhaps it would > be useful to output the name of the policy that was violated. If a > table had 20 policies on, it could become time-consuming to debug. Good point. That'll involve a bit more as I'll need to look at the existing with check options structure, but I believe it's just adding the field to the structure, populating it when adding the WCO entries, and then checking for it in the ereport() call. The policy name is already stashed in the relcache entry, so it's already pretty easily available. > I keep getting tripped up by overlapping policies. On the one hand, I > created a policy to ensure rows being added or selected have a > "visible" column set to true. On the other hand, I have a policy that > ensures that the name of a colour doesn't appear in a list. Policy 1 > is violated until policy 2 is added: > > (using the table I created in a previous post on this thread...) > > # create policy must_be_visible ON colours for all to joe using > (visible = true) with check (visible = true); > CREATE POLICY > > \c - joe > > > insert into colours (name, visible) values ('pink',false); > ERROR: new row violates WITH CHECK OPTION for "colours" > DETAIL: Failing row contains (28, pink, f). > > \c - thom > > # create policy no_greys_allowed on colours for insert with check > (name not in ('grey','gray')); > CREATE POLICY > > \c - joe > > # insert into colours (name, visible) values ('pink',false); > INSERT 0 1 > > I expected this to still trigger an error due to the first policy. Am > I to infer from this that the policy model is permissive rather than > restrictive? That's correct and I believe pretty clear in the documentation- policies are OR'd together, just the same as how roles are handled. As a logged-in user, you have the rights of all of the roles you are a member of (subject to inheiritance rules, of course), and similairly, you are able to view and add all rows which match any policy which applies to you (either through role membership or through different policies). > I've also attached a few corrections for the docs. Thanks! I'll plan to include these with a few other typos and the fix for the bug that Andres pointed out, once I finish testing (and doing another CLOBBER_CACHE_ALWAYS run..). Thanks again, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 10:22:47 -0400, Robert Haas wrote: > On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund > wrote: > > That leads me to wonder: Have you measured different, lower, number of > > buffer mapping locks? 128 locks is, if we'd as we should align them > > properly, 8KB of memory. Common L1 cache sizes are around 32k... > > Amit has some results upthread showing 64 being good, but not as good > as 128. I haven't verified that myself, but have no reason to doubt > it. How about you push the spinlock change and I crosscheck the partition number on a multi socket x86 machine? Seems worthwile to make sure that it doesn't cause problems on x86. I seriously doubt it'll, but ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund wrote: > That leads me to wonder: Have you measured different, lower, number of > buffer mapping locks? 128 locks is, if we'd as we should align them > properly, 8KB of memory. Common L1 cache sizes are around 32k... Amit has some results upthread showing 64 being good, but not as good as 128. I haven't verified that myself, but have no reason to doubt it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 09:02:25 -0500, Merlin Moncure wrote: > On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas wrote: > > 1. To see the effect of reduce-replacement-locking.patch, compare the > > first TPS number in each line to the third, or the second to the > > fourth. At scale factor 1000, the patch wins in all of the cases with > > 32 or more clients and exactly half of the cases with 1, 8, or 16 > > clients. The variations at low client counts are quite small, and the > > patch isn't expected to do much at low concurrency levels, so that's > > probably just random variation. At scale factor 3000, the situation > > is more complicated. With only 16 bufmappinglocks, the patch gets its > > biggest win at 48 clients, and by 96 clients it's actually losing to > > unpatched master. But with 128 bufmappinglocks, it wins - often > > massively - on everything but the single-client test, which is a small > > loss, hopefully within experimental variation. > > > > Comments? > > Why stop at 128 mapping locks? Theoretical downsides to having more > mapping locks have been mentioned a few times but has this ever been > measured? I'm starting to wonder if the # mapping locks should be > dependent on some other value, perhaps the # of shared bufffers... Wrong way round. You need to prove the upside of increasing it further, not the contrary. The primary downside is cache hit ratio and displacing other cache entries... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 09:51:17 -0400, Robert Haas wrote: > On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas wrote: > > The patch I attached the first time was just the last commit in the > > git repository where I wrote the patch, rather than the changes that I > > made on top of that commit. So, yes, the results from the previous > > message are with the patch attached to the follow-up. I just typed > > the wrong git command when attempting to extract that patch to attach > > it to the email. > > Here are some more results. TL;DR: The patch still looks good, but we > should raise the number of buffer mapping partitions as well. > So I'm inclined to (a) push > reduce-replacement-locking.patch and then also (b) bump up the number > of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS > accordingly so that pg_buffercache doesn't get unhappy). I'm happy with that. I don't think it's likely that a moderate increase in the number of mapping lwlocks will be noticeably bad for any workload. One difference is that the total number of lwlock acquirations will be a bit higher because currently it's more likely for the old and new to fall into different partitions. But that's not really significant. The other difference is the number of cachelines touched. Currently, in concurrent workloads, there's already lots of L1 cache misses around the buffer mapping locks because they're exclusively owned by a different core/socket. So, to be effectively worse, the increase would need to lead to lower overall cache hit rates by them not being in *any* cache or displacing other content. That leads me to wonder: Have you measured different, lower, number of buffer mapping locks? 128 locks is, if we'd as we should align them properly, 8KB of memory. Common L1 cache sizes are around 32k... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inefficient barriers on solaris with sun cc
On Thu, Sep 25, 2014 at 9:34 AM, Andres Freund wrote: > Binaries compiled on solaris using sun studio cc currently don't have > compiler and memory barriers implemented. That means we fall back to > relatively slow generic implementations for those. Especially compiler, > read, write barriers will be much slower than necessary (since they all > just need to prevent compiler reordering as both sparc and x86 are run > in TSO mode under solaris). > > Since my estimate is that we'll use more and more barriers, that's going > to hurt more and more. > > I do *not* plan to do anything about it atm, I just thought it might be > helpful to have this stated somewhere searchable. To put that another way: If there are any Sun Studio users out there who care about performance on big iron, please send a patch to fix this... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 28 August 2014 03:43, Peter Geoghegan wrote: > The patch currently lacks a way of referencing datums rejected for > insertion when updating. The way MySQL handles the issue seems > questionable. They allow you to do something like this: > > INSERT INTO upsert (key, val) VALUES (1 'val') ON DUPLICATE KEY UPDATE > val = VALUES(val); > > The implication is that the updated value comes from the INSERT's > VALUES() list, but emulating that seems like a bad idea. In general, > at least with Postgres it's entirely possible that values rejected > differ from the values appearing in the VALUES() list, due to the > effects of before triggers. I'm not sure whether or not we should > assume equivalent transformations during any UPDATE before triggers. > > This is an open item. I think it makes sense to deal with it a bit later. IMHO it is impossible to know if any of the other code is correct until we have a clear and stable vision of what the command is supposed to perform. The inner workings are less important than what the feature does. FWIW, the row available at the end of all BEFORE triggers is clearly the object we should be manipulating, not the original VALUES() clause. Otherwise this type of INSERT would behave differently from normal INSERTs. Which would likely violate RLS, if nothing else. > As I mentioned, I have incorporated feedback from Kevin Grittner. You > may specify a unique index to merge on from within the INSERT > statement, thus avoiding the risk of inadvertently having the update > affect the wrong tuple due to the user failing to consider that there > was a would-be unique violation within some other unique index > constraining some other attribute. You may write the DML statement > like this: > > INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN > upsert_pkey UPDATE SET val = 'update'; > > I think that there is a good chance that at least some people will > want to make this mandatory. I guess that's fair enough, but I > *really* don't want to *mandate* that users specify the name of their > unique index in DML for obvious reasons. Perhaps we can come up with a > more tasteful syntax that covers all interesting cases (consider the > issues with partial unique indexes and before triggers for example, > where a conclusion reached about which index to use during parse > analysis may subsequently be invalidated by user-defined code, or > ambiguous specifications in the face of overlapping attributes between > two unique composite indexes, etc). The Right Thing is far from > obvious, and there is very little to garner from other systems, since > SQL MERGE promises essentially nothing about concurrency, both as > specified by the standard and in practice. You don't need a unique > index at all, and as I showed in my pgCon talk, there are race > conditions even for a trivial UPSERT operations in all major SQL MERGE > implementations. Surely if there are multiple unique indexes then the result row must be validated against all unique indexes before it is allowed at all? The only problem I see is if the newly inserted row matches one row on one unique value and a different row on a different unique index. Turning the INSERT into an UPDATE will still fail on one or other, no matter which index you pick. If there is one row for ALL unique indexes then it is irrelevant which index you pick. So either way, I cannot see a reason to specify an index. If we do need such a construct, we have already the concept of an IDENTITY for a table, added in 9.4, currently targeted at replication. Listing indexes or columns in the DML statement is more pushups for developers and ORMs, so lets KISS. The way forwards, in my view, is to define precisely the behaviour we wish to have. That definition will include the best current mechanism for running an UPSERT using INSERT/UPDATE/loops and comparing that against what is being provided here. We will then have a functional test of equivalence of the approaches, and a basis for making a performance test that shows that performance is increased without any loss of concurrency. Once we have that, we can then be certain our time spent on internals is not wasted by overlooking a simple userland gotcha. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP armor headers
On 9/25/14 4:08 PM, Heikki Linnakangas wrote: On 09/25/2014 04:56 PM, Marko Tiikkaja wrote: On 9/25/14 3:50 PM, Heikki Linnakangas wrote: On 09/10/2014 04:35 PM, Marko Tiikkaja wrote: It might've been a tad more efficient to return the StringInfo buffer directly from pgp_armor/dearmor, and avoid the extra palloc and memcpy, but this isn't performance critical enough for it to really matter. I couldn't see any way of doing that without breaking the VARDATA abstraction. I even went looking for similar cases in the source code, but couldn't find any. If you can come up with a way, feel free to change that -- I'd like to learn myself. You could first append VARHDRSZ zeros to the StringInfo, then append the base64-encoded data, and last replace the zeros with the real length, using SET_VARSIZE. That's assuming that VARDATA() is at exactly VARHDRSZ bytes. I couldn't find any callers making that assumption. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 10:02 AM, Merlin Moncure wrote: > On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas wrote: >> 1. To see the effect of reduce-replacement-locking.patch, compare the >> first TPS number in each line to the third, or the second to the >> fourth. At scale factor 1000, the patch wins in all of the cases with >> 32 or more clients and exactly half of the cases with 1, 8, or 16 >> clients. The variations at low client counts are quite small, and the >> patch isn't expected to do much at low concurrency levels, so that's >> probably just random variation. At scale factor 3000, the situation >> is more complicated. With only 16 bufmappinglocks, the patch gets its >> biggest win at 48 clients, and by 96 clients it's actually losing to >> unpatched master. But with 128 bufmappinglocks, it wins - often >> massively - on everything but the single-client test, which is a small >> loss, hopefully within experimental variation. >> >> Comments? > > Why stop at 128 mapping locks? Theoretical downsides to having more > mapping locks have been mentioned a few times but has this ever been > measured? I'm starting to wonder if the # mapping locks should be > dependent on some other value, perhaps the # of shared bufffers... Good question. My belief is that the number of buffer mapping locks required to avoid serious contention will be roughly proportional to the number of hardware threads. At the time the value 16 was chosen, there were probably not more than 8-core CPUs in common use; but now we've got a machine with 64 hardware threads and, what do you know but it wants 128 locks. I think the long-term solution here is that we need a lock-free hash table implementation for our buffer mapping tables, because I'm pretty sure that just cranking the number of locks up and up is going to start to have unpleasant side effects at some point. We may be able to buy a few more years by just cranking it up, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP armor headers
On 09/25/2014 04:56 PM, Marko Tiikkaja wrote: On 9/25/14 3:50 PM, Heikki Linnakangas wrote: On 09/10/2014 04:35 PM, Marko Tiikkaja wrote: It might've been a tad more efficient to return the StringInfo buffer directly from pgp_armor/dearmor, and avoid the extra palloc and memcpy, but this isn't performance critical enough for it to really matter. I couldn't see any way of doing that without breaking the VARDATA abstraction. I even went looking for similar cases in the source code, but couldn't find any. If you can come up with a way, feel free to change that -- I'd like to learn myself. You could first append VARHDRSZ zeros to the StringInfo, then append the base64-encoded data, and last replace the zeros with the real length, using SET_VARSIZE. Are you planning to post the main patch rebased on top of this soon? As in the next day or two? Otherwise I'll mark this as "Returned with feedback" for this commitfest. Yes. With good luck I'll get you a rebased one today, otherwise it'll have to wait until tomorrow. Ok, I'll leave this in Waiting on Author state then. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas wrote: > 1. To see the effect of reduce-replacement-locking.patch, compare the > first TPS number in each line to the third, or the second to the > fourth. At scale factor 1000, the patch wins in all of the cases with > 32 or more clients and exactly half of the cases with 1, 8, or 16 > clients. The variations at low client counts are quite small, and the > patch isn't expected to do much at low concurrency levels, so that's > probably just random variation. At scale factor 3000, the situation > is more complicated. With only 16 bufmappinglocks, the patch gets its > biggest win at 48 clients, and by 96 clients it's actually losing to > unpatched master. But with 128 bufmappinglocks, it wins - often > massively - on everything but the single-client test, which is a small > loss, hopefully within experimental variation. > > Comments? Why stop at 128 mapping locks? Theoretical downsides to having more mapping locks have been mentioned a few times but has this ever been measured? I'm starting to wonder if the # mapping locks should be dependent on some other value, perhaps the # of shared bufffers... merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing isinf declaration on solaris
On 2014-09-25 10:56:56 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > From VS 2013 onwards they're trying hard to be C99 and C11 compatible. > > Sounds great. Is VS2013 released already? Yes. > If so, maybe we can think about moving to C99 in 2016 or so; at least > assuming you can build for I think we should simply pick and choose the features we want. And go for it now. Besidesthe fact that one of them wasn't applied, I'd sent patches a couple months back that'd allow setting up a buildfarm animal failing for any patch going above C89 + chosen features. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)
On 2014-09-25 14:43:14 +0100, Simon Riggs wrote: > On 25 September 2014 10:41, Andres Freund wrote: > > On 2014-09-25 10:24:39 +0530, Abhijit Menon-Sen wrote: > >> At 2014-09-24 11:09:24 +0200, and...@2ndquadrant.com wrote: > >> > I think it's completely unacceptable to copy a visibility routine. > >> > >> OK. Which visibility routine should I use, and should I try to create a > >> variant that doesn't set hint bits? > > > > I've not yet followed your premise that you actually need one that > > doesn't set hint bits... > > Not least because I'm trying to solve a similar problem on another > thread, so no need to make a special case here. That's mostly unrelated though - Abhijit wants to avoid them because he tried to avoid having *any* form of lock on the buffer. That's the reason he tried avoid hint bit setting. Since I don't believe that's safe (at least there's by far not enough evidence about it), there's simply no reason to avoid it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing isinf declaration on solaris
Andres Freund wrote: > On 2014-09-24 17:39:19 -0300, Alvaro Herrera wrote: > > AFAIK we cannot move all the way to C99, because MSVC doesn't support > > it. > > FWIW, msvc has supported a good part of C99 for long while. There's bits > and pieces it doesn't, but it's not things I think we're likely to > adopt. The most commonly complained about one is C99 variable > declarations. I can't see PG adopting that tomorrow. I got unlucky that I got bitten by precisely that missing feature twice, I guess. > From VS 2013 onwards they're trying hard to be C99 and C11 compatible. Sounds great. Is VS2013 released already? If so, maybe we can think about moving to C99 in 2016 or so; at least assuming you can build for older Windows versions with the new compiler. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers