Re: [HACKERS] Triconsistent catalog declaration
On 09/15/2014 08:56 PM, Robert Haas wrote: On Mon, Sep 15, 2014 at 10:13 AM, Heikki Linnakangas wrote: That makes for a bit awkward input and output from psql, when the values used are 0, 1, 2, rather than ascii characters. But that's OK, because as you said these functions are not callable from psql anyway, as they have "internal" arguments. Maybe we should change them to something a bit more understandable. We can't change the return datatype to anything wider, or the values from 0, 1, 2, because those values have been chosen so that they are "compatible" with booleans. A boolean can be safely cast to a GinTernaryValue. I'm not sure if we make use of that anywhere ATM, but it's a useful property. This requires a catalog change to fix. Are we still planning to do a catversion bump for 9.4 because of the jsonb changes? That was my understanding, although we seem to be proceeding at an inexplicably glacial pace. Ok, committed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Anonymous code block with parameters
I'd like to propose support for IN and OUT parameters in 'DO' blocks. Currently, anonymous code blocks (DO statements) can not receive or return parameters. I suggest: 1) Add a new clause to DO statement for specifying names, types, directions and values of parameters: DO [LANGUAGE ] [USING ()] where has the same syntax as in 'CREATE FUNCTION ()'. Example: do $$ begin z := x || y; end; $$ language plpgsql using ( x text = '1', in out y int4 = 123, out z text ); 2) Values for IN and IN OUT parameters are specified using syntax for default values of function arguments. 3) If DO statement has at least one of OUT or IN OUT parameters then it returns one tuple containing values of OUT and IN OUT parameters. Do you think that this feature would be useful? I have a proof-of-concept patch in progress that I intend to publish soon. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Sep 15, 2014 at 8:00 PM, Michael Paquier wrote: > Alvaro got faster than me... I was just looking at the first patch and > +1 on those comments. It is worth noting that the first patch, as it > does only a large refactoring, does not impact performance or size of > WAL records. Some comments after a second and closer read of the first patch: 1) Wouldn't it be better to call GetFullPageWriteInfo directly in XLogRecordAssemble, removing RedoRecPtr and doPageWrites from its arguments? 2) XLogCheckBufferNeedsBackup is not used. It can be removed. 3) If I am following correctly, there are two code paths where XLogInsertRecData can return InvalidXLogRecPtr, and then there is this process in XLogInsert + { + /* +* Undo the changes we made to the rdata chain. +* +* XXX: This doesn't undo *all* the changes; the XLogRecData +* entries for buffers that we had already decided to back up have +* had their data-pointers cleared. That's OK, as long as we +* decide to back them up on the next iteration as well. Hence, +* don't allow "doPageWrites" value to go from true to false after +* we've modified the rdata chain. +*/ + boolnewDoPageWrites; + + GetFullPageWriteInfo(&RedoRecPtr, &newDoPageWrites); + if (!doPageWrites && newDoPageWrites) + doPageWrites = true; + rdt_lastnormal->next = NULL; + } Wouldn't it be better to keep that in XLogInsertRecData? This way GetFullPageWriteInfo could be removed (I actually see little point in keeping it as part of this patch, but does this change make its sense in patch 2?). 4) This patch is in DOS encoding (?!) That's all I have for now... Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
Jan Wieck wrote: > On 09/14/2014 02:25 PM, Pavel Stehule wrote: > >a) Isn't possible handle a assertion exception anywhere .. it enforce > >ROLLBACK in 100% > > > >b) Assertions should be disabled globally .. I am not sure, it it is a > >good idea, but I can understand so some tests based on queries to data > >can be performance issue. > > > >Important question is a relation assertations and exceptions. Is it only > >shortcut for exception or some different? > > I think that most data integrity issues can be handled by a well > designed database schema that uses UNIQUE, NOT NULL, REFERENCES and > CHECK constraints. Assertions are usually found inside of complex > code constructs to check values of local variables. I don't think it > is even a good idea to implement assertions that can query arbitrary > data. Actually Peter Eisentraut posted a patch for SQL assertions: http://www.postgresql.org/message-id/1384486216.5008.17.ca...@vanquo.pezone.net -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Mon, Sep 15, 2014 at 7:03 PM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > >> Here we go. I've split this again into two patches. The first patch >> is just refactoring the current code. It moves XLogInsert into a new >> file, xloginsert.c, and the definition of XLogRecord to new >> xlogrecord.h header file. As a result, there is a a lot of churn in >> the #includes in C files that generate WAL records, or contain redo >> routines. The number of files that pull in xlog.h - directly or >> indirectly through other headers - is greatly reduced. > > I think you should push the first patch for now and continue to > investigate the issues in the second patch. Some minor suggestions I > have for the first patch are that since xlog.h is no longer used by > other headers (but only by .c files), you should just #include > xloginsert.h instead of doing the forward declaration of struct > XLogRecData; and in xlog_internal.h, instead of > > - * XLogRecord is defined in xlog.h, but we avoid #including that to keep > + * XLogRecord is defined in xlogrecord.h, but we avoid #including that to > keep > > I would just suggest to #include xlogrecord.h, since it should just > work to include that file from frontend programs. It didn't work for > xlog.h because that one #includes fmgr.h IIRC and that one causes Datum > to appear, which makes compilation blow up. Shouldn't be the case here. Alvaro got faster than me... I was just looking at the first patch and +1 on those comments. It is worth noting that the first patch, as it does only a large refactoring, does not impact performance or size of WAL records. Btw, a declaration of access/xlog.h in fd.c is forgotten. Also, if somebody is interested in running the test suite, there are comments on how to do it at the top of compare.sql. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW
(2014/09/13 0:13), Tom Lane wrote: > Albe Laurenz writes: >> Tom Lane wrote: >>> I'm not sure offhand what the new plan tree ought to look like. We could >>> just generate a ForeignScan node, but that seems like rather a misnomer. >>> Is it worth inventing a new ForeignUpdate node type? Or maybe it'd still >>> be ForeignScan but with a flag field saying "hey this is really an update >>> (or a delete)". The main benefit I can think of right now is that the >>> EXPLAIN output would be less strange-looking --- but EXPLAIN is hardly >>> the only thing that ever looks at plan trees, so having an outright >>> misleading plan structure is likely to burn us down the line. > I was envisioning that the EXPLAIN output would look like > > Foreign Scan on tab1 >Remote SQL: SELECT ... > > for the normal case, versus > > Foreign Update on tab1 >Remote SQL: UPDATE ... > > for the pushed-down-update case (and similarly for DELETE). For a > non-optimized update it'd still be a ForeignScan underneath a ModifyTable. > As for the internal representation, I was thinking of adding a CmdType > field to struct ForeignScan, with currently only CMD_SELECT, CMD_UPDATE, > CMD_DELETE as allowed values, though possibly in future we'd think of a > reason to allow CMD_INSERT there. +1 > The only thing that's bothering me about this concept is that I'm not > seeing how to scale it up to handling a pushed-down update on a join, > ie, "UPDATE foo ... FROM bar ..." where both foo and bar are remote. > Maybe it's silly to worry about that until join push-down is done; > but in that case I'd vote for postponing this whole patch until we > have join push-down. OK Thanks, PS: I'll help Hanada-san do the work if there is anything I can do. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Here we go. I've split this again into two patches. The first patch > is just refactoring the current code. It moves XLogInsert into a new > file, xloginsert.c, and the definition of XLogRecord to new > xlogrecord.h header file. As a result, there is a a lot of churn in > the #includes in C files that generate WAL records, or contain redo > routines. The number of files that pull in xlog.h - directly or > indirectly through other headers - is greatly reduced. I think you should push the first patch for now and continue to investigate the issues in the second patch. Some minor suggestions I have for the first patch are that since xlog.h is no longer used by other headers (but only by .c files), you should just #include xloginsert.h instead of doing the forward declaration of struct XLogRecData; and in xlog_internal.h, instead of - * XLogRecord is defined in xlog.h, but we avoid #including that to keep + * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep I would just suggest to #include xlogrecord.h, since it should just work to include that file from frontend programs. It didn't work for xlog.h because that one #includes fmgr.h IIRC and that one causes Datum to appear, which makes compilation blow up. Shouldn't be the case here. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/16/2014 07:44 AM, Peter Geoghegan wrote: > FWIW, I am slightly concerned about weighing use cases around very > large JSON documents too heavily. Having enormous jsonb documents just > isn't going to work out that well, but neither will equivalent designs > in popular document database systems for similar reasons. For example, > the maximum BSON document size supported by MongoDB is 16 megabytes, > and that seems to be something that their users don't care too much > about. Having 270 pairs in an object isn't unreasonable, but it isn't > going to be all that common either. Also, at a certain size the fact that Pg must rewrite the whole document for any change to it starts to introduce other practical changes. Anyway - this is looking like the change will go in, and with it a catversion bump. Introduction of a jsonb version/flags byte might be worthwhile at the same time. It seems likely that there'll be more room for improvement in jsonb, possibly even down to using different formats for different data. Is it worth paying a byte per value to save on possible upgrade pain? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I couldn't get my hands on the twitter data but I'm generating my own. The json template is http://paste2.org/wJ1dfcjw and data was generated with http://www.json-generator.com/. It has 35 top level keys, just in case someone is wondering. I generated 1 random objects and I'm inserting them repeatedly until I got 320k rows. Test query: SELECT data->>'name', data->>'email' FROM t_json Test storage: EXTERNAL Test jsonb lengths quartiles: {1278,1587,1731,1871,2231} Tom's lengths+cache aware: 455ms HEAD: 440ms This is a realistic-ish workload in my opinion and Tom's patch performs within 4% of HEAD. Due to the overall lenghts I couldn't really test compressibility so I re-ran the test. This time I inserted an array of 2 objects in each row, as in: [obj, obj]; The objects where taken in sequence from the 1 pool so contents match in both tests. Test query: SELECT data #> '{0, name}', data #> '{0, email}', data #> '{1, name}', data #> '{1, email}' FROM t_json Test storage: EXTENDED HEAD: 17mb table + 878mb toast HEAD size quartiles: {2015,2500,2591,2711,3483} HEAD query runtime: 15s Tom's: 220mb table + 580mb toast Tom's size quartiles: {1665,1984,2061,2142.25,2384} Tom's query runtime: 13s This is an intriguing edge case that Tom's patch actually outperform the base implementation for 3~4kb jsons.
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 4:21 PM, Peter Geoghegan wrote: > I attach a much simpler patch, that only adds an opportunistic > "memcmp() == 0" before a possible strcoll(). Both > bttextfastcmp_locale() and varstr_cmp() have the optimization added, > since there is no point in leaving anyone out for this part. FWIW, it occurs to me that this could be a big win for cases like ExecMergeJoin(). Obviously, abbreviated keys will usually make your merge join on text attributes a lot faster in the common case where a sort is involved (if we consider that the sort is integral to the cost of the join). However, when making sure that inner and outer tuples match, the MJCompare() call will *very* frequently get away with a cheap memcmp(). That could make a very significant additional difference, I think. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus wrote: > Actually, having the keys all at the same level *is* relevant for the > issue we're discussing. If those 270 keys are organized in a tree, it's > not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
El 14/09/14 17:37, Simon Riggs escribió: > On 12 September 2014 18:19, Simon Riggs wrote: >> On 12 September 2014 15:30, Tom Lane wrote: >>> After a little bit I remembered there was already a function for this. >>> So specifically, I'd suggest using ExecRelationIsTargetRelation() >>> to decide whether to mark the scan as requiring pruning. >> Sounds cool. Thanks both, this is sounding like a viable route now. > Yes, this is viable. > > Patch attached, using Alvaro's idea of use-case specific pruning and > Tom's idea of aiming at target relations. Patch uses or extends > existing infrastructure, so its shorter than it might have been, yet > with all that bufmgr yuck removed. > > This is very, very good because while going through this I notice the > dozen or more places where we were pruning blocks in annoying places I > didn't even know about such as about 4-5 constraint checks. In more > than a few DDL commands like ALTER TABLE and CLUSTER we were even > pruning the old relation prior to rewrite. > A simple performance test with the following variables: LOOP=50 CONN=60 TXSS=500 SCALE=30 Select only: WITH PATCH Average: 20716.1 tps NO PATCH Average: 19141.7 tps With writes: WITH PATCH Average: 2602.65 NO PATCH Average: 2565.32 TODO: - Consistency check. - ALTER and CLUSTER test. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan wrote: > OK, I'll draft a patch for that today, including similar alterations > to varstr_cmp() for the benefit of Windows and so on. I attach a much simpler patch, that only adds an opportunistic "memcmp() == 0" before a possible strcoll(). Both bttextfastcmp_locale() and varstr_cmp() have the optimization added, since there is no point in leaving anyone out for this part. When this is committed, and I hear back from you on the question of what to do about having an independent sortsupport state for abbreviated tie-breakers (and possibly other issues of concern), I'll produce a rebased patch with a single commit. -- Peter Geoghegan diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c new file mode 100644 index 7549542..53be8ec *** a/src/backend/utils/adt/varlena.c --- b/src/backend/utils/adt/varlena.c *** varstr_cmp(char *arg1, int len1, char *a *** 1405,1410 --- 1405,1438 #endif } + /* + * Fast path: Attempt an entirely opportunistic "memcmp() == 0". + * + * In general there is no reason to be optimistic about being able to + * resolve the string comparison in this way. There are many usage + * patterns in which this fast path will never be taken, and some in + * which it will frequently be taken. When it is frequently taken, + * clearly the optimization is worthwhile, since strcoll() is known to + * be very expensive. When it is not taken, the overhead is completely + * negligible, and perhaps even immeasurably small, even in the worst + * case. We have nothing to lose and everything to gain. + * + * These counter-intuitive performance characteristics are likely + * explained by the dominant memory latency cost when performing a + * strcoll(); in practice, this may give the compiler and CPU + * sufficient leeway to order things in a way that hides the overhead + * of useless memcmp() calls. Sorting is very probably bottlenecked on + * memory bandwidth, and so it's worth it to use a few arithmetic + * instructions on the off chance that it results in a saving in memory + * bandwidth. In effect, this makes use of a capacity that would + * otherwise go unused. The same memory needed to be read into CPU + * cache at this juncture anyway. Modern CPUs can execute hundreds of + * instructions in the time taken to fetch a single cache line from + * main memory. + */ + if (len1 == len2 && memcmp(arg1, arg2, len1) == 0) + return 0; + #ifdef WIN32 /* Win32 does not have UTF-8, so we need to map to UTF-16 */ if (GetDatabaseEncoding() == PG_UTF8) *** bttextfastcmp_locale(Datum x, Datum y, S *** 1842,1847 --- 1870,1885 len1 = VARSIZE_ANY_EXHDR(arg1); len2 = VARSIZE_ANY_EXHDR(arg2); + /* + * Fast path: Attempt an entirely opportunistic memcmp() == 0, as + * explained within varstr_cmp() + */ + if (len1 == len2 && memcmp(a1p, a2p, len1) == 0) + { + result = 0; + goto done; + } + if (len1 >= tss->buflen1) { pfree(tss->buf1); *** bttextfastcmp_locale(Datum x, Datum y, S *** 1875,1880 --- 1913,1919 if (result == 0) result = strcmp(tss->buf1, tss->buf2); + done: /* We can't afford to leak memory here. */ if (PointerGetDatum(arg1) != x) pfree(arg1); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suspicious check (src/backend/access/gin/gindatapage.c)
On Fri, Sep 12, 2014 at 5:42 PM, Heikki Linnakangas wrote: > On 09/12/2014 11:38 AM, Heikki Linnakangas wrote: >> >> Now that the logic is fixed, I hope we >> won't get complaints that the indexes are bigger, if you fill a table by >> appending to the end. I wonder if we should aim at an even more uneven >> split; the default fillfactor for B-trees is 90%, for example. I didn't >> go that high when I wrote that, because the code in previous versions >> always did a 50/50 split. But it could be argued that a higher >> fillfactor makes sense for a GIN index - they typically don't get as >> much random updates as a B-tree. > > > Actually, we should add a fillfactor reloption to GIN. But that's 9.5 > material. Actually gin is the only method that does not have this parameter, no? Then the following extra-processing would be enough I imagine: 1) Tune freespace using fillfactor when placing keys to leaf data page (dataPlaceToPageLeaf) 2) On split, instead of (BLCKSZ * 3) / 4, have the left leaf full at (BLCKSZ * fillfactor) / 100 Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 02:16 PM, Robert Haas wrote: > On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus wrote: >> On 09/15/2014 10:23 AM, Claudio Freire wrote: >>> Now, large small keys could be 200 or 2000, or even 20k. I'd guess >>> several should be tested to find the shape of the curve. >> >> Well, we know that it's not noticeable with 200, and that it is >> noticeable with 100K. It's only worth testing further if we think that >> having more than 200 top-level keys in one JSONB value is going to be a >> use case for more than 0.1% of our users. I personally do not. > > FWIW, I have written one (1) application that uses JSONB and it has > one sub-object (not the top-level object) that in the most typical > configuration contains precisely 270 keys. Now, granted, that is not > the top-level object, if that distinction is actually relevant here, > but color me just a bit skeptical of this claim anyway. This was just > a casual thing I did for my own use, not anything industrial strength, > so it's hard to believe I'm stressing the system more than 99.9% of > users will. Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
On Mon, Sep 15, 2014 at 12:45 PM, Tom Lane wrote: > No. And we don't know how to change the default opclass without > breaking things, either. Is there a page on the Wiki along the lines of "things that we would like to change if ever there is a substantial change in on-disk format that will break pg_upgrade"? ISTM that we should be intelligently saving those some place, just as Redhat presumably save up ABI-breakage over many years for the next major release of RHEL. Alexander's complaint is a good example of such a change, IMV. Isn't it more or less expected that the day will come when we'll make a clean break? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus wrote: > On 09/15/2014 10:23 AM, Claudio Freire wrote: >> Now, large small keys could be 200 or 2000, or even 20k. I'd guess >> several should be tested to find the shape of the curve. > > Well, we know that it's not noticeable with 200, and that it is > noticeable with 100K. It's only worth testing further if we think that > having more than 200 top-level keys in one JSONB value is going to be a > use case for more than 0.1% of our users. I personally do not. FWIW, I have written one (1) application that uses JSONB and it has one sub-object (not the top-level object) that in the most typical configuration contains precisely 270 keys. Now, granted, that is not the top-level object, if that distinction is actually relevant here, but color me just a bit skeptical of this claim anyway. This was just a casual thing I did for my own use, not anything industrial strength, so it's hard to believe I'm stressing the system more than 99.9% of users will. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 15 September 2014 17:09, Robert Haas wrote: > Do we really want to disable HOT for all catalog scans? The intention of the patch is that catalog scans are treated identically to non-catalog scans. The idea here is that HOT cleanup only occurs on scans on target relations, so only INSERT, UPDATE and DELETE do HOT cleanup. It's possible that many catalog scans don't follow the normal target relation logic, so we might argue we should use HOT every time. OTOH, since we now have separate catalog xmins we may find that using HOT on catalogs is no longer effective. So I could go either way on how to proceed; its an easy change either way. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 12:25 PM, Claudio Freire wrote: > On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus wrote: >> On 09/15/2014 12:15 PM, Claudio Freire wrote: >>> So while you're right that it's perhaps above what would be a common >>> use case, the range "somewhere between 200 and 100K" for the tipping >>> point seems overly imprecise to me. >> >> Well, then, you know how to solve that. > > > I was hoping testing with other numbers was a simple hitting a key for > someone else. Nope. My test case has a fixed size. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
Peter Geoghegan writes: > On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov > wrote: >> Rename such opclasses and make them not default. >> Create new default opclasses with bitwise comparison functions. >> Write recommendation to re-create indexes with default opclasses into >> documentation. > I certainly think this should be fixed if at all possible, but I'm not > sure about this plan. Can we really rename an opclass without > consequence, including having that respected across pg_upgrade? No. And we don't know how to change the default opclass without breaking things, either. See previous discussions about how we might fix the totally-broken default gist opclass that btree_gist creates for the inet type [1]. The motivation for getting rid of that is *way* stronger than "it might be slow", but there's no apparent way to make something else be the default without creating havoc. regards, tom lane [1] http://www.postgresql.org/message-id/flat/cae2gyzyuesd188j0b290gf16502h9b-lwnrs3rfi1swdb9q...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus wrote: > On 09/15/2014 12:15 PM, Claudio Freire wrote: >> So while you're right that it's perhaps above what would be a common >> use case, the range "somewhere between 200 and 100K" for the tipping >> point seems overly imprecise to me. > > Well, then, you know how to solve that. I was hoping testing with other numbers was a simple hitting a key for someone else. But sure. I'll set something up. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 12:15 PM, Claudio Freire wrote: > So while you're right that it's perhaps above what would be a common > use case, the range "somewhere between 200 and 100K" for the tipping > point seems overly imprecise to me. Well, then, you know how to solve that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 4:09 PM, Josh Berkus wrote: > On 09/15/2014 10:23 AM, Claudio Freire wrote: >> Now, large small keys could be 200 or 2000, or even 20k. I'd guess >> several should be tested to find the shape of the curve. > > Well, we know that it's not noticeable with 200, and that it is > noticeable with 100K. It's only worth testing further if we think that > having more than 200 top-level keys in one JSONB value is going to be a > use case for more than 0.1% of our users. I personally do not. Yes, but bear in mind that the worst case is exactly at the use case jsonb was designed to speed up: element access within relatively big json documents. Having them uncompressed is expectable because people using jsonb will often favor speed over compactness if it's a tradeoff (otherwise they'd use plain json). So while you're right that it's perhaps above what would be a common use case, the range "somewhere between 200 and 100K" for the tipping point seems overly imprecise to me. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 10:23 AM, Claudio Freire wrote: > Now, large small keys could be 200 or 2000, or even 20k. I'd guess > several should be tested to find the shape of the curve. Well, we know that it's not noticeable with 200, and that it is noticeable with 100K. It's only worth testing further if we think that having more than 200 top-level keys in one JSONB value is going to be a use case for more than 0.1% of our users. I personally do not. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
On Mon, Sep 15, 2014 at 10:51 PM, Peter Geoghegan wrote: > On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov > wrote: > > some GIN opclasses uses collation-aware comparisons while they don't > need to > > do especially collation-aware comparison. Examples are text[] and hstore > > opclasses. Depending on collation this may make them a much slower. > > I'm glad that I saw how pointless this was with the jsonb GIN default > opclass during development. > > > Rename such opclasses and make them not default. > > Create new default opclasses with bitwise comparison functions. > > Write recommendation to re-create indexes with default opclasses into > > documentation. > > I certainly think this should be fixed if at all possible, but I'm not > sure about this plan. Can we really rename an opclass without > consequence, including having that respected across pg_upgrade? Just rename doesn't seem to be safe. Since pg_upgrade uses pg_dump, all indexes are linked to opclasses using their names. Therefore existed indexes will be linked to new opclasses. It's likely we need to execute SQL script renaming opclasses before pg_upgrade. Another option is to don't rename old opclasses, just create new default opclasses with new names. Bruce, what is your opinion about pg_upgrade? Contrib opclasses would be safe to rename using migration script. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Support for N synchronous standby servers
On Sat, Sep 13, 2014 at 3:56 AM, Robert Haas wrote: > I think so. If we make it a function, then it's either the kind of > function that you access via pg_proc, or it's the kind that's written > in C and installed by storing a function pointer in a hook variable > from _PG_init(). The first approach is a non-starter because it would > require walsender to be connected to the database where that function > lives, which is a non-starter at least for logical replication where > we need walsender to be connected to the database being replicated. > Even if we found some way around that problem, and I'm not sure there > is one, I suspect the overhead would be pretty high. The second > approach - a hook that can be accessed directly by loadable modules - > seems like it would work fine; the only problem that is that you've > got to write your policy function in C. But I have no issue with > exposing it that way if someone wants to write a patch. There is no > joy in getting between the advanced user and his nutty-complicated > sync rep configuration. However, I suspect that most users will > prefer a more user-friendly interface. Reading both your answers, I'd tend to think that having a set of hooks to satisfy all the potential user requests would be enough. We could let the server code decide what is the priority of the standbys using the information in synchronous_standby_names, then have the hooks interact with SyncRepReleaseWaiters and pg_stat_get_wal_senders. We would need two hooks: - one able to get an array of WAL sender position defining all the nodes considered as sync nodes. This is enough for pg_stat_get_wal_senders. SyncRepReleaseWaiters would need it as well... - a second able to define the update policy of the write and flush positions in walsndctl (SYNC_REP_WAIT_FLUSH and SYNC_REP_WAIT_WRITE), as well as the next failover policy. This would be needed for SyncRepReleaseWaiters when a WAL sender calls SyncRepReleaseWaiters. Perhaps that's overthinking, but I am getting the impression that whatever the decision taken, it would involve modifications of the sync-standby parametrization at GUC level, and whatever the path chosen (dedicated language, set of integer params), there will be complains about what things can or cannot be done. At least a set of hooks has the merit to say: do what you like with your synchronous node policy. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collation-aware comparisons in GIN opclasses
On Mon, Sep 15, 2014 at 8:28 AM, Alexander Korotkov wrote: > some GIN opclasses uses collation-aware comparisons while they don't need to > do especially collation-aware comparison. Examples are text[] and hstore > opclasses. Depending on collation this may make them a much slower. I'm glad that I saw how pointless this was with the jsonb GIN default opclass during development. > Rename such opclasses and make them not default. > Create new default opclasses with bitwise comparison functions. > Write recommendation to re-create indexes with default opclasses into > documentation. I certainly think this should be fixed if at all possible, but I'm not sure about this plan. Can we really rename an opclass without consequence, including having that respected across pg_upgrade? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb contains behaviour weirdness
On Mon, Sep 15, 2014 at 11:29 AM, Tom Lane wrote: > It might be that the benefit is very close to nil; that would depend a lot > on workload, so it's hard to be sure. I'd say though that the cost is > also very close to nil, in the sense that we're considering two additional > compare-and-branch instructions in a function that will surely expend > hundreds or thousands of instructions if there's no such short-circuit. Fair enough. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb contains behaviour weirdness
Peter Geoghegan writes: > On Mon, Sep 15, 2014 at 11:12 AM, Tom Lane wrote: >> Personally I'd think that we should retain it for objects; Peter's >> main argument against that was that the comment would be too complicated, >> but that seems a bit silly from here. > I just don't see any point to it. My argument against the complexity > of explaining why the optimization is only used with objects is based > on the costs and the benefits. I think the benefits are very close to > nil. It might be that the benefit is very close to nil; that would depend a lot on workload, so it's hard to be sure. I'd say though that the cost is also very close to nil, in the sense that we're considering two additional compare-and-branch instructions in a function that will surely expend hundreds or thousands of instructions if there's no such short-circuit. I've certainly been on the side of "that optimization isn't worth its keep" many times before, but I don't think the case is terribly clear cut here. Since somebody (possibly you) thought it was worth having to begin with, I'm inclined to follow that lead. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb contains behaviour weirdness
On Mon, Sep 15, 2014 at 2:18 PM, Peter Geoghegan wrote: > On Mon, Sep 15, 2014 at 11:12 AM, Tom Lane wrote: >> Personally I'd think that we should retain it for objects; Peter's >> main argument against that was that the comment would be too complicated, >> but that seems a bit silly from here. > > I just don't see any point to it. My argument against the complexity > of explaining why the optimization is only used with objects is based > on the costs and the benefits. I think the benefits are very close to > nil. That seems pessimistic to me; I'm with Tom. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 11:20 AM, Robert Haas wrote: > ...looks like about a 10-line patch. We have the data to show that > the loss is trivial even in the worst case, and we have or should be > able to get data showing that the best-case win is significant even > without the abbreviated key stuff. If you'd care to draft a patch for > just that, I assume we could get it committed in a day or two, whereas > I'm quite sure that considerably more work than that remains for the > main patch. OK, I'll draft a patch for that today, including similar alterations to varstr_cmp() for the benefit of Windows and so on. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 1:55 PM, Peter Geoghegan wrote: > On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas wrote: >> I think there's probably more than that to work out, but in any case >> there's no harm in getting a simple optimization done first before >> moving on to a complicated one. > > I guess we never talked about the abort logic in all that much detail. > I suppose there's that, too. Well, the real point is that from where I'm sitting, this... >> x = memcmp(); >> if (x == 0) >>return x; >> y = strcoll(); >> if (y == 0) >>return x; >> return y; ...looks like about a 10-line patch. We have the data to show that the loss is trivial even in the worst case, and we have or should be able to get data showing that the best-case win is significant even without the abbreviated key stuff. If you'd care to draft a patch for just that, I assume we could get it committed in a day or two, whereas I'm quite sure that considerably more work than that remains for the main patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Triconsistent catalog declaration
Robert Haas writes: > On Mon, Sep 15, 2014 at 10:13 AM, Heikki Linnakangas > wrote: >> This requires a catalog change to fix. Are we still planning to do a >> catversion bump for 9.4 because of the jsonb changes? > That was my understanding, although we seem to be proceeding at an > inexplicably glacial pace. We already had a post-beta2 catversion bump for 30c05261a. As long as this is a catalog fix and not an on-disk-data change, there's no reason not to make it in 9.4. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb contains behaviour weirdness
On Mon, Sep 15, 2014 at 11:12 AM, Tom Lane wrote: > Personally I'd think that we should retain it for objects; Peter's > main argument against that was that the comment would be too complicated, > but that seems a bit silly from here. I just don't see any point to it. My argument against the complexity of explaining why the optimization is only used with objects is based on the costs and the benefits. I think the benefits are very close to nil. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb contains behaviour weirdness
Josh Berkus writes: > On 09/12/2014 01:33 PM, Tom Lane wrote: >> No, it's a bug, not a documentation deficiency. > Hmmm? Are you proposing that we should change how ARRAY @> ARRAY works > for non-JSON data? No. > Or are you proposing that JSONARRAY @> JSONARRAY should work differently > from ARRAY @> ARRAY? And no. It's a bug that jsonb array containment works differently from regular array containment. We understand the source of the bug, ie a mistaken optimization. I don't see why there's much need for discussion about anything except whether removing the optimization altogether (as Peter proposed) is the best fix, or whether we want to retain some weaker form of it. Personally I'd think that we should retain it for objects; Peter's main argument against that was that the comment would be too complicated, but that seems a bit silly from here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas wrote: > I think there's probably more than that to work out, but in any case > there's no harm in getting a simple optimization done first before > moving on to a complicated one. I guess we never talked about the abort logic in all that much detail. I suppose there's that, too. > I rather assume we could reuse the results of the first memcmp() > instead of doing it again. > > x = memcmp(); > if (x == 0) >return x; > y = strcoll(); > if (y == 0) >return x; > return y; Of course, but you know what I mean. (I'm sure the compiler will realize this if the programmer doesn't) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Triconsistent catalog declaration
On Mon, Sep 15, 2014 at 10:13 AM, Heikki Linnakangas wrote: > That makes for a bit awkward input and output from psql, when the values > used are 0, 1, 2, rather than ascii characters. But that's OK, because as > you said these functions are not callable from psql anyway, as they have > "internal" arguments. Maybe we should change them to something a bit more understandable. > This requires a catalog change to fix. Are we still planning to do a > catversion bump for 9.4 because of the jsonb changes? That was my understanding, although we seem to be proceeding at an inexplicably glacial pace. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 1:34 PM, Peter Geoghegan wrote: > On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas wrote: >> It strikes me that perhaps we should make this change (rearranging >> things so that the memcmp tiebreak is run before strcoll) first, >> before dealing with the rest of the abbreviated keys infrastructure. >> It appears to be a separate improvement which is worthwhile >> independently of what we do about that patch. > > I guess we could do that, but AFAICT the only open item blocking the > commit of a basic version of abbreviated keys (the informally agreed > to basic version lacking support for single-attribute aggregates) is > what to do about the current need to create a separate sortsupport > state. I've talked about my thoughts on that question in detail now > [1]. I think there's probably more than that to work out, but in any case there's no harm in getting a simple optimization done first before moving on to a complicated one. > BTW, you probably realize this, but we still need a second memcmp() > after strcoll() too. hu_HU will care about that [2]. > > [1] > http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com > [2] > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad I rather assume we could reuse the results of the first memcmp() instead of doing it again. x = memcmp(); if (x == 0) return x; y = strcoll(); if (y == 0) return x; return y; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
I'm not sure I like the idea of printing a percentage. It might be unclear what the denominator was if somebody feels the urge to work back to the actual number of skipped transactions. I mean, I guess it's probably just the value you passed to -R, so maybe that's easy enough, but then why bother dividing in the first place? The user can do that easily enough if they want the data that way. Indeed "skipped" and "late" per second may have an unclear denominator. If you divide by the time, the unit would be "tps", but 120 tps performance including 20 late tps, plus 10 skipped tps... I do not think it is that clear. Reporting "tps" for transaction *not* performed looks strange. Maybe late transactions could be given as a percentage of all processed transactions in the interval. But for skipped the percentage of what? The only number that would make sense is the total number of transactions schedule in the interval, but that would mean that the denominator for late would be different than the denominator for skipped, which is basically uncomprehensible. I agree with you that it would be good to get some statistics on late/skipped transactions, but it's not obvious what people will want. Late transactions, straight up? Late by more than a threshold value? Yes. Under throttling transaction are given a schedule start time. When the transactions can actually start: (1) if it is already more late (before even starting) than the latency limit (a threshold), it is *NOT* started, but counted "skipped" (2) otherwise it is started. When it finishes, it may be (2a) out of the latency limit (scheduled time + limit) => it is counted as "late" (2b) within the latency limit => all is well -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas wrote: > It strikes me that perhaps we should make this change (rearranging > things so that the memcmp tiebreak is run before strcoll) first, > before dealing with the rest of the abbreviated keys infrastructure. > It appears to be a separate improvement which is worthwhile > independently of what we do about that patch. I guess we could do that, but AFAICT the only open item blocking the commit of a basic version of abbreviated keys (the informally agreed to basic version lacking support for single-attribute aggregates) is what to do about the current need to create a separate sortsupport state. I've talked about my thoughts on that question in detail now [1]. BTW, you probably realize this, but we still need a second memcmp() after strcoll() too. hu_HU will care about that [2]. [1] http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com [2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On Mon, Sep 15, 2014 at 8:38 AM, Kouhei Kaigai wrote: >> > The only reason why I put separate hooks here is, create_custom_scan() >> > needs to know exact size of the CustomScan node (including private >> > fields), however, it is helpful for extensions to kick its callback to >> > initialize the node next to the common initialization stuff. >> >> Why does it need to know that? I don't see that it's doing anything that >> requires knowing the size of that node, and if it is, I think it shouldn't >> be. That should get delegated to the callback provided by the custom plan >> provider. >> > Sorry, my explanation might be confusable. The create_custom_scan() does not > need to know the exact size of the CustomScan (or its inheritance) because of > the two separated hooks; one is node allocation time, the other is the tail > of the series of initialization. > If we have only one hook here, we need to have a mechanism to informs > create_custom_scan() an exact size of the CustomScan node; including private > fields managed by the provider, instead of the first hook on node allocation > time. In this case, node allocation shall be processed by create_custom_scan() > and it has to know exact size of the node to be allocated. > > How do I implement the feature here? Is the combination of static node size > and callback on the tail more simple than the existing design that takes two > individual hooks on create_custom_scan()? I still don't get it. Right now, the logic in create_custom_scan(), which I think should really be create_custom_plan() or create_plan_from_custom_path(), basically looks like this: 1. call hook function CreateCustomPlan 2. compute values for tlist and clauses 3. pass those values to hook function InitCustomScan() 4. call copy_path_costsize What I think we should do is: 1. compute values for tlist and clauses 2. pass those values to hook function PlanCustomPath(), which will return a Plan 3. call copy_path_costsize create_custom_scan() does not need to allocate the node. You don't need the node to be allocated before computing tlist and clauses. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 2:12 PM, Josh Berkus wrote: > If not, I think the corner case is so obscure as to be not worth > optimizing for. I can't imagine that more than a tiny minority of our > users are going to have thousands of keys per datum. Worst case is linear cost scaling vs number of keys, which depends on the number of keys how expensive it is. It would have an effect only on uncompressed jsonb, since compressed jsonb already pays a linear cost for decompression. I'd suggest testing performance of large small keys in uncompressed form. It's bound to have a noticeable regression there. Now, large small keys could be 200 or 2000, or even 20k. I'd guess several should be tested to find the shape of the curve. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sun, Sep 14, 2014 at 10:37 AM, Heikki Linnakangas wrote: > On 09/13/2014 11:28 PM, Peter Geoghegan wrote: >> Anyway, attached rough test program implements what you outline. This >> is for 30,000 32 byte strings (where just the final two bytes differ). >> On my laptop, output looks like this (edited to only show median >> duration in each case): > > Got to be careful to not let the compiler optimize away microbenchmarks like > this. At least with my version of gcc, the strcoll calls get optimized away, > as do the memcmp calls, if you don't use the result for anything. Clang was > even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it > optimizes away the loops altogether. > > Also, there should be a setlocale(LC_ALL, "") call somewhere. Otherwise it > runs in C locale, and we don't use strcoll() at all for C locale. > > After fixing those, it runs much slower, so I had to reduce the number of > strings. Here's a fixed program. I'm now getting numbers like this: > > (baseline) duration of comparisons without useless memcmp()s: 6.007368 > seconds > > duration of comparisons with useless memcmp()s: 6.079826 seconds > > Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless > memcmp() is free with these parameters. > > Is this the worst case scenario? I can't see a worse one, and I replicated your results here on my MacBook Pro. I also tried with 1MB strings and, surprisingly, it was basically free there, too. It strikes me that perhaps we should make this change (rearranging things so that the memcmp tiebreak is run before strcoll) first, before dealing with the rest of the abbreviated keys infrastructure. It appears to be a separate improvement which is worthwhile independently of what we do about that patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/12/2014 01:30 PM, Heikki Linnakangas wrote: > > Performance was one argument for sure. It's not hard to come up with a > case where the all-lengths approach is much slower: take a huge array > with, say, million elements, and fetch the last element in a tight loop. > And do that in a PL/pgSQL function without storing the datum to disk, so > that it doesn't get toasted. Not a very common thing to do in real life, > although something like that might come up if you do a lot of json > processing in PL/pgSQL. but storing offsets makes that faster. While I didnt post the results (because they were uninteresting), I did specifically test the "last element" in a set of 200 elements for all-lengths vs. original offsets for JSONB, and the results were not statistically different. I did not test against your patch; is there some reason why your patch would be faster for the "last element" case than the original offsets version? If not, I think the corner case is so obscure as to be not worth optimizing for. I can't imagine that more than a tiny minority of our users are going to have thousands of keys per datum. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb contains behaviour weirdness
On 09/12/2014 01:33 PM, Tom Lane wrote: > Josh Berkus writes: >> However, this better become a FAQ item, because it's not necessarily the >> behavior that folks used to JSON but not Postgres will expect. > > No, it's a bug, not a documentation deficiency. Hmmm? Are you proposing that we should change how ARRAY @> ARRAY works for non-JSON data? Or are you proposing that JSONARRAY @> JSONARRAY should work differently from ARRAY @> ARRAY? Or are you agreeing with Peter that it's the first case that's a bug, and the 2nd two tests are correct, and just being unclear? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On Mon, Sep 15, 2014 at 6:34 AM, Heikki Linnakangas wrote: > Please have a look. I have not looked at the docs changes yet. > > One thing that needs some thinking and changing is the progress reporting. > It currently looks like this: > > progress: 1.0 s, 4863.0 tps, lat 3.491 ms stddev 2.487, lag 1.809 ms, 99 > skipped > progress: 2.0 s, 5042.8 tps, lat 3.265 ms stddev 2.264, lag 1.584 ms, 16 > skipped > progress: 3.0 s, 4926.1 tps, lat 2.731 ms stddev 2.371, lag 1.196 ms, 45 > skipped > progress: 4.0 s, 4963.9 tps, lat 1.904 ms stddev 1.212, lag 0.429 ms, 0 > skipped > progress: 5.0 s, 4971.2 tps, lat 2.592 ms stddev 1.722, lag 0.975 ms, 0 > skipped > > The absolute number of skipped transactions doesn't make much sense when all > the other numbers are averages, and tps is a 1/s value. If you don't know > the total number of transactions executed, the absolute number is > meaningless. (Although you can calculate the absolute number of transactions > executed by multiplying the TPS value with the interval). I think a > percentage would be better here. > > Should we also print the number of late transactions here? I think that > would be an even more important detail than the number of skipped > transactions. It might be better to print only the percentage of late > transactions, including skipped ones. Or both, but it's difficult to cram > everything on a single line. This needs some further thinking.. I'm not sure I like the idea of printing a percentage. It might be unclear what the denominator was if somebody feels the urge to work back to the actual number of skipped transactions. I mean, I guess it's probably just the value you passed to -R, so maybe that's easy enough, but then why bother dividing in the first place? The user can do that easily enough if they want the data that way. I agree with you that it would be good to get some statistics on late/skipped transactions, but it's not obvious what people will want. Late transactions, straight up? Late by more than a threshold value? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres code for a query intermediate dataset
On Sun, Sep 14, 2014 at 5:18 AM, Rohit Goyal wrote: > Thanks for reply. But, I think i confused you. I am talking about access > using indexes. So, I assume that B+ tree store key-value pair where rohit is > the key and all the versions are its value. > > Another way to think is I have a secondary index on emp. name and there are > 4 rohit exist in DB. So, now B+ tree gives me 4 different tuple pointer for > each Rohit. I want to know the code portion for this where i can see all 4 > tuple pointer before each one have I/O access to fetch its tuple. You may want to look at index_getnext(), index_getnext_tid(), and/or heap_hot_search_buffer(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Sun, Sep 14, 2014 at 4:37 PM, Simon Riggs wrote: > On 12 September 2014 18:19, Simon Riggs wrote: >> On 12 September 2014 15:30, Tom Lane wrote: > >>> After a little bit I remembered there was already a function for this. >>> So specifically, I'd suggest using ExecRelationIsTargetRelation() >>> to decide whether to mark the scan as requiring pruning. >> >> Sounds cool. Thanks both, this is sounding like a viable route now. > > Yes, this is viable. > > Patch attached, using Alvaro's idea of use-case specific pruning and > Tom's idea of aiming at target relations. Patch uses or extends > existing infrastructure, so its shorter than it might have been, yet > with all that bufmgr yuck removed. > > This is very, very good because while going through this I notice the > dozen or more places where we were pruning blocks in annoying places I > didn't even know about such as about 4-5 constraint checks. In more > than a few DDL commands like ALTER TABLE and CLUSTER we were even > pruning the old relation prior to rewrite. Do we really want to disable HOT for all catalog scans? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .ready files appearing on slaves
Hi hackers, An issue that seems related to this has been posted on pgsql-admin. See: http://www.postgresql.org/message-id/caas3tylnxaydz0+zhxlpdvtovhqovr+jsphp30o8kvwqqs0...@mail.gmail.com How can we help on this issue? Cheers, On Thu, 4 Sep 2014 17:50:36 +0200 Jehan-Guillaume de Rorthais wrote: > Hi hackers, > > Since few months, we occasionally see .ready files appearing on some slave > instances from various context. The two I have in mind are under 9.2.x. > > I tried to investigate a bit. These .ready files are created when a WAL file > from pg_xlog has no corresponding file in pg_xlog/archive_status. I could > easily experience this by deleting such a file: it is created again at the > next restartpoint or checkpoint received from the master. > > Looking at the WAL in pg_xlog folder corresponding to these .ready files, they > are all much older than the current WAL "cycle" in both mtime and name logic > sequence. As instance on one of these box we have currently 6 of those "ghost" > WALs: > > 00021E5300FF > 00021F1800FF > 0002204700FF > 000220BF00FF > 0002214000FF > 0002237000FF > 0002255D00A8 > 0002255D00A9 > [...normal WAL sequence...] > 0002255E009D > > And on another box: > > 0001040E00FF > 0001041400DA > 0001046E00FF > 0001047000FF > 00010485000F > 000104850010 > [...normal WAL sequence...] > 000104850052 > > So it seems for some reasons, these old WALs were "forgotten" by the > restartpoint mechanism when they should have been recylced/deleted. > > For one of these servers, I could correlate this with some brutal > disconnection of the streaming replication appearing in its logs. But there > was no known SR disconnection on the second one. > > Any idea about this weird behaviour? What can we do to help you investigate > further? > > Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Collation-aware comparisons in GIN opclasses
Hackers, some GIN opclasses uses collation-aware comparisons while they don't need to do especially collation-aware comparison. Examples are text[] and hstore opclasses. Depending on collation this may make them a much slower. See example. # show lc_collate ; lc_collate ─ ru_RU.UTF-8 (1 row) # create table test as (select array_agg(i::text) from generate_series(1,100) i group by (i-1)/10); SELECT 10 # create index test_idx on test using gin(array_agg); CREATE INDEX Time: *26930,423 ms* # create index test_idx2 on test using gin(array_agg collate "C"); CREATE INDEX Time: *5143,682 ms* Index creation with collation "ru_RU.UTF-8" is 5 times slower while collation has absolutely no effect on index functionality. However, we can just replace comparison function for those opclasses because it would break binary compatibility for pg_upgrade. I see following solution: 1. Rename such opclasses and make them not default. 2. Create new default opclasses with bitwise comparison functions. 3. Write recommendation to re-create indexes with default opclasses into documentation. -- With best regards, Alexander Korotkov.
Re: [HACKERS] delta relations in AFTER triggers
Tom Lane wrote: > Heikki Linnakangas writes: > >> On 08/30/2014 12:15 AM, Kevin Grittner wrote: >>> If we were to go with the hooks as you propose, we would still need >>> to take the information from TriggerData and put it somewhere else >>> for the hook to reference. > >> Sure. > > FWIW, I agree with Heikki on this point. It makes a lot more sense for > the parser to provide hooks comparable to the existing hooks for resolving > column refs, and it's not apparent that loading such functionality into > SPI is sane at all. > > OTOH, I agree with Kevin that the things we're talking about are > lightweight relations not variables. Try as I might, I was unable to find any sensible way to use hooks. If the issue was only the parsing, the route was fairly obvious, but the execution side needs to access the correct tuplestore(s) for each run, too -- so the sort of information provided by relcache needed to be passed in to based on the context within the process (i.e., if you have nested triggers firing, each needs to use a different tuplestore with the same name in the same function, even though it's using the same plan). On both sides it seemed easier to pass things through the same sort of techniques as "normal" parameters; I couldn't find a way to use hooks that didn't just make things uglier. I see the down side of making this a feature which can only be used from SPI, so I've updated the patch to allow it from other contexts. On the other hand, I see many uses for it where SPI *is* used, and the SPI interface needs to change to support that. The functions I had added are one way to do that on the parsing/planning side without breaking any existing code. The same information (i.e., the metadata) needs to be passed to the executor along with references to the actual tuplestores, and that needs to go through a different path -- at least for the plpgsql usage. I broke out the changes from the previous patch in multiple commits in my repository on github: commit 2520db8fbb41c68a82c2c750c8543154c6d85eb9 Author: Kevin Grittner Date: Mon Sep 15 01:17:14 2014 -0500 Use executor state rather than SPI to get named tuplestores. spi.c and trigger.c will need a little more clean-up to match this, and it's likely that not all places that need to pass the baton are doing so, but at least this passes regression tests. commit de9067258125226d1625f160c3eee9aff90ca598 Author: Kevin Grittner Date: Sun Sep 14 22:01:04 2014 -0500 Pass the Tsrcache through ParserState to parse analysis. commit e94779c1e22ec587446a7aa2593ba5f102b6a28b Author: Kevin Grittner Date: Sun Sep 14 19:23:45 2014 -0500 Modify SPI to use Tsrcache instead of List. commit 7af841881d9113eb4c8dca8e82dc1867883bf75d Author: Kevin Grittner Date: Sun Sep 14 18:45:54 2014 -0500 Create context-specific Tsrcache. Not used yet. commit 35790a4b6c236d09e0be261be9b0017d34eaf9c9 Author: Kevin Grittner Date: Sun Sep 14 16:49:37 2014 -0500 Create Tsrmd structure so tuplestore.h isn't needed in parser. commit 93d57c580da095b71d9214f69fede71d2f8ed840 Author: Kevin Grittner Date: Sun Sep 14 15:59:45 2014 -0500 Improve some confusing naming for TuplestoreRelation node. Anyone who has already reviewed the earlier patch may want to look at these individually: https://github.com/kgrittn/postgres/compare/transition Attached is a new patch including all of that. Hopefully I haven't misunderstood what Heikki and Tom wanted; if I have, just let me know where I went wrong. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company trigger-transition-tables-v4.patch.gz Description: application/tgz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] orangutan seizes up during isolation-check
On Mon, Sep 15, 2014 at 10:11:57AM +0300, Heikki Linnakangas wrote: > On 09/15/2014 07:51 AM, Noah Misch wrote: > >libintl replaces setlocale(). Its setlocale(LC_x, "") uses OS-specific APIs > >to determine the default locale when $LANG and similar environment variables > >are empty, as they are during "make check NO_LOCALE=1". On OS X, it calls[1] > >CFLocaleCopyCurrent(), which in turn spins up a thread. See the end of this > >message for the postmaster thread stacks active upon hitting a breakpoint set > >at _dispatch_mgr_thread. > > Ugh. I'd call that a bug in libintl. setlocale() has no business to > make the process multi-threaded. Fair interpretation. libintl's options for mitigating the problem are even more constrained, unfortunately. > Do we have the same problem in backends? At a quick glance, aside > from postmaster we only use PG_SETMASK(&BlockSig) in signal > handlers, to prevent another signal handler from running > concurrently. In backends, we pass a specific value, not "", to pg_perm_setlocale(). (I expect the problem in EXEC_BACKEND builds, but I doubt anyone uses that as a production configuration on OS X.) Every frontend program does call setlocale(LC_ALL, "") via set_pglocale_pgservice(). None use sigprocmask(), but a few do use fork(). > >1. Fork, call setlocale(LC_x, "") in the child, pass back the effective > >locale > >name through a pipe, and pass that name to setlocale() in the original > >process. The short-lived child will get the extra threads, and the > >postmaster will remain clean. > >I'm skeptical of the value of looking up locale information using other OS X > >facilities when the usual environment variables are inconclusive, but I see > >no > >clear cause to reverse that decision now. I lean toward (1). > > Both of those are horrible hacks. And who's to say that calling > setlocale(LC_x, "foo") won't also call some function that makes the > process multi-threaded. If not in any current OS X release, it might > still happen in a future one. Right. Not just setlocale(); a large body of APIs could change that way. The CAVEATS section[1] of the OS X fork() manual page suggests Apple had that in mind at some point. To be 100% safe, we would write the postmaster as though it's always multithreaded, including making[2] EXEC_BACKEND the only supported configuration on OS X. Assuming we don't do that anytime soon, I did have in mind to make the postmaster raise an error if it becomes multithreaded. (Apple's pthread_is_threaded_np() is usable for this check.) Then, at least, we'll more-easily hear about new problems of this nature. > One idea would be to use an extra pthread mutex or similar, in > addition to PG_SETMASK(). Whenever you do PG_SETMASK(&BlockSig), > also grab the mutex, and release it when you do > PG_SETMASK(&UnBlockSig). I had considered and tested such a thing, and it sufficed to clear orangutan's present troubles. You still face undefined behavior after fork(). It's okay in practice if libdispatch threads are careful to register pthread_atfork() handlers for any relevant resources. It's a fair option, but running setlocale(LC_x, "") in a short-lived child assumes less about the system library implementation, seems no more of a hack to me, and isolates the overhead at postmaster start. [1] https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fork.2.html [2] http://www.postgresql.org/message-id/1349280184-sup-...@alvh.no-ip.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Triconsistent catalog declaration
On 09/15/2014 04:58 PM, Alexander Korotkov wrote: Hi! 9.4 GIN introduces new triconsistent method which can return one of three values in spite of just consistent method. But in catalog declaration triconsistent still returns bool type. It doesn't affect anything because nobody calls triconsistent from SQL. Good point. But I think it would be correct to declare them returns int2. No, int2 is two bytes wide, while the return value is a single byte. I guess "char" would be the right type. That makes for a bit awkward input and output from psql, when the values used are 0, 1, 2, rather than ascii characters. But that's OK, because as you said these functions are not callable from psql anyway, as they have "internal" arguments. This requires a catalog change to fix. Are we still planning to do a catversion bump for 9.4 because of the jsonb changes? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Triconsistent catalog declaration
Hi! 9.4 GIN introduces new triconsistent method which can return one of three values in spite of just consistent method. But in catalog declaration triconsistent still returns bool type. It doesn't affect anything because nobody calls triconsistent from SQL. But I think it would be correct to declare them returns int2. -- With best regards, Alexander Korotkov. tri-consistent-catalog.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] implement subject alternative names support for SSL connections
On 09/15/2014 01:44 PM, Alexey Klyukin wrote: Committed, with that change, ie. the CN is not checked if SANs are present. Actually, I disagree with the way the patch ignores the CN. Currently, it skips the CN unconditionally if the SubjectAltName section is present. But what RFC 6125 says is: "If a subjectAltName extension of type dNSName is present, that MUST be used as the identity. Otherwise, the (most specific) Common Name field in the Subject field of the certificate MUST be used." This means that we have to check that at least one dNSName resource is present before rejecting to examine the CN. Attached is a one-liner (excluding comments) that fixes this. Ok, good catch. Fixed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
> On Thu, Sep 11, 2014 at 8:40 PM, Kouhei Kaigai wrote: > >> On Thu, Sep 11, 2014 at 11:24 AM, Kouhei Kaigai > >> > >> wrote: > >> >> Don't the changes to src/backend/optimizer/plan/createplan.c > >> >> belong in patch #2? > >> >> > >> > The borderline between #1 and #2 is little bit bogus. So, I moved > >> > most of portion into #1, however, invocation of InitCustomScan > >> > (that is a callback in CustomPlanMethod) in create_custom_plan() is > still in #2. > >> > >> Eh, create_custom_scan() certainly looks like it is in #1 from here, > >> or at least part of it is. It calculates tlist and clauses and then > >> does nothing with them. That clearly can't be the right division. > >> > >> I think it would make sense to have create_custom_scan() compute > >> tlist and clauses first, and then pass those to CreateCustomPlan(). > >> Then you don't need a separate InitCustomScan() - which is misnamed > >> anyway, since it has nothing to do with ExecInitCustomScan(). > >> > > The only reason why I put separate hooks here is, create_custom_scan() > > needs to know exact size of the CustomScan node (including private > > fields), however, it is helpful for extensions to kick its callback to > > initialize the node next to the common initialization stuff. > > Why does it need to know that? I don't see that it's doing anything that > requires knowing the size of that node, and if it is, I think it shouldn't > be. That should get delegated to the callback provided by the custom plan > provider. > Sorry, my explanation might be confusable. The create_custom_scan() does not need to know the exact size of the CustomScan (or its inheritance) because of the two separated hooks; one is node allocation time, the other is the tail of the series of initialization. If we have only one hook here, we need to have a mechanism to informs create_custom_scan() an exact size of the CustomScan node; including private fields managed by the provider, instead of the first hook on node allocation time. In this case, node allocation shall be processed by create_custom_scan() and it has to know exact size of the node to be allocated. How do I implement the feature here? Is the combination of static node size and callback on the tail more simple than the existing design that takes two individual hooks on create_custom_scan()? > > Regarding to the naming, how about GetCustomScan() instead of > InitCustomScan()? > > It follows the manner in create_foreignscan_plan(). > > I guess that's a bit better, but come to think of it, I'd really like to > avoid baking in the assumption that the custom path provider has to return > any particular type of plan node. A good start would be to give it a name > that doesn't imply that - e.g. PlanCustomPath(). > OK, I'll use this naming. > >> > OK, I revised. Now custom-scan assumes it has a particular valid > >> > relation to be scanned, so no code path with scanrelid == 0 at this > moment. > >> > > >> > Let us revisit this scenario when custom-scan replaces relation-joins. > >> > In this case, custom-scan will not be associated with a particular > >> > base- relation, thus it needs to admit a custom-scan node with > >> > scanrelid > >> == 0. > >> > >> Yeah, I guess the question there is whether we'll want let CustomScan > >> have scanrelid == 0 or require that CustomJoin be used there instead. > >> > > Right now, I cannot imagine a use case that requires individual > > CustomJoin node because CustomScan with scanrelid==0 (that performs > > like custom-plan rather than custom-scan in actually) is sufficient. > > > > If a CustomScan gets chosen instead of built-in join logics, it shall > > looks like a relation scan on the virtual one that is consists of two > > underlying relation. Callbacks of the CustomScan has a responsibility > > to join underlying relations; that is invisible from the core executor. > > > > It seems to me CustomScan with scanrelid==0 is sufficient to implement > > an alternative logic on relation joins, don't need an individual node > > from the standpoint of executor. > > That's valid logic, but it's not the only way to do it. If we have CustomScan > and CustomJoin, either of them will require some adaption to handle this > case. We can either allow a custom scan that isn't scanning any particular > relation (i.e. scanrelid == 0), or we can allow a custom join that has no > children. I don't know which way will come out cleaner, and I think it's > good to leave that decision to one side for now. > Yep. I agree with you. It may not be productive discussion to conclude this design topic right now. Let's assume CustomScan scans on a particular relation (scanrelid != 0) on the first revision. > >> >> Why can't the Custom(GpuHashJoin) node build the hash table > >> >> internally instead of using a separate node? > >> >> > >> > It's possible, however, it prevents to check sub-plans using > >> > EXPLAIN if we manage inner-plans internally. So, I'd like to have a > >> > separa
Re: [HACKERS] PoC: Partial sort
On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan wrote: > I think we might be better off if a tuplesort function was called > shortly after tuplesort_begin_heap() is called. How top-n heap sorts > work is something that largely lives in tuplesort's head. Today, we > call tuplesort_set_bound() to hint to tuplesort "By the way, this is a > top-n heap sort applicable sort". I think that with this patch, we > should then hint (where applicable) "by the way, you won't actually be > required to sort those first n indexed attributes; rather, you can > expect to scan those in logical order. You may work the rest out > yourself, and may be clever about exploiting the sorted-ness of the > first few columns". The idea of managing a bunch of tiny sorts from > with ExecSort(), and calling the new function tuplesort_reset() seems > questionable. tuplesortstate is supposed to be private/opaque to > nodeSort.c, and the current design strains that. > > I'd like to keep nodeSort.c simple. I think it's pretty clear that the > guts of this do not belong within ExecSort(), in any case. Also, the > additions there should be much better commented, wherever they finally > end up. > As I understand, you propose to incapsulate partial sort algorithm into tuplesort. However, in this case we anyway need some significant change of its interface: let tuplesort decide when it's able to return tuple. Otherwise, we would miss significant part of LIMIT clause optimization. tuplesort_set_bound() can't solve all the cases. There could be other planner nodes between the partial sort and LIMIT. -- With best regards, Alexander Korotkov.
Re: [HACKERS] PoC: Partial sort
On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan wrote: > On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov > wrote: > > Actually, higher cardinality skip columns is better. Sorting of smaller > > groups is faster than sorting larger groups of same size. Also, with > smaller > > groups you achieve limit more accurate (in average), i.e. sort smaller > > amount of total rows. > > Higher cardinality leading key columns are better for what? Do you > mean they're better in that those cases are more sympathetic to your > patch, or better in the general sense that they'll perform better for > the user? The first example query, that you chose to demonstrate your > patch had a leading, indexed column (column "v1") with much lower > cardinality/n_distinct than the column that had to be sorted on > (column "v2"). That was what my comments were based on. > > When this feature is used, there will often be a significantly lower > cardinality in the leading, indexed column (as in your example). > Otherwise, the user might well have been happy to just order on the > indexed/leading column alone. That isn't definitely true, but it's > often true. > I just meant higher cardinality is cheaper for do partial sort. We could have some misunderstood because of notions "high" and "low" are relative. When cardinality is 1 then partial sort seems to be useless. When cardinality is few then it still could be cheaper to do sequential scan + sort rather than index scan + partial sort. When cardinality is close to number of rows then as you mentioned user probably don't need to sort by rest of columns. At least one exception is when user needs to force uniqueness of order. So, we need to target something in the middle of this two corner cases. > >> I'm not sure if that's worth it to more or less > >> duplicate heap_tuple_attr_equals() to save a "mere" n expensive > >> comparisons, but it's something to think about (actually, there are > >> probably less than even n comparisons in practice because there'll be > >> a limit). > > > > Not correct. Smaller groups are not OK. Imagine that two representations > of > > same skip column value exists. Index may return them in any order, even > > change them one by one. In this case sorting on other column never takes > > place, while it should. > > I think I explained this badly - it wouldn't be okay to make the > grouping smaller, but as you say we could tie-break with a proper > B-Tree support function 1 comparison to make sure we really had > reached the end of our grouping. > > FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp() > first, so the use of the equality operator with text in mind that you > mention may soon not be useful at all. The evidence suggests that > memcmp() is so much cheaper than special preparatory NUL-termination + > strcoll() that we should always try it first when sorting text, even > when we have no good reason to think it will work. The memcmp() is > virtually free. And so, you see why it might be worth thinking about > this when we already have reasonable confidence that many comparisons > will indicate that datums are equal. Other datatypes will have > expensive "real" comparators, not just text or numeric. > When strings are not equal bttextcmp still needs to use strcoll while texteq doesn't need that. So, it would be still advantage of using equality operator over comparison function: equality operator don't have to compare unequal values. However, real cost of this advantage depends on presorted columns cardinality as well. -- With best regards, Alexander Korotkov.
Re: [HACKERS] [bug fix or improvement?] Correctly place DLLs for ECPG apps in bin folder
This patch has been sitting in the commitfest with no activity since this email from July. What's the real status, who's got the ball on this one? Please update the commitfest app accordingly. If nothing happens in the next few days, I'll mark this as Returned with Feedback. On 07/09/2014 03:53 PM, MauMau wrote: From: "Alvaro Herrera" I think the suggestion by Peter Eisentraut upthread was pretty reasonable -- the Makefiles are already aware that they are building a shared library, so scrape the info off them. The less hardcoded stuff in the msvc framework, the better. I overlooked his mail. Do you mean something like this? (I don't know yet how to express this in Perl) for each Makefile in under top_dir_in_source_tree or src/interfaces { if (Makefile contains SO_MAJOR_VERSION) { extract NAME value from Makefile move lib/lib${NAME}.dll to bin/ } } But the source tree contains as many as 206 Makefiles. I'm worried that it will waste the install time. Should all these Makefiles be examined, or 16 Makefiles in src/interfaces/? Regards MauMau -- - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On 09/14/2014 11:34 PM, Peter Geoghegan wrote: On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas wrote: Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless memcmp() is free with these parameters. Is this the worst case scenario? Other than pushing the differences much much later in the strings (which you surely thought of already), yes. Please test the absolute worst case scenario you can think of. As I said earlier, if you can demonstrate that the slowdown of that is acceptable, we don't even need to discuss how likely or realistic the case is. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On 09/15/2014 02:42 AM, Arthur Silva wrote: Em 14/09/2014 12:21, "Andres Freund" escreveu: On 2014-09-13 20:27:51 -0500, k...@rice.edu wrote: What we are looking for here is uniqueness thus better error detection. Not avalanche effect, nor cryptographically secure, nor bit distribution. As far as I'm aware CRC32C is unbeaten collision wise and time proven. I couldn't find tests with xxhash and crc32 on the same hardware so I spent some time putting together a benchmark (see attachment, to run it just start run.sh) I included a crc32 implementation using ssr4.2 instructions (which works on pretty much any Intel processor built after 2008 and AMD built after 2012), a portable Slice-By-8 software implementation and xxhash since it's the fastest software 32bit hash I know of. Here're the results running the test program on my i5-4200M crc sb8: 90444623 elapsed: 0.513688s speed: 1.485220 GB/s crc hw: 90444623 elapsed: 0.048327s speed: 15.786877 GB/s xxhash: 7f4a8d5 elapsed: 0.182100s speed: 4.189663 GB/s The hardware version is insanely and works on the majority of Postgres setups and the fallback software implementations is 2.8x slower than the fastest 32bit hash around. Hopefully it'll be useful in the discussion. Note that all these numbers aren't fully relevant to the use case here. For the WAL - which is what we're talking about and the only place where CRC32 is used with high throughput - the individual parts of a record are pretty darn small on average. So performance of checksumming small amounts of data is more relevant. Mind, that's not likely to go for CRC32, especially not slice-by-8. The cache fooprint of the large tables is likely going to be noticeable in non micro benchmarks. Indeed, the small input sizes is something I was missing. Something more cache friendly would be better, it's just a matter of finding a better candidate. It's worth noting that the extra tables that slicing-by-4 requires are and *in addition to* the lookup table we already have. And slicing-by-8 builds on the slicing-by-4 lookup tables. Our current algorithm uses a 1kB lookup table, slicing-by-4 a 4kB, and slicing-by-8 8kB. But the first 1kB of the slicing-by-4 lookup table is identical to the current 1kB lookup table, and the first 4kB of the slicing-by-8 are identical to the slicing-by-4 tables. It would be pretty straightforward to use the current algorithm when the WAL record is very small, and slicing-by-4 or slicing-by-8 for larger records (like FPWs), where the larger table is more likely to pay off. I have no idea where the break-even point is with the current algorithm vs. slicing-by-4 and a cold cache, but maybe we can get a handle on that with some micro-benchmarking. Although this is complicated by the fact that slicing-by-4 or -8 might well be a win even with very small records, if you generate a lot of them. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] implement subject alternative names support for SSL connections
On Mon, Sep 15, 2014 at 10:23 AM, Alexey Klyukin wrote: > On Fri, Sep 12, 2014 at 4:20 PM, Heikki Linnakangas > wrote: > >>> Hmm. If that's what the browsers do, I think we should also err on the >>> side of caution here. Ignoring the CN is highly unlikely to cause anyone >>> a problem; a CA worth its salt should not issue a certificate with a CN >>> that's not also listed in the SAN section. But if you have such a >>> certificate anyway for some reason, it shouldn't be too difficult to get >>> a new certificate. Certificates expire every 1-3 years anyway, so there >>> must be a procedure to renew them anyway. >> >> >> Committed, with that change, ie. the CN is not checked if SANs are present. Actually, I disagree with the way the patch ignores the CN. Currently, it skips the CN unconditionally if the SubjectAltName section is present. But what RFC 6125 says is: "If a subjectAltName extension of type dNSName is present, that MUST be used as the identity. Otherwise, the (most specific) Common Name field in the Subject field of the certificate MUST be used." This means that we have to check that at least one dNSName resource is present before rejecting to examine the CN. Attached is a one-liner (excluding comments) that fixes this. Regards, Alexey diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c new file mode 100644 index 98d02b6..dd4fab8 *** a/src/interfaces/libpq/fe-secure-openssl.c --- b/src/interfaces/libpq/fe-secure-openssl.c *** verify_peer_name_matches_certificate(PGc *** 626,637 sk_GENERAL_NAME_free(peer_san); } /* !* If there is no subjectAltName extension, check the Common Name. * !* (Per RFC 2818 and RFC 6125, if the subjectAltName extension is present, * the CN must be ignored.) */ ! else { X509_NAME *subject_name; --- 626,637 sk_GENERAL_NAME_free(peer_san); } /* !* If there is no subjectAltName extension of type dNSName, check the Common Name. * !* (Per RFC 2818 and RFC 6125, if the subjectAltName extension of type dNSName is present, * the CN must be ignored.) */ ! if (names_examined == 0) { X509_NAME *subject_name; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 09/13/2014 11:25 AM, Fabien COELHO wrote: [about logging...] Here is an attempt at updating the log features, including the aggregate and sampling stuff, with skipped transactions under throttling. I moved the logging stuff into a function which is called when a transaction is skipped or finished. Makes sense. I spent some time on this, and this is what I ended up with. Notable changes: * I split this into two patches for easier review. The first patch contains the refactoring of the logging stuff, and the second patch contains the new functionality. * I renamed many of the variables, to be more consistent with existing similar variables * progress reporting was broken with !PTHREAD_FORK_EMULATION. Need to collect the number of skipped xacts from all threads. * I renamed the long option to --latency-limit. --limit is too generic. Please have a look. I have not looked at the docs changes yet. One thing that needs some thinking and changing is the progress reporting. It currently looks like this: progress: 1.0 s, 4863.0 tps, lat 3.491 ms stddev 2.487, lag 1.809 ms, 99 skipped progress: 2.0 s, 5042.8 tps, lat 3.265 ms stddev 2.264, lag 1.584 ms, 16 skipped progress: 3.0 s, 4926.1 tps, lat 2.731 ms stddev 2.371, lag 1.196 ms, 45 skipped progress: 4.0 s, 4963.9 tps, lat 1.904 ms stddev 1.212, lag 0.429 ms, 0 skipped progress: 5.0 s, 4971.2 tps, lat 2.592 ms stddev 1.722, lag 0.975 ms, 0 skipped The absolute number of skipped transactions doesn't make much sense when all the other numbers are averages, and tps is a 1/s value. If you don't know the total number of transactions executed, the absolute number is meaningless. (Although you can calculate the absolute number of transactions executed by multiplying the TPS value with the interval). I think a percentage would be better here. Should we also print the number of late transactions here? I think that would be an even more important detail than the number of skipped transactions. It might be better to print only the percentage of late transactions, including skipped ones. Or both, but it's difficult to cram everything on a single line. This needs some further thinking.. - Heikki commit 6f1158e9a7e77710164d503b40d77fd9dcde7a08 Author: Heikki Linnakangas Date: Mon Sep 15 11:09:29 2014 +0300 Refactor log-printing in pgbench to a separate function. The doCustom function was getting really unwieldy. Upcoming patches will add even more code there. diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 087e0d3..0daf92f 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -347,6 +347,9 @@ static char *select_only = { static void setalarm(int seconds); static void *threadRun(void *arg); +static void doLog(TState *thread, CState *st, FILE *logfile, instr_time *now, + AggVals *agg); + static void usage(void) { @@ -1016,6 +1019,16 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa PGresult *res; Command **commands; bool trans_needs_throttle = false; + instr_time now; + + /* + * gettimeofday() isn't free, so we get the current timestamp lazily the + * first time it's needed, we reuse the same value throughout this + * function after that. This also ensures that e.g. the calculated latency + * reported in the log file and in the totals are the same. Zero means + * "not set yet". + */ + INSTR_TIME_SET_ZERO(now); top: commands = sql_files[st->use_file]; @@ -1049,10 +1062,10 @@ top: if (st->sleeping) { /* are we sleeping? */ - instr_time now; int64 now_us; - INSTR_TIME_SET_CURRENT(now); + if (INSTR_TIME_IS_ZERO(now)) + INSTR_TIME_SET_CURRENT(now); now_us = INSTR_TIME_GET_MICROSEC(now); if (st->txn_scheduled <= now_us) { @@ -1074,11 +1087,6 @@ top: if (st->listen) { /* are we receiver? */ - instr_time now; - bool now_valid = false; - - INSTR_TIME_SET_ZERO(now); /* initialize to keep compiler quiet */ - if (commands[st->state]->type == SQL_COMMAND) { if (debug) @@ -1100,181 +1108,40 @@ top: { int cnum = commands[st->state]->command_num; - if (!now_valid) - { + if (INSTR_TIME_IS_ZERO(now)) INSTR_TIME_SET_CURRENT(now); -now_valid = true; - } INSTR_TIME_ACCUM_DIFF(thread->exec_elapsed[cnum], now, st->stmt_begin); thread->exec_count[cnum]++; } - /* transaction finished: record latency under progress or throttling */ - if ((progress || throttle_delay) && commands[st->state + 1] == NULL) + /* transaction finished: calculate latency and log the transaction */ + if (commands[st->state + 1] == NULL) { - int64 latency; - - if (!now_valid) + /* only calculate latency if an option is used that needs it. */ + if (progress || throttle_delay) { -INSTR_TIME_SET_CURRENT(now); -now_valid = true; - } - - latency = INSTR_TIME_GET_MICROSEC(now) - st->txn_scheduled; +int64 latenc
Re: [HACKERS] implement subject alternative names support for SSL connections
On Fri, Sep 12, 2014 at 4:20 PM, Heikki Linnakangas wrote: >> Hmm. If that's what the browsers do, I think we should also err on the >> side of caution here. Ignoring the CN is highly unlikely to cause anyone >> a problem; a CA worth its salt should not issue a certificate with a CN >> that's not also listed in the SAN section. But if you have such a >> certificate anyway for some reason, it shouldn't be too difficult to get >> a new certificate. Certificates expire every 1-3 years anyway, so there >> must be a procedure to renew them anyway. > > > Committed, with that change, ie. the CN is not checked if SANs are present. > > Thanks for bearing through all these iterations! Great news! Thank you very much for devoting your time and energy to the review and providing such a useful feedback! On the CN thing, I don't have particularly strong arguments for either of the possible behaviors, so sticking to RFC makes sense here Sincerely, -- Alexey Klyukin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] orangutan seizes up during isolation-check
On 09/15/2014 07:51 AM, Noah Misch wrote: libintl replaces setlocale(). Its setlocale(LC_x, "") uses OS-specific APIs to determine the default locale when $LANG and similar environment variables are empty, as they are during "make check NO_LOCALE=1". On OS X, it calls[1] CFLocaleCopyCurrent(), which in turn spins up a thread. See the end of this message for the postmaster thread stacks active upon hitting a breakpoint set at _dispatch_mgr_thread. Ugh. I'd call that a bug in libintl. setlocale() has no business to make the process multi-threaded. Do we have the same problem in backends? At a quick glance, aside from postmaster we only use PG_SETMASK(&BlockSig) in signal handlers, to prevent another signal handler from running concurrently. I see two options for fixing this in pg_perm_setlocale(LC_x, ""): 1. Fork, call setlocale(LC_x, "") in the child, pass back the effective locale name through a pipe, and pass that name to setlocale() in the original process. The short-lived child will get the extra threads, and the postmaster will remain clean. 2. On OS X, check for relevant environment variables. Finding none, set LC_x=C before calling setlocale(LC_x, ""). A variation is to raise ereport(FATAL) if sufficient environment variables aren't in place. Either way ensures the libintl setlocale() will never call CFLocaleCopyCurrent(). This is simpler than (1), but it entails a behavior change: "LANG= initdb" will use LANG=C or fail rather than use the OS X user account locale. I'm skeptical of the value of looking up locale information using other OS X facilities when the usual environment variables are inconclusive, but I see no clear cause to reverse that decision now. I lean toward (1). Both of those are horrible hacks. And who's to say that calling setlocale(LC_x, "foo") won't also call some function that makes the process multi-threaded. If not in any current OS X release, it might still happen in a future one. One idea would be to use an extra pthread mutex or similar, in addition to PG_SETMASK(). Whenever you do PG_SETMASK(&BlockSig), also grab the mutex, and release it when you do PG_SETMASK(&UnBlockSig). It would be nice to stop doing non-trivial things in the signal handler in the first place. It's pretty scary, even though it works when the process is single-threaded. I believe the reason it's currently implemented like that are the same problems that the latch code solves with the self-pipe trick: select() is not interrupted by a signal on all platforms, and even if it was, you would need pselect() with is not available (or does not work correctly even if it exists) on all platforms. I think we could use a latch in postmaster too. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers