Re: [HACKERS] Support for N synchronous standby servers - take 2
On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier michael.paqu...@gmail.com wrote: Group labels are essential. OK, so this is leading us to the following points: - Use a JSON object to define the quorum/priority groups for the sync state. - Store it as a GUC, and use the check hook to validate its format, which is what we have now with s_s_names - Rely on SIGHUP to maintain an in-memory image of the quorum/priority sync state - Have the possibility to define group labels in this JSON blob, and be able to use those labels in a quorum or priority sync definition. - For backward-compatibility, use for example s_s_names = 'json' to switch to the new system. Personally, I think we're going to find that using JSON for this rather than a custom syntax makes the configuration strings two or three times as long for no discernable benefit. But I just work here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 16 July 2015 at 18:27, Robert Haas robertmh...@gmail.com wrote: On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier michael.paqu...@gmail.com wrote: Group labels are essential. OK, so this is leading us to the following points: - Use a JSON object to define the quorum/priority groups for the sync state. - Store it as a GUC, and use the check hook to validate its format, which is what we have now with s_s_names - Rely on SIGHUP to maintain an in-memory image of the quorum/priority sync state - Have the possibility to define group labels in this JSON blob, and be able to use those labels in a quorum or priority sync definition. - For backward-compatibility, use for example s_s_names = 'json' to switch to the new system. Personally, I think we're going to find that using JSON for this rather than a custom syntax makes the configuration strings two or three times as long for They may well be 2-3 times as long. Why is that a negative? no discernable benefit. Benefits: * More readable * Easy to validate * No additional code required in the server to support this syntax (so no bugs) * Developers will immediately understand the format * Easy to programmatically manipulate in a range of languages -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs si...@2ndquadrant.com wrote: Personally, I think we're going to find that using JSON for this rather than a custom syntax makes the configuration strings two or three times as long for They may well be 2-3 times as long. Why is that a negative? In my opinion, brevity makes things easier to read and understand. We also don't support multi-line GUCs, so if your configuration takes 140 characters, you're going to have a very long line in your postgresql.conf (and in your pg_settings output, etc.) * No additional code required in the server to support this syntax (so no bugs) I think you'll find that this is far from true. Presumably not any arbitrary JSON object will be acceptable. You'll have to parse it as JSON, and then validate that it is of the expected form. It may not be MORE code than implementing a mini-language from scratch, but I wouldn't expect to save much. * Developers will immediately understand the format I doubt it. I think any format that we pick will have to be carefully documented. People may know what JSON looks like in general, but they will not immediately know what bells and whistles are available in this context. * Easy to programmatically manipulate in a range of languages I agree that JSON has that advantage, but I doubt that it is important here. I would expect that people might need to generate a new config string and dump it into postgresql.conf, but that should be easy with any reasonable format. I think it will be rare to need to parse the postgresql.conf string, manipulate it programatically, and then put it back. As we've already said, most configurations are simple and shouldn't change frequently. If they're not or they do, that's a problem of itself. However, I'm not trying to ram my idea through; I'm just telling you my opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generalized JSON output functions
On Wed, Jul 15, 2015 at 12:58 PM, Andrew Dunstan and...@dunslane.net wrote: The approach take was both invasive and broken. Well, then let's not do it that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generalized JSON output functions
On Wed, Jul 15, 2015 at 1:10 PM, Ryan Pedela rped...@datalanche.com wrote: Like I said previously, the situation with Javascript will hopefully be remedied in a few years with ES7 anyway. I don't understand these issues in great technical depth, but if somebody is arguing that it's OK for PostgreSQL to be difficult to use for a certain category of user for several years until the next language rev becomes mainstream, then I disagree. The fact that somebody wrote a patch to try to solve a problem means that the thing in question is a problem for at least that one user. If he's the only one, maybe we don't need to care all that much. If his needs are representative of a significant user community, we should not turn our backs on that community, regardless of whether we like the patch he wrote, and regardless of how well we are meeting the needs of other communities (like node.js users). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Generalized JSON output functions
2015-07-16 19:51 GMT+02:00 Robert Haas robertmh...@gmail.com: On Wed, Jul 15, 2015 at 1:10 PM, Ryan Pedela rped...@datalanche.com wrote: Like I said previously, the situation with Javascript will hopefully be remedied in a few years with ES7 anyway. I don't understand these issues in great technical depth, but if somebody is arguing that it's OK for PostgreSQL to be difficult to use for a certain category of user for several years until the next language rev becomes mainstream, then I disagree. The fact that somebody wrote a patch to try to solve a problem means that the thing in question is a problem for at least that one user. If he's the only one, maybe we don't need to care all that much. If his needs are representative of a significant user community, we should not turn our backs on that community, regardless of whether we like the patch he wrote, and regardless of how well we are meeting the needs of other communities (like node.js users). I don't think so this issue is too hot. How long we support XML? The output format is static - the date format is fixed. How much issues was there? Was there any issue, that was not solvable by casting? If somebody needs different quoting, then it can be solved by explicit cast in SQL query, and not in hacking our output routines. Regards Pavel -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
[HACKERS] Bugs in our qsort implementation
I've been trying to figure out the crash in qsort reported here: http://www.postgresql.org/message-id/flat/cal8hzunr2fr1owzhwg-p64gjtnfbbmpx1y2oxmj_xuq3p8y...@mail.gmail.com I first noticed that our qsort code uses an int to hold some transient values representing numbers of elements. Since the passed array size argument is a size_t, this is broken; the integer could overflow. I do not think this is a live bug so far as our core code is concerned, because tuplesort.c still limits itself to no more than INT_MAX items to be sorted, and nothing else in the core would even approach that. However, it's in principle a hazard for third-party modules that might try to sort more than that; and in any case it's a bug waiting to bite us on the rear whenever somebody decides they have enough RAM that they should be able to sort more than INT_MAX items. However, Yiqing reported the crash as occurring here: Program terminated with signal 11, Segmentation fault. #0 0x00785180 in med3_tuple (a=0x7f31613f1028, b=0x7f31613f1040, c=0x3ffd, cmp_tuple=0x7f43613f1010, state=0x1) at qsort_tuple.c:66 66 { which is a bit curious because that function does not itself access any of the data --- it just calls the cmp_tuple function, so even if we'd somehow computed a bad address, the crash should occur inside the comparator function, not here. After awhile a theory occurred to me: the qsort functions recurse without bothering with a stack depth check, so maybe the SEGV actually represents running out of stack space. And after a bit of research, that theory seems pretty plausible. It turns out that qsort is guaranteed to recurse no deeper than log(N) levels, but *only if you take care to recurse on the smaller partition and iterate on the larger one*. And we're not doing that, we just blindly recurse on the left partition. So given a fairly huge amount of data and some bad luck in partition-picking, it seems possible that stack overflow explains this report. I propose to (1) fix the code to use a size_t variable rather than int where appropriate; (2) teach it to recurse on the smaller partition. It's possible that this issue can only manifest on 9.4 and up where we have the ability for tuplesort to allocate work arrays approaching INT_MAX elements. But I don't have a lot of faith in that; I think the worst-case stack depth for the way we have it now could be as bad as O(N), so in principle a crash could be possible with significantly smaller input arrays. I think we'd better back-patch this all the way. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
On 2015-07-13 00:36, Tom Lane wrote: We have a far better model to follow already, namely the foreign data wrapper API. In that, there's a single SQL-visible function that returns a dummy datatype indicating that it's an FDW handler, and when called, it hands back a struct containing pointers to all the other functions that the particular wrapper needs to supply (and, if necessary, the struct could have non-function-pointer fields containing other info the core system might need to know about the handler). We could similarly invent a pseudotype tablesample_handler and represent each tablesample method by a single SQL function returning tablesample_handler. Any future changes in the API for tablesample handlers would then appear as changes in the C definition of the struct returned by the handler, which requires no SQL-visible thrashing, hence creates no headaches for pg_upgrade, and it is pretty easy to design it so that failure to update an external module that implements the API results in C compiler errors and/or sane fallback behavior. (back from vacation, going over this thread) Yes this sounds very sane (and we use something similar also for logical decoding plugins, not just FDWs). I wish this has occurred to me before, I would not have to spend time on the DDL support which didn't even get in. Once we've done that, I think we don't even need a special catalog representing tablesample methods. Given FROM foo TABLESAMPLE bernoulli(...), the parser could just look for a function bernoulli() returning tablesample_handler, and it's done. The querytree would have an ordinary function dependency on that function, which would be sufficient It seems possible that we might not need catalog indeed. This would also simplify the parser part which currently contains specialized function search code as we could most likely just reuse the generic code. to handle DROP dependency behaviors properly. (On reflection, maybe better if it's bernoulli(internal) returns tablesample_handler, so as to guarantee that such a function doesn't create a conflict with any user-defined function of the same name.) The probability of conflict seems high with the system() so yeah we'd need some kind of differentiator. PS: now that I've written this rant, I wonder why we don't redesign the index AM API along the same lines. It probably doesn't matter much at the moment, but if we ever get serious about supporting index AM extensions, I think we ought to consider doing that. +1 I think this is very relevant to the proposed sequence am patch as well. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Naming the GUC pgstat* seems a little inconsistent. Sorry, there is a typo in the mail. The GUC name is 'track_activity_progress'. Also, adding the new GUC to src/backend/utils/misc/postgresql.conf.sample might be helpful Yes. I will update. Thank you, Rahila Syed __ Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Both. Here's the patch. Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet. I ran this for a while without casserts and it seems to work. But with casserts, I get failures in the autovac process on the GIN index. I don't see how this is related to the LWLock issue, but I didn't see it without your patch. Perhaps the system just didn't survive long enough to uncover it without the patch (although it shows up pretty quickly). It could just be an overzealous Assert, since the casserts off didn't show problems. bt and bt full are shown below. Cheers, Jeff #0 0x003dcb632625 in raise () from /lib64/libc.so.6 #1 0x003dcb633e05 in abort () from /lib64/libc.so.6 #2 0x00930b7a in ExceptionalCondition ( conditionName=0x9a1440 !(((PageHeader) (page))-pd_special = (__builtin_offsetof (PageHeaderData, pd_linp))), errorType=0x9a12bc FailedAssertion, fileName=0x9a12b0 ginvacuum.c, lineNumber=713) at assert.c:54 #3 0x004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at ginvacuum.c:713 #4 0x0093b53a in FunctionCall2Coll (flinfo=0x7fffee073e60, collation=0, arg1=140737186840448, arg2=0) at fmgr.c:1323 #5 0x004d5c7a in index_vacuum_cleanup (info=0x7fffee073f80, stats=0x0) at indexam.c:717 #6 0x00664f69 in lazy_cleanup_index (indrel=0x7faafbcace20, stats=0x0, vacrelstats=0x28809c8) at vacuumlazy.c:1400 #7 0x00664142 in lazy_scan_heap (onerel=0x7faafbcab6d0, vacrelstats=0x28809c8, Irel=0x2881090, nindexes=2, scan_all=1 '\001') at vacuumlazy.c: #8 0x00662905 in lazy_vacuum_rel (onerel=0x7faafbcab6d0, options=65, params=0x286ea00, bstrategy=0x286ea90) at vacuumlazy.c:247 #9 0x006623e4 in vacuum_rel (relid=16409, relation=0x7fffee074550, options=65, params=0x286ea00) at vacuum.c:1377 #10 0x00660bea in vacuum (options=65, relation=0x7fffee074550, relid=16409, params=0x286ea00, va_cols=0x0, bstrategy=0x286ea90, isTopLevel=1 '\001') at vacuum.c:295 #11 0x0075caa9 in autovacuum_do_vac_analyze (tab=0x286e9f8, bstrategy=0x286ea90) at autovacuum.c:2811 #12 0x0075be67 in do_autovacuum () at autovacuum.c:2331 #13 0x0075ac1c in AutoVacWorkerMain (argc=0, argv=0x0) at autovacuum.c:1648 #14 0x0075a84d in StartAutoVacWorker () at autovacuum.c:1455 #15 0x0076f745 in StartAutovacuumWorker () at postmaster.c:5261 #16 0x0076effd in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:4918 #17 signal handler called #18 0x003dcb6e1353 in __select_nocancel () from /lib64/libc.so.6 #19 0x0076a8f0 in ServerLoop () at postmaster.c:1582 #20 0x0076a141 in PostmasterMain (argc=4, argv=0x27b2cf0) at postmaster.c:1263 #21 0x006c1e6e in main (argc=4, argv=0x27b2cf0) at main.c:223 #0 0x003dcb632625 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x003dcb633e05 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00930b7a in ExceptionalCondition ( conditionName=0x9a1440 !(((PageHeader) (page))-pd_special = (__builtin_offsetof (PageHeaderData, pd_linp))), errorType=0x9a12bc FailedAssertion, fileName=0x9a12b0 ginvacuum.c, lineNumber=713) at assert.c:54 No locals. #3 0x004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at ginvacuum.c:713 buffer = 186 page = 0x7faafc565b20 info = 0x7fffee073f80 stats = 0x2880858 index = 0x7faafbcace20 needLock = 1 '\001' npages = 22569 blkno = 12025 totFreePages = 11502 ginstate = {index = 0x7faafbcace20, oneCol = 1 '\001', origTupdesc = 0x7faafbcad150, tupdesc = {0x7faafbcad150, 0x0 repeats 31 times}, compareFn = {{fn_addr = 0x90224d bttextcmp, fn_oid = 360, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x285e3c0, fn_expr = 0x0}, {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0} repeats 31 times}, extractValueFn = {{ fn_addr = 0x494adc
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
Amit Kapila wrote: This can be tracked either in 9.5 Open Items or for next CF, any opinions? If nobody else has any opinion on this, I will add it to 9.5 Open Items list. I think this belongs in the open items list, yeah. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to fix spelling mistake in error message
On Thu, Jul 16, 2015 at 2:23 AM, David Rowley david.row...@2ndquadrant.com wrote: Attached is a small patch to fix a spelling mistake in an error message Applied, thanks. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] [PATCH] Comment fix for miscinit.c
On Wed, Jul 15, 2015 at 11:18 PM, David Christensen da...@endpoint.com wrote: Quick comment fix for edit issue. Applied, thanks. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
Petr Jelinek wrote: On 2015-07-13 00:36, Tom Lane wrote: PS: now that I've written this rant, I wonder why we don't redesign the index AM API along the same lines. It probably doesn't matter much at the moment, but if we ever get serious about supporting index AM extensions, I think we ought to consider doing that. +1 I think this is very relevant to the proposed sequence am patch as well. Hmm, how would this work? Would we have index AM implementation run some function that register their support methods somehow at startup? Hopefully we're not going to have the index AMs become shared libraries. In any case, if indexes AMs and sequence AMs go this route, that probably means the column store AM we're working on will probably have to go the same route too. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
On 2015-07-13 15:39, Tom Lane wrote: Datum tsm_system_rows_init(PG_FUNCTION_ARGS) { TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0); uint32 seed = PG_GETARG_UINT32(1); int32 ntuples = PG_ARGISNULL(2) ? -1 : PG_GETARG_INT32(2); This is rather curious coding. Why should this function check only one of its arguments for nullness? If it needs to defend against any of them being null, it really needs to check all. But in point of fact, this function is declared STRICT, which means it's a violation of the function call protocol if the core code ever passes a null to it, and so this test ought to be dead code. A similar pattern of ARGISNULL checks in declared-strict functions exists in all the tablesample modules, not just this one, showing that this is an overall design error not just a thinko here. My inclination would be to make the core code enforce non-nullness of all tablesample arguments so as to make it unnecessary to check strictness of the tsm*init functions, but in any case it is Not Okay to just pass nulls willy-nilly to strict C functions. The reason for this ugliness came from having to have balance between modularity and following the SQL Standard error codes for BERNOULLI and SYSTEM, which came as issue during reviews (the original code did the checks before calling the sampling method's functions but produced just generic error code about wrong parameter). I considered it as okayish mainly because those functions are not SQL callable and the code which calls them knows how to handle it correctly, but I understand why you don't. I guess if we did what you proposed in another email in this thread - don't have the API exposed on SQL level, we could send the additional parameters as List * and leave the validation completely to the function. (And maybe don't allow NULLs at all) Also, I find this coding pretty sloppy even without the strictness angle, because the net effect of this way of dealing with nulls is that an argument-must-not-be-null complaint is reported as out of range, which is both confusing and the wrong ERRCODE. if (ntuples 1) ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg(invalid sample size), errhint(Sample size must be positive integer value.))); I don't find this to be good error message style. The secondary comment is not a hint, it's an ironclad statement of what you did wrong, so if we wanted to phrase it like this it should be an errdetail not errhint. But the whole thing is overly cute anyway because there is no reason at all not to just say what we mean in the primary error message, eg ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg(sample size must be greater than zero))); Same as above, although now that I re-read the standard I am sure I misunderstand it the first time - it says: If S is the null value or if S 0 (zero) or if S 100, then an exception condition is raised: data exception — invalid sample size. I took it as literal error message originally but they just mean status code by it. While we're on the subject, what's the reason for disallowing a sample size of zero? Seems like a reasonable edge case. Yes that's a bug. /* All blocks have been read, we're done */ if (sampler-doneblocks sampler-nblocks || sampler-donetuples = sampler-ntuples) PG_RETURN_UINT32(InvalidBlockNumber); Okay, I lied, I *am* going to complain about this comment. Comments that do not accurately describe the code they're attached to are worse than useless. That's copy-pasto from the tsm_system_time. In short, I'd do something more like uint32 r; /* Safety check to avoid infinite loop or zero result for small n. */ if (n = 1) return 1; /* * This should only take 2 or 3 iterations as the probability of 2 numbers * being relatively prime is ~61%; but just in case, we'll include a * CHECK_FOR_INTERRUPTS in the loop. */ do { CHECK_FOR_INTERRUPTS(); r = (uint32) (sampler_random_fract(randstate) * n); } while (r == 0 || gcd(r, n) 1); Note however that this coding would result in an unpredictable number of iterations of the RNG, which might not be such a good thing if we're trying to achieve repeatability. It doesn't matter in the context of this module since the RNG is not used after initialization, but it would matter if we then went on to do Bernoulli-style sampling. (Possibly that could be dodged by reinitializing the RNG after the initialization steps.) Bernoulli-style sampling does not need this kind of code so it's not really an issue. That is unless you'd like to combine the linear probing and bernoulli of course, but I don't see any benefit in doing that. -- Petr Jelinek http://www.2ndQuadrant.com/
Re: [HACKERS] [DESIGN] Incremental checksums
- pg_disable_checksums(void) = turn checksums off for a cluster. Sets the state to disabled, which means bg_worker will not do anything. - pg_request_checksum_cycle(void) = if checksums are enabled, increment the data_checksum_cycle counter and set the state to enabling. If the cluster is already enabled for checksums, then what is the need for any other action? You are assuming this is a one-way action. No, I was confused by the state (enabling) this function will set. Okay. Requesting an explicit checksum cycle would be desirable in the case where you want to proactively verify there is no cluster corruption to be found. Sure, but I think that is different from setting the state to enabling. In your proposal above, in enabling state cluster needs to write checksums, where for such a feature you only need read validation. With “revalidating” since your database is still actively making changes, you need to validate writes too (think new tables, etc). “Enabling” needs reads unvalidated because you’re starting from an unknown state (i.e., not checksummed already). Thanks, David -- David Christensen PostgreSQL Team Manager End Point Corporation da...@endpoint.com 785-727-1171 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
On 2015-07-13 18:00, Tom Lane wrote: And here's that. I do *not* claim that this is a complete list of design issues with the patch, it's just things I happened to notice in the amount of time I've spent so far (which is already way more than I wanted to spend on TABLESAMPLE right now). I'm not sure that we need an extensible sampling-method capability at all, let alone that an API for that needs to be the primary focus of a patch. Certainly the offered examples of extension modules aren't inspiring: tsm_system_time is broken by design (more about that below) and nobody would miss tsm_system_rows if it weren't there. What is far more important is to get sampling capability into indexscans, and designing an API ahead of time seems like mostly a distraction from that. I'd think seriously about tossing the entire executor-stage part of the patch, creating a stateless (history-independent) sampling filter that just accepts or rejects TIDs, and sticking calls to that into all the table scan node types. Once you had something like that working well it might be worth considering whether to try to expose an API to generalize it. But even then it's not clear that we really need any more options than true-Bernoulli and block-level sampling. I think this is not really API issue as much as opposite approach on what to implement first. I prioritized in first iteration for the true block sampling support as that's what I've been mostly asked for. But my plan was to have at same later stage (9.6+) also ability to do subquery scans etc. Basically new SamplingFilter executor node which would pass the tuples to examinetuple() which would then decide what to do with it. The selection between using nextblock/nexttuple and examinetuple was supposed to be extension of the API where the sampling method would say if it supports examinetuple or nextblock/nexttuple or both. And eventually I wanted to rewrite bernoulli to just use the examinetuple() on top of whatever scan once this additional support was in. I think I explained this during the review stage. The IBM paper I linked to in the other thread mentions that their optimizer will sometimes choose to do Bernoulli sampling even if SYSTEM was requested. Probably that happens when it decides to do a simple indexscan, because then there's no advantage to trying to cluster the Yeah it happens when there is an index which is used in WHERE clause and has bigger selectivity than the percentage specified in the TABLESAMPLE clause. This of course breaks one of the common use-case though: SELECT count(*) * 100 FROM table TABLESAMPLE SYSTEM(1) WHERE foo = bar; sampled rows. But in the case of a bitmap scan, you could very easily do either true Bernoulli or block-level sampling simply by adjusting the rules about which bits you keep or clear in the bitmap (given that you apply the filter between collection of the index bitmap and accessing the heap, which seems natural). The only case where a special scan type really seems to be needed is if you want block-level sampling, the query would otherwise use a seqscan, *and* the sampling percentage is pretty low --- if you'd be reading every second or third block anyway, you're likely better off with a plain seqscan so that the kernel sees sequential access and does prefetching. The current API doesn't seem to make it possible to switch between seqscan and read-only-selected-blocks based on the sampling percentage, but I think that could be an interesting optimization. Well you can do that if you write your own sampling method. We don't do that in optimizer and that's design choice, because you can't really do that on high level like that if you want to keep extensibility. And given the amount of people that asked if they can do their own sampling when I talked to them about this during the design stage, I consider the extensibility as more important. Especially if extensibility gives you the option to do the switching anyway, albeit on lower-level and not out of the box. (Another bet that's been missed is having the samplescan logicrequest prefetching when it is doing selective block reads. The current API can't support that AFAICS, since there's no expectation that nextblock calls could be done asynchronously from nexttuple calls.) Not sure I follow, would not it be possible to achieve this using the tsmseqscan set to true (it's a misnomer then, I know)? Another issue with the API as designed is the examinetuple() callback. Allowing sample methods to see invisible tuples is quite possibly the worst idea in the whole patch. They can *not* do anything with such tuples, or they'll totally break reproducibility: if the tuple is invisible to your scan, it might well be or soon become invisible to everybody, whereupon it would be subject to garbage collection at the drop of a hat. So if an invisible tuple affects the sample method's behavior at all, repeated scans in the same query would
Re: [HACKERS] Freeze avoidance of very large table.
On Wed, Jul 15, 2015 at 3:07 AM, Sawada Masahiko sawada.m...@gmail.com wrote: On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs si...@2ndquadrant.com wrote: On 10 July 2015 at 15:11, Sawada Masahiko sawada.m...@gmail.com wrote: Oops, I had forgotten to add new file heapfuncs.c. Latest patch is attached. I think we've established the approach is desirable and defined the way forwards for this, so this is looking good. If we want to move stuff like pg_stattuple, pg_freespacemap into core, we could move them into heapfuncs.c. Some of my requests haven't been actioned yet, so I personally would not commit this yet. I am happy to continue as reviewer/committer unless others wish to take over. The main missing item is pg_upgrade support, which won't happen by end of CF1, so I am marking this as Returned With Feedback. Hopefully we can review this again before CF2. I appreciate your reviewing. Yeah, the pg_upgrade support and regression test for VFM patch is almost done now, I will submit the patch in this week after testing it . Attached patch is latest v9 patch. I added: - regression test for visibility map (visibilitymap.sql and visibilitymap.out files) - pg_upgrade support (rewriting vm file to vfm file) - regression test for pg_upgrade Please review it. Regards, -- Masahiko Sawada diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index 22c5f7a..b1b6a06 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat) * If the page has only visible tuples, then we can find out the free * space from the FSM and move on. */ - if (visibilitymap_test(rel, blkno, vmbuffer)) + if (visibilitymap_test(rel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE)) { freespace = GetRecordedFreeSpace(rel, blkno); stat-tuple_len += BLCKSZ - freespace; diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile index b83d496..806ce27 100644 --- a/src/backend/access/heap/Makefile +++ b/src/backend/access/heap/Makefile @@ -12,6 +12,7 @@ subdir = src/backend/access/heap top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global -OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o +OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o \ + heapfuncs.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 86a2e6b..796b76f 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2131,8 +2131,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, CheckForSerializableConflictIn(relation, NULL, InvalidBuffer); /* - * Find buffer to insert this tuple into. If the page is all visible, - * this will also pin the requisite visibility map page. + * Find buffer to insert this tuple into. If the page is all visible + * or all frozen, this will also pin the requisite visibility map and + * frozen map page. */ buffer = RelationGetBufferForTuple(relation, heaptup-t_len, InvalidBuffer, options, bistate, @@ -2147,7 +2148,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, if (PageIsAllVisible(BufferGetPage(buffer))) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(BufferGetPage(buffer)); + PageClearAllFrozen(BufferGetPage(buffer)); + visibilitymap_clear(relation, ItemPointerGetBlockNumber((heaptup-t_self)), vmbuffer); @@ -2448,7 +2453,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, if (PageIsAllVisible(page)) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer); @@ -2731,9 +2740,9 @@ heap_delete(Relation relation, ItemPointer tid, /* * If we didn't pin the visibility map page and the page has become all - * visible while we were busy locking the buffer, we'll have to unlock and - * re-lock, to avoid holding the buffer lock across an I/O. That's a bit - * unfortunate, but hopefully shouldn't happen often. + * visible or all frozen while we were busy locking the buffer, we'll + * have to unlock and re-lock, to avoid holding the buffer lock across an + * I/O. That's a bit unfortunate, but hopefully shouldn't happen often. */ if (vmbuffer == InvalidBuffer PageIsAllVisible(page)) { @@ -2925,10 +2934,15 @@ l1: */ PageSetPrunable(page, xid); + /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */ if (PageIsAllVisible(page)) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); +
Re: [HACKERS] optimizing vacuum truncation scans
On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Wed, Jul 15, 2015 at 11:19 PM, Amit Kapila amit.kapil...@gmail.com wrote: One case where this patch can behave differently then current code is, when before taking Exclusive lock on relation for truncation, if some backend clears the vm bit and then inserts-deletes-prune that page. I think with patch it will not consider to truncate pages whereas current code will allow to truncate it as it re-verifies each page. I think such a case would be rare and we might not want to bother about it, but still worth to think about it once. Thanks for your review. corrected the code as instead of returning the blkno after visibility map check failure, the code just continue to verify the contents as earlier approach. Okay, I think this should work. Here I attached updated patches, 1) without prefetch logic. 2) with combined vm and prefetch logic. I think it is better to just get the first patch in as that in itself is a clear win and then we can evaluate the second one (prefetching during truncation) separately. I think after first patch, the use case for doing prefetch seems to be less beneficial and I think it could hurt by loading not required pages (assume you have prefetched 32 pages and after inspecting first page, it decides to quit the loop as that is non-empty page and other is when it has prefetched just because for page the visibility map bit was cleared, but for others it is set) in OS buffer cache. Having said that, I am not against prefetching in this scenario as that can help in more cases then it could hurt, but I think it will be better to evaluate that separately. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
Petr Jelinek wrote: On 2015-07-13 15:39, Tom Lane wrote: I don't find this to be good error message style. The secondary comment is not a hint, it's an ironclad statement of what you did wrong, so if we wanted to phrase it like this it should be an errdetail not errhint. But the whole thing is overly cute anyway because there is no reason at all not to just say what we mean in the primary error message, eg ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg(sample size must be greater than zero))); Same as above, although now that I re-read the standard I am sure I misunderstand it the first time - it says: If S is the null value or if S 0 (zero) or if S 100, then an exception condition is raised: data exception — invalid sample size. I took it as literal error message originally but they just mean status code by it. Yes, must use a new errcode ERRCODE_INVALID_SAMPLE_SIZE defined to 2202H here. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Fri, Jul 3, 2015 at 12:15 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jul 2, 2015 at 7:44 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Kapila wrote: Added the above log messages in attached patch with small change such that in message, file names will be displayed with quotes as most of other usages of rename (failure) in that file uses quotes to display filenames. Why emit two messages? not necessary. Can we reduce that to a single one? Maybe the first one could be errdetail or something. I think it is better other way (basically have second one as errdetail). We already have one similar message in xlog.c that way. ereport(LOG, (errmsg(online backup mode canceled), errdetail(\%s\ was renamed to \%s\., BACKUP_LABEL_FILE, BACKUP_LABEL_OLD))); Attached patch consolidates errmsg into one message. Here are some minor comments: +ereport(LOG, +(errmsg(ignoring \%s\ file because no \%s\ file exists, +TABLESPACE_MAP, BACKUP_LABEL_FILE), + errdetail(could not rename file \%s\ to \%s\: %m, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); WARNING is better than LOG here because it indicates a problematic case? In detail message, the first word of sentence needs to be capitalized. + errdetail(renamed file \%s\ to \%s\, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); In detail message, basically we should use a complete sentence. So like other similar detail messages in xlog.c, I think that it's better to use \%s\ was renamed to \%s\. as the detail message here. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Volatility of pg_xact_commit_timestamp() and pg_last_committed_xact()
Hi, Volatilities of pg_xact_commit_timestamp() and pg_last_committed_xact() are now STABLE. But ISTM that those functions can return different results even within a single statement. So we should change the volatilities of them to VOLATILE? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Jul 14, 2015, at 5:25 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I really think we should do the simple thing first. If we make this complicated and add lots of bells and whistles, it is going to be much harder to get anything committed, because there will be more things for somebody to object to. If we start with something simple, we can always improve it later, if we are confident that the design for improving it is good. The hardest thing about a proposal like this is going to be getting down the overhead to a level that is acceptable, and every expansion of the basic design that has been proposed - gathering more than one byte of information, or gathering times, or having the backend update a tracking hash - adds *significant* overhead to the design I proposed. FWIW, I entirely share Robert's opinion that adding gettimeofday() overhead in routinely-taken paths is likely not to be acceptable. But there's no need to base this solely on opinions. I suggest somebody try instrumenting just one hotspot in the various ways that are being proposed, and then we can actually measure what it costs, instead of guessing about that. regards, tom lane I made benchmark of gettimeofday(). I believe it is certainly usable for monitoring. Testing configuration: 24 cores, Intel Xeon CPU X5675@3.07Ghz RAM 24 GB 54179703 - microseconds total 2147483647 - (INT_MAX), the number of gettimeofday() calls 54179703 / 2147483647.0 0.025229390256679331 Here we have the average duration of one gettimeofday in microseconds. Now we get the count of all waits in one minute (pgbench -i -s 500, 2 GB shared buffers, started with -c 96 -j 4, it is almost 100% cpu load). b1=# select sum(wait_count) from pg_stat_wait_profile; sum - 2113608 So, we can estimate the time of gtd in one minute (it multiplies by two, because we are going to call it twice for each wait) 2113608 * 0.025229390256679331 * 2 106650.08216327897 This is time in microseconds that we spend on gtd in one minute. Calculation of overhead (in percents): 106650 / 6000.0 * 100 0.17775 gtd_test.c Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
On 2015-07-16 15:59, Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Petr Jelinek wrote: On 2015-07-13 00:36, Tom Lane wrote: PS: now that I've written this rant, I wonder why we don't redesign the index AM API along the same lines. It probably doesn't matter much at the moment, but if we ever get serious about supporting index AM extensions, I think we ought to consider doing that. I think this is very relevant to the proposed sequence am patch as well. Hmm, how would this work? Would we have index AM implementation run some function that register their support methods somehow at startup? Well, registration of new index AMs is an unsolved question ATM anyhow. But what I'm imagining is that pg_am would reduce to about two columns, amname and a handler function OID, and everything else that constitutes the API for AMs would get moved down to the C level. We have to keep that catalog because we still need index AMs to have OIDs that will represent them in pg_opclass etc; but we don't need to nail the exact set of AM interface functions into the catalog. (I'm not sure whether we'd want to remove all the bool columns from pg_am. At the C level it would be about as convenient to have them in a struct returned by the handler function. But it's occasionally useful to have those properties visible to SQL queries.) This is along the lines of how I was thinking also (when I read your previous email). I think the properties of the index will have to be decided on individual basis once somebody actually starts working on this. But functions can clearly go into C struct if they are called only from C anyway. I'm not clear on whether sequence AMs would need explicit catalog representation, or could be folded down to just a single SQL function with special signature as I suggested for tablesample handlers. Is there any need for a sequence AM to have additional catalog infrastructure like index AMs need? That depends on the route we will choose to take with the storage there. If we allow custom columns for sequence AMs (which is what both Heikki and me seem to be inclined to do) then I think it will still need catalog, plus it's also easier to just reuse the relam behavior than coming up up with something completely new IMHO. In any case, if indexes AMs and sequence AMs go this route, that probably means the column store AM we're working on will probably have to go the same route too. It's worth considering anyway. The FDW API has clearly been far more successful than the index AM API in terms of being practically usable by extensions. Yep, I now consider it to be a clear mistake that I modeled both sequence am and tablesample after indexes given that I targeted both at extensibility. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
On Thu, Jul 16, 2015 at 10:33 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Petr Jelinek wrote: On 2015-07-13 00:36, Tom Lane wrote: PS: now that I've written this rant, I wonder why we don't redesign the index AM API along the same lines. It probably doesn't matter much at the moment, but if we ever get serious about supporting index AM extensions, I think we ought to consider doing that. +1 I think this is very relevant to the proposed sequence am patch as well. Hmm, how would this work? Would we have index AM implementation run some function that register their support methods somehow at startup? Hopefully we're not going to have the index AMs become shared libraries. I recall a proposal by Alexander Korotkov about extensible access methods although his proposal also included a CREATE AM command that would add a pg_am row so that perhaps differs from what Tom seems to allude to here. http://www.postgresql.org/message-id/capphfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
Ildus Kurbangaliev i.kurbangal...@postgrespro.ru writes: I made benchmark of gettimeofday(). I believe it is certainly usable for monitoring. Testing configuration: 24 cores, Intel Xeon CPU X5675@3.07Ghz RAM 24 GB 54179703 - microseconds total 2147483647 - (INT_MAX), the number of gettimeofday() calls 54179703 / 2147483647.0 0.025229390256679331 Here we have the average duration of one gettimeofday in microseconds. 25 nsec per gettimeofday() is in the same ballpark as what I measured on a new-ish machine last year: http://www.postgresql.org/message-id/flat/31856.1400021...@sss.pgh.pa.us The problem is that (a) on modern hardware that is not a small number, it's the equivalent of 100 or more instructions; and (b) the results look very much worse on less-modern hardware, particularly machines where gettimeofday requires a kernel call. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] postgres_fdw extension support
On Thu, Jul 16, 2015 at 11:06 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Langote langote_amit...@lab.ntt.co.jp writes: On 2015-07-16 PM 12:43, Tom Lane wrote: The basic issue here is how can a user control which functions/operators can be sent for remote execution?. While it's certainly true that sometimes you might want function-by-function control of that, Paul's point was that extension-level granularity would be extremely convenient for PostGIS, and probably for other extensions. Perhaps just paranoid but is the extension version number any significant? In any scenario for user control of sending functions to the far end, it's on the user's head to make sure that he's telling us the truth about which functions are compatible between local and remote servers. That would extend to checking cross-version compatibility if he's running different versions, too. We already have risks of that kind with built-in functions, really, and I've not heard complaints about it. Yeah, that's true. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
Amit Langote amitlangot...@gmail.com writes: On Thu, Jul 16, 2015 at 10:33 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Hmm, how would this work? Would we have index AM implementation run some function that register their support methods somehow at startup? I recall a proposal by Alexander Korotkov about extensible access methods although his proposal also included a CREATE AM command that would add a pg_am row so that perhaps differs from what Tom seems to allude to here. I think we'd still need to invent CREATE AM if we wanted to allow index AMs to be created as extensions; we'd still have to have the pg_am catalog, and extensions still couldn't write rows directly into that, for the same reasons I pointed out with respect to tablesample methods. However, if the contents of pg_am could be boiled down to just a name and a handler function, then that would represent a simple and completely stable definition for CREATE AM's arguments, which would be a large improvement over trying to reflect the current contents of pg_am directly in a SQL statement. We add new columns to pg_am all the time, and that would create huge backward-compatibility headaches if we had to modify the behavior of a CREATE AM statement every time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] postgres_fdw extension support
Amit Langote langote_amit...@lab.ntt.co.jp writes: On 2015-07-16 PM 12:43, Tom Lane wrote: The basic issue here is how can a user control which functions/operators can be sent for remote execution?. While it's certainly true that sometimes you might want function-by-function control of that, Paul's point was that extension-level granularity would be extremely convenient for PostGIS, and probably for other extensions. Perhaps just paranoid but is the extension version number any significant? In any scenario for user control of sending functions to the far end, it's on the user's head to make sure that he's telling us the truth about which functions are compatible between local and remote servers. That would extend to checking cross-version compatibility if he's running different versions, too. We already have risks of that kind with built-in functions, really, and I've not heard complaints about it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
Petr Jelinek p...@2ndquadrant.com writes: On 2015-07-12 18:02, Tom Lane wrote: A possible way around this problem is to redefine the sampling rule so that it is not history-dependent but depends only on the tuple TIDs. For instance, one could hash the TID of a candidate tuple, xor that with a hash of the seed being used for the current query, and then select the tuple if (hash/MAXINT) P. That would work for bernoulli for physical tuples, yes. Only thing that worries me is future extensibility for data sources that only provide virtual tuples. Well, repeatability of a TABLESAMPLE attached to a join seems like an unsolved and possibly unsolvable problem anyway. I don't think we should assume that the API we define today will cope with that. But that is another reason why the current API is inadequate: there's no provision for specifying whether or how a tablesample method can be applied to non-base-table RTEs. (I re-read the thread and noted that Peter E. complained about that some time ago, but nothing was done about it. I'm fine with not supporting the case right now, but nonetheless it's another reason why we'd better make the API more easily extensible.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape
Petr Jelinek p...@2ndquadrant.com writes: On 2015-07-16 15:59, Tom Lane wrote: I'm not clear on whether sequence AMs would need explicit catalog representation, or could be folded down to just a single SQL function with special signature as I suggested for tablesample handlers. Is there any need for a sequence AM to have additional catalog infrastructure like index AMs need? That depends on the route we will choose to take with the storage there. If we allow custom columns for sequence AMs (which is what both Heikki and me seem to be inclined to do) then I think it will still need catalog, plus it's also easier to just reuse the relam behavior than coming up up with something completely new IMHO. Hm. I've not been following the sequence AM stuff carefully, but if you want to use relam to point at a sequence's AM then really sequence AMs have to be represented in pg_am. (It would be quite broken from a relational theory standpoint if relam could point to either of two catalogs; not to mention that we have no way to guarantee OID uniqueness across multiple catalogs.) As things stand today, putting both kinds of AM into one catalog would be pretty horrible, but it seems not hard to make it work if we did this sort of refactoring first. We'd end up with three columns in pg_am: amname, amkind (index or sequence), and amhandler. The type and contents of the struct returned by amhandler could be different depending on amkind, which means that the APIs could be as different as we need, despite sharing just one catalog. The only real restriction is that index and sequence AMs could not have the same names, which doesn't seem like much of a problem from here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
On 2015-07-16 16:22, Tom Lane wrote: Petr Jelinek p...@2ndquadrant.com writes: On 2015-07-12 18:02, Tom Lane wrote: A possible way around this problem is to redefine the sampling rule so that it is not history-dependent but depends only on the tuple TIDs. For instance, one could hash the TID of a candidate tuple, xor that with a hash of the seed being used for the current query, and then select the tuple if (hash/MAXINT) P. That would work for bernoulli for physical tuples, yes. Only thing that worries me is future extensibility for data sources that only provide virtual tuples. Well, repeatability of a TABLESAMPLE attached to a join seems like an unsolved and possibly unsolvable problem anyway. I don't think we should assume that the API we define today will cope with that. Ok, It's true that the implementations I've seen in other databases so far only concern themselves by sampling physical relations and ignore the rest. But that is another reason why the current API is inadequate: there's no provision for specifying whether or how a tablesample method can be applied to non-base-table RTEs. (I re-read the thread and noted that Peter E. complained about that some time ago, but nothing was done about it. I'm fine with not supporting the case right now, but nonetheless it's another reason why we'd better make the API more easily extensible.) Nothing in terms of implementation yes, I did write my idea on how this could be done via extending the current API in the future. I won't try to pretend that I am absolutely sure that the API might not need some breaking change to do that though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Thu, Jul 16, 2015 at 10:54 AM, Tom Lane t...@sss.pgh.pa.us wrote: Ildus Kurbangaliev i.kurbangal...@postgrespro.ru writes: I made benchmark of gettimeofday(). I believe it is certainly usable for monitoring. Testing configuration: 24 cores, Intel Xeon CPU X5675@3.07Ghz RAM 24 GB 54179703 - microseconds total 2147483647 - (INT_MAX), the number of gettimeofday() calls 54179703 / 2147483647.0 0.025229390256679331 Here we have the average duration of one gettimeofday in microseconds. 25 nsec per gettimeofday() is in the same ballpark as what I measured on a new-ish machine last year: http://www.postgresql.org/message-id/flat/31856.1400021...@sss.pgh.pa.us The problem is that (a) on modern hardware that is not a small number, it's the equivalent of 100 or more instructions; and (b) the results look very much worse on less-modern hardware, particularly machines where gettimeofday requires a kernel call. Yes, we've been through this many times before. All you have to do is look at how much slower a query gets when you run EXPLAIN ANALYZE vs. when you run it without EXPLAIN ANALYZE. The slowdown there is platform-dependent, but I think it's significant even on platforms where gettimeofday is fast, like modern Linux machines. That overhead is precisely the reason why we added EXPLAIN (ANALYZE, TIMING OFF) - so that if you want to, you can see the row-count estimates without incurring the timing overhead. There is *plenty* of evidence that using gettimeofday in contexts where it may be called many times per query measurably hurts performance. It is possible that we can have an *optional feature* where timing can be turned on, but it is dead certain that turning it on unconditionally will be unacceptable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Retrieve the snapshot's LSN
On Thu, Jul 16, 2015 at 12:54 PM, Andres Freund and...@anarazel.de wrote: Well, in combination with logical decoding it kinda has one: It should allow you to take a dump of the database with a certain snapshot and replay all transactions with a commit lsn bigger than the snapshot's lsn and end up with a continually consistent database. Visibility for HS actually works precisely in commit LSN order, even if that is possibly different than on the primary... That makes sense, and hopefully answers Florent's question about why this is only exposed through the slot mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Retrieve the snapshot's LSN
On 2015-07-16 12:40:07 -0400, Robert Haas wrote: On Wed, Jul 15, 2015 at 12:51 PM, Florent Guiliani flor...@guiliani.fr wrote: During slot creation, the snapshot building and exporting code seems highly coupled with the logical decoding stuff. It doesn't seems much reusable to retrieve the snapshot's LSN outside of logical decoding. I don't think the snapshot's LSN has a well-defined meaning in general. The obvious meaning would be the LSN such that all commits prior to that LSN are visible and all later commits are invisible, but such an LSN need not exist. Suppose A writes a commit record at LSN 0/1, and then B writes a commit record at 0/10100, and then B calls ProcArrayEndTransaction(). At this point, B is visible and A is not visible, even though A's commit record precedes that of B. Well, in combination with logical decoding it kinda has one: It should allow you to take a dump of the database with a certain snapshot and replay all transactions with a commit lsn bigger than the snapshot's lsn and end up with a continually consistent database. Visibility for HS actually works precisely in commit LSN order, even if that is possibly different than on the primary... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Thu, Jul 16, 2015 at 9:41 AM, Fujii Masao masao.fu...@gmail.com wrote: Here are some minor comments: +ereport(LOG, +(errmsg(ignoring \%s\ file because no \%s\ file exists, +TABLESPACE_MAP, BACKUP_LABEL_FILE), + errdetail(could not rename file \%s\ to \%s\: %m, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); WARNING is better than LOG here because it indicates a problematic case? No, that's not the right distinction. Remember that, when sending messages to the client, WARNING LOG, and when sending messages to the log, LOG WARNING. So messages that a user is more likely to care about than the administrator should be logged at WARNNG; those that the administrator is more likely to care about should be LOG. I think LOG is clearly the appropriate thing here. In detail message, the first word of sentence needs to be capitalized. + errdetail(renamed file \%s\ to \%s\, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); In detail message, basically we should use a complete sentence. So like other similar detail messages in xlog.c, I think that it's better to use \%s\ was renamed to \%s\. as the detail message here. Right, that's what the style guidelines say. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Retrieve the snapshot's LSN
On Wed, Jul 15, 2015 at 12:51 PM, Florent Guiliani flor...@guiliani.fr wrote: During slot creation, the snapshot building and exporting code seems highly coupled with the logical decoding stuff. It doesn't seems much reusable to retrieve the snapshot's LSN outside of logical decoding. I don't think the snapshot's LSN has a well-defined meaning in general. The obvious meaning would be the LSN such that all commits prior to that LSN are visible and all later commits are invisible, but such an LSN need not exist. Suppose A writes a commit record at LSN 0/1, and then B writes a commit record at 0/10100, and then B calls ProcArrayEndTransaction(). At this point, B is visible and A is not visible, even though A's commit record precedes that of B. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
Robert Haas wrote: On Thu, Jul 16, 2015 at 9:41 AM, Fujii Masao masao.fu...@gmail.com wrote: Here are some minor comments: +ereport(LOG, +(errmsg(ignoring \%s\ file because no \%s\ file exists, +TABLESPACE_MAP, BACKUP_LABEL_FILE), + errdetail(could not rename file \%s\ to \%s\: %m, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); WARNING is better than LOG here because it indicates a problematic case? No, that's not the right distinction. Remember that, when sending messages to the client, WARNING LOG, and when sending messages to the log, LOG WARNING. So messages that a user is more likely to care about than the administrator should be logged at WARNNG; those that the administrator is more likely to care about should be LOG. I think LOG is clearly the appropriate thing here. Right. Now that we have errors marked with criticality, it will be pretty obvious what message is indicating a system problem rather than just a problem that can be ignored without taking any action. ... oh, actually we don't have the criticality marking yet, do we. I think we need to fix that at some point. (Some guys have said to grep the log for certain SQLSTATE codes, but, uh, really?) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] postgres_fdw extension support
Michael, thanks so much for the review! On July 15, 2015 at 7:35:11 PM, Michael Paquier (michael.paqu...@gmail.com) wrote: This patch includes some diff noise, it would be better to remove that. Done. - if (!is_builtin(fe-funcid)) + if (!is_builtin(fe-funcid) (!is_in_extension(fe-funcid, fpinfo))) Extra parenthesis are not needed. OK, will remove. + if ( (!is_builtin(oe-opno)) (!is_in_extension(oe-opno, fpinfo)) ) ... And this does not respect the project code format. See here for more details for example: http://www.postgresql.org/docs/devel/static/source.html I’m sorry, that link doesn’t clarify for me what’s stylistically wrong here (it’s almost all about error messages), could you help (is it the padding around the conditionals? removed that) + /* PostGIS metadata */ + List *extensions; + bool use_postgis; + Oid postgis_oid; This addition in PgFdwRelationInfo is surprising. What the point of keeping use_postgis and postgres_oid that are actually used nowhere? Whoops, a couple old pieces from my proof-of-concept survived the conversion to a generic features. Removed. appendStringInfo(buf, ::%s, - format_type_with_typemod(node-consttype, - node-consttypmod)); + format_type_be_qualified(node-consttype)); What's the reason for this change? Good question. As I recall, the very sparse search path the FDW connection makes can sometimes leave remote function failing to find other functions they need, so we want to force the calls to be schema-qualified. Unfortunately there’s no perfect public call for that, ideally it would be return format_type_internal(type_oid, typemod, true, false, true), but there’s no function like that, so I settled for format_type_be_qualified(), which forces qualification at the expense of ignoring the typmod. Thinking a bit wider, why is this just limited to extensions? You may have as well other objects defined locally and remotely like functions or operators that are not defined in extensions, but as normal objects. Hence I think that what you are looking for is not this option, but a boolean option enforcing the behavior of code paths using is_builtin() in foreign_expr_walker such as the sanity checks on existing objects are not limited to FirstBootstrapObjectId but to other objects in the catalogs. Well, as I see it there’s three broad categories of behavior available: 1- Forward nothing non-built-in (current behavior) 2- Use options to forward only specified non-built-in things (either in function chunks (extensions, as in this patch) or one-by-one (mark your desired functions / ops) 3- Forward everything if a “forward everything” option is set I hadn’t actually considered the possibility of option 3, but for my purposes it would work just as well, with the added efficiency bonus of not having to check whether particular funcs/ops are inside declared extensions. Both the current state of FDW and the patch I’m providing expect a *bit* of smarts on the part of users, to make sure their remote/local environments are in sync (in particular versions of pgsql and of extensions). Option 3 just ups the ante on that requirement. I’d be fine w/ this, makes the patch very simple and fast. For option 2, marking things one at a time really isn’t practical for a package like PostGIS, with several hundred functions and operators. Using the larger block of “extension” makes more sense. I think in general, expecting users to itemize every func/op they want to forward, particular if they just want an extension to “work” over FDW is too big an expectation. That’s not to minimize the utility of being able to mark individual functions/ops for forwarding, but I think that’s a separate use case that doesn’t eliminate the need for extension-level forwarding. Thanks again for the review! P. That's a risky option because it could lead to inconsistencies among objects, so obviously the default is false but by documenting correctly the risks of using this option we may be able to get something integrated (aka be sure that each object is defined consistently across the servers queried or you'll have surprising results!). In short, it seems to me that you are taking the wrong approach. -- fdw-extension-support-2.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple psql option -c
On Thu, Jul 16, 2015 at 4:42 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hi can we support multiple -c option? Why? Because some statements like VACUUM cannot be used together with any other statements with single -c option. The current solution is using echo and pipe op, but it is a complication in some complex scripts - higher complication when you run psql via multiple sudo statement. Example: psql -c select pg_stat_reset() -c vacuum full analyze dbname or on all db psql -At -c select datname from pg_databases postgres | \ xargs -n 1 -P 3 psql -c ... -c ... Ideas, notes, comments? Why you want it if we already have the -f option that cover this use case? Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
[HACKERS] proposal: multiple psql option -c
Hi can we support multiple -c option? Why? Because some statements like VACUUM cannot be used together with any other statements with single -c option. The current solution is using echo and pipe op, but it is a complication in some complex scripts - higher complication when you run psql via multiple sudo statement. Example: psql -c select pg_stat_reset() -c vacuum full analyze dbname or on all db psql -At -c select datname from pg_databases postgres | \ xargs -n 1 -P 3 psql -c ... -c ... Ideas, notes, comments? Regards Pavel
Re: [HACKERS] proposal: multiple psql option -c
2015-07-16 23:10 GMT+02:00 Rosser Schwarz rosser.schw...@gmail.com: On Thu, Jul 16, 2015 at 1:44 PM, Pavel Stehule pavel.steh...@gmail.com wrote: 2015-07-16 22:07 GMT+02:00 Fabrízio de Royes Mello fabriziome...@gmail.com: Why you want it if we already have the -f option that cover this use case? It doesn't help me - we would to run script or remote script (via ssh) without necessity to create (and later drop) files on production servers. Does piping a series of commands into psql work in your scenario? You can even say things like: cat $local_file | ssh $production_server 'psql $database' probably not - the first remote command is sudo su - due security reasons -- :wq
Re: [HACKERS] proposal: multiple psql option -c
2015-07-16 22:07 GMT+02:00 Fabrízio de Royes Mello fabriziome...@gmail.com : On Thu, Jul 16, 2015 at 4:42 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hi can we support multiple -c option? Why? Because some statements like VACUUM cannot be used together with any other statements with single -c option. The current solution is using echo and pipe op, but it is a complication in some complex scripts - higher complication when you run psql via multiple sudo statement. Example: psql -c select pg_stat_reset() -c vacuum full analyze dbname or on all db psql -At -c select datname from pg_databases postgres | \ xargs -n 1 -P 3 psql -c ... -c ... Ideas, notes, comments? Why you want it if we already have the -f option that cover this use case? It doesn't help me - we would to run script or remote script (via ssh) without necessity to create (and later drop) files on production servers. remote execution of scripts is much more simple if you don't need to create any script files. Regards Pavel Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore
Peter Geoghegan p...@heroku.com writes: Attached patch adds a portable Postgres wrapper on the GCC intrinsic. Meh. I don't like the assumption that non-GCC compilers will be smart enough to optimize away the useless-to-them if() tests this adds. Please refactor that so that there is exactly 0 new code when the intrinsic doesn't exist. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
On Tue, Jul 14, 2015 at 7:25 AM, Tom Lane t...@sss.pgh.pa.us wrote: FWIW, I entirely share Robert's opinion that adding gettimeofday() overhead in routinely-taken paths is likely not to be acceptable. I think that it can depend on many factors. For example, the availability of vDSO support on Linux/glibc. I've heard that clock_gettime() with CLOCK_REALTIME_COARSE, or with CLOCK_MONOTONIC_COARSE can have significantly lower overhead than gettimeofday(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Thu, Jul 16, 2015 at 12:30 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Kapila wrote: This can be tracked either in 9.5 Open Items or for next CF, any opinions? If nobody else has any opinion on this, I will add it to 9.5 Open Items list. I think this belongs in the open items list, yeah. Okay, added to the list of 9.5 Open Items. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Fri, Jul 17, 2015 at 1:28 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jul 16, 2015 at 9:41 AM, Fujii Masao masao.fu...@gmail.com wrote: Here are some minor comments: +ereport(LOG, +(errmsg(ignoring \%s\ file because no \%s\ file exists, +TABLESPACE_MAP, BACKUP_LABEL_FILE), + errdetail(could not rename file \%s\ to \%s\: %m, + TABLESPACE_MAP, TABLESPACE_MAP_OLD))); WARNING is better than LOG here because it indicates a problematic case? No, that's not the right distinction. Remember that, when sending messages to the client, WARNING LOG, and when sending messages to the log, LOG WARNING. So messages that a user is more likely to care about than the administrator should be logged at WARNNG; those that the administrator is more likely to care about should be LOG. I think LOG is clearly the appropriate thing here. Isn't this rule confusing the administrators? ISTM that the administrators would intuitively and literally pay more attention to WARNING than LOG. Also there are already some warning messages with WARNING level that the administrators rather than the clients should care about. For example, the warning message which output when archive_command fails. ereport(WARNING, (errmsg(archiving transaction log file \%s\ failed too many times, will try again later, xlog))); Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On 16 July 2015 at 05:07, Jeff Davis pg...@j-davis.com wrote: On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote: I think a heuristic would be more suited here and ignoring memory consumption for internal types means that we are not making the memory accounting useful for a set of usecases. OK, I will drop this patch and proceed with the HashAgg patch, with a heuristic for internal types. Should we mark the patch as returned with feedback in the commitfest app then? -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Add CINE for ALTER TABLE ... ADD COLUMN
On Fri, Jun 26, 2015 at 12:41 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Wed, Jun 24, 2015 at 3:36 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Fabrízio de Royes Mello wrote: Another rebased version. There are a number of unrelated whitespace changes in this patch; also please update the comment on top of check_for_column_name_collision. Sorry, bad merging after a pgident run. Comments on top of check_for_column_name collision also updated. I had a look at this patch, and here are some minor comments: 1) In alter_table.sgml, you need a space here: [ IF NOT EXISTS ]replaceable 2) + check_for_column_name_collision(targetrelation, newattname, false); (void) needs to be added in front of check_for_column_name_collision where its return value is not checked or static code analyzers are surely going to complain. 3) Something minor, some lines of codes exceed 80 characters, see declaration of check_for_column_name_collision for example... 4) This comment needs more precisions? /* new name should not already exist */ - check_for_column_name_collision(rel, colDef-colname); + if (!check_for_column_name_collision(rel, colDef-colname, if_not_exists)) The new name can actually exist if if_not_exists is true. Except that the implementation looks sane to me. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Thu, Jul 16, 2015 at 9:54 PM, Fujii Masao masao.fu...@gmail.com wrote: Isn't this rule confusing the administrators? I'd like to think not, but yeah, it probably is. It is not like it isn't documented. There are even comments in postgresql.conf explaining it. But the fact that we have committers who are confused probably isn't a great sign. I really rather like the system we have and find it quite intuitive, but it's obvious that a lot of other people don't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive
Peter Geoghegan p...@heroku.com writes: I've heard that clock_gettime() with CLOCK_REALTIME_COARSE, or with CLOCK_MONOTONIC_COARSE can have significantly lower overhead than gettimeofday(). It can, but it also has *much* lower precision, typically 1ms or so. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/29/2015 09:44 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote: But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. Well, sure, pg_rewind does not copy the soft links either way. Now it would be nice to have an option to be able to recreate the soft link of at least pg_xlog even if it can be scripted as well after a run. Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. If pg_xlog is simply ignored, some old WAL files may remain in target server. Don't these old files cause the subsequent startup of target server as new standby to fail? That is, it's the case where the WAL file with the same name but different content exist both in target and source. If that's harmfull, pg_rewind also should remove the files in pg_xlog of target server. This would reduce usability. The rewound node will replay WAL from the previous checkpoint where WAL forked up to the minimum recovery point of source node where pg_rewind has been run. Hence if we remove completely the contents of pg_xlog we'd lose a portion of the logs that need to be replayed until timeline is switched on the rewound node when recovering it (while streaming from the promoted standby, whatever). I don't really see why recycled segments would be a problem, as that's perhaps what you are referring to, but perhaps I am missing something. Attached is a rebased version of the previous patch to ignore the contents of pg_xlog/ when rewinding. -- Michael 20150717_pgrewind_ignore_xlog.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple psql option -c
it is one possible solution too multiple -c option has advantage of simple evaluation of backslash statements .. -c \l -c \dt - but this advantage is not high important. Or just properly understand the ; ? -c select * from foo; update bar set baz = 'bing'; vacuum bar; JD Pavel Best Regards, Dinesh manojadinesh.blogspot.com http://manojadinesh.blogspot.com Regards Pavel -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing I'm offended is basically telling the world you can't control your own emotions, so everyone else should do it for you. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple psql option -c
2015-07-17 6:26 GMT+02:00 Joshua D. Drake j...@commandprompt.com: it is one possible solution too multiple -c option has advantage of simple evaluation of backslash statements .. -c \l -c \dt - but this advantage is not high important. Or just properly understand the ; ? -c select * from foo; update bar set baz = 'bing'; vacuum bar; there is a risk of compatibility issues - all statements runs under one transaction implicitly Pavel JD Pavel Best Regards, Dinesh manojadinesh.blogspot.com http://manojadinesh.blogspot.com Regards Pavel -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing I'm offended is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Re: [HACKERS] Bugs in our qsort implementation
Peter Geoghegan p...@heroku.com writes: On Thu, Jul 16, 2015 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: It's possible that this issue can only manifest on 9.4 and up where we have the ability for tuplesort to allocate work arrays approaching INT_MAX elements. But I don't have a lot of faith in that; I think the worst-case stack depth for the way we have it now could be as bad as O(N), so in principle a crash could be possible with significantly smaller input arrays. I think we'd better back-patch this all the way. +1. If you want to generate a worst case, McIlroy wrote a program that will generate one [1]. AFAIR, it will generate a series of self-consistent comparisons in the gas comparator that produce a worst-case outcome (as opposed to producing a simple worse-case input, which would be more convenient in this kind of scenario). This is known specifically to affect the Bentley McIlroy implementation, as the paper goes into. [1] http://www.cs.dartmouth.edu/~doug/mdmspe.pdf Wow, interesting article. I stuck that into a test program with our qsort, and found out that (1) our presort check totally beats that oracle, because it causes the oracle to freeze the gas in perfectly sequential order, so that the presort run gets all the way through and decides that the input is sorted. Hence, N-1 comparisons and no recursion. (2) if you dike out the presort check, you do see quadratic comparison behavior but the max recursion depth is only 1. Apparently, the oracle effectively always makes the right-hand partition as large as possible, so that recursing on the left partition is the optimal policy as far as stack depth goes. However, you can trivially modify this oracle to break our code: just negate its comparison result, ie - return val[x] - val[y]; /* only the sign matters */ + return -(val[x] - val[y]); /* only the sign matters */ This stops the presort check immediately, since now the data looks to be reverse-sorted instead of correctly ordered; and now it makes the left-hand partition as large as possible, instead of the right. With our unmodified code and the tweaked oracle, I see maximum recursion depth in the vicinity of N/4. So it's definitely possible to get O(N) stack growth without a fix, though it may take very unusual input. With the correction to recurse to the smaller partition, we get max recursion depth of just 1, since that always chooses a very small partition to recurse to. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: multiple psql option -c
2015-07-17 0:03 GMT+02:00 dinesh kumar dineshkuma...@gmail.com: On Thu, Jul 16, 2015 at 12:42 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hi can we support multiple -c option? Why? Because some statements like VACUUM cannot be used together with any other statements with single -c option. The current solution is using echo and pipe op, but it is a complication in some complex scripts - higher complication when you run psql via multiple sudo statement. Example: psql -c select pg_stat_reset() -c vacuum full analyze dbname or on all db psql -At -c select datname from pg_databases postgres | \ xargs -n 1 -P 3 psql -c ... -c ... Ideas, notes, comments? IMO, rather having multiple -c args, it would be good to have another flag like -C which do accept and execute multiple SQL statements in sequential. it is one possible solution too multiple -c option has advantage of simple evaluation of backslash statements .. -c \l -c \dt - but this advantage is not high important. Pavel Best Regards, Dinesh manojadinesh.blogspot.com Regards Pavel
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Thu, Jul 16, 2015 at 11:10 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs si...@2ndquadrant.com wrote: * Developers will immediately understand the format I doubt it. I think any format that we pick will have to be carefully documented. People may know what JSON looks like in general, but they will not immediately know what bells and whistles are available in this context. I also think any format where user has to carefully remember how he has to provide the values is not user-friendly, why in this case SQL based syntax is not preferable, with that we can even achieve consistency of this parameter across all servers which I think is not of utmost importance for this feature, but still I think it will make users happy. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] auto_explain sample rate
On 7 July 2015 at 21:37, Julien Rouhaud julien.rouh...@dalibo.com wrote: Well, I obviously missed that pg_srand48() is only used if the system lacks random/srandom, sorry for the noise. So yes, random() must be used instead of pg_lrand48(). I'm attaching a new version of the patch fixing this issue just in case. Thanks for picking this up. I've been trying to find time to come back to it but been swamped in priority work. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
On Thu, Jul 16, 2015 at 4:01 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch adds a portable Postgres wrapper on the GCC intrinsic. It also adds a client within tuplesort.c -- a common function that seems like a good generic choke point. I can get a speed-up of 6% - 9% for all of these cases by prefetching a few SortTuples ahead, for the tuple proper. (In-memory sorts only.) I added a silly bug during last minute clean-up. I attach a V2. -- Peter Geoghegan From 5d3093e9c0d10f4140072da8d18d0dddf6b84b1e Mon Sep 17 00:00:00 2001 From: Peter Geoghegan peter.geoghega...@gmail.com Date: Sun, 12 Jul 2015 13:14:01 -0700 Subject: [PATCH] Prefetch from memtuples array in tuplesort Testing shows that prefetching the tuple proper of a slightly later SortTuple in the memtuples array during each of many sequential, in-logical-order SortTuple fetches speeds up various sorting intense operations considerably. For example, B-Tree index builds are accelerated as leaf pages are created from the memtuples array. (i.e. The operation following actually performing the sort, but before a tuplesort_end() call is made as a B-Tree spool is destroyed.) Similarly, ordered set aggregates (all cases except the datumsort case with a pass-by-value type), and regular heap tuplesorts benefit to about the same degree. The optimization is only used when sorts fit in memory, though. Also, prefetch a few places ahead within the analogous fetching point in tuplestore.c. This appears to offer similar benefits in certain cases. For example, queries involving large common table expressions significantly benefit. --- config/c-compiler.m4| 17 + configure | 30 ++ configure.in| 1 + src/backend/utils/sort/tuplesort.c | 18 ++ src/backend/utils/sort/tuplestore.c | 11 +++ src/include/c.h | 19 +++ src/include/pg_config.h.in | 3 +++ src/include/pg_config.h.win32 | 3 +++ 8 files changed, 102 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 050bfa5..5776201 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -287,6 +287,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE +# PGAC_C_BUILTIN_PREFETCH +# - +# Check if the C compiler understands __builtin_prefetch(), +# and define HAVE__BUILTIN_PREFETCH if so. +AC_DEFUN([PGAC_C_BUILTIN_PREFETCH], +[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch, +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], +[int i = 0;__builtin_prefetch(i, 0, 3);])], +[pgac_cv__builtin_prefetch=yes], +[pgac_cv__builtin_prefetch=no])]) +if test x$pgac_cv__builtin_prefetch = xyes ; then +AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1, + [Define to 1 if your compiler understands __builtin_prefetch.]) +fi])# PGAC_C_BUILTIN_PREFETCH + + + # PGAC_C_VA_ARGS # -- # Check if the C compiler understands C99-style variadic macros, diff --git a/configure b/configure index 77048e8..f14e65d 100755 --- a/configure +++ b/configure @@ -11248,6 +11248,36 @@ if test x$pgac_cv__builtin_unreachable = xyes ; then $as_echo #define HAVE__BUILTIN_UNREACHABLE 1 confdefs.h fi +{ $as_echo $as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch 5 +$as_echo_n checking for __builtin_prefetch... 6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n (cached) 6 +else + cat confdefs.h - _ACEOF conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ +int i = 0;__builtin_prefetch(i, 0, 3); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile $LINENO; then : + pgac_cv__builtin_prefetch=yes +else + pgac_cv__builtin_prefetch=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +{ $as_echo $as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch 5 +$as_echo $pgac_cv__builtin_prefetch 6; } +if test x$pgac_cv__builtin_prefetch = xyes ; then + +$as_echo #define HAVE__BUILTIN_PREFETCH 1 confdefs.h + +fi { $as_echo $as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__ 5 $as_echo_n checking for __VA_ARGS__... 6; } if ${pgac_cv__va_args+:} false; then : diff --git a/configure.in b/configure.in index 5724a4d..2132d69 100644 --- a/configure.in +++ b/configure.in @@ -1318,6 +1318,7 @@ PGAC_C_TYPES_COMPATIBLE PGAC_C_BUILTIN_BSWAP32 PGAC_C_BUILTIN_CONSTANT_P PGAC_C_BUILTIN_UNREACHABLE +PGAC_C_BUILTIN_PREFETCH PGAC_C_VA_ARGS PGAC_STRUCT_TIMEZONE PGAC_UNION_SEMUN diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 435041a..70b53d1 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -1663,6 +1663,24 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, if (state-current state-memtupcount) { *stup = state-memtuples[state-current++]; + + /* + * Perform memory prefetch of tuple proper of the + * SortTuple that's
Re: [HACKERS] [PATCH] postgres_fdw extension support
On Fri, Jul 17, 2015 at 1:08 AM, Paul Ramsey wrote: + if ( (!is_builtin(oe-opno)) (!is_in_extension(oe-opno, fpinfo)) ) ... And this does not respect the project code format. See here for more details for example: http://www.postgresql.org/docs/devel/static/source.html I’m sorry, that link doesn’t clarify for me what’s stylistically wrong here (it’s almost all about error messages), could you help (is it the padding around the conditionals? removed that) Two things: 1) Spaces between parenthesis at the beginning and the end of he expression 2) Parenthesis around is_in_extension are not that much necessary. appendStringInfo(buf, ::%s, - format_type_with_typemod(node-consttype, - node-consttypmod)); + format_type_be_qualified(node-consttype)); What's the reason for this change? Good question. As I recall, the very sparse search path the FDW connection makes can sometimes leave remote function failing to find other functions they need, so we want to force the calls to be schema-qualified. Unfortunately there’s no perfect public call for that, ideally it would be return format_type_internal(type_oid, typemod, true, false, true), but there’s no function like that, so I settled for format_type_be_qualified(), which forces qualification at the expense of ignoring the typmod. Hm. I don't have my mind wrapped around that now, but this needs to be checked with data types like char or varchar. It may matter. Thinking a bit wider, why is this just limited to extensions? You may have as well other objects defined locally and remotely like functions or operators that are not defined in extensions, but as normal objects. Hence I think that what you are looking for is not this option, but a boolean option enforcing the behavior of code paths using is_builtin() in foreign_expr_walker such as the sanity checks on existing objects are not limited to FirstBootstrapObjectId but to other objects in the catalogs. Well, as I see it there’s three broad categories of behavior available: 1- Forward nothing non-built-in (current behavior) 2- Use options to forward only specified non-built-in things (either in function chunks (extensions, as in this patch) or one-by-one (mark your desired functions / ops) 3- Forward everything if a “forward everything” option is set Then what you are describing here is a parameter able to do a switch among this selection: - nothing, which is to check on built-in stuff - extension list. - all. I hadn’t actually considered the possibility of option 3, but for my purposes it would work just as well, with the added efficiency bonus of not having to check whether particular funcs/ops are inside declared extensions. Both the current state of FDW and the patch I’m providing expect a *bit* of smarts on the part of users, to make sure their remote/local environments are in sync (in particular versions of pgsql and of extensions). Option 3 just ups the ante on that requirement. I’d be fine w/ this, makes the patch very simple and fast. Yeah, perhaps too simple though :) You may want something that is more advanced. For example when using a set of objects you can decide which want you want to pushdown and which one you want to keep as evaluated locally. For option 2, marking things one at a time really isn’t practical for a package like PostGIS, with several hundred functions and operators. Using the larger block of “extension” makes more sense. I think in general, expecting users to itemize every func/op they want to forward, particular if they just want an extension to “work” over FDW is too big an expectation. That’s not to minimize the utility of being able to mark individual functions/ops for forwarding, but I think that’s a separate use case that doesn’t eliminate the need for extension-level forwarding. OK, that's good to know. Perhaps then that using a parameter with an extension list is a good balance. Now would there be practical cases where it is actually useful to have an extension list? For example, let's say that there are two extensions installed: foo1 and foo2. Do you think (or know) if some users would be willing to include only the objects of foo1, but include those of foo2? Also, could it be possible to consider an exclude list? Like ignoring all the objects in the given extension list and allow the rest. I am just trying to consider all the possibilities here. Thanks again for the review! Good to see that you added in the CF app: https://commitfest.postgresql.org/6/304/ Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] optimizing vacuum truncation scans
On Thu, Jul 16, 2015 at 10:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jul 16, 2015 at 11:21 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: Here I attached updated patches, 1) without prefetch logic. 2) with combined vm and prefetch logic. I think it is better to just get the first patch in as that in itself is a clear win and then we can evaluate the second one (prefetching during truncation) separately. I think after first patch, the use case for doing prefetch seems to be less beneficial and I think it could hurt by loading not required pages (assume you have prefetched 32 pages and after inspecting first page, it decides to quit the loop as that is non-empty page and other is when it has prefetched just because for page the visibility map bit was cleared, but for others it is set) in OS buffer cache. Yes, in the above cases, the prefetch is an overhead. I am not sure how frequently such scenarios occur in real time scenarios. Having said that, I am not against prefetching in this scenario as that can help in more cases then it could hurt, but I think it will be better to evaluate that separately. Yes, the prefetch patch works better in cases, where majorly the first vacuum skips the truncation because of lock waiters. If such cases are more in real time scenarios, then the prefetch patch is needed. Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore
I've always thought that GCC intrinsics for software-based memory prefetching are a very sharp tool. While it's really hard to use GCC's __builtin_prefetch() effectively (I've tried enough times to know), I always thought that there'd probably end up being a handful of cases where using it presented a clear and worthwhile win. Hash joins seemed to me to be a likely candidate, because memory latency is a real killer there. However, I stumbled upon another case where prefetching pays clear dividends while actually being quite simple: sequentially reading an already-sorted memtuples array (an array of SortTuple structs), during some kind of post-processing. This comes up for many common and important cases, including regular sort nodes (heaptuple sorts), datumsorts with pass-by-reference types (e.g. ordered set aggregates), and B-Tree index builds, where the sequential fetching occurs as actual B-Tree leaf pages are built from the array. Patch = Attached patch adds a portable Postgres wrapper on the GCC intrinsic. It also adds a client within tuplesort.c -- a common function that seems like a good generic choke point. I can get a speed-up of 6% - 9% for all of these cases by prefetching a few SortTuples ahead, for the tuple proper. (In-memory sorts only.) ISTM that this alone makes the patch worthwhile, but we should still consider the break-down of costs in each of these operations, and that the cost of the sequential reading and processing of SortTuples is all that has been touched by the patch. There are only about 3 real lines of code added to tuplesort. Obviously, the dominant costs within each of these queries/utility commands are the sort proper, and to a lesser extent the heap scan preceding it. I am not targeting those at all -- I'm only targeting the tertiary cost of sequentially scanning the memtuples array by hiding memory latency, and yet I end up with a notable overall speed-up. It's only really fair to look at this final sequential fetching cost in isolation. The reduction in that cost in isolation seems to be huge. I tried out the proprietary Intel vTune Amplifier utility with this case, and it indicated that CPU time spent in calls to _bt_buildadd was less than 1/8 the previous time in a comparable test of the master branch, with no other changes that I can attribute to this patch (i.e. there is a bit of noise). You can easily get some hint of the size of the improvement by reviewing trace_sort = on log output, and the reduction of time spent after performsort is done but before the sort is shut down. Tuplestores tuplestore.c has a similar optimization added, on the theory that it should match tuplesort.c wherever possible, and because it seems to help a lot there too. For example, queries that perform a big CTE Scan seem to benefit to about the same degree (although, again, only when the array is fully in memory). Portability concerns === I started writing this patch with an intuition that this would help. I arrived at the fixed-up version posted here through frobbing the code based on the feedback of Postgres running on my laptop. I iteratively tweaked the number of tuples ahead to fetch from, and the __builtin_prefetch() temporal locality argument's value, which helped a little for a number of test cases. If that doesn't inspire much confidence in this patch, then good -- that's the idea. Obviously if we are going to commit this, there needs to be testing of a reasonably representative selection of platforms. On the other hand, I couldn't find a case that was regressed, and I am encouraged by the fact that this helps the post-processing of SortTuples so effectively. I imagine that some platform would have very different performance characteristics in order for this to be the wrong thing. However, it seems possible that there is a case that has been regressed, since the prefetch is added at a very generic point within tuplesort.c and tuplestore.c. -- Peter Geoghegan From 3c44ec9a9c68da3408716e06d0e822acfc68a746 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan peter.geoghega...@gmail.com Date: Sun, 12 Jul 2015 13:14:01 -0700 Subject: [PATCH] Prefetch from memtuples array in tuplesort Testing shows that prefetching the tuple proper of a slightly later SortTuple in the memtuples array during each of many sequential, in-logical-order SortTuple fetches speeds up various sorting intense operations considerably. For example, B-Tree index builds are accelerated as leaf pages are created from the memtuples array. (i.e. The operation following actually performing the sort, but before a tuplesort_end() call is made as a B-Tree spool is destroyed.) Similarly, ordered set aggregates (all cases except the datumsort case with a pass-by-value type), and regular heap tuplesorts benefit to about the same degree. The optimization is only used when sorts fit in memory, though. Also, prefetch a few places ahead within the analogous fetching point in tuplestore.c.
Re: [HACKERS] Bugs in our qsort implementation
On Thu, Jul 16, 2015 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: It's possible that this issue can only manifest on 9.4 and up where we have the ability for tuplesort to allocate work arrays approaching INT_MAX elements. But I don't have a lot of faith in that; I think the worst-case stack depth for the way we have it now could be as bad as O(N), so in principle a crash could be possible with significantly smaller input arrays. I think we'd better back-patch this all the way. +1. If you want to generate a worst case, McIlroy wrote a program that will generate one [1]. AFAIR, it will generate a series of self-consistent comparisons in the gas comparator that produce a worst-case outcome (as opposed to producing a simple worse-case input, which would be more convenient in this kind of scenario). This is known specifically to affect the Bentley McIlroy implementation, as the paper goes into. [1] http://www.cs.dartmouth.edu/~doug/mdmspe.pdf -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Size of TOAST pointer
Vignesh Raghunathan vignesh.pg...@gmail.com writes: I was looking at the documentation for TOAST ( http://www.postgresql.org/docs/devel/static/storage-toast.html) and it's specified that the toast pointer occupies 18 bytes. However, the struct representing the toast pointer is defined as follows typedef struct varatt_external The data that actually ends up on disk is a varattrib_1b_e wrapping a varatt_external, and that header is where the extra 2 bytes come from. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers