Re: SegFault on 9.6.14
On Sat, Jul 27, 2019 at 8:29 AM Thomas Munro wrote: > > On Fri, Jul 26, 2019 at 4:13 PM Amit Kapila wrote: > > On Tue, Jul 23, 2019 at 5:28 PM Amit Kapila wrote: > > > Right, that will be lesser code churn and it can also work. However, > > > one thing that needs some thought is till now es_top_eflags is only > > > set in ExecutorStart and same is mentioned in comments where it is > > > declared and it seems we are going to change that with this idea. How > > > about having a separate function ExecBlahShutdown which will clean up > > > resources as parallel context and can be called only from ExecutePlan > > > where we are calling ExecShutdownNode? I think both these and the > > > other solution we have discussed are on similar lines and another idea > > > could be to relax the assert which again is not a superb idea. > > > > It seems we don't have a clear preference for any particular solution > > among these and neither there appears to be any better idea. I guess > > we can wait for a few days to see if Robert has any views on this, > > otherwise, pick one of the above and move ahead. > > I take the EXEC_FLAG_DONE idea back. It's ugly and too hard to verify > that every appropriate path sets it, and a flag that means the > opposite would be even more of a kluge, and generally I think I was > looking at this too myopically: I was looking for a way to shut down > processes ASAP without giving up the shared memory we'll need for > rescanning, but what I should have been looking at is the reason you > did that in the first place: to get the instrumentation data. Can you > explain why it's necessary to do that explicitly for Limit? Wouldn't > the right place to collect instrumentation be at the end of execution > when Shutdown will run in all cases anyway (and possibly also during > ExecParallelReinitialize() or something like that if it's being > clobbered by rescans, I didn't check)? What's special about Limit? > I think here you are missing the point that to collect the instrumentation information one also need to use InstrStartNode and InstrStopNode. So, for the Limit node, the InstrStopNode would be already done by the time we call shutdown of workers at the end of execution. To know a bit more details, see [1][2][3]. > Today while poking at this and trying to answer those questions for > myself, I realised that the repro I posted earlier[1] crashes exactly > as Jerry reported on REL9_6_STABLE, but in later release branches it > runs to completion. That's because the crashing code was removed in > commit 41b0dd98 "Separate reinitialization of shared parallel-scan > state from ExecReScan.". > > So newer branches get past that problem, but they all spit out tons of > each of these three warnings: > > WARNING: buffer refcount leak: [172] (rel=base/12558/16390, > blockNum=5, flags=0x9380, refcount=1 2998) > ... > WARNING: relcache reference leak: relation "join_bar" not closed > ... > WARNING: Snapshot reference leak: Snapshot 0x7ff20383bfb0 still referenced > ... > > Oops. > This is exactly due to the same problem that before rescans, we have destroyed the shared memory. If you do the earlier trick of not cleaning up shared memory till ExecEndNode, then you won't see this problem. [1] - https://www.postgresql.org/message-id/CAA4eK1KZEbYKj9HHP-6WqqjAXuoB%2BWJu-w1s9uovj%3DeeBxC48Q%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CA%2BTgmoY3kcTcc5bFCZeY5NMFna-xaMPuTHA-z-z2Bmfg%2Bdb-XQ%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAA4eK1L0KAZWgnRJz%3DVNVpyS3FFbVh8E5egyziaR0E10bC204Q%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: proposal: type info support functions for functions that use "any" type
pá 26. 7. 2019 v 22:53 odesílatel Tom Lane napsal: > I wrote: > > TBH, I don't like this proposal one bit. As far as I can see, the idea > > is to let a function's support function redefine the function's declared > > argument and result types on-the-fly according to no predetermined rules, > > and that seems to me like it's a recipe for disaster. How will anyone > > understand which function(s) are candidates to match a query, or why one > > particular candidate got selected over others? It's already hard enough > > to understand the behavior of polymorphic functions in complex cases, > > and those are much more constrained than this would be. > > After thinking about this a bit more, it seems like you could avoid > a lot of problems if you restricted what the support function call > does to be potentially replacing the result type of a function > declared to return ANY with some more-specific type (computed from > examination of the actual arguments). That would make it act much > more like a traditional polymorphic function. It'd remove the issues > about interactions among multiple potentially-matching functions, > since we'd only call a single support function for an already-identified > target function. > I am not sure if I understand well - so I repeat it with my words. So calculation of result type (replace ANY by some specific) can be ok? I am able to do it if there will be a agreement. I wrote a possibility to specify argument types as optimization as protection against repeated type identification and casting (that can be done in planning time, and should not be repeated). This feature should be used only for functions with types fx("any", "any", ..) returns "any". So it is very probable so in execution type you should to do some work with parameter type identification. But if we find a agreement just on work with return type, then it is good enough solution. The practical overhead of type cache inside function should not be dramatic. > You'd still need to touch everyplace that knows about polymorphic > type resolution, since this would essentially be another form of > polymorphic function. And I'm still very dubious that it's worth > the trouble. But it would be a lot more controllable than the > proposal as it stands. > ok > regards, tom lane >
Re: proposal: type info support functions for functions that use "any" type
pá 26. 7. 2019 v 22:03 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > so 9. 3. 2019 v 7:22 odesílatel Pavel Stehule > > napsal: > >> Tom introduced supported functions for calculation function's > selectivity. > >> Still I have similar idea to use supported function for calculation > >> function's parameter's types and function return type. > >> Motivation: > >> Reduce a necessity of overloading of functions. My motivation is related > >> primary to Orafce, but this feature should be helpful for anybody with > >> similar goals. The function's overloading is great functionality but it > is > >> hard for maintenance. > > > here is a patch > > TBH, I don't like this proposal one bit. As far as I can see, the idea > is to let a function's support function redefine the function's declared > argument and result types on-the-fly according to no predetermined rules, > and that seems to me like it's a recipe for disaster. How will anyone > understand which function(s) are candidates to match a query, or why one > particular candidate got selected over others? It's already hard enough > to understand the behavior of polymorphic functions in complex cases, > and those are much more constrained than this would be. > I quietly expect so this feature will be used without combination with overloading. But the combination of support function and overloading can be explicitly disabled - (in runtime for simple implementation). > Moreover, I don't think you've even provided a compelling example > case. What's this doing that you couldn't do with existing polymorphic > types or the anycompatibletype proposal? > There are two cases of usage a) combination of polymorphic types - fx(t1, t1, t2, t1, t2, t1, t2, ...) b) forcing types fx(t1, t2) t1 force explicit cast for t2 to t1 c) optimization of repeated call of functions like fx("any", "any", "any", ...) It is pretty hard to create simple non-procedural language to describe syntaxes like @a. But with procedural code it is easy. @c is special case, that we can do already. But we cannot to push casting outside function, and inside function, there is a overhead with casting. With implementing type case inside function, then we increase startup time and it is overhead for function started by plpgsql runtime. > I also strongly suspect that this would break pieces of the system > that expect that the stored pg_proc.prorettype has something to do > with reality. At minimum, you'd need to fix a number of places you > haven't touched here that have their own knowledge of function type > resolution, such as enforce_generic_type_consistency, > resolve_polymorphic_argtypes, resolve_aggregate_transtype. Probably > anyplace that treats polymorphics as being any sort of special case > would have to be taught to re-call the support function to find out > what it should think the relevant types are. > > (I don't even want to think about what happens if the support function's > behavior changes between original parsing and these re-checking spots.) > The helper function should be immutable - what I know, is not possible to change data types dynamically, so repeated call should not be effective, but should to produce same result, so it should not be a problem. > > Another thing that's very much less than compelling about your example > is that your support function seems to be happy to throw errors > if the argument types don't match what it's expecting. That seems > quite unacceptable, since it would prevent the parser from moving on > to consider other possibly-matching functions. Maybe that's just > because it's a quick hack not a polished example, but it doesn't > seem like a good precedent. > In this case it is decision, because I don't expect overloading. I understand to your objections about mixing parser helper functions and overloading. Currently it is pretty hard to understand what will be expected behave when somebody overload function with polymorphic function. With parser helper function the overloading is not necessary and can be disabled. > In short, I think the added complexity and bug potential outweigh > any possible gain from this. > > regards, tom lane >
Re: LLVM compile failing in seawasp
On Fri, Jun 7, 2019 at 12:13 PM Tom Lane wrote: > didier writes: > > c.h defines a C Min macro conflicting with llvm new class > > llvm:ElementCount Min member > > Really? Well, we will hardly be the only code they broke with that. > I think we can just wait for them to reconsider. FYI This is now on LLVM's release_90 branch, due out on August 28. -- Thomas Munro https://enterprisedb.com
Re: SegFault on 9.6.14
On Fri, Jul 26, 2019 at 4:13 PM Amit Kapila wrote: > On Tue, Jul 23, 2019 at 5:28 PM Amit Kapila wrote: > > Right, that will be lesser code churn and it can also work. However, > > one thing that needs some thought is till now es_top_eflags is only > > set in ExecutorStart and same is mentioned in comments where it is > > declared and it seems we are going to change that with this idea. How > > about having a separate function ExecBlahShutdown which will clean up > > resources as parallel context and can be called only from ExecutePlan > > where we are calling ExecShutdownNode? I think both these and the > > other solution we have discussed are on similar lines and another idea > > could be to relax the assert which again is not a superb idea. > > It seems we don't have a clear preference for any particular solution > among these and neither there appears to be any better idea. I guess > we can wait for a few days to see if Robert has any views on this, > otherwise, pick one of the above and move ahead. I take the EXEC_FLAG_DONE idea back. It's ugly and too hard to verify that every appropriate path sets it, and a flag that means the opposite would be even more of a kluge, and generally I think I was looking at this too myopically: I was looking for a way to shut down processes ASAP without giving up the shared memory we'll need for rescanning, but what I should have been looking at is the reason you did that in the first place: to get the instrumentation data. Can you explain why it's necessary to do that explicitly for Limit? Wouldn't the right place to collect instrumentation be at the end of execution when Shutdown will run in all cases anyway (and possibly also during ExecParallelReinitialize() or something like that if it's being clobbered by rescans, I didn't check)? What's special about Limit? Today while poking at this and trying to answer those questions for myself, I realised that the repro I posted earlier[1] crashes exactly as Jerry reported on REL9_6_STABLE, but in later release branches it runs to completion. That's because the crashing code was removed in commit 41b0dd98 "Separate reinitialization of shared parallel-scan state from ExecReScan.". So newer branches get past that problem, but they all spit out tons of each of these three warnings: WARNING: buffer refcount leak: [172] (rel=base/12558/16390, blockNum=5, flags=0x9380, refcount=1 2998) ... WARNING: relcache reference leak: relation "join_bar" not closed ... WARNING: Snapshot reference leak: Snapshot 0x7ff20383bfb0 still referenced ... Oops. I don't know exactly why yet, but the problem goes away if you just comment out the offending ExecShutdownNode() call in nodeLimit.c. I tried to understand whether the buffer stats were wrong with that code commented out (Adrien Nayrat's original complaint[2]), but I ran out of time for debugging adventures today. [1] https://www.postgresql.org/message-id/CA%2BhUKGJyqDp9FZSHLTjiNMcz-c6%3DRdStB%2BUjVZsR8wfHnJXy8Q%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/86137f17-1dfb-42f9-7421-82fd786b04a1%40anayrat.info -- Thomas Munro https://enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
Hi, On 2019-07-25 17:51:33 +1200, Thomas Munro wrote: > 1. WAL's use of fdatasync(): The reason we fill and then fsync() > newly created WAL files up front is because we want to make sure the > blocks are definitely on disk. The comment doesn't spell out exactly > why the author considered later fdatasync() calls to be insufficient, > but they were: it was many years after commit 33cc5d8a4d0d that Linux > ext3/4 filesystems began flushing file size changes to disk in > fdatasync()[1][2]. I don't know if its original behaviour was > intentional or not. So, if you didn't use the bigger fsync() hammer > on that OS, you might lose the end of a recently extended file in a > power failure even though fdatasync() had returned success. > > By my reading of POSIX, that shouldn't be necessary on a conforming > implementation of fdatasync(), and that was fixed years ago in Linux. > I'm not proposing any changes there, and I'm not proposing to take > advantage of that in the new code. I'm pointing out that that we > don't have to worry about that for these undo segments, because they > are already flushed with fsync(), not fdatasync(). > (To understand POSIX's descriptions of fsync() and fdatasync() you > have to find the meanings of "Synchronized I/O Data Integrity > Completion" and "Synchronized I/O File Integrity Completion" elsewhere > in the spec. TL;DR: fdatasync() is only allowed to skip flushing > attributes like the modified time, it's not allowed to skip flushing a > file size change since that would interfere with retrieving the data.) Note that there's very good performance reasons trying to avoid metadata changes at e.g. commit time. They're commonly journaled at the FS level, which can add a good chunk of IO and synchronization to an operations that we commonly want to be as fast as possible. Basically you often at least double the amount of synchronous writes. And for the potential future where use async direct IO - writes that change the file size take considerably slower codepaths, and add a lot of synchronization. I suspect that's much more likely to be the reason for the preallocation in 33cc5d8a4d0d, than avoiding an ext* bug (I doubt the bug you reference existed back then, it IIUC didn't apply to ext2, and ext3 was was introduced after 33cc5d8a4d0d). > 2. Time of reservation: Although they don't call fsync(), regular > relations and these new undo files still write zeroes up front > (respectively, for a new block and for a new segment). One reason for > that is that most popular filesystems reserve space at write time, so > you'll get ENOSPC when trying to allocate undo space, and that's a > non-fatal ERROR. If we deferred until writing back buffer contents, > we might get file holes, and deferred ENOSPC is much harder to report > to users and for users to deal with. FWIW, the hole bit I don't quite buy - we could zero the hole at that time (and not be worse than today, except that it might be done by somebody that didn't cause the extension), or even better just look up the buffers between the FS end of the relation, and the block currently written, and write them out in order. The whole thing with deferred ENOSPC being harder to report to users is obviously true regardless of htat. > BTW we could probably use posix_fallocate() instead of writing zeroes; > I think Andres mentioned that recently. I see also that someone tried > that for WAL and it got reverted back in 2013 (commit > b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down > the discussion). IIRC the problem from back then was that while the space is reserved on the FS level, the actual blocks don't contain zeroes at that time. Which means that a) Small writes need to write more, because the surrounding data also needs to be zeroed (annoying but not terrible). b) Writes into the fallocated but not written range IIRC effectively cause metadata writes, because while the "allocated file ending" doesn't change anymore, the new "non-zero written to" fileending does need to be journaled to disk before an f[data]sync - otherwise you could end up with the old value after a crash, and would read spurious zeroes. That's quite bad. Those don't necessarily apply to e.g. extending relations as we e.g. don't granularly fsync them. Although even there the performance picture is mixed - it helps a lot in certain workloads, but there's others were it mildly regresses performance on ext4. Not sure why yet, possibly it's due to more heavyweight locking needed when later changing the "non-zero size", or it's the additional metadata changes. I suspect those would be mostly gone if we didn't write back blocks in random order under memory pressure. Note that neither of those mean that it's not a good idea to posix_fallocate() and *then* write zeroes, when initializing. For several filesystems that's more likely to result in more optimally sized filesystem extents, reducing fragmentation.
Re: Patch for SortSupport implementation on inet/cdir
On Fri, Jul 26, 2019 at 6:58 PM Peter Geoghegan wrote: > I found this part of your approach confusing: > > > + /* > > +* Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8. > > +* > > +* However, only some of the bits may have made it into the fixed sized > > +* datum, so take the smallest number between bits in the subnet and > > bits > > +* in the datum which are not part of the network. > > +*/ > > + datum_subnet_size = Min(ip_maxbits(authoritative) - > > ip_bits(authoritative), > > + SIZEOF_DATUM * BITS_PER_BYTE - > > ip_bits(authoritative)); > > The way that you put a Min() on the subnet size potentially constrains > the size of the bitmask used for the network component of the > abbreviated key (the component that comes immediately after the > ipfamily status bit). Why not just let the bitmask be a bitmask, > without bringing SIZEOF_DATUM into it? Doing it that way allowed for a > more streamlined approach, with significantly fewer special cases. I'm > not sure whether or not your approach had bugs, but I didn't like the > way you sometimes did a straight "network = ipaddr_datum" assignment > without masking. I guess that the idea here was to prevent masking on ipv6 addresses, though not on ipv4 addresses. Obviously we're only dealing with a prefix with ipv6 addresses, whereas we usually have the whole raw ipaddr with ipv4. Not sure if I'm doing the right thing there in v3, even though the tests pass. In any case, this will need to be a lot clearer in the final version. -- Peter Geoghegan
Re: Patch for SortSupport implementation on inet/cdir
On Fri, Feb 8, 2019 at 10:20 AM Brandur Leach wrote: > Attached a V2 patch: identical to V1 except rebased and > with a new OID selected. Attached is a revised version that I came up with, based on your v2. I found this part of your approach confusing: > + /* > +* Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8. > +* > +* However, only some of the bits may have made it into the fixed sized > +* datum, so take the smallest number between bits in the subnet and bits > +* in the datum which are not part of the network. > +*/ > + datum_subnet_size = Min(ip_maxbits(authoritative) - > ip_bits(authoritative), > + SIZEOF_DATUM * BITS_PER_BYTE - > ip_bits(authoritative)); The way that you put a Min() on the subnet size potentially constrains the size of the bitmask used for the network component of the abbreviated key (the component that comes immediately after the ipfamily status bit). Why not just let the bitmask be a bitmask, without bringing SIZEOF_DATUM into it? Doing it that way allowed for a more streamlined approach, with significantly fewer special cases. I'm not sure whether or not your approach had bugs, but I didn't like the way you sometimes did a straight "network = ipaddr_datum" assignment without masking. I really liked your diagrams, but much of the text that went with them either seemed redundant (it described established rules about how the underlying types sort), or seemed to discuss things that were better discussed next to the relevant network_abbrev_convert() code. Thoughts? -- Peter Geoghegan v3-0001-Add-sort-support-for-inet-cidr-opfamily.patch Description: Binary data
warning on reload for PGC_POSTMASTER, guc.c duplication, ...
Hi, When specifying a config a PGC_POSTMASTER variable on the commandline (i.e. -c something=other) the config processing blurts a wrong warning about not being able to change that value. E.g. when specifying shared_buffers via -c, I get: 2019-07-26 16:28:04.795 PDT [14464][] LOG: 0: received SIGHUP, reloading configuration files 2019-07-26 16:28:04.795 PDT [14464][] LOCATION: SIGHUP_handler, postmaster.c:2629 2019-07-26 16:28:04.798 PDT [14464][] LOG: 55P02: parameter "shared_buffers" cannot be changed without restarting the server 2019-07-26 16:28:04.798 PDT [14464][] LOCATION: set_config_option, guc.c:7107 2019-07-26 16:28:04.798 PDT [14464][] LOG: F: configuration file "/srv/dev/pgdev-dev/postgresql.conf" contains errors; unaffected changes were applied 2019-07-26 16:28:04.798 PDT [14464][] LOCATION: ProcessConfigFileInternal, guc-file.l:502 ISTM that the codeblocks throwing these warnings: if (prohibitValueChange) { if (*conf->variable != newval) { record->status |= GUC_PENDING_RESTART; ereport(elevel, (errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM), errmsg("parameter \"%s\" cannot be changed without restarting the server", name))); return 0; } record->status &= ~GUC_PENDING_RESTART; return -1; } ought to only enter the error path if changeVal indicates that we're actually intending to apply the value. I.e. something roughly like the attached. Two more things I noticed when looking at this code: 1) Aren't we leaking memory if prohibitValueChange is set, but newextra is present? The cleanup path for that: /* Perhaps we didn't install newextra anywhere */ if (newextra && !extra_field_used(>gen, newextra)) free(newextra); isn't reached in the prohibitValueChange path shown above. ISTM the return -1 in the prohibitValueChange ought to be removed? 2) The amount of PGC_* dependant code duplication in set_config_option() imo is over the top. ISTM that they should be merged, and a call_*_check_hook wrapper take care of invoking the check hooks, and a nother wrapper should take care of calling the assign hook, ->variable, and reset_val processing. Those wrappers could probably also reduce the amount of code in InitializeOneGUCOption(), parse_and_validate_value(), ResetAllOptions(), AtEOXact_GUC(). I'm also wondering we shouldn't just use config_var_value for at least config_*->{reset_val, boot_val}. It seems pretty clear that reset_extra ought to be moved? I'm even wondering the various hooks shouldn't actually just take config_var_value. But changing that would probably cause more pain to external users - in contrast to looking directly at reset_val, boot_val, reset_extra they're much more likely to have hooks. Greetings, Andres Freund diff --git i/src/backend/utils/misc/guc.c w/src/backend/utils/misc/guc.c index fc463601ff3..0dde7d3bf40 100644 --- i/src/backend/utils/misc/guc.c +++ w/src/backend/utils/misc/guc.c @@ -7008,7 +7008,7 @@ set_config_option(const char *name, const char *value, if (prohibitValueChange) { - if (*conf->variable != newval) + if (changeVal && *conf->variable != newval) { record->status |= GUC_PENDING_RESTART; ereport(elevel, @@ -7098,7 +7098,7 @@ set_config_option(const char *name, const char *value, if (prohibitValueChange) { - if (*conf->variable != newval) + if (changeVal && *conf->variable != newval) { record->status |= GUC_PENDING_RESTART; ereport(elevel, @@ -7188,7 +7188,7 @@ set_config_option(const char *name, const char *value, if (prohibitValueChange) { - if (*conf->variable != newval) + if (changeVal && *conf->variable != newval) { record->status |= GUC_PENDING_RESTART; ereport(elevel, @@ -7295,8 +7295,9 @@ set_config_option(const char *name, const char *value, if (prohibitValueChange) { /* newval shouldn't be NULL, so we're a bit sloppy here */ - if (*conf->variable == NULL || newval == NULL || - strcmp(*conf->variable, newval) != 0) + if (changeVal && (*conf->variable == NULL || + newval == NULL || + strcmp(*conf->variable, newval) != 0)) { record->status |= GUC_PENDING_RESTART; ereport(elevel, @@ -7391,7 +7392,7 @@ set_config_option(const char *name, const char *value, if (prohibitValueChange) { - if (*conf->variable != newval) + if (changeVal && *conf->variable != newval) { record->status |= GUC_PENDING_RESTART; ereport(elevel,
Re: Duplicated LSN in ReorderBuffer
Hi, Petr, Simon, see the potential issue related to fast forward at the bottom. On 2019-07-26 18:46:35 -0400, Alvaro Herrera wrote: > On 2019-Jul-09, Masahiko Sawada wrote: > > > I think the cause of this bug would be that a ReorderBufferTXN entry > > of sub transaction is created as top-level transaction. And this > > happens because we skip to decode ASSIGNMENT during the state of > > snapshot builder < SNAPBUILD_FULL. > > That explanation seems to make sense. Yea. The comment that "in the assignment case we'll not decode those xacts" is true, but it misses that we *do* currently process XLOG_HEAP2_NEW_CID records for transactions that started before reaching FULL_SNAPSHOT. Thinking about it, it was not immediately clear to me that it is necessary to process XLOG_HEAP2_NEW_CID at that stage. We only need the cid mapping when decoding content of the transaction that the XLOG_HEAP2_NEW_CID record was about - which will not happen if it started before SNAPBUILD_FULL. Except that they also are responsible for signalling that a transaction performed catalog modifications (cf ReorderBufferXidSetCatalogChanges() call), which in turn is important for SnapBuildCommitTxn() to know whether to include that transaction needs to be included in historic snapshots. So unless I am missing something - which is entirely possible, I've had this code thoroughly swapped out - that means that we only need to process XLOG_HEAP2_NEW_CID < SNAPBUILD_FULL if there can be transactions with relevant catalog changes, that don't have any invalidations messages. After thinking about it for a bit, that's not guaranteed however. For one, even for system catalog tables, looking at CacheInvalidateHeapTuple() et al there can be catalog modifications that create neither a snapshot invalidation message, nor a catcache one. There's also the remote scenario that we possibly might be only modifying a toast relation. But more importantly, the only modified table could be a user defined catalog table (i.e. WITH (user_catalog_table = true)). Which in all likelihood won't cause invalidation messages. So I think currently it is required to process NEW_ID records - although we don't need to actually execute the ReorderBufferAddNewTupleCids() etc calls. Perhaps the right fix for the future would actually be to not rely on on NEW_ID for recognizing transactions as such, but instead have an xact.c marker that signals whether a transaction performed catalog modifications. Hm, need to think more about this. > > Instead, I wonder if we can decode ASSIGNMENT even when the state of > > snapshot builder < SNAPBUILD_FULL_SNAPSHOT. That way, the > > ReorderBufferTXN entries of both top transaction and sub transaction > > are created properly before we decode NEW_CID. > > Yeah, that seems a sensible remediation to me. That does seems like a reasonable approach. I can see two alternatives: 1) Change SnapBuildProcessNewCid()'s ReorderBufferXidSetCatalogChanges() call to reference the toplevel xid. That has the disadvantage that if the subtransaction that performed DDL rolls back, the main transaction will still be treated as a catalog transaction - i have a hard time seeing that being common, however. That'd then also require SnapBuildProcessNewCid() in SNAPBUILD_FULL_SNAPSHOT to return before processing any data assigned to subtransactions. Which would be good, because that's currently unnecessarily stored in memory. 2) We could simply assign the subtransaction to the parent using ReorderBufferAssignChild() in SnapBuildProcessNewCid() or it's caller. That ought to also fix the bug I also has the advantage that we can save some memory in transactions that have some, but fewer than the ASSIGNMENT limit subtransactions, because it allows us to avoid having a separate base snapshot for them (c.f. ReorderBufferTransferSnapToParent()). Like 1) that could be combined with adding an early return when < SNAPBUILD_FULL_SNAPSHOT, after ReorderBufferXidSetCatalogChanges(), but I don't think it'd be required for correctness in contrast to 1). Both of these would have the advantage that we only would track additional information for transactions that have modified the catalog, whereas the proposal to process ASSIGNMENT earlier, would mean that we additionally track all transactions with more than 64 children. So provided that I didn't mis-analyze here, I think both of my alternatives are preferrable? I think 2) is simpler? > /* > - * No point in doing anything yet, data could not be decoded anyway. > It's > - * ok not to call ReorderBufferProcessXid() in that case, except in the > - * assignment case there'll not be any later records with the same xid; > - * and in the assignment case we'll not decode those xacts. > + * If the snapshot isn't yet fully built, we cannot decode anything, so > + * bail out. > + * > + * However, it's critical to process
RE: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
Hi Kyotaro, Thank you so much for your valued feedback and suggestions! > I assume that we are in a consensus about the problem we are to fix here. > > > 0a 0004`8080cc30 0004`80dcf917 postgres!PGSemaphoreLock+0x65 > > [d:\orcasqlagsea10\14\s\src\backend\port\win32_sema.c @ 158] 0b > > 0004`8080cc90 0004`80db025c postgres!LWLockAcquire+0x137 > > [d:\orcasqlagsea10\14\s\src\backend\storage\lmgr\lwlock.c @ 1234] 0c > > 0004`8080ccd0 0004`80db25db postgres!AbortBufferIO+0x2c > > [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 3995] 0d > > 0004`8080cd20 0004`80dbce36 postgres!AtProcExit_Buffers+0xb > > [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 2479] 0e > > 0004`8080cd50 0004`80dbd1bd postgres!shmem_exit+0xf6 > > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 262] 0f > > 0004`8080cd80 0004`80dbccfd postgres!proc_exit_prepare+0x4d > > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 188] > > 10 0004`8080cdb0 0004`80ef9e74 postgres!proc_exit+0xd > > [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 141] > > 11 0004`8080cde0 0004`80ddb6ef postgres!errfinish+0x204 > > [d:\orcasqlagsea10\14\s\src\backend\utils\error\elog.c @ 624] > > 12 0004`8080ce50 0004`80db0f59 postgres!mdread+0x12f > > [d:\orcasqlagsea10\14\s\src\backend\storage\smgr\md.c @ 806] Yes, this is one of the two situations we want to fix. The other situation is a cascade exception case like following. #0 0x7f0fdb7cb6d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f0fd14c81b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem(at)entry=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:111 #2 0x7f0fdb7cb7c8 in __new_sem_wait_slow (sem=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:181 #3 0x5630d475658a in PGSemaphoreLock (sema=0x7f0fd14c81b8) at pg_sema.c:316 #4 0x5630d47f689e in LWLockAcquire (lock=0x7f0fd9ae9c00, mode=LW_EXCLUSIVE) at /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243 #5 0x5630d47cd087 in AbortBufferIO () at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988 #6 0x5630d47cb3f9 in AtProcExit_Buffers (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473 #7 0x5630d47dbc32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272 #8 0x5630d47dba5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194 #9 0x5630d47db9c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107 #10 0x5630d49811bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541 #11 0x5630d4801f1f in mdwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843 #12 0x5630d4804716 in smgrwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650 #13 0x5630d47cb824 in FlushBuffer (buf=0x7f0fd19e9c00, reln=0x5630d6588c68) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751 #14 0x5630d47cb219 in SyncOneBuffer (buf_id=0, skip_recently_used=false, wb_context=0x7ffccc371a70) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394 #15 0x5630d47cab00 in BufferSync (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984 #16 0x5630d47cb57f in CheckPointBuffers (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578 #17 0x5630d44a685b in CheckPointGuts (checkPointRedo=23612304, flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149 #18 0x5630d44a62cf in CreateCheckPoint (flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937 #19 0x5630d44a45e3 in StartupXLOG () at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:7723 #20 0x5630d4995f88 in InitPostgres (in_dbname=0x5630d65582b0 "postgres", dboid=0, username=0x5630d653d7d0 "chengyu", useroid=0, out_dbname=0x0, override_allow_connections=false) at /path/to/postgres/source/build/../src/backend/utils/init/postinit.c:636 #21 0x5630d480b68b in PostgresMain (argc=7, argv=0x5630d6534d20, dbname=0x5630d65582b0 "postgres", username=0x5630d653d7d0 "chengyu") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:3810 #22 0x5630d4695e8b in main (argc=7, argv=0x5630d6534d20) at /path/to/postgres/source/build/../src/backend/main/main.c:224 Though ENOSPC is avoided by reservation in PG,
Re: TopoSort() fix
Andres Freund writes: > On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote: >> Hello, is anybody looking into this issue? > I guess this is on Robert's docket otherwise. He's on vacation till > early next week... I think this is a sufficiently obvious bug, and a sufficiently obvious fix, that we should just fix it and not insist on getting a reproducible test case first. I think a test case would soon bit-rot anyway, and no longer exercise the problem. I certainly do *not* want to wait so long that we miss the upcoming minor releases. regards, tom lane
Re: proposal: make NOTIFY list de-duplication optional
=?UTF-8?Q?Filip_Rembia=C5=82kowski?= writes: > Still no hash table fallback is implemented, so this is *not* a > performance improvement. Only a little more flexibility. I think that we'd probably be better off fixing the root performance issue than adding semantic complexity to bypass it. Especially since, if you don't de-duplicate, that's going to cost you when it comes time to actually send the notifications, receive them, forward them to clients, and process them in the clients. Admittedly, if the app *knows* that it's generating non-duplicate events, maybe it'd be all right to skip checking that. But if we can make the check cheap, I'd just as soon keep it. Accordingly, I looked into making a hash table when there are more than a small number of notifications pending, and attached is a lightly-tested version of that. This seems to be more or less similar speed to the existing code for up to 100 or so distinct notifies, but it soon pulls away above that. A point that needs discussion is that this patch, unlike the existing code, *does* de-duplicate fully: any events generated by a subtransaction that duplicate events already emitted by a parent will get removed when the subxact is merged to its parent. I did this partly because we have to expend O(N) work to merge N subtransaction notifies in any case, now that we have to make new hashtable entries in the parent xact; so the old excuse that subxact-end processing is really cheap no longer applies. Also because the Assert(!found) assertions in the attached hash coding fall over if we cheat on this. If we really want to maintain exactly the old semantics here, we could relax the hashtable code to just ignore duplicate entries. But, per the argument above, de-duplication is a good thing so I'm inclined to keep it like this. I also noticed that as things stand, it costs us two or three pallocs to construct a Notification event. It wouldn't be terribly hard to reduce that to one palloc, and maybe it'd be worthwhile if we're thinking that transactions with many many notifies are a case worth optimizing. But I didn't do that here; it seems like a separable patch. I also thought for awhile about not having the hashtable as an auxiliary data structure, but making it the main data structure. We could preserve the required notification ordering information by threading the live hashtable entries into an slist, say. However, this would greatly increase the overhead for transactions with just one or a few distinct NOTIFY events, since we'd have to set up the hashtable at the first one. I think that's a common enough case that we shouldn't de-optimize it. A smaller objection is that such a data structure would absolutely commit us to de-duplication semantics, whereas the list plus separate hashtable can cope with not de-duping if someone persuades us that's sane. Thoughts? regards, tom lane diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index 6e9c580..c21daa5 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -135,6 +135,7 @@ #include "storage/sinval.h" #include "tcop/tcopprot.h" #include "utils/builtins.h" +#include "utils/hashutils.h" #include "utils/memutils.h" #include "utils/ps_status.h" #include "utils/snapmgr.h" @@ -323,14 +324,25 @@ static List *upperPendingActions = NIL; /* list of upper-xact lists */ /* * State for outbound notifies consists of a list of all channels+payloads - * NOTIFYed in the current transaction. We do not actually perform a NOTIFY - * until and unless the transaction commits. pendingNotifies is NIL if no - * NOTIFYs have been done in the current transaction. + * NOTIFYed in the current transaction. We do not actually perform a NOTIFY + * until and unless the transaction commits. pendingNotifies is NULL if no + * NOTIFYs have been done in the current (sub) transaction. + * + * We discard duplicate notify events issued in the same transaction. + * Hence, in addition to the list proper (which we need to track the order + * of the events, since we guarantee to deliver them in order), we build a + * hash table which we can probe to detect duplicates. Since building the + * hash table is somewhat expensive, we do so only once we have at least + * MIN_HASHABLE_NOTIFIES events queued in the current (sub) transaction; + * before that we just scan the events linearly. * * The list is kept in CurTransactionContext. In subtransactions, each * subtransaction has its own list in its own CurTransactionContext, but - * successful subtransactions attach their lists to their parent's list. - * Failed subtransactions simply discard their lists. + * successful subtransactions add their entries to their parent's list. + * Failed subtransactions simply discard their lists. Since these lists + * are independent, there may be notify events in a subtransaction's list + * that duplicate events in some ancestor (sub) transaction; we get rid
Re: TopoSort() fix
Hi, On 2019-07-26 18:05:38 -0400, Alvaro Herrera wrote: > On 2019-Jul-04, Rui Hai Jiang wrote: > > > I'll try to figure out some scenarios to do the test. A parallel process > > group is needed for the test. Rui, have you made any progress on this? > > Actually I was trying to do some testing against the locking mechanism. I > > happened to see this issue. > > Hello, is anybody looking into this issue? I guess this is on Robert's docket otherwise. He's on vacation till early next week... Greetings, Andres Freund
Re: Duplicated LSN in ReorderBuffer
On 2019-Jul-09, Masahiko Sawada wrote: > I think the cause of this bug would be that a ReorderBufferTXN entry > of sub transaction is created as top-level transaction. And this > happens because we skip to decode ASSIGNMENT during the state of > snapshot builder < SNAPBUILD_FULL. That explanation seems to make sense. > Instead, I wonder if we can decode ASSIGNMENT even when the state of > snapshot builder < SNAPBUILD_FULL_SNAPSHOT. That way, the > ReorderBufferTXN entries of both top transaction and sub transaction > are created properly before we decode NEW_CID. Yeah, that seems a sensible remediation to me. I would reduce the scope a little bit -- only create the assignment in the BUILDING state, and skip it in the START state. I'm not sure that it's possible to get assignments while in START state that are significant (I'm still trying to digest SnapBuildFindSnapshot). I would propose the attached. Andres, do you have an opinion on this? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 0d37a9a64bc20414742a8e9497dc8dffeeea16d1 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 26 Jul 2019 18:38:44 -0400 Subject: [PATCH v3] decode XACT_ASSIGNMENT while building snapshot --- src/backend/replication/logical/decode.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 151c3ef882..a6f7dc2723 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -215,14 +215,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) ReorderBuffer *reorder = ctx->reorder; XLogReaderState *r = buf->record; uint8 info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK; + SnapBuildState state; /* - * No point in doing anything yet, data could not be decoded anyway. It's - * ok not to call ReorderBufferProcessXid() in that case, except in the - * assignment case there'll not be any later records with the same xid; - * and in the assignment case we'll not decode those xacts. + * If the snapshot isn't yet fully built, we cannot decode anything, so + * bail out. + * + * However, it's critical to process XLOG_XACT_ASSIGNMENT records even + * when the snapshot is being built: it is possible to get later records + * that require subxids to be properly assigned. */ - if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) + state = SnapBuildCurrentState(builder); + if (state < SNAPBUILD_FULL_SNAPSHOT || + (info == XLOG_XACT_ASSIGNMENT && state < SNAPBUILD_BUILDING_SNAPSHOT)) return; switch (info) -- 2.17.1
Re: Attached partition not considering altered column properties of root partition.
On 2019-Jul-03, Amit Langote wrote: > Thanks for the report. This seems like a bug. Documentation claims > that the child tables inherit column storage options from the parent > table. That's actually enforced in only some cases. > To fix this, MergeAttributesIntoExisting() should check that the > attribute options of a child don't conflict with the parent, which the > attached patch implements. Note that partitioning uses the same code > as inheritance, so the fix applies to it too. After the patch: Thanks for the patch! I'm not completely sold on back-patching this. Should we consider changing it in 12beta and up only, or should we just backpatch it all the way back to 9.4? It's going to raise errors in cases that previously worked. On the patch itself: I think ERRCODE_DATATYPE_MISMATCH is not the correct one to use for this; maybe ERRCODE_INVALID_OBJECT_DEFINITION or, more likely, ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE. "FOO versus BAR" does not sound proper style. Maybe "Child table has FOO, parent table expects BAR." Or maybe put it all in errmsg, errmsg("child table \"%s\" has storage option \"%s\" for column \"%s\" mismatching \"%s\" on parent", -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: TopoSort() fix
On 2019-Jul-04, Rui Hai Jiang wrote: > I'll try to figure out some scenarios to do the test. A parallel process > group is needed for the test. > > Actually I was trying to do some testing against the locking mechanism. I > happened to see this issue. Hello, is anybody looking into this issue? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Multivariate MCV list vs. statistics target
On Fri, Jul 26, 2019 at 07:03:41AM +, Jamison, Kirk wrote: On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote: >+ /* XXX What if the target is set to 0? Reset the statistic? */ > >This also makes me wonder. I haven't looked deeply into the code, but >since 0 is a valid value, I believe it should reset the stats. I agree resetting the stats after setting the target to 0 seems quite reasonable. But that's not what we do for attribute stats, because in that case we simply skip the attribute during the future ANALYZE runs - we don't reset the stats, we keep the existing stats. So I've done the same thing here, and I've removed the XXX comment. If we want to change that, I'd do it in a separate patch for both the regular and extended stats. Hi, Tomas Sorry for my late reply. You're right. I have no strong opinion whether we'd want to change that behavior. I've also confirmed the change in the patch where setting statistics target 0 skips the statistics. OK, thanks. Maybe only some minor nitpicks in the source code comments below: 1. "it's" should be "its": +* Compute statistic target, based on what's set for the statistic +* object itself, and for it's attributes. 2. Consistency whether you'd use either "statistic " or "statisticS ". Ex. statistic target vs statisticS target, statistics object vs statistic object, etc. Attached is v4 of the patch, with a couple more improvements: 1) I've renamed the if_not_exists flag to missing_ok, because that's more consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda the exact opposite), and I've added a NOTICE about the skip. + boolmissing_ok; /* do nothing if statistics does not exist */ Confirmed. So we ignore if statistic does not exist, and skip the error. Maybe to make it consistent with other data structures in parsernodes.h, you can change the comment of missing_ok to: /* skip error if statistics object does not exist */ Thanks, I've fixed all those places in the attached v5. 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's what the function was doing anyway (computing sample size). 3) I've added a couple of regression tests to stats_ext.sql Aside from that, I've cleaned up a couple of places and improved a bunch of comments. Nothing huge. I have a question though regarding ComputeExtStatisticsRows. I'm just curious with the value 300 when computing sample size. Where did this value come from? + /* compute sample size based on the statistic target */ + return (300 * result); Overall, the patch is almost already in good shape for commit. I'll wait for the next update. That's how we compute number of rows to sample, based on the statistics target. See std_typanalyze() in analyze.c, which also cites the paper where this comes from. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/ref/alter_statistics.sgml b/doc/src/sgml/ref/alter_statistics.sgml index 58c7ed020d..be4c3f1f05 100644 --- a/doc/src/sgml/ref/alter_statistics.sgml +++ b/doc/src/sgml/ref/alter_statistics.sgml @@ -26,6 +26,7 @@ PostgreSQL documentation ALTER STATISTICS name OWNER TO { new_owner | CURRENT_USER | SESSION_USER } ALTER STATISTICS name RENAME TO new_name ALTER STATISTICS name SET SCHEMA new_schema +ALTER STATISTICS name SET STATISTICS new_target @@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA + + new_target + + +The statistic-gathering target for this statistics object for subsequent + operations. +The target can be set in the range 0 to 1; alternatively, set it +to -1 to revert to using the system default statistics +target (). +For more information on the use of statistics by the +PostgreSQL query planner, refer to +. + + + + diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 8d633f2585..c27df32d92 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -307,7 +307,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params, VacAttrStats **vacattrstats; AnlIndexData *indexdata; int targrows, - numrows; + numrows, + minrows; double totalrows, totaldeadrows; HeapTuple *rows; @@ -491,6 +492,16 @@ do_analyze_rel(Relation onerel, VacuumParams *params, } } + /* +* Look at extended statistics objects too, as those may define custom +* statistics target. So we may need to sample more rows and then build +* the statistics with enough detail. +*/ +
Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )
On 2019-Jul-25, Michael Paquier wrote: > On Wed, Jul 24, 2019 at 11:23:30AM -0400, Alvaro Herrera wrote: > > Heh, yesterday I revised the original patch as attached and was about to > > push when the bell rang. I like this one because it keeps the comment > > to one line and it mentions the function name in charge of the > > validation (useful for grepping later on). It's a bit laconic because > > of the long function name and the desire to keep it to one line, but it > > seems sufficient to me. > > Looks fine to me. A nit: addition of braces for the if block. Even > if that a one-liner, there is a comment so I think that this makes the > code more readable. Yeah, I had removed those on purpose, but that was probably inconsistent with my own reviews of others' patches. I pushed it with them. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: double free in ExecHashJoin, 9.6.12
On Wed, Jul 24, 2019 at 11:01 PM Thomas Munro wrote: > > On Thu, Jul 25, 2019 at 2:39 AM Merlin Moncure wrote: > > Server is generally running pretty well, and is high volume. This > > query is not new and is also medium volume. Database rebooted in > > about 4 seconds with no damage; fast enough we didn't even trip alarms > > (I noticed this troubleshooting another issue). We are a couple of > > bug fixes releases behind but I didn't see anything obvious in the > > release notes suggesting a resolved issue. Anyone have any ideas? > > thanks in advance. > > > postgres: rms ysconfig 10.33.190.21(36788) > > SELECT(ExecHashJoin+0x5a2)[0x5e2d32] > > Hi Merlin, > > Where's the binary from (exact package name, if installed with a > package)? Not sure if this is going to help, but is there any chance > you could disassemble that function so we can try to see what it's > doing at that offset? For example on Debian if you have > postgresql-9.6 and postgresql-9.6-dbg installed you could run "gdb > /usr/lib/postgresql/9.6/bin/postgres" and then "disassemble > ExecHashJoin". The code at "<+1442>" (0x5a2) is presumably calling > free or some other libc thing (though I'm surprised not to see an > intervening palloc thing). Thanks -- great suggestion. I'll report back with any interesting findings. merlin
Re: proposal: type info support functions for functions that use "any" type
I wrote: > TBH, I don't like this proposal one bit. As far as I can see, the idea > is to let a function's support function redefine the function's declared > argument and result types on-the-fly according to no predetermined rules, > and that seems to me like it's a recipe for disaster. How will anyone > understand which function(s) are candidates to match a query, or why one > particular candidate got selected over others? It's already hard enough > to understand the behavior of polymorphic functions in complex cases, > and those are much more constrained than this would be. After thinking about this a bit more, it seems like you could avoid a lot of problems if you restricted what the support function call does to be potentially replacing the result type of a function declared to return ANY with some more-specific type (computed from examination of the actual arguments). That would make it act much more like a traditional polymorphic function. It'd remove the issues about interactions among multiple potentially-matching functions, since we'd only call a single support function for an already-identified target function. You'd still need to touch everyplace that knows about polymorphic type resolution, since this would essentially be another form of polymorphic function. And I'm still very dubious that it's worth the trouble. But it would be a lot more controllable than the proposal as it stands. regards, tom lane
Re: pgbench - allow to create partitioned tables
Attached v3 fixes strcasecmp non portability on windows, per postgresql patch tester. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 816f9cc4c7..3e8e292e39 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -306,6 +306,32 @@ pgbench options d + + --partitions=NUM + + +Create a partitioned pgbench_accounts table with +NUM partitions of nearly equal size for +the scaled number of accounts. +Default is 0, meaning no partitioning. + + + + + + --partition-method=NAME + + +Create a partitioned pgbench_accounts table with +NAME method. +Expected values are range or hash. +This option is only taken into account if +--partitions is non-zero. +Default is range. + + + + --tablespace=tablespace diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 570cf3306a..6fa8ed7f81 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -186,6 +186,11 @@ int64 latency_limit = 0; char *tablespace = NULL; char *index_tablespace = NULL; +/* partitioning for pgbench_accounts table, 0 for no partitioning */ +int partitions = 0; +enum { PART_RANGE, PART_HASH } + partition_method = PART_RANGE; + /* random seed used to initialize base_random_sequence */ int64 random_seed = -1; @@ -617,6 +622,9 @@ usage(void) " --foreign-keys create foreign key constraints between tables\n" " --index-tablespace=TABLESPACE\n" " create indexes in the specified tablespace\n" + " --partitions=NUM partition account table in NUM parts (defaults: 0)\n" + " --partition-method=(range|hash)\n" + " partition account table with this method (default: range)\n" " --tablespace=TABLESPACE create tables in the specified tablespace\n" " --unlogged-tablescreate tables as unlogged tables\n" "\nOptions to select what to run:\n" @@ -3601,6 +3609,17 @@ initDropTables(PGconn *con) "pgbench_tellers"); } +/* + * add fillfactor percent option if not 100. + */ +static void +append_fillfactor(char *opts, int len) +{ + if (fillfactor < 100) + snprintf(opts + strlen(opts), len - strlen(opts), + " with (fillfactor=%d)", fillfactor); +} + /* * Create pgbench's standard tables */ @@ -3625,6 +3644,7 @@ initCreateTables(PGconn *con) const char *bigcols; /* column decls if accountIDs are 64 bits */ int declare_fillfactor; }; + static const struct ddlinfo DDLs[] = { { "pgbench_history", @@ -3651,11 +3671,10 @@ initCreateTables(PGconn *con) 1 } }; - int i; fprintf(stderr, "creating tables...\n"); - for (i = 0; i < lengthof(DDLs); i++) + for (int i = 0; i < lengthof(DDLs); i++) { char opts[256]; char buffer[256]; @@ -3664,9 +3683,17 @@ initCreateTables(PGconn *con) /* Construct new create table statement. */ opts[0] = '\0'; - if (ddl->declare_fillfactor) + + /* Partition pgbench_accounts table */ + if (partitions >= 1 && strcmp(ddl->table, "pgbench_accounts") == 0) + { snprintf(opts + strlen(opts), sizeof(opts) - strlen(opts), - " with (fillfactor=%d)", fillfactor); + " partition by %s (aid)", + partition_method == PART_RANGE ? "range" : "hash"); + } + else if (ddl->declare_fillfactor) + append_fillfactor(opts, sizeof(opts)); + if (tablespace != NULL) { char *escape_tablespace; @@ -3686,6 +3713,54 @@ initCreateTables(PGconn *con) executeStatement(con, buffer); } + + if (partitions >= 1) + { + int64 part_size = (naccounts * (int64) scale + partitions - 1) / partitions; + char ff[64]; + ff[0] = '\0'; + append_fillfactor(ff, sizeof(ff)); + + fprintf(stderr, "creating %d partitions...\n", partitions); + + for (int p = 1; p <= partitions; p++) + { + char query[256]; + + if (partition_method == PART_RANGE) + { +char minvalue[32], maxvalue[32]; + +if (p == 1) + sprintf(minvalue, "MINVALUE"); +else + sprintf(minvalue, INT64_FORMAT, (p-1) * part_size + 1); + +if (p < partitions) + sprintf(maxvalue, INT64_FORMAT, p * part_size + 1); +else + sprintf(maxvalue, "MAXVALUE"); + +snprintf(query, sizeof(query), + "create%s table pgbench_accounts_%d\n" + " partition of pgbench_accounts\n" + " for values from (%s) to (%s)%s\n", + unlogged_tables ? " unlogged" : "", p, + minvalue, maxvalue, ff); + } + else if (partition_method == PART_HASH) +snprintf(query, sizeof(query), + "create%s table pgbench_accounts_%d\n" + " partition of pgbench_accounts\n" + " for values with (modulus %d, remainder %d)%s\n", + unlogged_tables ? " unlogged" : "", p, + partitions, p-1, ff); + else /* cannot get there */ +
Re: Built-in connection pooler
Hi Konstantin, I've started reviewing this patch and experimenting with it, so let me share some initial thoughts. 1) not handling session state (yet) I understand handling session state would mean additional complexity, so I'm OK with not having it in v1. That being said, I think this is the primary issue with connection pooling on PostgreSQL - configuring and running a separate pool is not free, of course, but when people complain to us it's when they can't actually use a connection pool because of this limitation. So what are your plans regarding this feature? I think you mentioned you already have the code in another product. Do you plan to submit it in the pg13 cycle, or what's the plan? I'm willing to put some effort into reviewing and testing that. FWIW it'd be nice to expose it as some sort of interface, so that other connection pools can leverage it too. There are use cases that don't work with a built-in connection pool (say, PAUSE/RESUME in pgbouncer allows restarting the database) so projects like pgbouncer or odyssey are unlikely to disappear anytime soon. I also wonder if we could make it more permissive even in v1, without implementing dump/restore of session state. Consider for example patterns like this: BEGIN; SET LOCAL enable_nestloop = off; ... COMMIT; or PREPARE x(int) AS SELECT ...; EXECUTE x(1); EXECUTE x(2); ... EXECUTE x(10); DEALLOCATE x; or perhaps even CREATE FUNCTION f() AS $$ ... $$ LANGUAGE sql SET enable_nestloop = off; In all those cases (and I'm sure there are other similar examples) the connection pool considers the session 'tainted' it marks it as tainted and we never reset that. So even when an application tries to play nice, it can't use pooling. Would it be possible to maybe track this with more detail (number of prepared statements, ignore SET LOCAL, ...)? That should allow us to do pooling even without full support for restoring session state. 2) configuration I think we need to rethink how the pool is configured. The options available at the moment are more a consequence of the implementation and are rather cumbersome to use in some cases. For example, we have session_pool_size, which is (essentially) the number of backends kept in the pool. Which seems fine at first, because it seems like you might say max_connections = 100 session_pool_size = 50 to say the connection pool will only ever use 50 connections, leaving the rest for "direct" connection. But that does not work at all, because the number of backends the pool can open is session_pool_size * connection_proxies * databases * roles which pretty much means there's no limit, because while we can specify the number of proxies, the number of databases and roles is arbitrary. And there's no way to restrict which dbs/roles can use the pool. So you can happily do this max_connections = 100 connection_proxies = 4 session_pool_size = 10 pgbench -c 24 -U user1 test1 pgbench -c 24 -U user2 test2 pgbench -c 24 -U user3 test3 pgbench -c 24 -U user4 test4 at which point it's pretty much game over, because each proxy has 4 pools, each with ~6 backends, 96 backends in total. And because non-tainted connections are never closed, no other users/dbs can use the pool (will just wait indefinitely). To allow practical configurations, I think we need to be able to define: * which users/dbs can use the connection pool * minimum/maximum pool size per user, per db and per user/db * maximum number of backend connections We need to be able to close connections when needed (when not assigned, and we need the connection for someone else). Plus those limits need to be global, not "per proxy" - it's just strange that increasing connection_proxies bumps up the effective pool size. I don't know what's the best way to specify this configuration - whether to store it in a separate file, in some system catalog, or what. 3) monitoring I think we need much better monitoring capabilities. At this point we have a single system catalog (well, a SRF) giving us proxy-level summary. But I think we need much more detailed overview - probably something like pgbouncer has - listing of client/backend sessions, with various details. Of course, that's difficult to do when those lists are stored in private memory of each proxy process - I think we need to move this to shared memory, which would also help to address some of the issues I mentioned in the previous section (particularly that the limits need to be global, not per proxy). 4) restart_pooler_on_reload I find it quite strange that restart_pooler_on_reload binds restart of the connection pool to reload of the configuration file. That seems like a rather surprising behavior, and I don't see why would you ever want that? Currently it seems like the only way to force the proxies to close the connections (the docs mention DROP DATABASE), but why shouldn't we have separate functions to do that? In particular, why would
Re: proposal: type info support functions for functions that use "any" type
Pavel Stehule writes: > so 9. 3. 2019 v 7:22 odesílatel Pavel Stehule > napsal: >> Tom introduced supported functions for calculation function's selectivity. >> Still I have similar idea to use supported function for calculation >> function's parameter's types and function return type. >> Motivation: >> Reduce a necessity of overloading of functions. My motivation is related >> primary to Orafce, but this feature should be helpful for anybody with >> similar goals. The function's overloading is great functionality but it is >> hard for maintenance. > here is a patch TBH, I don't like this proposal one bit. As far as I can see, the idea is to let a function's support function redefine the function's declared argument and result types on-the-fly according to no predetermined rules, and that seems to me like it's a recipe for disaster. How will anyone understand which function(s) are candidates to match a query, or why one particular candidate got selected over others? It's already hard enough to understand the behavior of polymorphic functions in complex cases, and those are much more constrained than this would be. Moreover, I don't think you've even provided a compelling example case. What's this doing that you couldn't do with existing polymorphic types or the anycompatibletype proposal? I also strongly suspect that this would break pieces of the system that expect that the stored pg_proc.prorettype has something to do with reality. At minimum, you'd need to fix a number of places you haven't touched here that have their own knowledge of function type resolution, such as enforce_generic_type_consistency, resolve_polymorphic_argtypes, resolve_aggregate_transtype. Probably anyplace that treats polymorphics as being any sort of special case would have to be taught to re-call the support function to find out what it should think the relevant types are. (I don't even want to think about what happens if the support function's behavior changes between original parsing and these re-checking spots.) Another thing that's very much less than compelling about your example is that your support function seems to be happy to throw errors if the argument types don't match what it's expecting. That seems quite unacceptable, since it would prevent the parser from moving on to consider other possibly-matching functions. Maybe that's just because it's a quick hack not a polished example, but it doesn't seem like a good precedent. In short, I think the added complexity and bug potential outweigh any possible gain from this. regards, tom lane
Re: Optimization of some jsonb functions
Thomas Munro wrote: > This doesn't apply -- to attract reviewers, could we please have a rebase? To help the review go forward, I have rebased the patch on 27cd521e6e. It passes `make check` for me, but that's as far as I've verified the correctness. I squashed the changes into a single patch, sorry if that makes it harder to review than the original set of five patch files... -- Joe Nelson https://begriffs.com diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 69f41ab455..8dced4ef6c 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -1873,9 +1873,7 @@ jsonb_object_agg_finalfn(PG_FUNCTION_ARGS) bool JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) { - JsonbIterator *it; - JsonbIteratorToken tok PG_USED_FOR_ASSERTS_ONLY; - JsonbValue tmp; + JsonbValue *scalar PG_USED_FOR_ASSERTS_ONLY; if (!JsonContainerIsArray(jbc) || !JsonContainerIsScalar(jbc)) { @@ -1884,25 +1882,8 @@ JsonbExtractScalar(JsonbContainer *jbc, JsonbValue *res) return false; } - /* - * A root scalar is stored as an array of one element, so we get the array - * and then its first (and only) member. - */ - it = JsonbIteratorInit(jbc); - - tok = JsonbIteratorNext(, , true); - Assert(tok == WJB_BEGIN_ARRAY); - Assert(tmp.val.array.nElems == 1 && tmp.val.array.rawScalar); - - tok = JsonbIteratorNext(, res, true); - Assert(tok == WJB_ELEM); - Assert(IsAJsonbScalar(res)); - - tok = JsonbIteratorNext(, , true); - Assert(tok == WJB_END_ARRAY); - - tok = JsonbIteratorNext(, , true); - Assert(tok == WJB_DONE); + scalar = getIthJsonbValueFromContainer(jbc, 0, res); + Assert(scalar); return true; } diff --git a/src/backend/utils/adt/jsonb_op.c b/src/backend/utils/adt/jsonb_op.c index a64206eeb1..82c4b0b2cb 100644 --- a/src/backend/utils/adt/jsonb_op.c +++ b/src/backend/utils/adt/jsonb_op.c @@ -24,6 +24,7 @@ jsonb_exists(PG_FUNCTION_ARGS) Jsonb *jb = PG_GETARG_JSONB_P(0); text *key = PG_GETARG_TEXT_PP(1); JsonbValue kval; + JsonbValue vval; JsonbValue *v = NULL; /* @@ -38,7 +39,7 @@ jsonb_exists(PG_FUNCTION_ARGS) v = findJsonbValueFromContainer(>root, JB_FOBJECT | JB_FARRAY, - ); + , ); PG_RETURN_BOOL(v != NULL); } @@ -59,6 +60,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS) for (i = 0; i < elem_count; i++) { JsonbValue strVal; + JsonbValue valVal; if (key_nulls[i]) continue; @@ -69,7 +71,7 @@ jsonb_exists_any(PG_FUNCTION_ARGS) if (findJsonbValueFromContainer(>root, JB_FOBJECT | JB_FARRAY, - ) != NULL) + , ) != NULL) PG_RETURN_BOOL(true); } @@ -92,6 +94,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS) for (i = 0; i < elem_count; i++) { JsonbValue strVal; + JsonbValue valVal; if (key_nulls[i]) continue; @@ -102,7 +105,7 @@ jsonb_exists_all(PG_FUNCTION_ARGS) if (findJsonbValueFromContainer(>root, JB_FOBJECT | JB_FARRAY, - ) == NULL) + , ) == NULL) PG_RETURN_BOOL(false); } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index ac04c4a57b..05e1c18472 100644 --- a/src/backend/utils/adt/jsonb_util.c +++ b/src/backend/utils/adt/jsonb_util.c @@ -57,6 +57,8 @@ static void appendValue(JsonbParseState *pstate, JsonbValue *scalarVal); static void appendElement(JsonbParseState *pstate, JsonbValue *scalarVal); static int lengthCompareJsonbStringValue(const void *a, const void *b); static int lengthCompareJsonbPair(const void *a, const void *b, void *arg); +static int lengthCompareJsonbString(const char *val1, int len1, + const char *val2, int len2); static void uniqueifyJsonbObject(JsonbValue *object); static JsonbValue *pushJsonbValueScalar(JsonbParseState **pstate, JsonbIteratorToken seq, @@ -297,6 +299,102 @@ compareJsonbContainers(JsonbContainer *a, JsonbContainer *b) return res; } +/* Find scalar element in Jsonb array and return it. */ +static JsonbValue * +findJsonbElementInArray(JsonbContainer *container, JsonbValue *elem, + JsonbValue *res) +{ + JsonbValue *result; + JEntry *children = container->children; + int count = JsonContainerSize(container); + char *baseAddr = (char *) (children + count); + uint32 offset = 0; + int i; + + result = res ? res : palloc(sizeof(*result)); + + for (i = 0; i < count; i++) + { + fillJsonbValue(container, i, baseAddr, offset, result); + + if (elem->type == result->type && + equalsJsonbScalarValue(elem, result)) + return result; + + JBE_ADVANCE_OFFSET(offset, children[i]); + } + + if (!res) + pfree(result); + + return NULL; +} + +/* Find value by key in Jsonb object and fetch it into 'res'. */ +JsonbValue * +findJsonbKeyInObject(JsonbContainer *container, const char *keyVal, int keyLen, + JsonbValue *res) +{ + JEntry *children = container->children; + JsonbValue *result = res; + int count = JsonContainerSize(container); + /* Since this is an object, account for *Pairs* of
Re: seems like a bug in pgbench -R
Fabien COELHO writes: >> |...] So I'll mark this ready for committer. > Ok, thanks for the review. LGTM, pushed. regards, tom lane
Re: Optimze usage of immutable functions as relation
I wrote: > * It would be useful for the commentary to point out that in principle we > could pull up any immutable (or, probably, even just stable) expression; > but we don't, (a) for fear of multiple evaluations of the result costing > us more than we can save, and (b) because a primary goal is to let the > constant participate in further const-folding, and of course that won't > happen for a non-Const. BTW, so far as I can see, none of the test cases demonstrate that such further const-folding can happen. Since this is all pretty processing- order-sensitive, I think an explicit test that that's possible would be a good idea. regards, tom lane
Re: Optimze usage of immutable functions as relation
Anastasia Lubennikova writes: > New version is in attachments. I took a quick look at this and I have a couple of gripes --- * The naming and documentation of transform_const_function_to_result seem pretty off-point to me. ISTM the real goal of that function is to pull up constant values from RTE_FUNCTION RTEs. Replacing the RTE with an RTE_RESULT is just a side-effect that's needed so that we don't generate a useless FunctionScan plan node. I'd probably call it pull_up_constant_function or something like that. * It would be useful for the commentary to point out that in principle we could pull up any immutable (or, probably, even just stable) expression; but we don't, (a) for fear of multiple evaluations of the result costing us more than we can save, and (b) because a primary goal is to let the constant participate in further const-folding, and of course that won't happen for a non-Const. * The test cases are really pretty bogus, because int4(1) or int4(0) are not function invocations at all. The parser thinks those are no-op type casts, and so what comes out of parse analysis is already just a plain Const. Thus, not one of these test cases is actually verifying that const-folding of an immutable function has happened before we try to pull up. While you could dodge the problem today by, say, writing int8(0) which isn't a no-op cast, I'd recommend staying away from typename() notation altogether. There's too much baggage there and too little certainty that you wrote a function call not something else. The existing test cases you changed, with int4(sin(1)) and so on, are better examples of something that has to actively be folded to a constant. regards, tom lane
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 7/1/19 5:20 PM, Alexey Kondratov wrote: Hi Thomas, On 01.07.2019 15:02, Thomas Munro wrote: Hi Alexey, This no longer applies. Since the Commitfest is starting now, could you please rebase it? Thank you for a reminder. Rebased version of the patch is attached. I've also modified my logging code in order to obey new unified logging system for command-line programs commited by Peter (cc8d415117). Regards Hi Alexey, I would like to suggest a couple of changes to docs and comments, please see the attachment. The "...or fetched on startup" part also seems wrong here, but it's not a part of your patch, so I'm going to ask about it on psql-docs separately. It might also be useful to reword the following error messages: - "using restored from archive version of file \"%s\"" - "could not open restored from archive file \"%s\" We could probably say something like "could not open file \"%s\" restored from WAL archive" instead. On a more general note, I wonder if everyone is happy with the --using-postgresql-conf option name, or we should continue searching for a narrower term. Unfortunately, I don't have any better suggestions right now, but I believe it should be clear that its purpose is to fetch missing WAL files for target. What do you think? -- Liudmila Mantrova Technical writer at Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 52a1caa..7e76fcc 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,9 +66,12 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -r or + -R option to automatically retrieve them from the WAL + archive, or fetched on startup by configuring or . The use of pg_rewind is not limited to failover, e.g. a standby @@ -203,6 +206,39 @@ PostgreSQL documentation + -r + --use-postgresql-conf + + +Use the restore_command defined in +postgresql.conf to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory of the target cluster. + + +This option cannot be used together with --restore-command. + + + + + + -R restore_command + --restore-command=restore_command + + +Specifies the restore_command to use for retrieving +WAL files from the WAL archive if these files are no longer available +in the pg_wal directory of the target cluster. + + +If restore_command is already set in +postgresql.conf, you can provide the +--use-postgresql-conf option instead. + + + + + --debug @@ -288,7 +324,10 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b history forked off from the target cluster. For each WAL record, record each data block that was touched. This yields a list of all the data blocks that were changed in the target cluster, after the - source cluster forked off. + source cluster forked off. If some of the WAL files are no longer + available, try re-running pg_rewind with + the -r or -R option to search + for the missing files in the WAL archive. diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 287af60..5a7f759 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -12,6 +12,7 @@ #include "postgres_fe.h" #include +#include #include "pg_rewind.h" #include "filemap.h" @@ -44,6 +45,7 @@ static char xlogfpath[MAXPGPATH]; typedef struct XLogPageReadPrivate { const char *datadir; + const char *restoreCommand; int tliIndex; } XLogPageReadPrivate; @@ -52,6 +54,9 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader, int reqLen, XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *pageTLI); +static int RestoreArchivedWAL(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand); + /* * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of
Re: Built-in connection pooler
I attached a patch to apply after your latest patch [2] with my suggested changes to the docs. I tried to make things read smoother without altering your meaning. I don't think the connection pooler chapter fits in The SQL Language section, it seems more like Server Admin functionality so I moved it to follow the chapter on HA, load balancing and replication. That made more sense to me looking at the overall ToC of the docs. Thank you. I have committed your documentation changes in my Git repository and attach new patch with your corrections. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 84341a3..2758506 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -719,6 +719,123 @@ include_dir 'conf.d' + + max_sessions (integer) + + max_sessions configuration parameter + + + + + The maximum number of client sessions that can be handled by + one connection proxy when session pooling is enabled. + This parameter does not add any memory or CPU overhead, so + specifying a large max_sessions value + does not affect performance. + If the max_sessions limit is reached new connections are not accepted. + + + The default value is 1000. This parameter can only be set at server start. + + + + + + session_pool_size (integer) + + session_pool_size configuration parameter + + + + + Enables session pooling and defines the maximum number of + backends that can be used by client sessions for each database/user combination. + Launched non-tainted backends are never terminated even if there are no active sessions. + Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements. + Tainted backend can server only one client. + + + The default value is 10, so up to 10 backends will serve each database, + + + + + + proxy_port (integer) + + proxy_port configuration parameter + + + + + Sets the TCP port for the connection pooler. + Clients connected to main "port" will be assigned dedicated backends, + while client connected to proxy port will be connected to backends through proxy which + performs transaction level scheduling. + + + The default value is 6543. + + + + + + connection_proxies (integer) + + connection_proxies configuration parameter + + + + + Sets number of connection proxies. + Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing). + Each proxy launches its own subset of backends. + So maximal number of non-tainted backends is session_pool_size*connection_proxies*databases*roles. + + + The default value is 0, so session pooling is disabled. + + + + + + session_schedule (enum) + + session_schedule configuration parameter + + + + + Specifies scheduling policy for assigning session to proxies in case of + connection pooling. Default policy is round-robin. + + + With round-robin policy postmaster cyclicly scatter sessions between proxies. + + + With random policy postmaster randomly choose proxy for new session. + + + With load-balancing policy postmaster choose proxy with lowest load average. + Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections. + + + + + + restart_pooler_on_reload (string) + + restart_pooler_on_reload configuration parameter + + + + + Restart session pool workers once pg_reload_conf() is called. + The default value is false. + + + + unix_socket_directories (string) diff --git a/doc/src/sgml/connpool.sgml b/doc/src/sgml/connpool.sgml new file mode 100644 index 000..a4b2720 --- /dev/null +++ b/doc/src/sgml/connpool.sgml @@ -0,0 +1,173 @@ + + + + Connection pooling + + + built-in connection pool proxy + + + +PostgreSQL spawns a separate process (backend) for each client. +For large number of clients this model can consume a large number of system +resources and lead to significant performance degradation, especially on computers with large +number of CPU cores. The
Re: POC: Cleaning up orphaned files using undo logs
On Fri, 26 Jul 2019 at 12:25, Amit Kapila wrote: > I agree with all your other comments. Thanks for addressing the comments. Below is the continuation of my comments from 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch : + * Perform rollback request. We need to connect to the database for first + * request and that is required because we access system tables while for first request and that is required => for the first request. This is required --- +UndoLauncherShmemSize(void) +{ +Sizesize; + +/* + * Need the fixed struct and the array of LogicalRepWorker. + */ +size = sizeof(UndoApplyCtxStruct); The fixed structure size should be offsetof(UndoApplyCtxStruct, workers) rather than sizeof(UndoApplyCtxStruct) --- In UndoWorkerCleanup(), we set individual fields of the UndoApplyWorker structure, whereas in UndoLauncherShmemInit(), for all the UndoApplyWorker array elements, we just memset all the UndoApplyWorker structure elements to 0. I think we should be consistent for the two cases. I guess we can just memset to 0 as you do in UndoLauncherShmemInit(), but this will cause the worker->undo_worker_queue to be 0 i.e. XID_QUEUE , whereas in UndoWorkerCleanup(), it is set to -1. Is the -1 value essential, or we can just set it to XID_QUEUE initially ? Also, if we just use memset in UndoWorkerCleanup(), we need to first save generation into a temp variable, and then after memset(), restore it back. That brought me to another point : We already have a macro ResetUndoRequestInfo(), so UndoWorkerCleanup() can just call ResetUndoRequestInfo(). +boolallow_peek; + +CHECK_FOR_INTERRUPTS(); + +allow_peek = !TimestampDifferenceExceeds(started_at, Some comments would be good about what is allow_peek used for. Something like : "Arrange to prevent the worker from restarting quickly to switch databases" - +++ b/src/backend/access/undo/README.UndoProcessing - +worker then start reading from one of the queues the requests for that start=>starts --- +work, it lingers for UNDO_WORKER_LINGER_MS (10s as default). This avoids As per the latest definition, it is 20s. IMHO, there's no need to mention the default value in the readme. --- +++ b/src/backend/access/undo/discardworker.c --- + * portion of transaction that is overflowed into a separate log can be processed 80-col crossed. +#include "access/undodiscard.h" +#include "access/discardworker.h" Not in alphabetical order +++ b/src/backend/access/undo/undodiscard.c --- +next_insert = UndoLogGetNextInsertPtr(logno); I checked UndoLogGetNextInsertPtr() definition. It calls find_undo_log_slot() to get back the slot from logno. Why not make it accept slot as against logno ? At all other places, the slot->logno is passed, so it is convenient to just pass the slot there. And in UndoDiscardOneLog(), first call find_undo_log_slot() just before the above line (or call it at the end of the do-while loop). This way, during each of the UndoLogGetNextInsertPtr() calls in undorequest.c, we will have one less find_undo_log_slot() call. My suggestion is of course valid only under the assumption that when you call UndoLogGetNextInsertPtr(fooslot->logno), then inside UndoLogGetNextInsertPtr(), find_undo_log_slot() will return back the same fooslot. - In UndoDiscardOneLog(), there are at least 2 variable declarations that can be moved inside the do-while loop : uur and next_insert. I am not sure about the other variables viz : undofxid and latest_discardxid. Values of these variables in one iteration continue across to the second iteration. For latest_discardxid, it looks like we do want its value to be carried forward, but is it also true for undofxid ? + /* If we reach here, this means there is something to discard. */ + need_discard = true; + } while (true); Also, about need_discard; there is no place where need_discard is set to false. That means, from 2nd iteration onwards, it will never be false. So even if the code that explicitly sets need_discard to true does not get run, still the undolog will be discarded. Is this expected ? - +if (request_rollback && dbid_exists(uur->uur_txn->urec_dbid)) +{ +(void) RegisterRollbackReq(InvalidUndoRecPtr, + undo_recptr, + uur->uur_txn->urec_dbid, + uur->uur_fxid); + +pending_abort = true; +} We can get rid of request_rollback variable. Whatever the "if" block above is doing, do it in this upper condition : if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress)) Something like this : if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress)) { if (dbid_exists(uur->uur_txn->urec_dbid)) {
Re: Fetching timeline during recovery
On Fri, 26 Jul 2019 10:02:58 +0200 Jehan-Guillaume de Rorthais wrote: > On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time) > Kyotaro Horiguchi wrote: [...] > > We have an LSN reporting function each for several objectives. > > > > pg_current_wal_lsn > > pg_current_wal_insert_lsn > > pg_current_wal_flush_lsn > > pg_last_wal_receive_lsn > > pg_last_wal_replay_lsn > > Yes. In fact, my current implementation might be split as: > > pg_current_wal_tl: returns TL on a production cluster > pg_last_wal_received_tl: returns last received TL on a standby > > If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and > *flush_tl would be useful as a cluster in production is not supposed to > change its timeline during its lifetime. > > > But, I'm not sure just adding further pg_last_*_timeline() to > > this list is a good thing.. > > I think this is a much better idea than mixing different case (production and > standby) in the same function as I did. Moreover, it's much more coherent with > other existing functions. Please, find in attachment a new version of the patch. It now creates two new fonctions: pg_current_wal_tl() pg_last_wal_received_tl() Regards, >From 1e21fb7203e66ed514129d41c9bbf947c5284d7b Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Thu, 25 Jul 2019 19:36:40 +0200 Subject: [PATCH] Add functions to get timeline pg_current_wal_tl() returns the current timeline of a cluster in production. pg_last_wal_received_tl() returns the timeline of the last xlog record flushed to disk. --- src/backend/access/transam/xlog.c | 17 + src/backend/access/transam/xlogfuncs.c | 20 src/backend/replication/walreceiver.c | 19 +++ src/include/access/xlog.h | 1 + src/include/catalog/pg_proc.dat| 12 5 files changed, 69 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index da3d250986..fd30c88534 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12243,3 +12243,20 @@ XLogRequestWalReceiverReply(void) { doRequestWalReceiverReply = true; } + +/* + * Returns current active timeline. + */ +TimeLineID +GetCurrentTimeLine(void) +{ + TimeLineID localTimeLineID; + + SpinLockAcquire(>info_lck); + + localTimeLineID = XLogCtl->ThisTimeLineID; + + SpinLockRelease(>info_lck); + + return localTimeLineID; +} diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index b35043bf71..ae877be351 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -776,3 +776,23 @@ pg_promote(PG_FUNCTION_ARGS) (errmsg("server did not promote within %d seconds", wait_seconds))); PG_RETURN_BOOL(false); } + +/* + * Returns the current timeline on a production cluster + */ +Datum +pg_current_wal_tl(PG_FUNCTION_ARGS) +{ + TimeLineID currentTL; + + if (RecoveryInProgress()) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("%s cannot be executed during recovery.", + "pg_current_wal_tl()"))); + + currentTL = GetCurrentTimeLine(); + + PG_RETURN_INT32(currentTL); +} diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 6abc780778..97d1c900c7 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1454,3 +1454,22 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS) /* Returns the record as Datum */ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); } + +/* + * Returns the timeline of the last xlog record flushed to WAL + */ +Datum +pg_last_wal_received_tl(PG_FUNCTION_ARGS) +{ + TimeLineID localTimeLineID; + WalRcvData *walrcv = WalRcv; + + SpinLockAcquire(>mutex); + localTimeLineID = walrcv->receivedTLI; + SpinLockRelease(>mutex); + + if (!localTimeLineID) + PG_RETURN_NULL(); + + return localTimeLineID; +} diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index d519252aad..f0502c0b41 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -313,6 +313,7 @@ extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(void); extern XLogRecPtr GetLastImportantRecPtr(void); extern void RemovePromoteSignalFiles(void); +extern TimeLineID GetCurrentTimeLine(void); extern bool CheckPromoteSignal(void); extern void WakeupRecovery(void); diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 0902dce5f1..d7ec6ea100 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -6006,6 +6006,18 @@ { oid => '2851', descr => 'wal filename, given a wal location', proname => 'pg_walfile_name', prorettype => 'text', proargtypes => 'pg_lsn', prosrc => 'pg_walfile_name' }, +{ oid => '3434', + descr => 'current timeline', + proname
Re: Built-in connection pooler
> I attached new version of the patch with fixed indentation problems and > Win32 specific fixes. Great, this latest patch applies cleanly to master. installcheck world still passes. > "connections_proxies" is used mostly to toggle connection pooling. > Using more than 1 proxy is be needed only for huge workloads (hundreds > connections). My testing showed using only one proxy performing very poorly vs not using the pooler, even at 300 connections, with -3% TPS. At lower numbers of connections it was much worse than other configurations I tried. I just shared my full pgbench results [1], the "No Pool" and "# Proxies 2" data is what I used to generate the charts I previously shared. The 1 proxy and 10 proxy data I had referred to but hadn't shared the results, sorry about that. > And "session_pool_size" is core parameter which determine efficiency of > pooling. > The main trouble with it now, is that it is per database/user > combination. Each such combination will have its own connection pool. > Choosing optimal value of pooler backends is non-trivial task. It > certainly depends on number of available CPU cores. > But if backends and mostly disk-bounded, then optimal number of pooler > worker can be large than number of cores. I will do more testing around this variable next. It seems that increasing session_pool_size for connection_proxies = 1 might help and leaving it at its default was my problem. > PgPRO EE version of connection pooler has "idle_pool_worker_timeout" > parameter which allows to terminate idle workers. +1 > It is possible to implement it also for vanilla version of pooler. But > primary intention of this patch was to minimize changes in Postgres core Understood. I attached a patch to apply after your latest patch [2] with my suggested changes to the docs. I tried to make things read smoother without altering your meaning. I don't think the connection pooler chapter fits in The SQL Language section, it seems more like Server Admin functionality so I moved it to follow the chapter on HA, load balancing and replication. That made more sense to me looking at the overall ToC of the docs. Thanks, [1] https://docs.google.com/spreadsheets/d/11XFoR26eiPQETUIlLGY5idG3fzJKEhuAjuKp6RVECOU [2] https://www.postgresql.org/message-id/attachment/102848/builtin_connection_proxy-11.patch *Ryan* > builtin_connection_proxy-docs-1.patch Description: Binary data
Re: psql FETCH_COUNT feature does not work with combined queries
FETCH_COUNT does not work with combined queries, and probably has never worked since 2006. For the record, this bug has already been reported & discussed by Daniel Vérité a few months back, see: https://www.postgresql.org/message-id/flat/a0a854b6-563c-4a11-bf1c-d6c6f924004d%40manitou-mail.org Given the points of view expressed on this thread, especially Tom's view against improving the lexer used by psql to known where query ends, it looks that my documentation patch is the way to go in the short term, and that if the "always show query results" patch is committed, it might simplify a bit solving this issue as the PQexec call is turned into PQsendQuery. -- Fabien.
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Jul 26, 2019 at 10:57 AM Jonathan S. Katz wrote: > > Hi, > > Before my reply, I wanted to say that I've been lurking on this thread > for a bit as I've tried to better inform myself on encryption at rest > and how it will apply to what we want to build. I actually built a > (poor) prototype in Python of the key management system that Joe & > Masahiko both laid out, in addition to performing some "buffer > encrpytion" with it. It's not worth sharing at this point. > > With the disclaimer that I'm not as familiar with a lot of concepts as I > would like to be: > > On 7/25/19 1:54 PM, Masahiko Sawada wrote: > > On Fri, Jul 26, 2019 at 2:18 AM Bruce Momjian wrote: > >> > >> On Thu, Jul 18, 2019 at 12:04:25PM +0900, Masahiko Sawada wrote: > >>> I've re-considered the design of TDE feature based on the discussion > >>> so far. The one of the main open question is the granular of > >>> encryption objects: cluster encryption or more-granular-than-cluster > >>> encryption. The followings describe about the new TDE design when we > >>> choose table-level encryption or something-new-group-level encryption. > >>> > >>> General > >>> > >>> We will use AES and support both AES-128 and AES-256. User can specify > >>> the new initdb option something like --aes-128 or --aes-256 to enable > >>> encryption and must specify --encryption-key-passphrase-command along > >>> with. (I guess we also require openssl library.) If these options are > >>> specified, we write the key length to the control file and derive the > >>> KEK and generate MDEK during initdb. wal_log_hints will be enabled > >>> automatically in encryption mode, like we do for checksum mode, > >> > >> Agreed. pg_control will store the none/AES128/AES256 indicator. > >> > >>> Key Management > >>> == > >>> We will use 3-tier key architecture as Joe proposed. > >>> > >>> 1. A master key encryption key (KEK): this is the ley supplied by the > >>> database admin using something akin to ssl_passphrase_command > >>> > >>> 2. A master data encryption key (MDEK): this is a generated key using a > >>> cryptographically secure pseudo-random number generator. It is > >>> encrypted using the KEK, probably with Key Wrap (KW): > >>> or maybe better Key Wrap with Padding (KWP): > >>> > >>> 3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate > >>> table specific keys. > >> > >> What is the value of a per-table encryption key? How is HKDF derived? > > > > Per-table encryption key is derived from MDEK with salt and its OID as > > info. I think we can store salts for each encryption keys into the > > separate file so that off-line tool also can read it. > > +1 with using the info/salt for the HKDF as described above. The other > decision will be the hashing algorithm to use. SHA-256? Yeah, SHA-256 would be better for safety. > > > >>> 3b. WAL data encryption keys (WDEK): Similarly use MDEK and a HKDF to > >>> generate new keys when needed for WAL. > >>> > >>> We store MDEK to the plain file (say global/pgkey) after encrypted > >>> with the KEK. I might want to store the hash of passphrase of the KEK > >>> in order to verify the correctness of the given passphrase. However we > >>> don't need to store TDEK and WDEK as we can derive them as needed. The > >>> key file can be read by both backend processes and front-end tools. > >> > >> Yes, we need to verify the pass phrase. > > Just to clarify, this would be a hash of the KEK? No, it's a hash of passphrase. Or we might be able to use crypt(3) to verify the input passphrase. Apart from passing the passphrase, there are users who rather want to pass the key directly, for example when using external key management services. So it might be good if we provide both way. > > From my experiments, the MDEK key unwrapping fails if you do not have > the correct KEK (as it should). If it's a matter of storing a hash of > the KEK, I'm not sure if there is much added benefit to have it, but I > would not necessarily oppose it either. > > >>> When postmaster startup, it reads the key file and decrypts MDEK and > >>> derive WDEK using key id for WDEK. > > I don't know if this is getting too far ahead, but what happens if the > supplied KEK fails to decrypt the MDEK? Will postmaster refuse to startup? I think it should refuse to startup. It would not able to operate all things properly without correct keys and we prevent to startup from possible malicious user. > > >>> WDEK is loaded to the key hash map > >>> (keyid -> key) on the shared memory. Also we derive TDEK as needed > >>> when reading tables or indexes and add it to the key hash map as well > >>> if not exists. > > +1 to this approach. > > >>> > >>> Buffer Encryption > >>> == > >>> We will use AES-CBC for buffer encryption. We will add key id (4byte) > >> > >> I think we might want to use CTR for this, and will post after this. > > Not sure if I missed this post or not (as
Re: Optimze usage of immutable functions as relation
23.07.2019 14:36, Anastasia Lubennikova : 08.07.2019 4:18, Thomas Munro: The July Commitfest is here. Could we please have a rebase of this patch? Updated patch is in attachments. I've only resolved one small cosmetic merge conflict. Later this week I'm going to send a more thoughtful review. Well, I looked through the patch and didn't find any issues, so I'll mark this ready for committer. Last implementation differs from the original one and is based on suggestion from this message: https://www.postgresql.org/message-id/2906.1542395026%40sss.pgh.pa.us It does eval_const_expressions() earlier and pulls up only Const functions. I also added a few more tests and comments. New version is in attachments. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 36fefd9..acc3776 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1071,7 +1071,8 @@ preprocess_expression(PlannerInfo *root, Node *expr, int kind) expr = flatten_join_alias_vars(root->parse, expr); /* - * Simplify constant expressions. + * Simplify constant expressions. For function RTEs, this was already done + * by pullup_simple_subqueries. * * Note: an essential effect of this is to convert named-argument function * calls to positional notation and insert the current actual values of @@ -1085,7 +1086,8 @@ preprocess_expression(PlannerInfo *root, Node *expr, int kind) * careful to maintain AND/OR flatness --- that is, do not generate a tree * with AND directly under AND, nor OR directly under OR. */ - expr = eval_const_expressions(root, expr); + if (kind != EXPRKIND_RTFUNC && kind != EXPRKIND_RTFUNC_LATERAL) + expr = eval_const_expressions(root, expr); /* * If it's a qual or havingQual, canonicalize it. diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c index 4fbc03f..f96b290 100644 --- a/src/backend/optimizer/prep/prepjointree.c +++ b/src/backend/optimizer/prep/prepjointree.c @@ -79,6 +79,9 @@ static Node *pull_up_simple_union_all(PlannerInfo *root, Node *jtnode, static void pull_up_union_leaf_queries(Node *setOp, PlannerInfo *root, int parentRTindex, Query *setOpQuery, int childRToffset); +static void transform_const_function_to_result(PlannerInfo *root, Node *jtnode, + RangeTblEntry *rte, + JoinExpr *lowest_nulling_outer_join); static void make_setop_translation_list(Query *query, Index newvarno, List **translated_vars); static bool is_simple_subquery(Query *subquery, RangeTblEntry *rte, @@ -754,6 +757,18 @@ pull_up_subqueries_recurse(PlannerInfo *root, Node *jtnode, is_simple_values(root, rte)) return pull_up_simple_values(root, jtnode, rte); + /* + * Or is it an immutable function that evaluated to a single Const? + */ + if (rte->rtekind == RTE_FUNCTION) + { + rte->functions = (List *) +eval_const_expressions(root, (Node *) rte->functions); + transform_const_function_to_result(root, jtnode, rte, + lowest_nulling_outer_join); + return jtnode; + } + /* Otherwise, do nothing at this node. */ } else if (IsA(jtnode, FromExpr)) @@ -1784,6 +1799,78 @@ is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, List *colTypes) } /* + * If the function of this RTE_FUNCTION entry evaluated to a single Const + * after eval_const_expressions, transform it to RTE_RESULT. + */ +static void +transform_const_function_to_result(PlannerInfo *root, Node *jtnode, + RangeTblEntry *rte, + JoinExpr *lowest_nulling_outer_join) +{ + ListCell *lc; + pullup_replace_vars_context rvcontext; + RangeTblFunction *rtf = (RangeTblFunction *) linitial(rte->functions); + Query *parse = root->parse; + + if (!IsA(rtf->funcexpr, Const)) + return; + + /* Fail if the RTE has ORDINALITY - we don't implement that here. */ + if (rte->funcordinality) + return; + + /* Fail if RTE isn't a single, simple Const expr */ + if (list_length(rte->functions) != 1) + return; + + rvcontext.targetlist = list_make1(makeTargetEntry((Expr *) rtf->funcexpr, + 1 /* resno */, NULL /* resname */, false /* resjunk */)); + rvcontext.root = root; + rvcontext.target_rte = rte; + + /* + * Since this function was reduced to Const, it can't really have lateral + * references, even if it's marked as LATERAL. This means we don't need + * to fill relids. + */ + rvcontext.relids = NULL; + + rvcontext.outer_hasSubLinks = >hasSubLinks; + rvcontext.varno = ((RangeTblRef *) jtnode)->rtindex; + + /* + * If this RTE is on a nullable side of an outer join, we have to insert + * PHVs around our Consts so that they go to null when needed. + */ + rvcontext.need_phvs = lowest_nulling_outer_join != NULL; + + rvcontext.wrap_non_vars = false; + rvcontext.rv_cache = palloc0((list_length(rvcontext.targetlist) +
Re: Support for jsonpath .datetime() method
On 7/23/19 6:48 PM, Nikita Glukhov wrote: > Some concrete pieces of review: >> + >> +FF1 >> +decisecond (0-9) >> + >> >> Let's not use such weird terms as "deciseconds". We could say >> "fractional seconds, 1 digit" etc. or something like that. > And what about "tenths of seconds", "hundredths of seconds"? Yes, those are much better. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Support for jsonpath .datetime() method
On 7/24/19 4:25 PM, Peter Eisentraut wrote: > On 2019-07-24 00:48, Nikita Glukhov wrote: >> It seems that our YY works like RR should: >> >> SELECT to_date('69', 'YY'); >> to_date >> >> 2069-01-01 >> (1 row) >> >> SELECT to_date('70', 'YY'); >> to_date >> >> 1970-01-01 >> (1 row) >> >> But by the standard first two digits of current year should be used in YY. > Is this behavior even documented anywhere in our documentation? I > couldn't find it. What's the exact specification of what it does in > these cases? > >> So it's unclear what we should do: >> - implement YY and RR strictly following the standard only in .datetime() >> - fix YY implementation in to_date()/to_timestamp() and implement RR >> - use our non-standard templates in .datetime() > I think we definitely should try to use the same template system in both > the general functions and in .datetime(). Agreed. It's too hard to maintain otherwise. > This might involve some > compromises between existing behavior, Oracle behavior, SQL standard. > So far I'm not worried: If you're using two-digit years like above, > you're playing with fire anyway. Also some of the other cases like > dealing with trailing spaces are probably acceptable as slight > incompatibilities or extensions. My instict wouyld be to move as close as possible to the standard, especially if the current behaviour isn't documented. > > We should collect a list of test cases that illustrate the differences > and then work out how to deal with them. > Agreed. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Contribution to Perldoc for TestLib module in Postgres
On 7/10/19 9:38 AM, Alvaro Herrera wrote: > On 2019-Apr-11, Iwata, Aya wrote: > >> In the above document, why not write a description after the function name? >> I think it is better to write the function name first and then the >> description. >> In your code; >> #Checks if all the tests passed or not >> all_tests_passing() >> >> In my suggestion; >> all_tests_passing() >> Checks if all the tests passed or not > Yeah, so there are two parts in the submitted patch: first the synopsis > list the methods using this format you describe, and later the METHODS > section lists then again, using your suggested style. I think we should > do away with the long synopsis -- maybe keep it as just the "use > TestLib" line, and then let the METHODS section list and describe the > methods. > >> And some functions return value. How about adding return information >> to the above doc? > That's already in the METHODS section for some of them. For example: > > all_tests_passing() > Returns 1 if all the tests pass. Otherwise returns 0 > > It's missing for others, such as "tempdir". > > In slurp_file you have this: > Opens the file provided as an argument to the function in read mode(as > indicated by <). > I think the parenthical comment is useless; remove that. > > Please break long source lines (say to 78 chars -- make sure pgperltidy > agrees), and keep some spaces after sentence-ending periods and other > punctuation. > I've fixed the bitrot and some other infelicities on this patch. It's not commitable yet but I think it's more reviewable. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm index 6195c21c59..ebb0eb82e7 100644 --- a/src/test/perl/TestLib.pm +++ b/src/test/perl/TestLib.pm @@ -1,9 +1,108 @@ -# TestLib, low-level routines and actions regression tests. -# -# This module contains a set of routines dedicated to environment setup for -# a PostgreSQL regression test run and includes some low-level routines -# aimed at controlling command execution, logging and test functions. This -# module should never depend on any other PostgreSQL regression test modules. + +=pod + +=head1 NAME + +TestLib - module containing routines for environment setup, low-level routines +for command execution, logging and test functions + +=head1 SYNOPSIS + + use TestLib; + + # Checks if all the tests passed or not + all_tests_passing() + + # Creates a temporary directory + tempdir(prefix) + + # Creates a temporary directory outside the build tree for the + # Unix-domain socket + tempdir_short() + + # Returns the real directory for a virtual path directory under msys + real_dir(dir) + + # Runs the command which is passed as an argument + system_log() + + # Runs the command which is passed as an argument. On failure abandons + # all other tests + system_or_bail() + + # Runs the command which is passed as an argument. + run_log() + + # Returns the output after running a command + run_command(dir) + + # Generate a string made of the given range of ASCII characters + generate_ascii_string(from_char, to_char) + + # Read the contents of the directory + slurp_dir(dir) + + # Read the contents of the file + slurp_file(filename) + + # Appends a string at the end of a given file + append_to_file(filename, str) + + # Check that all file/dir modes in a directory match the expected values + check_mode_recursive(dir, expected_dir_mode, expected_file_mode, + ignore_list) + + # Change mode recursively on a directory + chmod_recursive(dir, dir_mode, file_mode) + + # Check presence of a given regexp within pg_config.h + check_pg_config(regexp) + + # Test function to check that the command runs successfully + command_ok(cmd, test_name) + + # Test function to check that the command fails + command_fails(cmd, test_name) + + # Test function to check that the command exit code matches the + # expected exit code + command_exit_is(cmd, expected, test_name) + + # Test function to check that the command supports --help option + program_help_ok(cmd) + + # Test function to check that the command supports --version option + program_version_ok(cmd) + + # Test function to check that a command with invalid option returns + # non-zero exit code and error message + program_options_handling_ok(cmd) + + # Test function to check that the command runs successfully and the + # output matches the given regular expression + command_like(cmd, expected_stdout, test_name) + + #TODO + command_like_safe(cmd, expected_stdout, test_name) + + # Test function to check that the command fails and the error message + # matches the given regular expression + command_fails_like(cmd, expected_stderr, test_name) + + # Test function to run a command and check its status and outputs + command_checks_all(cmd, expected_ret, out,
Re: SQL:2011 PERIODS vs Postgres Ranges?
The patch requires to rebase on the master branch. The new status of this patch is: Waiting on Author
Re: psql - add SHOW_ALL_RESULTS option
Fabien COELHO wrote: > sh> /usr/bin/psql > psql (12beta2 ...) > fabien=# \set FETCH_COUNT 2 > fabien=# SELECT 1234 \; SELECT 5432 ; > fabien=# > > same thing with pg 11.4, and probably down to every version of postgres > since the feature was implemented... > > I think that fixing this should be a separate bug report and patch. I'll > try to look at it. That reminds me that it was already discussed in [1]. I should add the proposed fix to the next commitfest. [1] https://www.postgresql.org/message-id/flat/a0a854b6-563c-4a11-bf1c-d6c6f924004d%40manitou-mail.org Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: Hypothetical indexes using BRIN broken since pg10
On Fri, Jul 26, 2019 at 1:34 PM Heikki Linnakangas wrote: > > There seems to be consensus on the going with the approach from the > original patch, so I had a closer look at it. > > The patch first does this: > > > > > /* > >* Obtain some data from the index itself, if possible. Otherwise > > invent > >* some plausible internal statistics based on the relation page > > count. > >*/ > > if (!index->hypothetical) > > { > > indexRel = index_open(index->indexoid, AccessShareLock); > > brinGetStats(indexRel, ); > > index_close(indexRel, AccessShareLock); > > } > > else > > { > > /* > >* Assume default number of pages per range, and estimate the > > number > >* of ranges based on that. > >*/ > > indexRanges = Max(ceil((double) baserel->pages / > > > > BRIN_DEFAULT_PAGES_PER_RANGE), 1.0); > > > > statsData.pagesPerRange = BRIN_DEFAULT_PAGES_PER_RANGE; > > statsData.revmapNumPages = (indexRanges / > > REVMAP_PAGE_MAXITEMS) + 1; > > } > > ... > > And later in the function, there's this: > > > /* work out the actual number of ranges in the index */ > > indexRanges = Max(ceil((double) baserel->pages / > > statsData.pagesPerRange), > > 1.0); > > It seems a bit error-prone that essentially the same formula is used > twice in the function, to compute 'indexRanges', with some distance > between them. Perhaps some refactoring would help with, although I'm not > sure what exactly would be better. Maybe move the second computation > earlier in the function, like in the attached patch? I had the same thought without a great idea on how to refactor it. I'm fine with the one in this patch. > The patch assumes the default pages_per_range setting, but looking at > the code at https://github.com/HypoPG/hypopg, the extension actually > takes pages_per_range into account when it estimates the index size. I > guess there's no easy way to pass the pages_per_range setting down to > brincostestimate(). I'm not sure what we should do about that, but seems > that just using BRIN_DEFAULT_PAGES_PER_RANGE here is not very accurate. Yes, hypopg can use a custom pages_per_range as it intercepts it when the hypothetical index is created. I didn't find any way to get that information in brincostestimate(), especially for something that could backpatched. I don't like it, but I don't see how to do better than just using BRIN_DEFAULT_PAGES_PER_RANGE :( > The attached patch is based on PG v11, because I tested this with > https://github.com/HypoPG/hypopg, and it didn't compile with later > versions. There's a small difference in the locking level used between > v11 and 12, which makes the patch not apply across versions, but that's > easy to fix by hand. FTR I created a REL_1_STABLE branch for hypopg which is compatible with pg12 (it's already used for debian packages), as master is still in beta and v12 compatibility worked on.
Re: Hypothetical indexes using BRIN broken since pg10
On 27/06/2019 23:09, Alvaro Herrera wrote: On 2019-Jun-27, Tom Lane wrote: Dunno, I just can't get excited about exposing REVMAP_PAGE_MAXITEMS. Especially not since we seem to agree on the long-term solution here, and what you're suggesting to Julien doesn't particularly fit into that long-term solution. Well, it was brin_page.h, which is supposed to be internal to BRIN itself. But since we admit that in its current state selfuncs.c is not salvageable as a module and we'll redo the whole thing in the short term, I withdraw my comment. There seems to be consensus on the going with the approach from the original patch, so I had a closer look at it. The patch first does this: /* * Obtain some data from the index itself, if possible. Otherwise invent * some plausible internal statistics based on the relation page count. */ if (!index->hypothetical) { indexRel = index_open(index->indexoid, AccessShareLock); brinGetStats(indexRel, ); index_close(indexRel, AccessShareLock); } else { /* * Assume default number of pages per range, and estimate the number * of ranges based on that. */ indexRanges = Max(ceil((double) baserel->pages / BRIN_DEFAULT_PAGES_PER_RANGE), 1.0); statsData.pagesPerRange = BRIN_DEFAULT_PAGES_PER_RANGE; statsData.revmapNumPages = (indexRanges / REVMAP_PAGE_MAXITEMS) + 1; } ... And later in the function, there's this: /* work out the actual number of ranges in the index */ indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange), 1.0); It seems a bit error-prone that essentially the same formula is used twice in the function, to compute 'indexRanges', with some distance between them. Perhaps some refactoring would help with, although I'm not sure what exactly would be better. Maybe move the second computation earlier in the function, like in the attached patch? The patch assumes the default pages_per_range setting, but looking at the code at https://github.com/HypoPG/hypopg, the extension actually takes pages_per_range into account when it estimates the index size. I guess there's no easy way to pass the pages_per_range setting down to brincostestimate(). I'm not sure what we should do about that, but seems that just using BRIN_DEFAULT_PAGES_PER_RANGE here is not very accurate. The attached patch is based on PG v11, because I tested this with https://github.com/HypoPG/hypopg, and it didn't compile with later versions. There's a small difference in the locking level used between v11 and 12, which makes the patch not apply across versions, but that's easy to fix by hand. - Heikki diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 7ac6d2b339..0a2e7898cc 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -102,6 +102,7 @@ #include #include "access/brin.h" +#include "access/brin_page.h" #include "access/gin.h" #include "access/htup_details.h" #include "access/sysattr.h" @@ -8079,11 +8080,31 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, _seq_page_cost); /* - * Obtain some data from the index itself. + * Obtain some data from the index itself, if possible. Otherwise invent + * some plausible internal statistics based on the relation page count. */ - indexRel = index_open(index->indexoid, AccessShareLock); - brinGetStats(indexRel, ); - index_close(indexRel, AccessShareLock); + if (!index->hypothetical) + { + indexRel = index_open(index->indexoid, AccessShareLock); + brinGetStats(indexRel, ); + index_close(indexRel, AccessShareLock); + + /* work out the actual number of ranges in the index */ + indexRanges = Max(ceil((double) baserel->pages / + statsData.pagesPerRange), 1.0); + } + else + { + /* + * Assume default number of pages per range, and estimate the number + * of ranges based on that. + */ + indexRanges = Max(ceil((double) baserel->pages / + BRIN_DEFAULT_PAGES_PER_RANGE), 1.0); + + statsData.pagesPerRange = BRIN_DEFAULT_PAGES_PER_RANGE; + statsData.revmapNumPages = (indexRanges / REVMAP_PAGE_MAXITEMS) + 1; + } /* * Compute index correlation @@ -8184,10 +8205,6 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count, baserel->relid, JOIN_INNER, NULL); - /* work out the actual number of ranges in the index */ - indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange), - 1.0); - /* * Now calculate the minimum possible ranges we could match with if all of * the rows were in the perfect order in the table's heap.
Re: make libpq documentation navigable between functions
Hello Peter, I have committed this with some additions. Thanks for the push. It was really a pain to write a small libpq app without navigation. Also, due to some mysterious problems with the PDF toolchain I had to remove some links. Your script would find those, so I won't list them here. If you put those back in, the PDF build breaks. If you want to work out why, feel free to submit more patches. Otherwise I'm happy to leave it as is now; it's very useful. Ok fine with me for now as well. I'm not keen to invest more time on the documentation tool chain. -- Fabien.
psql FETCH_COUNT feature does not work with combined queries
Hello devs, As pointed out by Kyotaro Horiguchi in https://www.postgresql.org/message-id/20190726.131704.86173346.horikyota@gmail.com FETCH_COUNT does not work with combined queries, and probably has never worked since 2006. What seems to happen is that ExecQueryUsingCursor is hardcoded to handle one simple query. It simply inserts the cursor generation in front of the query believed to be a select: DECLARE ... For combined queries, say two selects, it results in: DECLARE ... ; Then PQexec returns the result of the second one, and nothing is printed. However, if the second query is not a select, eg: "select ... \; update ... ;", the result of the *first* query is shown. How fun! This is because PQexec returns the second result. The cursor declaration expects a PGRES_COMMAND_OK before proceeding. With a select it gets PGRES_TUPLES_OK so decides it is an error and silently skips to the end. With the update it indeed obtains the expected PGRES_COMMAND_OK, not really for the command it sent but who cares, and proceeds to show the cursor results. Basically, the whole logic is broken. The minimum is to document that it does not work properly with combined queries. Attached patch does that, so that the bug becomes a documented bug, aka a feature:-) Otherwise, probably psql lexer could detect, with some efforts, that it is a combined query (detect embedded ; and check that they are not empty queries), so that it could skip the feature if it is the case. Another approach would be to try to detect if the returned result does not correspond to the cursor one reliably. Maybe some result counting could be added somewhere so that the number of results under PQexec is accessible to the user, i.e. result struct would contain its own number. Hmmm. A more complex approach would be to keep the position of embedded queries, and to insert cursor declarations where needed, currently the last one if it is a SELECT. However, for the previous ones the allocation and such could be prohibitive as no cursor would be used. Not sure it is worth the effort as the bug has not been detected for 13 years. -- Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 7789fc6177..d217d82f57 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -3684,6 +3684,12 @@ bar fail after having already displayed some rows. + + +This feature does not work properly with combined queries. + + + Although you can use any output format with this feature,
Re: make libpq documentation navigable between functions
On 2019-07-22 22:56, Fabien COELHO wrote: > Attached script does, hopefully, the expected transformation. It adds ids > to occurrences when the id is not defined elsewhere. > > Attached v3 is the result of applying your kindly provided xslt patch plus > the script on "libpq.sgml". > > Three functions are ignored because no documentation is found: > PQerrorField (does not exist anywhere in the sources), > PQsetResultInstanceData (idem) and PQregisterThreadLock (it exists). > > Doc build works for me and looks ok. I have committed this with some additions. I have changed all the function-related id attributes in libpq.sgml to be mixed case, for easier readability. So if you run your script again, you can omit the lc() call. I also needed to make some changes to the markup in some places to remove extra whitespace that would have appeared in the generated link. (This was already happening in some places, but your patch would have repeated it in many places.) Also, due to some mysterious problems with the PDF toolchain I had to remove some links. Your script would find those, so I won't list them here. If you put those back in, the PDF build breaks. If you want to work out why, feel free to submit more patches. Otherwise I'm happy to leave it as is now; it's very useful. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: Cleaning up orphaned files using undo logs
On Wed, Jul 24, 2019 at 10:00 AM Dilip Kumar wrote: > > On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar wrote: > > > Please find my review comments for > 0013-Allow-foreground-transactions-to-perform-undo-action > > > + * We can't postpone applying undo actions for subtransactions as the > + * modifications made by aborted subtransaction must not be visible even if > + * the main transaction commits. > + */ > + if (IsSubTransaction()) > + return; > > I am not completely sure but is it possible that the outer function > CommitTransactionCommand/AbortCurrentTransaction can avoid > calling this function in the switch case based on the current state, > so that under subtransaction this will never be called? > We can do that and also can have an additional check similar to "if (!s->performUndoActions)", but such has to be all places from where this function is called. I feel that will make code less readable at many places. > > + bool undo_req_pushed[UndoLogCategories]; /* undo request pushed > + * to worker? */ > + bool performUndoActions; > + > struct TransactionStateData *parent; /* back link to parent */ > > We must have some comments to explain how performUndoActions is used, > where it's set. If it's explained somewhere else then we can > give reference to that code. > I am planning to remove this variable in the next version and have an explicit check as we have in UndoActionsRequired. I agree with your other comments and will address them in the next version of the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
COPY command on a table column marked as GENERATED ALWAYS
Hi All, I'm able to insert data into a table column marked as GENERATED ALWAYS using COPY command however, it fails with INSERT command. Isn't that a bug with COPY command? Here is the test-case for more clarity. postgres=# create table tab_always (i int generated always as identity, j int); CREATE TABLE postgres=# insert into tab_always values(1, 10); ERROR: cannot insert into column "i" DETAIL: Column "i" is an identity column defined as GENERATED ALWAYS. HINT: Use OVERRIDING SYSTEM VALUE to override. [ashu@localhost bin]$ cat /tmp/always.csv 1310 1420 1530 1640 postgres=# copy tab_always from '/tmp/always.csv'; COPY 4 postgres=# select * from tab_always; i | j + 13 | 10 14 | 20 15 | 30 16 | 40 (4 rows) -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: pg_walfile_name_offset can return inconsistent values
On Fri, 26 Jul 2019 17:21:20 +0900 (Tokyo Standard Time) Kyotaro Horiguchi wrote: > Hello. > > While looking [1], I noticed that pg_walfile_name_offset behaves > somewhat oddly at segment boundary. > > select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as > t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn); > lsn |file_name | file_offset > +--+- > 0/16ff | 00020016 |16777215 > 0/1700 | 00020016 | 0 > 0/1701 | 00020017 | 1 > > > The file names are right as defined, but the return value of the > second line wrong, or at least misleading. +1 I noticed it as well and put this report on hold while working on my patch. Thanks for reporting this! > It should be (16, 100) or (16, FF). The former is out-of-domain so we > would have no way than choosing the latter. I'm not sure the purpose of > the second output parameter, thus the former might be right > decision. > > The function returns the following result after this patch is > applied. > > select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as > t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn); > lsn |file_name | file_offset > +--+- > 0/16ff | 00020016 |16777214 > 0/1700 | 00020016 |16777215 > 0/1701 | 00020017 | 0 So you shift the file offset for all LSN by one byte? This could lead to regression in various tools relying on this function. Moreover, it looks weird as the LSN doesn't reflect the given offset anymore (FF <> 16777214, 01 <> 0, etc). Another solution might be to return the same result when for both 0/16ff and 0/1700, but it doesn't feel right either. So in fact, returning 0x100 seems to be the cleaner result to me. Regards,
pg_walfile_name_offset can return inconsistent values
Hello. While looking [1], I noticed that pg_walfile_name_offset behaves somewhat oddly at segment boundary. select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn); lsn |file_name | file_offset +--+- 0/16ff | 00020016 |16777215 0/1700 | 00020016 | 0 0/1701 | 00020017 | 1 The file names are right as defined, but the return value of the second line wrong, or at least misleading. It should be (16, 100) or (16, FF). The former is out-of-domain so we would have no way than choosing the latter. I'm not sure the purpose of the second output parameter, thus the former might be right decision. The function returns the following result after this patch is applied. select * from (values ('0/16ff'), ('0/1700'), ('0/1701')) as t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn); lsn |file_name | file_offset +--+- 0/16ff | 00020016 |16777214 0/1700 | 00020016 |16777215 0/1701 | 00020017 | 0 regards. [1]: https://www.postgresql.org/message-id/20190725193808.1648ddc8@firost -- Kyotaro Horiguchi NTT Open Source Software Center >From ca9e174f53ac4d513163cbe8201746c8d3d2eb62 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 26 Jul 2019 17:12:24 +0900 Subject: [PATCH] Fix offset of pg_walfile_name_offset. The function returns the name of the previous segment with the offset of the given location at segment boundaries. For example, it returns ("...16", 0) returned for '0/1700'. That is inconsistent, or at least confusing. This patch changes the function to return the given LSN - 1 as offset. --- src/backend/access/transam/xlogfuncs.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index b35043bf71..79ea0495b4 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -447,6 +447,8 @@ pg_last_wal_replay_lsn(PG_FUNCTION_ARGS) * Note that a location exactly at a segment boundary is taken to be in * the previous segment. This is usually the right thing, since the * expected usage is to determine which xlog file(s) are ready to archive. + * To be consistent to filename, returns the offset one byte before the given + * location as offset. */ Datum pg_walfile_name_offset(PG_FUNCTION_ARGS) @@ -480,10 +482,16 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS) resultTupleDesc = BlessTupleDesc(resultTupleDesc); + /* + * We assume the given location point as one-byte after the location + * really wanted. + */ + locationpoint--; + /* * xlogfilename */ - XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size); + XLByteToSeg(locationpoint, xlogsegno, wal_segment_size); XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size); values[0] = CStringGetTextDatum(xlogfilename); -- 2.16.3
Re: psql - add SHOW_ALL_RESULTS option
Hello Kyotaro-san, Attached a v2 for the always-show-all-results variant. Thanks for the debug! I have some comments on this patch. I'm +1 for always output all results without having knobs. That makes 4 opinions expressed towards this change of behavior, and none against. Documentation (psql-ref.sgml) has another place that needs the same amendment. Indeed. Looking the output for -t, -0, -A or something like, we might need to introduce result-set separator. Yep, possibly. I'm not sure this is material for this patch, though. # -eH looks broken for me but it would be another issue. It seems to work for me. Could you be more precise about how it is broken? Valid setting of FETCH_COUNT disables this feature. I think it is unwanted behavior. Yes and no: this behavior (bug, really) is pre-existing, FETCH_COUNT does not work with combined queries: sh> /usr/bin/psql psql (12beta2 ...) fabien=# \set FETCH_COUNT 2 fabien=# SELECT 1234 \; SELECT 5432 ; fabien=# same thing with pg 11.4, and probably down to every version of postgres since the feature was implemented... I think that fixing this should be a separate bug report and patch. I'll try to look at it. Thanks for the feedback. Attached v3 with further documentation updates. -- Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 7789fc6177..4e6ab5a0a5 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) - Also, psql only prints the - result of the last SQL command in the string. - This is different from the behavior when the same string is read from - a file or fed to psql's standard input, - because then psql sends - each SQL command separately. - Because of this behavior, putting more than one SQL command in a - single -c string often has unexpected results. - It's better to use repeated -c commands or feed - multiple commands to psql's standard input, + If having several commands executed in one transaction is not desired, + use repeated -c commands or feed multiple commands to + psql's standard input, either using echo as illustrated above, or via a shell here-document, for example: @@ -3355,10 +3348,8 @@ select 1\; select 2\; select 3; commands included in the string to divide it into multiple transactions. (See for more details about how the server handles multi-query strings.) -psql prints only the last query result -it receives for each request; in this example, although all -three SELECTs are indeed executed, psql -only prints the 3. +psql prints all results it receives, one +after the other. diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 44a782478d..4534c45854 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -472,6 +472,16 @@ ResetCancelConn(void) #endif } +static void +ShowErrorMessage(const PGresult *result) +{ + const char *error = PQerrorMessage(pset.db); + + if (strlen(error)) + pg_log_info("%s", error); + + CheckConnection(); +} /* * AcceptResult @@ -482,7 +492,7 @@ ResetCancelConn(void) * Returns true for valid result, false for error state. */ static bool -AcceptResult(const PGresult *result) +AcceptResult(const PGresult *result, bool show_error) { bool OK; @@ -513,15 +523,8 @@ AcceptResult(const PGresult *result) break; } - if (!OK) - { - const char *error = PQerrorMessage(pset.db); - - if (strlen(error)) - pg_log_info("%s", error); - - CheckConnection(); - } + if (!OK && show_error) + ShowErrorMessage(result); return OK; } @@ -701,7 +704,7 @@ PSQLexec(const char *query) ResetCancelConn(); - if (!AcceptResult(res)) + if (!AcceptResult(res, true)) { ClearOrSaveResult(res); res = NULL; @@ -743,7 +746,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt) ResetCancelConn(); - if (!AcceptResult(res)) + if (!AcceptResult(res, true)) { ClearOrSaveResult(res); return 0; @@ -999,199 +1002,113 @@ loop_exit: return success; } - /* - * ProcessResult: utility function for use by SendQuery() only - * - * When our command string contained a COPY FROM STDIN or COPY TO STDOUT, - * PQexec() has stopped at the PGresult associated with the first such - * command. In that event, we'll marshal data for the COPY and then cycle - * through any subsequent PGresult objects. - * - * When the command string contained no such COPY command, this function - * degenerates to an AcceptResult() call. - * - * Changes its argument to point to the last PGresult of the command string, - * or NULL if
Re: Add parallelism and glibc dependent only options to reindexdb
On Fri, Jul 26, 2019 at 10:53:03AM +0300, Sergei Kornilov wrote: > Explicit is better than implicit, so I am +1 to commit both patches. Hence my count is incorrect: - Forbid --jobs and --index: Michael P, Sergei K. - Enforce --jobs=1 with --index: Julien R. - Have no restrictions: 0. -- Michael signature.asc Description: PGP signature
Re: Fetching timeline during recovery
On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time) Kyotaro Horiguchi wrote: > Hi. > > At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais > wrote in <20190725193808.1648ddc8@firost> > > On Wed, 24 Jul 2019 14:33:27 +0200 > > Jehan-Guillaume de Rorthais wrote: > > > > > On Wed, 24 Jul 2019 09:49:05 +0900 > > > Michael Paquier wrote: > > > > > > > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais > > > > wrote: > > [...] > > > > I think that there are arguments for being more flexible with it, and > > > > perhaps have a system-level view to be able to look at some of its > > > > fields. > > > > > > Great idea. I'll give it a try to keep the discussion on. > > > > After some thinking, I did not find enough data to expose to justify the > > creation a system-level view. As I just need the current timeline I > > wrote "pg_current_timeline()". Please, find the patch in attachment. > > > > The current behavior is quite simple: > > * if the cluster is in production, return ThisTimeLineID > > * else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr) > > > > This is really naive implementation. We should probably add some code around > > the startup process to gather and share general recovery stats. This would > > allow to fetch eg. the current recovery method, latest xlog file name > > restored from archives or streaming, its timeline, etc. > > > > Any thoughts? > > If replay is delayed behind timeline switch point, replay-LSN and > receive/write/flush LSNs are on different timelines. When > replica have not reached the new timeline to which alredy > received file belongs, the fucntion returns wrong file name, > specifically a name consisting of the latest segment number and > the older timeline where the segment doesn't belong to. Indeed. > We have an LSN reporting function each for several objectives. > > pg_current_wal_lsn > pg_current_wal_insert_lsn > pg_current_wal_flush_lsn > pg_last_wal_receive_lsn > pg_last_wal_replay_lsn Yes. In fact, my current implementation might be split as: pg_current_wal_tl: returns TL on a production cluster pg_last_wal_received_tl: returns last received TL on a standby If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and *flush_tl would be useful as a cluster in production is not supposed to change its timeline during its lifetime. > But, I'm not sure just adding further pg_last_*_timeline() to > this list is a good thing.. I think this is a much better idea than mixing different case (production and standby) in the same function as I did. Moreover, it's much more coherent with other existing functions. > The function returns NULL for NULL input (STRICT behavior) but > returns (NULL, NULL) for undefined timeline. I don't think the > differene is meaningful. Unless I'm missing something, nothing returns "(NULL, NULL)" in 0001-v1-Add-function-pg_current_timeline.patch. Thank you for your feedback!
Re: block-level incremental backup
On Fri, Jul 26, 2019 at 11:21 AM Jeevan Ladhe wrote: > Hi Vignesh, > > Please find my comments inline below: > > 1) If relation file has changed due to truncate or vacuum. >> During incremental backup the new files will be copied. >> There are chances that both the old file and new file >> will be present. I'm not sure if cleaning up of the >> old file is handled. >> > > When an incremental backup is taken it either copies the file in its > entirety if > a file is changed more than 90%, or writes .partial with changed blocks > bitmap > and actual data. For the files that are unchanged, it writes 0 bytes and > still > creates a .partial file for unchanged files too. This means there is a > .partitial > file for all the files that are to be looked up in full backup. > While composing a synthetic backup from incremental backup the > pg_combinebackup > tool will only look for those relation files in full(parent) backup which > are > having .partial files in the incremental backup. So, if vacuum/truncate > happened > between full and incremental backup, then the incremental backup image > will not > have a 0-length .partial file for that relation, and so the synthetic > backup > that is restored using pg_combinebackup will not have that file as well. > > Thanks Jeevan for the update, I feel this logic is good. It will handle the case of deleting the old relation files. > > >> 2) Just a small thought on building the bitmap, >> can the bitmap be built and maintained as >> and when the changes are happening in the system. >> If we are building the bitmap while doing the incremental backup, >> Scanning through each file might take more time. >> This can be a configurable parameter, the system can run >> without capturing this information by default, but if there are some >> of them who will be taking incremental backup frequently this >> configuration can be enabled which should track the modified blocks. > > > IIUC, this will need changes in the backend. Honestly, I think backup is a > maintenance task and hampering the backend for this does not look like a > good > idea. But, having said that even if we have to provide this as a switch > for some > of the users, it will need a different infrastructure than what we are > building > here for constructing bitmap, where we scan all the files one by one. > Maybe for > the initial version, we can go with the current proposal that Robert has > suggested, > and add this switch at a later point as an enhancement. > > That sounds fair to me. Regards, vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Add parallelism and glibc dependent only options to reindexdb
Hi > So here we go: > - 0001 is your original thing, with --jobs enforced to 1 for the index > part. > - 0002 is my addition to forbid --index with --jobs. > > I am fine to be outvoted regarding 0002, and it is the case based on > the state of this thread with 2:1. We could always revisit this > decision in this development cycle anyway. Explicit is better than implicit, so I am +1 to commit both patches. regards, Sergei
Re: Fetching timeline during recovery
Hi. At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais wrote in <20190725193808.1648ddc8@firost> > On Wed, 24 Jul 2019 14:33:27 +0200 > Jehan-Guillaume de Rorthais wrote: > > > On Wed, 24 Jul 2019 09:49:05 +0900 > > Michael Paquier wrote: > > > > > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais > > > wrote: > [...] > > > I think that there are arguments for being more flexible with it, and > > > perhaps have a system-level view to be able to look at some of its > > > fields. > > > > Great idea. I'll give it a try to keep the discussion on. > > After some thinking, I did not find enough data to expose to justify the > creation a system-level view. As I just need the current timeline I > wrote "pg_current_timeline()". Please, find the patch in attachment. > > The current behavior is quite simple: > * if the cluster is in production, return ThisTimeLineID > * else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr) > > This is really naive implementation. We should probably add some code around > the startup process to gather and share general recovery stats. This would > allow to fetch eg. the current recovery method, latest xlog file name restored > from archives or streaming, its timeline, etc. > > Any thoughts? If replay is delayed behind timeline switch point, replay-LSN and receive/write/flush LSNs are on different timelines. When replica have not reached the new timeline to which alredy received file belongs, the fucntion returns wrong file name, specifically a name consisting of the latest segment number and the older timeline where the segment doesn't belong to. We have an LSN reporting function each for several objectives. pg_current_wal_lsn pg_current_wal_insert_lsn pg_current_wal_flush_lsn pg_last_wal_receive_lsn pg_last_wal_replay_lsn But, I'm not sure just adding further pg_last_*_timeline() to this list is a good thing.. The function returns NULL for NULL input (STRICT behavior) but returns (NULL, NULL) for undefined timeline. I don't think the differene is meaningful. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add parallelism and glibc dependent only options to reindexdb
On Fri, Jul 26, 2019 at 09:36:32AM +0200, Julien Rouhaud wrote: > I see that you iterate over the SimpleStringList after it's generated. > Why not computing that while building it in get_parallel_object_list > (and keep the provided table list count) instead? Yeah. I was hesitating to do that, or just break out of the counting loop if there are more objects than concurrent jobs, but that's less intuitive. -- Michael signature.asc Description: PGP signature
Re: Add parallelism and glibc dependent only options to reindexdb
On Fri, Jul 26, 2019 at 5:27 AM Michael Paquier wrote: > > On Thu, Jul 25, 2019 at 01:00:34PM +0200, Julien Rouhaud wrote: > > The problem is that a user doing something like: > > > > reindexdb -j48 -i some_index -S s1 -S s2 -S s3 > > > > will probably be disappointed to learn that he has to run a specific > > command for the index(es) that should be reindexed. Maybe we can > > issue a warning that parallelism isn't used when an index list is > > processed and user asked for multiple jobs? > > Arguments go in both directions as some other users may be surprised > by the performance of indexes as serialization is enforced. Sure, but there is no easy solution in that case, as you'd have to do all the work of spawning multiple reindexdb according to the underlying table, so probably what will happen here is that there'll just be two simple calls to reindexdb, one for the indexes, serialized anyway, and one for everything else. My vote is still to allow it, possibly emitting a notice or a warning. > > I don't send a new patch since the --index wanted behavior is not > > clear yet. > > So I am sending one patch (actually two) after a closer review that I > have spent time shaping into a committable state. And for this part I > have another suggestion that is to use a switch/case without a default > so as any newly-added object types would allow somebody to think about > those code paths as this would generate compiler warnings. Thanks for that! I'm fine with using switch to avoid future bad surprises. > While reviewing I have found an extra bug in the logic: when using a > list of tables, the number of parallel slots is the minimum between > concurrentCons and tbl_count, but this does not get applied after > building a list of tables for a schema or database reindex, so we had > better recompute the number of items in reindex_one_database() before > allocating the number of parallel slots. I see that you iterate over the SimpleStringList after it's generated. Why not computing that while building it in get_parallel_object_list (and keep the provided table list count) instead?
RE: Multivariate MCV list vs. statistics target
On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote: > >+/* XXX What if the target is set to 0? Reset the statistic? > */ > > > >This also makes me wonder. I haven't looked deeply into the code, but > >since 0 is a valid value, I believe it should reset the stats. > > I agree resetting the stats after setting the target to 0 seems quite > reasonable. But that's not what we do for attribute stats, because in that > case we simply skip the attribute during the future ANALYZE runs - we don't > reset the stats, we keep the existing stats. So I've done the same thing here, > and I've removed the XXX comment. > > If we want to change that, I'd do it in a separate patch for both the regular > and extended stats. Hi, Tomas Sorry for my late reply. You're right. I have no strong opinion whether we'd want to change that behavior. I've also confirmed the change in the patch where setting statistics target 0 skips the statistics. Maybe only some minor nitpicks in the source code comments below: 1. "it's" should be "its": > + * Compute statistic target, based on what's set for the > statistic > + * object itself, and for it's attributes. 2. Consistency whether you'd use either "statistic " or "statisticS ". Ex. statistic target vs statisticS target, statistics object vs statistic object, etc. > Attached is v4 of the patch, with a couple more improvements: > > 1) I've renamed the if_not_exists flag to missing_ok, because that's more > consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda > the exact opposite), and I've added a NOTICE about the skip. + boolmissing_ok; /* do nothing if statistics does not exist */ Confirmed. So we ignore if statistic does not exist, and skip the error. Maybe to make it consistent with other data structures in parsernodes.h, you can change the comment of missing_ok to: /* skip error if statistics object does not exist */ > 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's > what the function was doing anyway (computing sample size). > > 3) I've added a couple of regression tests to stats_ext.sql > > Aside from that, I've cleaned up a couple of places and improved a bunch of > comments. Nothing huge. I have a question though regarding ComputeExtStatisticsRows. I'm just curious with the value 300 when computing sample size. Where did this value come from? + /* compute sample size based on the statistic target */ + return (300 * result); Overall, the patch is almost already in good shape for commit. I'll wait for the next update. Regards, Kirk Jamison
Re: POC: Cleaning up orphaned files using undo logs
On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar wrote: > > On Tue, 23 Jul 2019 at 08:48, Amit Kapila wrote: > > > > > > > -- > > > > > > + if (!InsertRequestIntoErrorUndoQueue(urinfo)) > > > I was thinking what happens if for some reason > > > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the > > > entry will not be marked invalid, and so there will be no undo action > > > carried out because I think the undo worker will exit. What happens > > > next with this entry ? > > > > The same entry is present in two queues xid and size, so next time it > > will be executed from the second queue based on it's priority in that > > queue. However, if it fails again a second time in the same way, then > > we will be in trouble because now the hash table has entry, but none > > of the queues has entry, so none of the workers will attempt to > > execute again. Also, when discard worker again tries to register it, > > we won't allow adding the entry to queue thinking either some backend > > is executing the same or it must be part of some queue. > > > > The one possibility to deal with this could be that we somehow allow > > discard worker to register it again in the queue or we can do this in > > critical section so that it allows system restart on error. However, > > the main thing is it possible that InsertRequestIntoErrorUndoQueue > > will fail unless there is some bug in the code? If so, we might want > > to have an Assert for this rather than handling this condition. > > Yes, I also think that the function would error out only because of > can't-happen cases, like "too many locks taken" or "out of binary heap > slots" or "out of memory" (this last one is not such a can't happen > case). These cases happen probably due to some bugs, I suppose. But I > was wondering : Generally when the code errors out with such > can't-happen elog() calls, worst thing that happens is that the > transaction gets aborted. Whereas, in this case, the worst thing that > could happen is : the undo action would never get executed, which > means selects for this tuple will keep on accessing the undo log ? > Yeah, or in zheap, we have page-wise rollback facility which rollbacks the transaction for a particular page (this gets triggers whenever we try to update/delete a tuple which was last updated by aborted xact or when we try to reuse slot of aborted xact) and we don't need to traverse undo chain. > This does not sound like any data consistency issue, so we should be > fine after all ? > I will see if we can have an Assert in the code for this. > > -- > > +if (UndoGetWork(false, false, , NULL) && > +IsUndoWorkerAvailable()) > +UndoWorkerLaunch(urinfo); > > There is no lock acquired between IsUndoWorkerAvailable() and > UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable() > returns true, there is a small window where UndoWorkerLaunch() does > not find any worker slot with in_use false, causing assertion failure > for (worker != NULL). > -- > Yeah, I think UndoWorkerLaunch should be able to return without launching worker in such a case. > > + if (!RegisterDynamicBackgroundWorker(, _handle)) > + { > + /* Failed to start worker, so clean up the worker slot. */ > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE); > + UndoWorkerCleanup(worker); > + LWLockRelease(UndoWorkerLock); > + > + return false; > + } > > Is it intentional that there is no (warning?) message logged when we > can't register a bg worker ? > - I don't think it was intentional. I think it will be good to have a warning here. I agree with all your other comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
RE: [Patch] PQconnectPoll() is blocked if target_session_attrs is read-write
Tsunakawa-san Thank you for your comment. I understand the sense. I don't require an explicit rule. Regards Ryo Matsumura