Re: Index-only scan for "f(x)" without "x"
On Thu, 23 Jan 2020 at 00:00, Tom Lane wrote: > The problem is that the planner's initial analysis of the query tree > concludes that the scan of "tab" has to return "x", because it looks > through the tree for plain Vars, and "x" is what it's going to find. > This conclusion is in fact true for any plan that doesn't involve > scanning a suitable index on f(x), so it's not something we can just > dispense with. If f(x) was actually added to the table as a virtual generated column V then the planner could perhaps reasonably find that a particular expression depends in part on V (rather than the set of real columns underlying the virtual generated column). That is, the planner now knows that it needs to return V = "f(x)" rather than "x" and so it can match the requirement to our index and decide to use an index-only scan. > To back up from that and conclude that the indexscan doesn't really > need to return "x" because every use of it is in the context "f(x)" > seems like a pretty expensive proposition, especially once you start > considering index expressions that are more complex than a single > function call. A brute-force matching search could easily be O(N^2) > or worse in the size of the query tree, which is a lot to pay when > there's a pretty high chance of failure. (IOW, we have to expect > that most queries on "tab" are in fact not going to be able to use > that index, so we can't afford to spend a lot of planning time just > because the index exists.) In the approach outlined above, the set of expressions to match on would be limited by the set of virtual generated columns defined on the table — some sort of prefix tree that allows efficient matching on a given expression syntax tree? There would be no cost to this on a table that had no virtual generated columns. The question is whether this is a reasonable take on virtual generated columns. ---regards
Re: Run-time pruning for ModifyTable
Sorry, I didn't notice this email until now. On Wed, Nov 27, 2019 at 5:17 PM Michael Paquier wrote: > On Tue, Nov 05, 2019 at 04:04:25PM +1300, Thomas Munro wrote: > > On Thu, Sep 12, 2019 at 10:10 AM Alvaro Herrera > > wrote: > > > Here's a rebased version of this patch (it had a trivial conflict). > > > > Hi, FYI partition_prune.sql currently fails (maybe something to do > > with commit d52eaa09?): > > David, perhaps you did not notice that? For now I have moved this > patch to next CF waiting on author to look after the failure. > > Amit, Kato-san, both of you are marked as reviewers of this patch. > Are you planning to look at it? Sorry, I never managed to look at the patch closely. As I commented up-thread, the functionality added by this patch would be unnecessary if we were to move forward with the other project related to UPDATE and DELETE over inheritance trees: https://www.postgresql.org/message-id/357.1550612935%40sss.pgh.pa.us I had volunteered to submit a patch in that thread and even managed to write one but didn't get time to get it in good enough shape to post it to the list, like I couldn't make it handle foreign child tables. The gist of the new approach is that ModifyTable will always have *one* subplan under ModifyTable, not N for N target partitions as currently. That one subplan being the same plan as one would get if the query were SELECT instead of UPDATE/DELETE, it would automatically take care of run-time pruning if needed, freeing ModifyTable itself from having to do it. Now, the chances of such a big overhaul of how UPDATEs of inheritance trees are handled getting into PG 13 seem pretty thin even if I post the patch in few days, so perhaps it would make sense to get this patch in so that we can give users run-time pruning for UPDATE/DELETE in PG 13, provided the code is not very invasive. If and when the aforesaid overhaul takes place, that code would go away along with a lot of other code. Thanks, Amit
Re: Do we need to handle orphaned prepared transactions in the server?
On Thu, Jan 23, 2020 at 12:56:41PM +0800, Craig Ringer wrote: > We could definitely improve on that by exposing a view that integrates > everything that holds down xmin and catalog_xmin. It'd show > > * the datfrozenxid and datminmxid for the oldest database > * if that database is the current database, the relation(s) with the > oldest relfrozenxid and relminmxd > * ... and the catalog relation(s) with the oldest relfrozenxid and > relminmxid if greater > * the absolute xid and xid-age positions of entries in pg_replication_slots > * pg_stat_replication connections (joined to pg_stat_replication if > connected) with their feedback xmin > * pg_stat_activity backend_xid and backend_xmin for the backend(s) > with oldest values; this may be different sets of backends > * pg_prepared_xacts entries by oldest xid It seems to me that what you are describing here is a set of properties good for a monitoring tool that we don't necessarily need to maintain in core. There are already tools able to do that in ways I think are better than what we could ever design, like check_pgactivity and such. And there are years of experience behind that from the maintainers of such scripts and/or extensions. The argument about Postgres getting more and more complex is true as the code grows, but I am not really convinced that we need to make it grow more with more layers that we think are good, because we may finish by piling up stuff which are not actually that good in the long term. I'd rather just focus in the core code on the basics with views that map directly to what we have in memory and/or disk. -- Michael signature.asc Description: PGP signature
Re: vacuum verbose detail logs are unclear; log at *start* of each stage; show allvisible/frozen/hintbits
On Wed, Jan 22, 2020 at 02:34:57PM +0900, Michael Paquier wrote: > Shouldn't the last part of the sentence be "we should mark it so" > instead of "we should so mark it"? I would rephrase the whole as > follows: > "If the all-visible page is all-frozen but not marked as such yet, > mark it as all-frozen." Applied this one to HEAD after chewing on it a bit. -- Michael signature.asc Description: PGP signature
Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
On Wed, Jan 22, 2020 at 11:18:12PM +0100, Daniel Gustafsson wrote: > Once my plate clears up a bit I will return to this one, but feel free to mark > it rwf for this cf. Thanks for the update. I have switched the patch status to returned with feedback. -- Michael signature.asc Description: PGP signature
Re: Do we need to handle orphaned prepared transactions in the server?
On Thu, 23 Jan 2020 at 01:20, Tom Lane wrote: > > Bruce Momjian writes: > > I think the big question is whether we want to make active prepared > > transactions more visible to administrators, either during server start > > or idle duration. > > There's already the pg_prepared_xacts view ... I think Bruce has a point here. We shouldn't go around "resolving" prepared xacts, but the visibility of them is a problem for users. I've seen that myself quite enough times, even now that they cannot be used by default. Our monitoring and admin views are not keeping up with Pg's complexity. Resource retention is one area where that's becoming a usability and admin challenge. If a user has growing bloat (and have managed to figure that out, since we don't make it easy to do that either) or unexpected WAL retention they may find it hard to quickly work out why. We could definitely improve on that by exposing a view that integrates everything that holds down xmin and catalog_xmin. It'd show * the datfrozenxid and datminmxid for the oldest database * if that database is the current database, the relation(s) with the oldest relfrozenxid and relminmxd * ... and the catalog relation(s) with the oldest relfrozenxid and relminmxid if greater * the absolute xid and xid-age positions of entries in pg_replication_slots * pg_stat_replication connections (joined to pg_stat_replication if connected) with their feedback xmin * pg_stat_activity backend_xid and backend_xmin for the backend(s) with oldest values; this may be different sets of backends * pg_prepared_xacts entries by oldest xid ... probably sorted by xid age. It'd be good to expose some internal state too, which would usually correspond to the oldest values found in the above, but is useful for cross-checking: * RecentGlobalXmin and RecentGlobalDataXmin to show the xmin and catalog_xmin actually used * procArray->replication_slot_xmin and procArray->replication_slot_catalog_xmin I'm not sure whether WAL retention (lsn tracking) should be in the same view or a different one, but I lean toward different. I already have another TODO kicking around for me to write a view that generates a blocking locks graph, since pg_locks is really more of a building block than a directly useful view for admins to understand the system's state. And if that's not enough I also want to write a decent bloat-checking view to include in the system views, since IMO lock-blocking, bloat, and resource retention are real monitoring pain points right now. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Error message inconsistency
On Wed, Jan 22, 2020 at 8:48 PM Tom Lane wrote: > > Alvaro Herrera writes: > > I wonder if we shouldn't be using errtablecol() here instead of (in > > addition to?) patching the errmsg() to include the table name. > > > Discussion: If we really like having the table names in errtable(), then > > we should have psql display it by default, and other tools will follow > > suit; in that case we should remove the table name from error messages, > > or at least not add it to even more messages. > > > If we instead think that errtable() is there just for programmatically > > knowing the affected table, then we should add the table name to all > > errmsg() where relevant, as in the patch under discussion. > > > What do others think? > > I believe that the intended use of errtable() and friends is so that > applications don't have to parse those names out of a human-friendly > message. We should add calls to them in cases where (a) an application > can tell from the SQLSTATE that some particular table is involved > and (b) it's likely that the app would wish to know which table that is. > I don't feel a need to sprinkle every single ereport() in the backend > with errtable(), just ones where there's a plausible use-case for the > extra cycles that will be spent. > > On the other side of the coin, whether we use errtable() is not directly > a factor in deciding what the human-friendly messages should say. > I do find it hard to envision a case where we'd want to use errtable() > and *not* put the table name in the error message, just because if > applications need to know something then humans probably do too. > makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Allow to_date() and to_timestamp() to accept localized names
Arthur Zakirov writes: > On 2020/01/23 7:11, Tom Lane wrote: >> Closer examination shows that the "max" argument is pretty bogus as >> well. It doesn't do anything except confuse the reader, because there >> are no cases where the value passed is less than the maximum array entry >> length, so it never acts to change seq_search's behavior. So we should >> just drop that behavior from seq_search, too, and redefine "max" as >> having no purpose except to specify how much of the string to show in >> error messages. There's still a question of what that should be for >> non-English cases, but at least we now have a clear idea of what we >> need the value to do. > Shouldn't we just show all remaining string instead of truncating it? That would avoid a bunch of arbitrary decisions, for sure. Anybody have an objection? regards, tom lane
Re: Do we need to handle orphaned prepared transactions in the server?
On Wed, 22 Jan 2020 at 23:22, Tom Lane wrote: > > Thomas Kellerer writes: > > Tom Lane schrieb am 22.01.2020 um 16:05: > >> Right. It's the XA transaction manager's job not to forget uncommitted > >> transactions. Reasoning as though no TM exists is not only not very > >> relevant, but it might lead you to put in features that actually > >> make the TM's job harder. In particular, a timeout (or any other > >> mechanism that leads PG to abort or commit a prepared transaction > >> of its own accord) does that. > > > That's a fair point, but the reality is that not all XA transaction managers > > do a good job with that. > > If you've got a crappy XA manager, you should get a better one, not > ask us to put in features that make PG unsafe to use with well-designed > XA managers. Agreed. Or use some bespoke script that does the cleanup that you think is appropriate for your particular environment and set of bugs. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Error message inconsistency
On Wed, Jan 22, 2020 at 3:23 PM MBeena Emerson wrote: > > > - > > 14. src/backend/commands/tablecmds.c > > > > 5310 else > > 5311 ereport(ERROR, > > 5312 (errcode(ERRCODE_CHECK_VIOLATION), > > 5313 errmsg("partition constraint is violated > > by some row"))); > > > > Added relation name for this error. This can be verified by below example: > > Ex: > > CREATE TABLE list_parted (a int,b char)PARTITION BY LIST (a); > > CREATE TABLE part_1 (LIKE list_parted); > > INSERT INTO part_1 VALUES (3, 'a'); > > ALTER TABLE list_parted ATTACH PARTITION part_1 FOR VALUES IN (2); > > > > Without patch: > > ERROR: partition constraint is violated by some row > > With patch: > > ERROR: partition constraint "part_1" is violated by some row > > Here it seems as if "part_1" is the constraint name. > I agree. > It would be > better to change it to: > > partition constraint is violated by some row in relation "part_1" OR > partition constraint of relation "part_1" is violated b some row > +1 for the second option suggested by Beena. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jan 22, 2020 at 10:07 PM Alvaro Herrera wrote: > > I looked at this patchset and it seemed natural to apply 0008 next > (adding work_mem to subscriptions). > I am not so sure whether we need this patch as the exact scenario where it can help is not very clear to me and neither did anyone explained. I have raised this concern earlier as well [1]. The point is that 'logical_decoding_work_mem' applies to the entire ReorderBuffer in the publisher's side and how will a parameter from a particular subscription help in that? [1] - https://www.postgresql.org/message-id/CAA4eK1J%2B3kab6RSZrgj0YiQV1r%2BH3FWVaNjKhWvpEe5-bpZiBw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Allow to_date() and to_timestamp() to accept localized names
On 2020/01/23 7:11, Tom Lane wrote: Closer examination shows that the "max" argument is pretty bogus as well. It doesn't do anything except confuse the reader, because there are no cases where the value passed is less than the maximum array entry length, so it never acts to change seq_search's behavior. So we should just drop that behavior from seq_search, too, and redefine "max" as having no purpose except to specify how much of the string to show in error messages. There's still a question of what that should be for non-English cases, but at least we now have a clear idea of what we need the value to do. Shouldn't we just show all remaining string instead of truncating it? For example I get the following output: =# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH '); ERROR: invalid value "ЯНВА" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. This behavior is reproduced without the patch though (on Postgres 12). And the English case might confuse too I think: =# select to_date('3 JANUZRY 1999', 'DD MONTH '); ERROR: invalid value "JANUZRY 1" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. Users might think what means "1" in "JANUZRY 1" string. I attached the draft patch, which shows all remaining string. So the query above will show the following: =# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH '); ERROR: invalid value "ЯНВАРЯ 1999" for "MONTH" DETAIL: The remaining value did not match any of the allowed values for this field. The 0001 patch attached cleans up all the above complaints. I felt that given the business about scribbling on strings we shouldn't, it would also be wise to const-ify the string pointers if possible. That seems not too painful, per 0002 attached. +1 on the patch. -- Arthur diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index ef21b7e426..63e773c4c1 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -2539,17 +2539,11 @@ from_char_seq_search(int *dest, const char **src, const char *const *array, if (len <= 0) { - charcopy[DCH_MAX_ITEM_SIZ + 1]; - - /* Use multibyte-aware truncation to avoid generating a bogus string */ - max = pg_mbcliplen(*src, strlen(*src), max); - strlcpy(copy, *src, max + 1); - RETURN_ERROR(ereport(ERROR, (errcode(ERRCODE_INVALID_DATETIME_FORMAT), errmsg("invalid value \"%s\" for \"%s\"", -copy, node->key->name), - errdetail("The given value did not match any of " +*src, node->key->name), + errdetail("The remaining value did not match any of " "the allowed values for this field."; } *src += len;
Re: Improve search for missing parent downlinks in amcheck
On Fri, Jan 17, 2020 at 1:05 AM Tomas Vondra wrote: > On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote: > >On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote: > >> The revised patch seems to fix all of above. > > > >The latest patch is failing to apply. Please provide a rebase. > > This still does not apply (per cputube). Can you provide a fixed patch? Rebased patch is attached. Sorry for so huge delay. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company amcheck-btree-improve-missing-parent-downlinks-check-4.patch Description: Binary data
Re: Index Skip Scan
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan wrote: > On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee > wrote: > > As far as I know, a page split could happen at any random element in the > > page. One of the situations we could be left with is: > > Leaf page 1 = [2,4] > > Leaf page 2 = [5,6,8] > > However, our scan is still pointing to leaf page 1. For non-skip scans this > > is not a problem, as we already read all matching elements in our local > > buffer and we'll return those. But the skip scan currently: > > a) checks the lo-key of the page to see if the next prefix can be found on > > the leaf page 1 > > b) finds out that this is actually true > > c) does a search on the page and returns value=4 (while it should have > > returned value=6) > > > > Peter, is my understanding about the btree internals correct so far? > > This is a good summary. This is the kind of scenario I had in mind > when I expressed a general concern about "stopping between pages". > Processing a whole page at a time is a crucial part of how > _bt_readpage() currently deals with concurrent page splits. I want to be clear about what it means that the page doesn't have a "low key". Let us once again start with a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] -- just like in Floris' original page split scenario. Let us also say that page 1 has a left sibling page -- page 0. Page 0 happens to have a high key with the integer value 0. So you could *kind of* claim that the "low key" of page 1 is the integer value 0 (page 1 values must be > 0) -- *not* the integer value 2 (the so-called "low key" here is neither > 2, nor >= 2). More formally, an invariant exists that says that all values on page 1 must be *strictly* greater than the integer value 0. However, this formal invariant thing is hard or impossible to rely on when we actually reach page 1 and want to know about its lower bound -- since there is no "low key" pivot tuple on page 1 (we can only speak of a "low key" as an abstract concept, or something that works transitively from the parent -- there is only a physical high key pivot tuple on page 1 itself). Suppose further that Page 0 is now empty, apart from its "all values on page are <= 0" high key (page 0 must have had a few negative integer values in its tuples at some point, but not anymore). VACUUM will delete the page, *changing the effective low key* of Page 0 in the process. The lower bound from the shared parent page will move lower/left as a consequence of the deletion of page 0. nbtree page deletion makes the "keyspace move right, not left". So the "conceptual low key" of page 1 just went down from 0 to -5 (say), without there being any practical way of a skip scan reading page 1 noticing the change (the left sibling of page 0, page -1, has a high key of <= -5, say). Not only is it possible for somebody to insert the value 1 in page 1 -- now they can insert the value -3 or -4! More concretely, the pivot tuple in the parent that originally pointed to page 0 is still there -- all that page deletion changed about this tuple is its downlink, which now points to page 1 instead or page 0. Confusingly, page deletion removes the pivot tuple of the right sibling page from the parent -- *not* the pivot tuple of the empty page that gets deleted (in this case page 0) itself. Note: this example ignores things like negative infinity values in truncated pivot tuples, and the heap TID tiebreaker column -- in reality this would look a bit different because of those factors. See also: amcheck's bt_right_page_check_scankey() function, which has a huge comment that reasons about a race involving page deletion. In general, page deletion is by far the biggest source of complexity when reasoning about the key space. -- Peter Geoghegan
Re: A rather hackish POC for alternative implementation of WITH TIES
> "Alvaro" == Alvaro Herrera writes: Alvaro> My own inclination is that Andrew's implementation, being more Alvaro> general in nature, would be the better one to have in the Alvaro> codebase; but we don't have a complete patch yet. Can we reach Alvaro> some compromise such as if Andrew's patch is not completed then Alvaro> we push Surafel's? Mine needs some attention to where exactly in planning the necessary transformation work should be done; right now the planner part is a hack, intended to demonstrate the idea (and to let the executor changes work) rather than actually be the final version. As I mentioned before, some stuff does need to be pushed out to an InitPlan to make it work without multiple-evaluation problems. (A second opinion from another planner expert would be welcome on that part) I was largely holding off on doing further work hoping for some discussion of which way we should go. If you think my approach is worth pursuing (I haven't seriously tested the performance, but I'd expect it to be slower than Surafel's - the price you pay for flexibility) then I can look at it further, but figuring out the planner stuff will take some time. -- Andrew.
Re: error context for vacuum to include block number
On Mon, Jan 20, 2020 at 11:11:20AM -0800, Andres Freund wrote: > > @@ -966,8 +986,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, > > LVRelStats *vacrelstats, > > /* Work on all the indexes, then the heap */ > > + /* Don't use the errcontext handler outside this > > function */ > > + error_context_stack = errcallback.previous; > > lazy_vacuum_all_indexes(onerel, Irel, indstats, > > > > vacrelstats, lps, nindexes); > > + error_context_stack = > > Alternatively we could push another context for each index inside > lazy_vacuum_all_indexes(). There's been plenty bugs in indexes > triggering problems, so that could be worthwhile. Is the callback for index vacuum useful without a block number? FYI, I have another patch which would add DEBUG output before each stage, which would be just as much information, and without needing to use a callback. It's 0004 here: https://www.postgresql.org/message-id/20200121134934.GY26045%40telsasoft.com @@ -1752,9 +1753,12 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel, { int idx; - for (idx = 0; idx < nindexes; idx++) + for (idx = 0; idx < nindexes; idx++) { + ereport(DEBUG1, (errmsg("\"%s\": vacuuming index", + RelationGetRelationName(Irel[idx]; lazy_vacuum_index(Irel[idx], [idx], vacrelstats->dead_tuples,
Re: Index Skip Scan
On Wed, Jan 22, 2020 at 1:35 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Oh, that's what you mean. Yes, I was somehow tricked by the name of this > function and didn't notice that it checks only one boundary, so in case > of backward scan it returns wrong result. I think in the situation > you've describe it would actually not find any item on the current page > and restart from the root, but nevertheless we need to check for both > keys in _bt_scankey_within_page. I suggest reading the nbtree README file's description of backwards scans. Read the paragraph that begins with 'We support the notion of an ordered "scan" of an index...'. I also suggest that you read a bit of the stuff in the large section on page deletion. Certainly read the paragraph that begins with 'Moving left in a backward scan is complicated because...'. It's important to grok why it's okay that we don't "couple" or "crab" buffer locks as we descend the tree with Lehman & Yao's design -- we can get away with having *no* interlock against page splits (e.g., pin, buffer lock) when we are "between" levels of the tree. This is safe, since the page that we land on must still be "substantively the same page", no matter how much time passes. That is, it must at least cover the leftmost portion of the keyspace covered by the original version of the page that we saw that we needed to descend to within the parent page. The worst that can happen is that we have to recover from a concurrent page split by moving right one or more times. (Actually, page deletion can change the contents of a page entirely, but that's not really an exception to the general rule -- page deletion is careful about recycling pages that an in flight index scan might land on.) Lehman & Yao don't have backwards scans (or left links, or page deletion). Unlike nbtree. This is why the usual Lehman & Yao guarantees don't quite work with backward scans. We must therefore compensate as described by the README file (basically, we check and re-check for races, possibly returning to the original page when we think that we might have overlooked something and need to make sure). It's an exception to the general rule, you could say. -- Peter Geoghegan
Re: Index-only scan for "f(x)" without "x"
Malthe writes: > Referencing the example given in the documentation for index-only > scans [0], we consider an index: > CREATE INDEX tab_f_x ON tab (f(x)); > This index currently will not be used for an index-scan for the > following query since the planner isn't smart enough to know that "x" > is not needed: > SELECT f(x) FROM tab WHERE f(x) < 1; > However, any function applied to a column for an index expression is > required to be immutable so as far as I can tell the planner doesn't > have to be very smart to know that the index can indeed be used for an > index-only scan (without having "x" included). The problem is that the planner's initial analysis of the query tree concludes that the scan of "tab" has to return "x", because it looks through the tree for plain Vars, and "x" is what it's going to find. This conclusion is in fact true for any plan that doesn't involve scanning a suitable index on f(x), so it's not something we can just dispense with. To back up from that and conclude that the indexscan doesn't really need to return "x" because every use of it is in the context "f(x)" seems like a pretty expensive proposition, especially once you start considering index expressions that are more complex than a single function call. A brute-force matching search could easily be O(N^2) or worse in the size of the query tree, which is a lot to pay when there's a pretty high chance of failure. (IOW, we have to expect that most queries on "tab" are in fact not going to be able to use that index, so we can't afford to spend a lot of planning time just because the index exists.) > What's required in order to move forward on these capabilities? Some non-brute-force solution to the above problem. regards, tom lane
Index-only scan for "f(x)" without "x"
Referencing the example given in the documentation for index-only scans [0], we consider an index: CREATE INDEX tab_f_x ON tab (f(x)); This index currently will not be used for an index-scan for the following query since the planner isn't smart enough to know that "x" is not needed: SELECT f(x) FROM tab WHERE f(x) < 1; However, any function applied to a column for an index expression is required to be immutable so as far as I can tell the planner doesn't have to be very smart to know that the index can indeed be used for an index-only scan (without having "x" included). One interesting use-case for this is to be able to create space-efficient indexes for raw log data. For example, for each type of message (which might be encoded as JSON), one could create a partial index with the relevant fields extracted and converted into native data types and use index-only scanning to query. This is not particularly attractive today because the message itself would need to be added to the index effectively duplicating the log data. In the same vein, being able to add this auxiliary data (which is basically immutable expressions on one or more columns) explicitly using INCLUDE would make the technique actually reliable. This is not possible right now since expression are not supported as included columns. What's required in order to move forward on these capabilities? [0] https://www.postgresql.org/docs/current/indexes-index-only-scans.html
Re: Index Skip Scan
On Wed, Jan 22, 2020 at 1:09 PM Floris Van Nee wrote: > The loose index scan shouldn't return a tuple twice. It should only be able > to skip 'further', so that shouldn't be a problem. Out of curiosity, why > can't index scans return the same tuple twice? Is there something in the > executor that isn't able to handle this? I have no reason to believe that the executor has a problem with index scans that return a tuple more than once, aside from the very obvious: in general, that will often be wrong. It might not be wrong when the scan happens to be input to a unique node anyway, or something like that. I'm not particularly concerned about it. Just wanted to be clear on our assumptions for loose index scans -- if loose index scans were allowed to return a tuple more than once, then that would at least have to at least be considered in the wider context of the executor (but apparently they're not, so no need to worry about it). This may have been mentioned somewhere already. If it is then I must have missed it. -- Peter Geoghegan
Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
> On 26 Nov 2019, at 06:44, Michael Paquier wrote: Re this patch being in WoA state for some time [0]: > The regression tests of contrib/test_decoding are still failing here: > +ERROR: could not resolve cmin/cmax of catalog tuple This is the main thing left with this patch, and I've been unable so far to figure it out. I have an unscientific hunch that this patch is shaking out something (previously unexercised) in the logical decoding code supporting multi-inserts in the catalog. If anyone has ideas or insights, I would love the help. Once my plate clears up a bit I will return to this one, but feel free to mark it rwf for this cf. cheers ./daniel [0] postgr.es/m/20200121144937.n24oacjkegu4pnpe@development
Re: Allow to_date() and to_timestamp() to accept localized names
I wrote: > One thing I completely don't understand is why it's sufficient for > seq_search_localized to do "initcap" semantics. Shouldn't it cover > the all-upper and all-lower cases as well, as the existing seq_search > code does? (That is, why is it okay to ignore the "type" argument?) I took a closer look at this and decided you were probably doing that just because the seq_search code uses initcap-style case folding to match month and day names, relying on the assumption that the constant arrays it's passed are cased that way. But we shouldn't (and the patch doesn't) assume that the localized names we'll get from pg_locale.c are cased that way. However ... it seems to me that the way seq_search is defined is plenty bogus. In the first place, it scribbles on its source string, which is surely not okay. You can see that in error messages that are thrown on match failures: regression=# select to_date('3 JANUZRY 1999', 'DD MONTH '); ERROR: invalid value "JanuzRY 1" for "MONTH" DETAIL: The given value did not match any of the allowed values for this field. Now, maybe somebody out there thinks this is a cute way of highlighting how much of the string we examined, but it looks more like a bug from where I sit. It's mere luck that what's being clobbered is a local string copy created by do_to_timestamp(), and not the original input data passed to to_date(). In the second place, there's really no need at all for seq_search to rely on any assumptions about the case state of the array it's searching. pg_toupper() is pretty cheap already, and we can make it guaranteed cheap if we use pg_ascii_toupper() instead. So I think we ought to just remove the "type" argument altogether and have seq_search dynamically convert all the strings it examines to upper case (or lower case would work as well, at least for English). > On the other hand, you probably *should* be ignoring the "max" > argument, because AFAICS the values that are passed for that are > specific to the English ASCII spellings of the day and month names. > It seems very likely that they could be too small for some sets of > non-English names. Closer examination shows that the "max" argument is pretty bogus as well. It doesn't do anything except confuse the reader, because there are no cases where the value passed is less than the maximum array entry length, so it never acts to change seq_search's behavior. So we should just drop that behavior from seq_search, too, and redefine "max" as having no purpose except to specify how much of the string to show in error messages. There's still a question of what that should be for non-English cases, but at least we now have a clear idea of what we need the value to do. I also noted while I was looking at it that from_char_seq_search() would truncate at "max" bytes even when dealing with multibyte input. That has a clear risk of generating an invalidly-encoded error message. The 0001 patch attached cleans up all the above complaints. I felt that given the business about scribbling on strings we shouldn't, it would also be wise to const-ify the string pointers if possible. That seems not too painful, per 0002 attached. I'm tempted to go a bit further than 0001 does, and remove the 'max' argument from from_char_seq_search() altogether. Since we only need 'max' in error cases, which don't need to be super-quick, we could drop the requirement for the caller to specify that and instead compute it when we do need it as the largest of the constant array's string lengths. That'd carry over into the localized-names case in a reasonably straightforward way (though we might need to count characters not bytes for that to work nicely). Because of the risk of incorrectly-encoded error messages, I'm rather tempted to claim that these patches (or at least 0001) are a bug fix and should be back-patched. In any case I think we should apply this to HEAD and then rebase the localized-names patch over it. It makes a whole lot more sense, IMO, for the localized-names comparison logic to match what this is doing than what seq_search does today. Comments? regards, tom lane diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index ca3c48d..3ef10dc 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL}; * Flags & Options: * -- */ -#define ONE_UPPER 1 /* Name */ -#define ALL_UPPER 2 /* NAME */ -#define ALL_LOWER 3 /* name */ - #define MAX_MONTH_LEN 9 #define MAX_MON_LEN 3 #define MAX_DAY_LEN 9 @@ -1068,9 +1064,9 @@ static int from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node, bool *have_error); static int from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error); -static int seq_search(char *name, const char *const *array, int type, int
Re: Online checksums patch - once again
On Wed, Jan 22, 2020 at 3:28 PM Magnus Hagander wrote: > > I think the argument about adding catalog flags adding overhead is > > pretty much bogus. Fixed-width fields in catalogs are pretty cheap. > > If that's the general view, then yeah our "cost calculations" were > off. I guess I may have been colored by the cost of adding statistics > counters, and had that influence the thinking. Incorrect judgement on > that cost certainly contributed to the decision. back then. For either statistics or for pg_class, the amount of data that we have to manage is proportional to the number of relations (which could be big) multiplied by the data stored for each relation. But the difference is that the stats file has to be rewritten, at least on a per-database basis, very frequently, while pg_class goes through shared-buffers and so doesn't provoke the same stupid write-the-whole-darn-thing behavior. That is a pretty key difference, IMHO. Now, it would be nice to fix the stats system, but until we do, here we are. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: A rather hackish POC for alternative implementation of WITH TIES
Hello As this is a valuable feature, it would be good to have something happen here. I wouldn't like to have pg13 ship with no implementation of WITH TIES at all. My own inclination is that Andrew's implementation, being more general in nature, would be the better one to have in the codebase; but we don't have a complete patch yet. Can we reach some compromise such as if Andrew's patch is not completed then we push Surafel's? Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgsql: Fix crash in BRIN inclusion op functions, due to missing datum c
On 2020-Jan-20, Heikki Linnakangas wrote: > That case arises at least with the range_union() function, when one of > the arguments is an 'empty' range: > > CREATE TABLE brintest (n numrange); > CREATE INDEX brinidx ON brintest USING brin (n); > INSERT INTO brintest VALUES ('empty'); > INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric)); > INSERT INTO brintest VALUES ('(-1, 0)'); > > SELECT brin_desummarize_range('brinidx', 0); > SELECT brin_summarize_range('brinidx', 0); I noticed that this test increases line-wise coverage of brin_inclusion.c by a few percentage points, so I added it. Again, thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
> On Wed, Jan 22, 2020 at 09:08:59PM +, Floris Van Nee wrote: > > Note in particular that index scans cannot return the same index tuple > > twice - > > - processing a page at a time ensures that that cannot happen. > > > > Can a loose index scan return the same tuple (i.e. a tuple with the same > > heap > > TID) to the executor more than once? > > > > The loose index scan shouldn't return a tuple twice. It should only be able > to skip 'further', so that shouldn't be a problem. Yes, it shouldn't happen.
Re: Index Skip Scan
> On Wed, Jan 22, 2020 at 05:24:43PM +, Floris Van Nee wrote: > > > > Anyone please correct me if I'm wrong, but I think one case where the > > > current patch relies on some data from the page it has locked before it > > > in checking this hi/lo key. I think it's possible for the following > > > sequence to happen. Suppose we have a very simple one leaf-page btree > > > containing four elements: leaf page 1 = [2,4,6,8] > > > We do a backwards index skip scan on this and have just returned our > > > first tuple (8). The buffer is left pinned but unlocked. Now, someone > > > else comes in and inserts a tuple (value 5) into this page, but suppose > > > the page happens to be full. So a page split occurs. As far as I know, a > > > page split could happen at any random element in the page. One of the > > > situations we could be left with is: > > > Leaf page 1 = [2,4] > > > Leaf page 2 = [5,6,8] > > > However, our scan is still pointing to leaf page 1. > > > In case if we just returned a tuple, the next action would be either > >> check the next page for another key or search down to the tree. Maybe > > But it won't look at the 'next page for another key', but rather at the 'same > page or another key', right? In the _bt_scankey_within_page shortcut we're > taking, there's no stepping to a next page involved. It just locks the page > again that it previously also locked. Yep, it would look only on the same page. Not sure what do you mean by "another key", if the current key is not found within the current page at the first stage, we restart from the root. > > I'm missing something in your scenario, but the latter will land us on a > > required page (we do not point to any leaf here), and before the former > > there is a check for high/low key. Is there anything else missing? > > Let me try to clarify. After we return the first tuple, so->currPos.buf is > pointing to page=1 in my example (it's the only page after all). We've > returned item=8. Then the split happens and the items get rearranged as in my > example. We're still pointing with so->currPos.buf to page=1, but the page > now contains [2,4]. The split happened to the right, so there's a page=2 with > [5,6,8], however the ongoing index scan is unaware of that. > Now _bt_skip gets called to fetch the next tuple. It starts by checking > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the > result of which will be 'true': we're comparing the skip key to the low key > of the page. So it thinks the next key can be found on the current page. It > locks the page and does a _binsrch, finding item=4 to be returned. > > The problem here is that _bt_scankey_within_page mistakenly returns true, > thereby limiting the search to just the page that it's pointing to already. > It may be fine to just fix this function to return the proper value (I guess > it'd also need to look at the high key in this example). It could also be > fixed by not looking at the lo/hi key of the page, but to use the local tuple > buffer instead. We already did a _read_page once, so if we have any matching > tuples on that specific page, we have them locally in the buffer already. > That way we never need to lock the same page twice. Oh, that's what you mean. Yes, I was somehow tricked by the name of this function and didn't notice that it checks only one boundary, so in case of backward scan it returns wrong result. I think in the situation you've describe it would actually not find any item on the current page and restart from the root, but nevertheless we need to check for both keys in _bt_scankey_within_page.
RE: Index Skip Scan
> Note in particular that index scans cannot return the same index tuple twice - > - processing a page at a time ensures that that cannot happen. > > Can a loose index scan return the same tuple (i.e. a tuple with the same heap > TID) to the executor more than once? > The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem. Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that isn't able to handle this? -Floris
Re: [PATCH] Windows port, fix some resources leaks
Hi, After review the patches and build all and run regress checks for each patch, those are the ones that don't break. Not all leaks detected by Coverity are fixed. regards, Ranier Vilela auth_leak.patch Description: Binary data logging_leaks.patch Description: Binary data postmaster_leak.patch Description: Binary data regex_leak.patch Description: Binary data restricted_token_leaks.patch Description: Binary data win32_shmem_leak.patch Description: Binary data
Re: Online checksums patch - once again
On Wed, Jan 22, 2020 at 12:20 PM Robert Haas wrote: > > On Wed, Jan 22, 2020 at 2:50 PM Magnus Hagander wrote: > > The reasoning that led us to *not* doing that is that it's a one-off > > operation. That along with the fact that we hope to at some point be > > able to change the default to chekcsums on (and t wouldn't be > > necessary for the transition on->off as that is very fast), it would > > become an increasingly rate on-off operation. And by adding these > > flags to the catalogs, everybody is paying the overhead for this > > one-off rare operation. Another option would be to add the flag on the > > pg_database level which would decrease the overhead, but our guess was > > that this would also decrease the usefulness in most cases to make it > > not worth it (most people with big databases don't have many big > > databases in the same cluster -- it's usually just one or two, so in > > the end the results would be more or less the same was we have now as > > it would have to keep re-doing the big ones) > > I understand, but the point for me is that the patch does not seem > robust as written. Nobody's going to be happy if there are reasonably > high-probability scenarios where it turns checksums part way on and > then just stops. Now, that can probably be improved to some degree > without adding catalog flags, but I bet it can be improved more and > for less effort if we do add catalog flags. Maybe being able to > survive a cluster restart without losing track of progress is not a > hard requirement for this feature, but it certainly seems nice. And I It's certainly nice, but that is of course a cost/benefit tradeoff calculation. Our thoughts on that was that the cost was higher than the benefit -- which may of course be wrong, and in that case it's better to have it changed. > would venture to say that continuing to run without giving up if there > happen to be no background workers available for a while IS a hard > requirement, because that can easily happen due to normal use of > parallel query. We do not normally commit features if, without any > error occurring, they might just give up part way through the > operation. That part I agree with, but I don't think that in itself requires per-relation level tracking. > I think the argument about adding catalog flags adding overhead is > pretty much bogus. Fixed-width fields in catalogs are pretty cheap. If that's the general view, then yeah our "cost calculations" were off. I guess I may have been colored by the cost of adding statistics counters, and had that influence the thinking. Incorrect judgement on that cost certainly contributed to the decision. back then. But as noted, work is being done on adding it, so let's see what that ends up looking like in reality. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Online checksums patch - once again
On Wed, Jan 22, 2020 at 2:50 PM Magnus Hagander wrote: > The reasoning that led us to *not* doing that is that it's a one-off > operation. That along with the fact that we hope to at some point be > able to change the default to chekcsums on (and t wouldn't be > necessary for the transition on->off as that is very fast), it would > become an increasingly rate on-off operation. And by adding these > flags to the catalogs, everybody is paying the overhead for this > one-off rare operation. Another option would be to add the flag on the > pg_database level which would decrease the overhead, but our guess was > that this would also decrease the usefulness in most cases to make it > not worth it (most people with big databases don't have many big > databases in the same cluster -- it's usually just one or two, so in > the end the results would be more or less the same was we have now as > it would have to keep re-doing the big ones) I understand, but the point for me is that the patch does not seem robust as written. Nobody's going to be happy if there are reasonably high-probability scenarios where it turns checksums part way on and then just stops. Now, that can probably be improved to some degree without adding catalog flags, but I bet it can be improved more and for less effort if we do add catalog flags. Maybe being able to survive a cluster restart without losing track of progress is not a hard requirement for this feature, but it certainly seems nice. And I would venture to say that continuing to run without giving up if there happen to be no background workers available for a while IS a hard requirement, because that can easily happen due to normal use of parallel query. We do not normally commit features if, without any error occurring, they might just give up part way through the operation. I think the argument about adding catalog flags adding overhead is pretty much bogus. Fixed-width fields in catalogs are pretty cheap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: making the backend's json parser work in frontend code
On 2020-Jan-22, Robert Haas wrote: > On Wed, Jan 22, 2020 at 2:26 PM Alvaro Herrera > wrote: > > I'm not sure I see the point of keeping json.h split from jsonapi.h. It > > seems to me that you could move back all the contents from jsonapi.h > > into json.h, and everything would work just as well. (Evidently the > > Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting > > that prototype in jsonfuncs.h would work.) > > > > I don't really object to your 0001 patch as posted, though. > > The goal is to make it possible to use the JSON parser in the > frontend, and we can't do that if the header files that would have to > be included on the frontend side rely on things that only work in the > backend. As written, the patch series leaves json.h with a dependency > on Datum, so the stuff that it leaves in jsonapi.h (which is intended > to be the header that gets moved to src/common and included by > frontend code) can't be merged with it. Right, I agree with that goal, and as I said, I don't object to your patch as posted. > Now, we could obviously rearrange that. I don't think any of the file > naming here is great. But I think we probably want, as far as > possible, for the code in FOO.c to correspond to the prototypes in > FOO.h. What I'm thinking we should work towards is: > > json.c/h - support for the 'json' data type > jsonb.c/h - support for the 'jsonb' data type > jsonfuncs.c/h - backend code that doesn't fit in either of the above > jsonapi.c/h - lexing/parsing code that can be used in either the > frontend or the backend ... it would probably require more work to make this 100% attainable, but I don't really care all that much. > I'm not wedded to that. It just looks like the most natural thing from > where we are now. Let's go with it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Amcheck: do rightlink verification with lock coupling
On Fri, Jan 17, 2020 at 5:43 PM Peter Geoghegan wrote: > I tried to come up with a specific example of how this could be > unsafe, but my explanation was all over the place (this could have had > something to do with it being Friday evening). Even still, it's up to > the patch to justify why it's safe, and that seems even more > difficult. I can't see a way around this problem, so I'm marking the patch rejected. Thanks -- Peter Geoghegan
Re: making the backend's json parser work in frontend code
On Wed, Jan 22, 2020 at 2:26 PM Alvaro Herrera wrote: > I'm not sure I see the point of keeping json.h split from jsonapi.h. It > seems to me that you could move back all the contents from jsonapi.h > into json.h, and everything would work just as well. (Evidently the > Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting > that prototype in jsonfuncs.h would work.) > > I don't really object to your 0001 patch as posted, though. The goal is to make it possible to use the JSON parser in the frontend, and we can't do that if the header files that would have to be included on the frontend side rely on things that only work in the backend. As written, the patch series leaves json.h with a dependency on Datum, so the stuff that it leaves in jsonapi.h (which is intended to be the header that gets moved to src/common and included by frontend code) can't be merged with it. Now, we could obviously rearrange that. I don't think any of the file naming here is great. But I think we probably want, as far as possible, for the code in FOO.c to correspond to the prototypes in FOO.h. What I'm thinking we should work towards is: json.c/h - support for the 'json' data type jsonb.c/h - support for the 'jsonb' data type jsonfuncs.c/h - backend code that doesn't fit in either of the above jsonapi.c/h - lexing/parsing code that can be used in either the frontend or the backend I'm not wedded to that. It just looks like the most natural thing from where we are now. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Online checksums patch - once again
On Mon, Jan 20, 2020 at 12:14 PM Robert Haas wrote: > > On Sat, Jan 18, 2020 at 6:18 PM Daniel Gustafsson wrote: > > Thanks again for reviewing (and working on the infrastructure required for > > this > > patch to begin with)! Regarding the persisting the progress; that would be > > a > > really neat feature but I don't have any suggestion on how to do that safely > > for real use-cases. > > Leaving to one side the question of how much work is involved, could > we do something conceptually similar to relfrozenxid/datfrozenxid, > i.e. use catalog state to keep track of which objects have been > handled and which not? > > Very rough sketch: > > * set a flag indicating that checksums must be computed for all page writes > * use barriers and other magic to make sure everyone has gotten the > memo from the previous step > * use new catalog fields pg_class.relhaschecksums and > pg_database.dathaschecksums to track whether checksums are enabled > * keep launching workers for databases where !pg_class.dathaschecksums > until none remain > * mark checksums as fully enabled > * party We did discuss this back when we started work on this (I can't remember if it was just me and Daniel and someone else or on a list -- but that's not important atm). The reasoning that led us to *not* doing that is that it's a one-off operation. That along with the fact that we hope to at some point be able to change the default to chekcsums on (and t wouldn't be necessary for the transition on->off as that is very fast), it would become an increasingly rate on-off operation. And by adding these flags to the catalogs, everybody is paying the overhead for this one-off rare operation. Another option would be to add the flag on the pg_database level which would decrease the overhead, but our guess was that this would also decrease the usefulness in most cases to make it not worth it (most people with big databases don't have many big databases in the same cluster -- it's usually just one or two, so in the end the results would be more or less the same was we have now as it would have to keep re-doing the big ones) Unless we actually want to support running systems more or less permanently with some tables with checksums and other tables without checksums. But that's going to have an effect on the validation of checksums that would generate a huge overhead (since each buffer check would have to look up the pg_class entry). FYI, Daniel is working on an update that will include this -- so we can see what the actual outcome is of it in th case of complexity as well. Should hopefully be ready soon. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: making the backend's json parser work in frontend code
On 2020-Jan-22, Robert Haas wrote: > Here is a new version that is, I think, much closer what I would > consider a final form. 0001 through 0003 are as before, and unless > somebody says pretty soon that they see a problem with those or want > more time to review them, I'm going to commit them; David Steele has > endorsed all three, and they seem like independently sensible > cleanups. I'm not sure I see the point of keeping json.h split from jsonapi.h. It seems to me that you could move back all the contents from jsonapi.h into json.h, and everything would work just as well. (Evidently the Datum in JsonEncodeDateTime's proto is problematic ... perhaps putting that prototype in jsonfuncs.h would work.) I don't really object to your 0001 patch as posted, though. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] kqueue
I wrote: > This just says it doesn't lock up, of course. I've not attempted > any performance-oriented tests. I've now done some light performance testing -- just stuff like pgbench -S -M prepared -c 20 -j 20 -T 60 bench I cannot see any improvement on either FreeBSD 12 or NetBSD 8.1, either as to net TPS or as to CPU load. If anything, the TPS rate is a bit lower with the patch, though I'm not sure that that effect is above the noise level. It's certainly possible that to see any benefit you need stress levels above what I can manage on the small box I've got these OSes on. Still, it'd be nice if a performance patch could show some improved performance, before we take any portability risks for it. regards, tom lane
Re: Index Skip Scan
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan wrote: > This is a good summary. This is the kind of scenario I had in mind > when I expressed a general concern about "stopping between pages". > Processing a whole page at a time is a crucial part of how > _bt_readpage() currently deals with concurrent page splits. Note in particular that index scans cannot return the same index tuple twice -- processing a page at a time ensures that that cannot happen. Can a loose index scan return the same tuple (i.e. a tuple with the same heap TID) to the executor more than once? -- Peter Geoghegan
Re: Index Skip Scan
On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee wrote: > Anyone please correct me if I'm wrong, but I think one case where the current > patch relies on some data from the page it has locked before it in checking > this hi/lo key. I think it's possible for the following sequence to happen. > Suppose we have a very simple one leaf-page btree containing four elements: > leaf page 1 = [2,4,6,8] > We do a backwards index skip scan on this and have just returned our first > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in > and inserts a tuple (value 5) into this page, but suppose the page happens to > be full. So a page split occurs. As far as I know, a page split could happen > at any random element in the page. One of the situations we could be left > with is: > Leaf page 1 = [2,4] > Leaf page 2 = [5,6,8] > However, our scan is still pointing to leaf page 1. For non-skip scans this > is not a problem, as we already read all matching elements in our local > buffer and we'll return those. But the skip scan currently: > a) checks the lo-key of the page to see if the next prefix can be found on > the leaf page 1 > b) finds out that this is actually true > c) does a search on the page and returns value=4 (while it should have > returned value=6) > > Peter, is my understanding about the btree internals correct so far? This is a good summary. This is the kind of scenario I had in mind when I expressed a general concern about "stopping between pages". Processing a whole page at a time is a crucial part of how _bt_readpage() currently deals with concurrent page splits. Holding a buffer pin on a leaf page is only effective as an interlock against VACUUM completely removing a tuple, which could matter with non-MVCC scans. > Now that I look at the patch again, I fear there currently may also be such a > dependency in the "Advance forward but read backward"-case. It saves the > offset number of a tuple in a variable, then does a _bt_search (releasing the > lock and pin on the page). At this point, anything can happen to the tuples > on this page - the page may be compacted by vacuum such that the offset > number you have in your variable does not match the actual offset number of > the tuple on the page anymore. Then, at the check for (nextOffset == > startOffset) later, there's a possibility the offsets are different even > though they relate to the same tuple. If skip scan is restricted to heapkeyspace indexes (i.e. those created on Postgres 12+), then it might be reasonable to save an index tuple, and relocate it within the same page using a fresh binary search that uses a scankey derived from the same index tuple -- without unsetting scantid/the heap TID scankey attribute. I suppose that you'll need to "find your place again" after releasing the buffer lock on a leaf page for a time. Also, I think that this will only be safe with MVCC scans, because otherwise the page could be concurrently deleted by VACUUM. -- Peter Geoghegan
Re: Index Skip Scan
On Tue, Jan 21, 2020 at 9:06 AM Jesper Pedersen wrote: > If you apply the attached patch on master it will fail the test suite; > did you mean something else ? Yeah, this is exactly what I had in mind for the _bt_readpage() assertion. As I said, it isn't a great sign that this kind of assertion is even necessary in index access method code (code like bufmgr.c is another matter). Usually it's just obvious that a buffer lock is held. I can't really blame this patch for that, though. You could say the same thing about the existing "buffer pin held" _bt_readpage() assertion. It's good that it verifies what is actually a fragile assumption, even though I'd prefer to not make a fragile assumption. -- Peter Geoghegan
Re: Duplicate Workers entries in some EXPLAIN plans
>> + int num_workers = planstate->worker_instrument->num_workers; >> + int n; >> + worker_strs = (StringInfo *) palloc0(num_workers * >> sizeof(StringInfo)); >> + for (n = 0; n < num_workers; n++) { >> >> I think C99 would be better here. Also no parenthesis needed. > > > Pardon my C illiteracy, but what am I doing that's not valid C99 here? My bad, I should have been more clear. I meant that it is preferable to use the C99 standard which calls for declaring variables in the scope that you need them. In this case, 'n' is needed only in the for loop, so something like for (int n = 0; n < num_workers; n++) is preferable. To be clear, your code was perfectly valid. It was only the style I was referring to. >> + for (n = 0; n < w->num_workers; ++n) >> >> I think C99 would be better here. > > > And here (if it's not the same problem)? Exactly the same as above. >> int indent; /* current indentation level */ >> List *grouping_stack; /* format-specific grouping state */ >> + boolprint_workers; /* whether current node has worker metadata >> */ >> >> Hmm.. commit introduced `hide_workers` in the struct. Having >> both >> names in the struct so far apart even seems a bit confusing and sloppy. Do >> you >> think it would be possible to combine or rename? > > > I noticed that. I was thinking about combining them, but > "hide_workers" seems to be about "pretend there is no worker output > even if there is" and "print_workers" is "keep track of whether or not > there is worker output to print". Maybe I'll rename to > "has_worker_output"? The rename sounds a bit better in my humble opinion. Thanks.
Re: Index Skip Scan
Hi Dmitry, > > On Wed, Jan 22, 2020 at 07:50:30AM +, Floris Van Nee wrote: > > > > Anyone please correct me if I'm wrong, but I think one case where the > > current patch relies on some data from the page it has locked before it in > > checking this hi/lo key. I think it's possible for the following sequence > > to happen. Suppose we have a very simple one leaf-page btree containing > > four elements: leaf page 1 = [2,4,6,8] > > We do a backwards index skip scan on this and have just returned our first > > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes > > in and inserts a tuple (value 5) into this page, but suppose the page > > happens to be full. So a page split occurs. As far as I know, a page split > > could happen at any random element in the page. One of the situations we > > could be left with is: > > Leaf page 1 = [2,4] > > Leaf page 2 = [5,6,8] > > However, our scan is still pointing to leaf page 1. > In case if we just returned a tuple, the next action would be either >> check the next page for another key or search down to the tree. Maybe But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the _bt_scankey_within_page shortcut we're taking, there's no stepping to a next page involved. It just locks the page again that it previously also locked. > I'm missing something in your scenario, but the latter will land us on a > required page (we do not point to any leaf here), and before the former > there is a check for high/low key. Is there anything else missing? Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the only page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're still pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there's a page=2 with [5,6,8], however the ongoing index scan is unaware of that. Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key of the page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4 to be returned. The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the page that it's pointing to already. It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key in this example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead. We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally in the buffer already. That way we never need to lock the same page twice.
Re: Do we need to handle orphaned prepared transactions in the server?
Bruce Momjian writes: > I think the big question is whether we want to make active prepared > transactions more visible to administrators, either during server start > or idle duration. There's already the pg_prepared_xacts view ... regards, tom lane
Re: [HACKERS] kqueue
Matteo Beccati writes: > On 22/01/2020 17:06, Tom Lane wrote: >> Matteo Beccati writes: >>> I had a NetBSD 8.0 VM lying around and I gave the patch a spin on latest >>> master. >>> With the kqueue patch, a pgbench -c basically hangs the whole postgres >>> instance. Not sure if it's a kernel issue, HyperVM issue o what, but >>> when it hangs, I can't even kill -9 the postgres processes or get the VM >>> to properly shutdown. The same doesn't happen, of course, with vanilla >>> postgres. >> I'm a bit confused about what you are testing --- the kqueue patch >> as per this thread, or that plus the WaitLatch refactorizations in >> the other thread you point to above? > my bad, I tested the v14 patch attached to the email. Thanks for clarifying. FWIW, I can't replicate the problem here using NetBSD 8.1 amd64 on bare metal. I tried various pgbench parameters up to "-c 20 -j 20" (on a 4-cores-plus-hyperthreading CPU), and it seems fine. One theory is that NetBSD fixed something since 8.0, but I trawled their 8.1 release notes [1], and the only items mentioning kqueue or kevent are for fixes in the pty and tun drivers, neither of which seem relevant. (But wait ... could your VM setup be dependent on a tunnel network interface for outside-the-VM connectivity? Still hard to see the connection though.) My guess is that what you're seeing is a VM bug. regards, tom lane [1] https://cdn.netbsd.org/pub/NetBSD/NetBSD-8.1/CHANGES-8.1
Re: Do we need to handle orphaned prepared transactions in the server?
On Wed, Jan 22, 2020 at 10:22:21AM -0500, Tom Lane wrote: > Thomas Kellerer writes: > > Tom Lane schrieb am 22.01.2020 um 16:05: > >> Right. It's the XA transaction manager's job not to forget uncommitted > >> transactions. Reasoning as though no TM exists is not only not very > >> relevant, but it might lead you to put in features that actually > >> make the TM's job harder. In particular, a timeout (or any other > >> mechanism that leads PG to abort or commit a prepared transaction > >> of its own accord) does that. > > > That's a fair point, but the reality is that not all XA transaction managers > > do a good job with that. > > If you've got a crappy XA manager, you should get a better one, not > ask us to put in features that make PG unsafe to use with well-designed > XA managers. I think the big question is whether we want to make active prepared transactions more visible to administrators, either during server start or idle duration. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Duplicate Workers entries in some EXPLAIN plans
Thanks! I'll fix the brace issues. Re: the other items: > + int num_workers = planstate->worker_instrument->num_workers; > + int n; > + worker_strs = (StringInfo *) palloc0(num_workers * > sizeof(StringInfo)); > + for (n = 0; n < num_workers; n++) { > > I think C99 would be better here. Also no parenthesis needed. Pardon my C illiteracy, but what am I doing that's not valid C99 here? > + /* Prepare worker general execution details */ > + if (es->analyze && es->verbose && planstate->worker_instrument) > + { > + WorkerInstrumentation *w = planstate->worker_instrument; > + int n; > + > + for (n = 0; n < w->num_workers; ++n) > > I think C99 would be better here. And here (if it's not the same problem)? > + ExplainCloseGroup("Workers", "Workers", false, es); > + // do we have any other cleanup to do? > > This comment does not really explain anything. Either remove > or rephrase. Also C style comments. Good catch, thanks--I had put this in to remind myself (and reviewers) about cleanup, but I don't think there's anything else to do, so I'll just drop it. > int indent; /* current indentation level */ > List *grouping_stack; /* format-specific grouping state */ > + boolprint_workers; /* whether current node has worker metadata */ > > Hmm.. commit introduced `hide_workers` in the struct. Having > both > names in the struct so far apart even seems a bit confusing and sloppy. Do you > think it would be possible to combine or rename? I noticed that. I was thinking about combining them, but "hide_workers" seems to be about "pretend there is no worker output even if there is" and "print_workers" is "keep track of whether or not there is worker output to print". Maybe I'll rename to "has_worker_output"? > +extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es); > +extern void ExplainCloseWorker(ExplainState *es); > +extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, > ExplainState *es); > > No need to expose those, is there? I feel there should be static. Good call, I'll update.
Re: [HACKERS] Restricting maximum keep segments by repslots
Hi, First, it seems you did not reply to Alvaro's concerns in your new set of patch. See: https://www.postgresql.org/message-id/20190917195800.GA16694%40alvherre.pgsql On Tue, 24 Dec 2019 21:26:14 +0900 (JST) Kyotaro Horiguchi wrote: [...] > > Indeed, "loosing" is a better match for this state. > > > > However, what's the point of this state from the admin point of view? In > > various situation, the admin will have no time to react immediately and fix > > whatever could help. > > > > How useful is this specific state? > > If we assume "losing" segments as "lost", a segment once "lost" can > return to "keeping" or "streaming" state. That is intuitively > impossible. On the other hand if we assume it as "keeping", it should > not be removed by the next checkpoint but actually it can be > removed. The state "losing" means such a unstable state different from > both "lost" and "keeping". OK, indeed. But I'm still unconfortable with this "unstable" state. It would be better if we could grab a stable state: either "keeping" or "lost". > > + Availability of WAL records claimed by this > > + slot. streaming, keeping, > > > > Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of > > WAL files claimed by this slot"? > > I choosed "record" since a slot points a record. I'm not sure but I'm > fine with "file". Fixed catalogs.sgml and config.sgml that way. Thanks. > > + streaming means that the claimed records are > > + available within max_wal_size. keeping means > > > > I wonder if streaming is the appropriate name here. The WALs required might > > be available for streaming, but the slot not active, thus not "streaming". > > What about merging with the "active" field, in the same fashion as > > pg_stat_activity.state? We would have an enum "pg_replication_slots.state" > > with the following states: > > * inactive: non active slot > > * active: activated, required WAL within max_wal_size > > * keeping: activated, max_wal_size is exceeded but still held by replication > > slots or wal_keep_segments. > > * lost: some WAL are definitely lost > > > > Thoughts? > > In the first place, I realized that I was missed a point about the > relationship between max_wal_size and max_slot_wal_keep_size > here. Since the v15 of this patch, GetLsnAvailablity returns > "streaming" when the restart_lsn is within max_wal_size. That behavior > makes sense when max_slot_wal_keep_size > max_wal_size. However, in > the contrary case, restart_lsn could be lost even it is > withinmax_wal_size. So we would see "streaming" (or "normal") even > though restart_lsn is already lost. That is broken. > > In short, the "streaming/normal" state is useless if > max_slot_wal_keep_size < max_wal_size. Good catch! > Finally I used the following wordings. > > (there's no "inactive" wal_state) > > * normal: required WAL within max_wal_size when max_slot_wal_keep_size > is larger than max_wal_size. > * keeping: required segments are held by replication slots or > wal_keep_segments. > > * losing: required segments are about to be removed or may be already > removed but streaming is not dead yet. As I wrote, I'm still uncomfortable with this state. Maybe we should ask other reviewers opinions on this. [...] > > WARNING: some replication slots have lost required WAL segments > > DETAIL: Slot slot_limit_st lost 177 segment(s) > > > > I wonder if this is useful to show these messages for slots that were > > already dead before this checkpoint? > > Makes sense. I changed KeepLogSeg so that it emits the message only on > slot_names changes. Thanks. Bellow some code review. In regard with FindOldestXLogFileSegNo(...): > /* > * Return the oldest WAL segment file. > * > * The returned value is XLogGetLastRemovedSegno() + 1 when the function > * returns a valid value. Otherwise this function scans over WAL files and > * finds the oldest segment at the first time, which could be very slow. > */ > XLogSegNo > FindOldestXLogFileSegNo(void) The comment is not clear to me. I suppose "at the first time" might better be expressed as "if none has been removed since last startup"? Moreover, what about patching XLogGetLastRemovedSegno() itself instead of adding a function? In regard with GetLsnAvailability(...): > /* > * Detect availability of the record at given targetLSN. > * > * targetLSN is restart_lsn of a slot. Wrong argument name. It is called "restart_lsn" in the function declaration. > * restBytes is the pointer to uint64 variable, to store the remaining bytes > * until the slot goes into "losing" state. I'm not convinced with this argument name. What about "remainingBytes"? Note that you use remaining_bytes elsewhere in your patch. > * -1 is stored to restBytes if the values is useless. What about returning a true negative value when the slot is really lost? All in all, I feel like this function is on the fence between being generic because of its
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
I looked at this patchset and it seemed natural to apply 0008 next (adding work_mem to subscriptions). Attached is Dilip's latest version, plus my review changes. This will break the patch tester's logic; sorry about that. What part of this change is what sets the process's logical_decoding_work_mem to the given value? I was unable to figure that out. Is it missing or am I just stupid? Changes: * the patch adds logical_decoding_work_mem SGML, but that has already been applied (cec2edfa7859); remove dupe. * parse_subscription_options() comment says that it will raise an error if a caller does not pass the pointer for an option but option list specifies that option. It does not really implement that behavior (an existing problem): instead, if the pointer is not passed, the option is ignored. Moreover, this new patch continued to fail to handle things as the comment says. I decided to implement the documented behavior instead; it's now inconsistent with how the other options are implemented. I think we should fix the other options to behave as the comment says, because it's a more convenient API; if we instead opted to update the code comment to match the code, each caller would have to be checked to verify that the correct options are passed, which is pointless and error prone. * the parse_subscription_options API is a mess. I reordered the arguments a little bit; also change the argument layout in callers so that each caller is grouped more sensibly. Also added comments to simplify reading the argument lists. I think this could be fixed by using an ad-hoc struct to pass in and out. Didn't get around to doing that, seems an unrelated potential improvement. * trying to do own range checking in pgoutput and subscriptioncmds.c seems pointless and likely to get out of sync with guc.c. Simpler is to call set_config_option() to verify that the argument is in range. (Note a further problem in the patch series: the range check in subscriptioncmds.c is only added in patch 0009). * parsing integers using scanint8() seemed weird (error messages there do not correspond to what we want). After a couple of false starts, I decided to rely on guc.c's set_config_option() followed by parse_int(). That also has the benefit that you can give it units. * psql \dRs+ should display the work_mem; patch failed to do that. Added. Unit display is done by pg_size_pretty(), which might be different from what guc.c does, but I think it works OK. It's the first place where we use pg_size_pretty to show a memory limit, however. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From a31b4ebd90dd7a4c94a35f2b3452258078c30e37 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 22 Jan 2020 12:44:13 -0300 Subject: [PATCH 1/2] Dilip's original --- doc/src/sgml/config.sgml | 21 + doc/src/sgml/ref/create_subscription.sgml | 12 + src/backend/catalog/pg_subscription.c | 1 + src/backend/commands/subscriptioncmds.c | 44 --- .../libpqwalreceiver/libpqwalreceiver.c | 3 ++ src/backend/replication/logical/worker.c | 1 + src/backend/replication/pgoutput/pgoutput.c | 30 - src/include/catalog/pg_subscription.h | 3 ++ src/include/replication/walreceiver.h | 1 + 9 files changed, 108 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 3ccacd528b..163cc77d1d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1751,6 +1751,27 @@ include_dir 'conf.d' + + logical_decoding_work_mem (integer) + + logical_decoding_work_mem configuration parameter + + + + +Specifies the maximum amount of memory to be used by logical decoding, +before some of the decoded changes are either written to local disk. +This limits the amount of memory used by logical streaming replication +connections. It defaults to 64 megabytes (64MB). +Since each replication connection only uses a single buffer of this size, +and an installation normally doesn't have many such connections +concurrently (as limited by max_wal_senders), it's +safe to set this value significantly higher than work_mem, +reducing the amount of decoded changes written to disk. + + + + max_stack_depth (integer) diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml index 1a90c244fb..91790b0c95 100644 --- a/doc/src/sgml/ref/create_subscription.sgml +++ b/doc/src/sgml/ref/create_subscription.sgml @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION subscription_name + + +work_mem (integer) + + + Limits the
Re: [HACKERS] kqueue
On 22/01/2020 17:06, Tom Lane wrote: > Matteo Beccati writes: >> On 21/01/2020 02:06, Thomas Munro wrote: >>> [1] >>> https://www.postgresql.org/message-id/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com > >> I had a NetBSD 8.0 VM lying around and I gave the patch a spin on latest >> master. >> With the kqueue patch, a pgbench -c basically hangs the whole postgres >> instance. Not sure if it's a kernel issue, HyperVM issue o what, but >> when it hangs, I can't even kill -9 the postgres processes or get the VM >> to properly shutdown. The same doesn't happen, of course, with vanilla >> postgres. > > I'm a bit confused about what you are testing --- the kqueue patch > as per this thread, or that plus the WaitLatch refactorizations in > the other thread you point to above? my bad, I tested the v14 patch attached to the email. The quoted url was just above the patch name in the email client and somehow my brain thought I was quoting the v14 patch name. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/
Re: [HACKERS] kqueue
Matteo Beccati writes: > On 21/01/2020 02:06, Thomas Munro wrote: >> [1] >> https://www.postgresql.org/message-id/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com > I had a NetBSD 8.0 VM lying around and I gave the patch a spin on latest > master. > With the kqueue patch, a pgbench -c basically hangs the whole postgres > instance. Not sure if it's a kernel issue, HyperVM issue o what, but > when it hangs, I can't even kill -9 the postgres processes or get the VM > to properly shutdown. The same doesn't happen, of course, with vanilla > postgres. I'm a bit confused about what you are testing --- the kqueue patch as per this thread, or that plus the WaitLatch refactorizations in the other thread you point to above? I've gotten through check-world successfully with the v14 kqueue patch atop yesterday's HEAD on: * macOS Catalina 10.15.2 (current release) * FreeBSD/amd64 12.0-RELEASE-p12 * NetBSD/amd64 8.1 * NetBSD/arm 8.99.41 * OpenBSD/amd64 6.5 (These OSes are all on bare metal, no VMs involved) This just says it doesn't lock up, of course. I've not attempted any performance-oriented tests. regards, tom lane
Re: Index Skip Scan
> On Wed, Jan 22, 2020 at 07:50:30AM +, Floris Van Nee wrote: > > Anyone please correct me if I'm wrong, but I think one case where the current > patch relies on some data from the page it has locked before it in checking > this hi/lo key. I think it's possible for the following sequence to happen. > Suppose we have a very simple one leaf-page btree containing four elements: > leaf page 1 = [2,4,6,8] > We do a backwards index skip scan on this and have just returned our first > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in > and inserts a tuple (value 5) into this page, but suppose the page happens to > be full. So a page split occurs. As far as I know, a page split could happen > at any random element in the page. One of the situations we could be left > with is: > Leaf page 1 = [2,4] > Leaf page 2 = [5,6,8] > However, our scan is still pointing to leaf page 1. In case if we just returned a tuple, the next action would be either check the next page for another key or search down to the tree. Maybe I'm missing something in your scenario, but the latter will land us on a required page (we do not point to any leaf here), and before the former there is a check for high/low key. Is there anything else missing? > Now that I look at the patch again, I fear there currently may also be such a > dependency in the "Advance forward but read backward"-case. It saves the > offset number of a tuple in a variable, then does a _bt_search (releasing the > lock and pin on the page). At this point, anything can happen to the tuples > on this page - the page may be compacted by vacuum such that the offset > number you have in your variable does not match the actual offset number of > the tuple on the page anymore. Then, at the check for (nextOffset == > startOffset) later, there's a possibility the offsets are different even > though they relate to the same tuple. Interesting point. The original idea here was to check that we're not returned to the same position after jumping, so maybe instead of offsets we can check a tuple we found.
Re: pgsql: walreceiver uses a temporary replication slot by default
Hello > In short, the following things: > - wal_receiver_create_temp_slot should be made PGC_POSTMASTER, > similarly to primary_slot_name and primary_conninfo. > - WalReceiverMain() should not load the parameter from the GUC context > by itself. > - RequestXLogStreaming(), called by the startup process, should be in > charge of defining if a temp slot should be used or not. I would like to cross-post here a patch with such changes that I posted in "allow online change primary_conninfo" thread. This thread is more appropriate for discussion about wal_receiver_create_temp_slot. PS: I posted this patch in both threads mostly to make cfbot happy. regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e07dc01e80..14992a08d7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4137,9 +4137,7 @@ ANY num_sync ( ). The default is on. The only reason to turn this off would be if the remote instance is currently out of available replication slots. This -parameter can only be set in the postgresql.conf -file or on the server command line. Changes only take effect when the -WAL receiver process starts a new connection. +parameter can only be set at server start. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7f4f784c0e..55e9294ae3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -283,6 +283,7 @@ bool StandbyModeRequested = false; char *PrimaryConnInfo = NULL; char *PrimarySlotName = NULL; char *PromoteTriggerFile = NULL; +bool wal_receiver_create_temp_slot = true; /* are we currently in standby mode? */ bool StandbyMode = false; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index a5e85d32f3..264b544194 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -73,7 +73,6 @@ /* GUC variables */ -bool wal_receiver_create_temp_slot; int wal_receiver_status_interval; int wal_receiver_timeout; bool hot_standby_feedback; @@ -349,42 +348,23 @@ WalReceiverMain(void) WalRcvFetchTimeLineHistoryFiles(startpointTLI, primaryTLI); /* - * Create temporary replication slot if no slot name is configured or - * the slot from the previous run was temporary, unless - * wal_receiver_create_temp_slot is disabled. We also need to handle - * the case where the previous run used a temporary slot but - * wal_receiver_create_temp_slot was changed in the meantime. In that - * case, we delete the old slot name in shared memory. (This would + * Create temporary replication slot if requested. In that + * case, we update slot name in shared memory. (This would * all be a bit easier if we just didn't copy the slot name into * shared memory, since we won't need it again later, but then we * can't see the slot name in the stats views.) */ - if (slotname[0] == '\0' || is_temp_slot) + if (is_temp_slot) { - bool changed = false; + snprintf(slotname, sizeof(slotname), + "pg_walreceiver_%lld", + (long long int) walrcv_get_backend_pid(wrconn)); - if (wal_receiver_create_temp_slot) - { -snprintf(slotname, sizeof(slotname), - "pg_walreceiver_%lld", - (long long int) walrcv_get_backend_pid(wrconn)); - -walrcv_create_slot(wrconn, slotname, true, 0, NULL); -changed = true; - } - else if (slotname[0] != '\0') - { -slotname[0] = '\0'; -changed = true; - } + walrcv_create_slot(wrconn, slotname, true, 0, NULL); - if (changed) - { -SpinLockAcquire(>mutex); -strlcpy(walrcv->slotname, slotname, NAMEDATALEN); -walrcv->is_temp_slot = wal_receiver_create_temp_slot; -SpinLockRelease(>mutex); - } + SpinLockAcquire(>mutex); + strlcpy(walrcv->slotname, slotname, NAMEDATALEN); + SpinLockRelease(>mutex); } /* diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c index 89c903e45a..a36bc83d2c 100644 --- a/src/backend/replication/walreceiverfuncs.c +++ b/src/backend/replication/walreceiverfuncs.c @@ -248,10 +248,16 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo, else walrcv->conninfo[0] = '\0'; - if (slotname != NULL) + if (slotname != NULL && slotname[0] != '\0') + { strlcpy((char *) walrcv->slotname, slotname, NAMEDATALEN); + walrcv->is_temp_slot = false; + } else + { walrcv->slotname[0] = '\0'; + walrcv->is_temp_slot = wal_receiver_create_temp_slot; + } if (walrcv->walRcvState == WALRCV_STOPPED) { diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 9f179a9129..c4b36eb2ad 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -1994,7 +1994,7 @@ static struct config_bool ConfigureNamesBool[] = }, { - {"wal_receiver_create_temp_slot", PGC_SIGHUP,
Re: allow online change primary_conninfo
Hello > Yeah, you are right. I was not paying much attention but something > does not stick here. My understanding is that we should have the WAL > receiver receive the value it needs to use from the startup process > (aka via RequestXLogStreaming from xlog.c), and that we ought to make > this new parameter PGC_POSTMASTER instead of PGC_SIGHUP. HEAD is > inconsistent here. Thank you! I attached two patches: - first changes wal_receiver_create_temp_slot to PGC_POSTMASTER and moved the logic to RequestXLogStreaming - second is based on last published v6 version of main patch. It changes wal_receiver_create_temp_slot back to PGC_SIGHUP along with primary_conninfo and primary_slot_name and will restart walreceiver if need. Regarding the main patch: we have several possibilities for moving RequestXLogStreaming call. We need to choose one. And, of course, changes in the text... regards, Sergeidiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 14992a08d7..19f36a3318 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4000,9 +4000,15 @@ ANY num_sync ( @@ -4017,9 +4023,13 @@ ANY num_sync ( ). - This parameter can only be set at server start. + This parameter can only be set in the postgresql.conf + file or on the server command line. This setting has no effect if primary_conninfo is not - set. + set or the server is not in standby mode. + + + The WAL receiver is restarted after an update of primary_slot_name. @@ -4137,7 +4147,11 @@ ANY num_sync ( ). The default is on. The only reason to turn this off would be if the remote instance is currently out of available replication slots. This -parameter can only be set at server start. +parameter can only be set in the postgresql.conf +file or on the server command line. + + +The WAL receiver is restarted after an update of wal_receiver_create_temp_slot. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 55e9294ae3..2861b9f22a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -803,6 +803,12 @@ static XLogSource readSource = 0; /* XLOG_FROM_* code */ static XLogSource currentSource = 0; /* XLOG_FROM_* code */ static bool lastSourceFailed = false; +/* + * Need for restart running WalReceiver due the configuration change. + * Suitable only for XLOG_FROM_STREAM source + */ +static bool pendingWalRcvRestart = false; + typedef struct XLogPageReadPrivate { int emode; @@ -11818,6 +11824,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, for (;;) { int oldSource = currentSource; + bool startWalReceiver = false; /* * First check if we failed to read from the current source, and @@ -11851,54 +11858,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (!StandbyMode) return false; - /* - * If primary_conninfo is set, launch walreceiver to try - * to stream the missing WAL. - * - * If fetching_ckpt is true, RecPtr points to the initial - * checkpoint location. In that case, we use RedoStartLSN - * as the streaming start position instead of RecPtr, so - * that when we later jump backwards to start redo at - * RedoStartLSN, we will have the logs streamed already. - */ - if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0) - { - XLogRecPtr ptr; - TimeLineID tli; - - if (fetching_ckpt) - { - ptr = RedoStartLSN; - tli = ControlFile->checkPointCopy.ThisTimeLineID; - } - else - { - ptr = RecPtr; - - /* - * Use the record begin position to determine the - * TLI, rather than the position we're reading. - */ - tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - - if (curFileTLI > 0 && tli < curFileTLI) -elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u", - (uint32) (tliRecPtr >> 32), - (uint32) tliRecPtr, - tli, curFileTLI); - } - curFileTLI = tli; - RequestXLogStreaming(tli, ptr, PrimaryConnInfo, - PrimarySlotName); - receivedUpto = 0; - } - /* * Move to XLOG_FROM_STREAM state in either case. We'll * get immediate failure if we didn't launch walreceiver, * and move on to the next state. */ currentSource = XLOG_FROM_STREAM; + startWalReceiver = true; break; case XLOG_FROM_STREAM: @@ -12044,7 +12010,69 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, Assert(StandbyMode); /* - * Check if WAL receiver is still active. + * shutdown WAL receiver if restart is requested. + */ + if
Re: Do we need to handle orphaned prepared transactions in the server?
Thomas Kellerer writes: > Tom Lane schrieb am 22.01.2020 um 16:05: >> Right. It's the XA transaction manager's job not to forget uncommitted >> transactions. Reasoning as though no TM exists is not only not very >> relevant, but it might lead you to put in features that actually >> make the TM's job harder. In particular, a timeout (or any other >> mechanism that leads PG to abort or commit a prepared transaction >> of its own accord) does that. > That's a fair point, but the reality is that not all XA transaction managers > do a good job with that. If you've got a crappy XA manager, you should get a better one, not ask us to put in features that make PG unsafe to use with well-designed XA managers. regards, tom lane
Re: Error message inconsistency
Alvaro Herrera writes: > I wonder if we shouldn't be using errtablecol() here instead of (in > addition to?) patching the errmsg() to include the table name. > Discussion: If we really like having the table names in errtable(), then > we should have psql display it by default, and other tools will follow > suit; in that case we should remove the table name from error messages, > or at least not add it to even more messages. > If we instead think that errtable() is there just for programmatically > knowing the affected table, then we should add the table name to all > errmsg() where relevant, as in the patch under discussion. > What do others think? I believe that the intended use of errtable() and friends is so that applications don't have to parse those names out of a human-friendly message. We should add calls to them in cases where (a) an application can tell from the SQLSTATE that some particular table is involved and (b) it's likely that the app would wish to know which table that is. I don't feel a need to sprinkle every single ereport() in the backend with errtable(), just ones where there's a plausible use-case for the extra cycles that will be spent. On the other side of the coin, whether we use errtable() is not directly a factor in deciding what the human-friendly messages should say. I do find it hard to envision a case where we'd want to use errtable() and *not* put the table name in the error message, just because if applications need to know something then humans probably do too. But saying that we can make the messages omit info because it's available from these program-friendly fields seems 100% wrong to me, even if one turns a blind eye to the fact that existing client code likely won't show those fields to users. regards, tom lane
Re: Do we need to handle orphaned prepared transactions in the server?
Tom Lane schrieb am 22.01.2020 um 16:05: > Craig Ringer writes: >> So I don't really see the point of doing anything with 2PC xacts >> within Pg proper. It's the job of the app that prepares the 2PC xacts, >> and if that app is unable to resolve them for some reason there's no >> generally-correct action to take without administrator action. > > Right. It's the XA transaction manager's job not to forget uncommitted > transactions. Reasoning as though no TM exists is not only not very > relevant, but it might lead you to put in features that actually > make the TM's job harder. In particular, a timeout (or any other > mechanism that leads PG to abort or commit a prepared transaction > of its own accord) does that. > > Or another way to put it: the fundamental premise of a prepared > transaction is that it will be possible to commit it on-demand with > extremely low chance of failure. Designing in a reason why we'd > fail to be able to do that would be an anti-feature. That's a fair point, but the reality is that not all XA transaction managers do a good job with that. Having somthing on the database side that can handle that in exceptional cases would be very welcome. (In Oracle you can't sometimes even run DML on tables where you have orphaned XA transactions - which is extremely annoying, because by default only the DBA can clean that up) Thomas
Re: Do we need to handle orphaned prepared transactions in the server?
Craig Ringer writes: > So I don't really see the point of doing anything with 2PC xacts > within Pg proper. It's the job of the app that prepares the 2PC xacts, > and if that app is unable to resolve them for some reason there's no > generally-correct action to take without administrator action. Right. It's the XA transaction manager's job not to forget uncommitted transactions. Reasoning as though no TM exists is not only not very relevant, but it might lead you to put in features that actually make the TM's job harder. In particular, a timeout (or any other mechanism that leads PG to abort or commit a prepared transaction of its own accord) does that. Or another way to put it: the fundamental premise of a prepared transaction is that it will be possible to commit it on-demand with extremely low chance of failure. Designing in a reason why we'd fail to be able to do that would be an anti-feature. regards, tom lane
Re: We're getting close to the end of 2020-01 CF
On Wed, Jan 22, 2020 at 02:09:39PM +0900, Michael Paquier wrote: On Tue, Jan 21, 2020 at 05:20:17PM +0100, Tomas Vondra wrote: Yeah, you're right returning them with feedback seems more appropriate, given the long inactivity. Plus, the CF app apparently does not allow moving WoA patches to the next CF anyway. FWIW, I tend to take a base of two weeks as a sensible period of time as that's half the CF period when I do the classification job. Yeah. I've only nagged about patches that have been set to WoA before the CF began, so far. Those are the patches that have been set as WoA before this CF, and have not been updated since. It's quite possible the state is stale for some of those patches, although I've tried to check if there were any messages on the list. You need to be careful about bug fixes, as these are things that we don't want to lose track of. Another thing that I noticed in the past is that some patches are registered as bug fixes, but they actually implement a new feature. So there can be tricky cases. -- Makes sense. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: We're getting close to the end of 2020-01 CF
On 2020-Jan-22, Michael Paquier wrote: > On Tue, Jan 21, 2020 at 05:20:17PM +0100, Tomas Vondra wrote: > > Those are the patches that have been set as WoA before this CF, and have > > not been updated since. It's quite possible the state is stale for some > > of those patches, although I've tried to check if there were any > > messages on the list. > > You need to be careful about bug fixes, as these are things that we > don't want to lose track of. Oh yeah, I forgot to mention the exception for bug fixes. I agree these should almost never be RwF (or in any way closed other than Committed, really, except in, err, exceptional cases). > Another thing that I noticed in the past > is that some patches are registered as bug fixes, but they actually > implement a new feature. So there can be tricky cases. Oh yeah, I think the CFM should exercise judgement and reclassify. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Error message inconsistency
I wonder if we shouldn't be using errtablecol() here instead of (in addition to?) patching the errmsg() to include the table name. Discussion: If we really like having the table names in errtable(), then we should have psql display it by default, and other tools will follow suit; in that case we should remove the table name from error messages, or at least not add it to even more messages. If we instead think that errtable() is there just for programmatically knowing the affected table, then we should add the table name to all errmsg() where relevant, as in the patch under discussion. What do others think? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Duplicate Workers entries in some EXPLAIN plans
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:not tested > TEXT format was tricky due to its inconsistencies, but I think I have > something working reasonably well. I added a simple test for TEXT > format output as well, using a similar approach as the JSON format Great! > test, and liberally regexp_replacing away any volatile output. I > suppose in theory we could do this for YAML, too, but I think it's > gross enough not to be worth it, especially given the high similarity > of all the structured outputs. Agreed, what is in the patch suffices. Overall great work, a couple of minor nitpicks if you allow me. + /* Prepare per-worker output */ + if (es->analyze && planstate->worker_instrument) { Style, parenthesis on its own line. + int num_workers = planstate->worker_instrument->num_workers; + int n; + worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo)); + for (n = 0; n < num_workers; n++) { I think C99 would be better here. Also no parenthesis needed. + worker_strs[n] = makeStringInfo(); + } + } @@ -1357,6 +1369,58 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es); } + /* Prepare worker general execution details */ + if (es->analyze && es->verbose && planstate->worker_instrument) + { + WorkerInstrumentation *w = planstate->worker_instrument; + int n; + + for (n = 0; n < w->num_workers; ++n) I think C99 would be better here. + { + Instrumentation *instrument = >instrument[n]; + double nloops = instrument->nloops; - appendStringInfoSpaces(es->str, es->indent * 2); - if (n > 0 || !es->hide_workers) - appendStringInfo(es->str, "Worker %d: ", n); + if (indent) + { + appendStringInfoSpaces(es->str, es->indent * 2); + } Style: No parenthesis needed - if (opened_group) - ExplainCloseGroup("Workers", "Workers", false, es); + /* Show worker detail */ + if (planstate->worker_instrument) { + ExplainFlushWorkers(worker_strs, planstate->worker_instrument->num_workers, es); } Style: No parenthesis needed +* just indent once, to add worker info on the next worker line. +*/ + if (es->str == es->root_str) + { + es->indent += es->format == EXPLAIN_FORMAT_TEXT ? 1 : 2; + } + Style: No parenthesis needed + ExplainCloseGroup("Workers", "Workers", false, es); + // do we have any other cleanup to do? This comment does not really explain anything. Either remove or rephrase. Also C style comments. + es->print_workers = false; +} int indent; /* current indentation level */ List *grouping_stack; /* format-specific grouping state */ + boolprint_workers; /* whether current node has worker metadata */ Hmm.. commit introduced `hide_workers` in the struct. Having both names in the struct so far apart even seems a bit confusing and sloppy. Do you think it would be possible to combine or rename? +extern void ExplainOpenWorker(StringInfo worker_str, ExplainState *es); +extern void ExplainCloseWorker(ExplainState *es); +extern void ExplainFlushWorkers(StringInfo *worker_strs, int num_workers, ExplainState *es); No need to expose those, is there? I feel there should be static. Awaiting for answer or resolution of these comments to change the status. //Georgios
Re: Do we need to handle orphaned prepared transactions in the server?
On Wed, 22 Jan 2020 at 16:45, Ants Aasma wrote: > The intended use case of two phase transactions is ensuring atomic > durability of transactions across multiple database systems. Exactly. I was trying to find a good way to say this. It doesn't make much sense to embed a 2PC resolver in Pg unless it's an XA coordinator or similar. And generally it doesn't make sense for the distributed transaction coordinator to reside alongside one of the datasources being managed anyway, especially where failover and HA are in the picture. I *can* see it being useful, albeit rather heavyweight, to implement an XA coordinator on top of PostgreSQL. Mostly for HA and replication reasons. But generally you'd use postgres instances for the HA coordinator and the DB(s) in which 2PC txns are being managed. While you could run them in the same instance it'd probably mostly be for toy-scale PoC/demo/testing use. So I don't really see the point of doing anything with 2PC xacts within Pg proper. It's the job of the app that prepares the 2PC xacts, and if that app is unable to resolve them for some reason there's no generally-correct action to take without administrator action.
Re: Error message inconsistency
Hi Mahendra, Thanks for working on this. On Wed, 22 Jan 2020 at 13:26, Mahendra Singh Thalor wrote: > > On Tue, 21 Jan 2020 at 10:51, Amit Kapila wrote: > > > > On Mon, Jan 6, 2020 at 6:31 PM Mahendra Singh Thalor > > wrote: > > > > > > On Sat, 6 Jul 2019 at 09:53, Amit Kapila wrote: > > > > > > > > On Mon, Jul 1, 2019 at 10:05 PM Alvaro Herrera > > > > wrote: > > > > > > > > > > Do we have an actual patch here? > > > > > > > > > > > > > We have a patch, but it needs some more work like finding similar > > > > places and change all of them at the same time and then change the > > > > tests to adapt the same. > > > > > > > > > > Hi all, > > > Based on above discussion, I tried to find out all the places where we > > > need to change error for "not null constraint". As Amit Kapila pointed > > > out 1 place, I changed the error and adding modified patch. > > > > > > > It seems you have not used the two error codes > > (ERRCODE_NOT_NULL_VIOLATION and ERRCODE_CHECK_VIOLATION) pointed by me > > above. > > Thanks Amit and Beena for reviewing patch. > > Yes, you are correct. I searched using error messages not error code. That > was my mistake. Now, I grepped using above error codes and found that these > error codes are used in 19 places. Below is the code parts of 19 places. > > 1. src/backend/utils/adt/domains.c > > 146 if (isnull) > 147 ereport(ERROR, > 148 (errcode(ERRCODE_NOT_NULL_VIOLATION), > 149 errmsg("domain %s does not allow null > values", > 150 > format_type_be(my_extra->domain_type)), > 151 errdatatype(my_extra->domain_type))); > 152 break; > > I think, above error is for domain, so there is no need to add anything in > error message. > - > 2. src/backend/utils/adt/domains.c > > 181 if (!ExecCheck(con->check_exprstate, econtext)) > 182 ereport(ERROR, > 183 (errcode(ERRCODE_CHECK_VIOLATION), > 184 errmsg("value for domain %s violates > check constraint \"%s\"", > 185 > format_type_be(my_extra->domain_type), > 186 con->name), > 187 > errdomainconstraint(my_extra->domain_type, > 188 con->name))); > > I think, above error is for domain, so there is no need to add anything in > error message. > - > 3. src/backend/partitioning/partbounds.c > > 1330 if (part_rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE) > 1331 ereport(WARNING, > 1332 (errcode(ERRCODE_CHECK_VIOLATION), > 1333 errmsg("skipped scanning foreign table \"%s\" > which is a partition of default partition \"%s\"", > 1334 RelationGetRelationName(part_rel), > 1335 RelationGetRelationName(default_rel; > > Relation name is already appended in error messgae. > - > 4. src/backend/partitioning/partbounds.c > > 1363 if (!ExecCheck(partqualstate, econtext)) > 1364 ereport(ERROR, > 1365 (errcode(ERRCODE_CHECK_VIOLATION), > 1366 errmsg("updated partition constraint for > default partition \"%s\" would be violated by some row", > 1367 RelationGetRelationName(default_rel; > > Relation name is already appended in error messgae. > - > 5. src/backend/executor/execPartition.c > > 342 ereport(ERROR, > 343 (errcode(ERRCODE_CHECK_VIOLATION), > 344 errmsg("no partition of relation \"%s\" found for > row", > 345 RelationGetRelationName(rel)), > 346 val_desc ? > 347 errdetail("Partition key of the failing row > contains %s.", > 348val_desc) : 0)); > > Relation name is already appended in error messgae. > - > 6. src/backend/executor/execMain.c > > 1877 ereport(ERROR, > 1878 (errcode(ERRCODE_CHECK_VIOLATION), > 1879 errmsg("new row for relation \"%s\" violates partition > constraint", > 1880 > RelationGetRelationName(resultRelInfo->ri_RelationDesc)), >
Re: Do we need to handle orphaned prepared transactions in the server?
On Wed, 22 Jan 2020 at 09:02, Hamid Akhtar wrote: > > At this stage, I'm not sure of the scale of changes this will require, > however, I wanted to get an understanding and consensus on whether (a) this > is something we should work on, and (b) whether an approach to implementing a > timeout makes sense. > > Please feel free to share your thoughts here. The intended use case of two phase transactions is ensuring atomic durability of transactions across multiple database systems. This necessarily means that there needs to be a failure tolerant agent that ensures there is consensus about the status of the transaction and then executes that consensus across all systems. In other words, there needs to be a transaction manager for prepared statements to actually fulfil their purpose. Therefore I think that unilaterally timing out prepared statements is just shifting the consequences of a broken client from availability to durability. But if durability was never a concern, why is the client even using prepared statements? Citing the documentation: > PREPARE TRANSACTION is not intended for use in applications or interactive > sessions. Its purpose is to allow an external transaction manager to perform > atomic global transactions across multiple databases or other transactional > resources. Unless you're writing a transaction manager, you probably > shouldn't be using PREPARE TRANSACTION. Regards, Ants Aasma
Re: error context for vacuum to include block number
On Wed, 22 Jan 2020 at 10:17, Justin Pryzby wrote: > > On Tue, Jan 21, 2020 at 05:54:59PM -0300, Alvaro Herrera wrote: > > > On Tue, Jan 21, 2020 at 03:11:35PM +0900, Masahiko Sawada wrote: > > > > Some of them conflicts with the current HEAD(62c9b52231). Please rebase > > > > them. > > > > > > Sorry, it's due to other vacuum patch in this branch. > > > > > > New patches won't conflict, except for the 0005, so I renamed it for > > > cfbot. > > > If it's deemed to be useful, I'll make a separate branch for the others. > > > > I think you have to have some other patches applied before these, > > because in the context lines for some of the hunks there are > > double-slash comments. > > And I knew that, so (re)tested that the extracted patches would apply, but it > looks like maybe the patch checker is less smart (or more strict) than git, so > it didn't work after all. Thank you for updating the patches! I'm not sure it's worth to have patches separately but I could apply all patches cleanly. Here is my comments for the code applied all patches: 1. + /* Used by the error callback */ + char*relname; + char*relnamespace; + BlockNumber blkno; + int stage; /* 0: scan heap; 1: vacuum heap; 2: vacuum index */ +} LVRelStats; * The comment should be updated as we use both relname and relnamespace for ereporting. * We can leave both blkno and stage that are used only for error context reporting put both relname and relnamespace to top of LVRelStats. * The 'stage' is missing to support index cleanup. * Maybe we need a comment for 'blkno'. 2. @@ -748,8 +742,31 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, vacrelstats->scanned_pages = 0; vacrelstats->tupcount_pages = 0; vacrelstats->nonempty_pages = 0; + + /* Setup error traceback support for ereport() */ + vacrelstats->relnamespace = get_namespace_name(RelationGetNamespace(onerel)); + vacrelstats->relname = RelationGetRelationName(onerel); + vacrelstats->blkno = InvalidBlockNumber; /* Not known yet */ + vacrelstats->stage = 0; + + errcallback.callback = vacuum_error_callback; + errcallback.arg = (void *) vacrelstats; + errcallback.previous = error_context_stack; + error_context_stack = + vacrelstats->latestRemovedXid = InvalidTransactionId; + if (aggressive) + ereport(elevel, + (errmsg("aggressively vacuuming \"%s.%s\"", + vacrelstats->relnamespace, + vacrelstats->relname))); + else + ereport(elevel, + (errmsg("vacuuming \"%s.%s\"", + vacrelstats->relnamespace, + vacrelstats->relname))); * It seems to me strange that only initialization of latestRemovedXid is done after error callback initialization. * Maybe we can initialize relname and relnamespace in heap_vacuum_rel rather than in lazy_scan_heap. BTW variables of vacrelstats are initialized different places: some of them in heap_vacuum_rel and others in lazy_scan_heap. I think we can gather those that can be initialized at that time to heap_vacuum_rel. 3. /* Work on all the indexes, then the heap */ + /* Don't use the errcontext handler outside this function */ + error_context_stack = errcallback.previous; lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps, nindexes); - /* Remove tuples from heap */ lazy_vacuum_heap(onerel, vacrelstats); + error_context_stack = Maybe we can do like: /* Pop the error context stack */ error_context_stack = errcallback.previous; /* Work on all the indexes, then the heap */ lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps, nindexes); /* Remove tuples from heap */ lazy_vacuum_heap(onerel, vacrelstats); /* Push again the error context of heap scan */ error_context_stack = 4. + /* Setup error traceback support for ereport() */ + /* vacrelstats->relnamespace already set */ + /* vacrelstats->relname already set */ These comments can be merged like: /* * Setup error traceback for ereport(). Both relnamespace and * relname are already set. */ 5. + /* Setup error traceback support for ereport() */ + vacrelstats.relnamespace = get_namespace_name(RelationGetNamespace(indrel)); + vacrelstats.relname = RelationGetRelationName(indrel); + vacrelstats.blkno = InvalidBlockNumber; /* Not used */ Why do we need to initialize blkno in spite of not using? 6. +/* + * Error context callback for errors occurring during vacuum. + */ +static void +vacuum_error_callback(void *arg) +{ + LVRelStats *cbarg = arg; + + if (cbarg->stage == 0) + errcontext(_("while scanning block %u of relation \"%s.%s\""), + cbarg->blkno, cbarg->relnamespace,
Re: Increase psql's password buffer size
> On 22 Jan 2020, at 01:41, David Fetter wrote: > > On Tue, Jan 21, 2020 at 07:05:47PM +0100, David Fetter wrote: >> On Tue, Jan 21, 2020 at 10:23:59AM -0500, Bruce Momjian wrote: >>> On Tue, Jan 21, 2020 at 04:19:13PM +0100, David Fetter wrote: On Tue, Jan 21, 2020 at 10:12:52AM -0500, Bruce Momjian wrote: > I think we should be using a macro to define the maximum length, rather > than have 100 used in various places. It's not just 100 in some places. It's different in different places, which goes to your point. How about using a system that doesn't meaningfully impose a maximum length? The shell variable is a const char *, so why not just re(p)alloc as needed? >>> >>> Uh, how do you know how big to make the buffer that receives the read? >> >> You can start at any size, possibly even 100, and then increase the >> size in a loop along the lines of (untested) It doesn't seem like a terribly safe pattern to have the client decide the read buffer without disclosing the size, and have the server resize the input buffer to an arbitrary size as input comes in. > my_size *= 2; > buf = (char *) realloc(buf, my_size); I know it's just example code, but using buf as the input to realloc like this risks a memleak when realloc fails as the original buf pointer is overwritten. Using a temporary pointer for ther returnvalue avoids that, which is how pg_repalloc and repalloc does it. cheers ./daniel