Re: pglz performance
On 02.08.2019 21:20, Andres Freund wrote: Another question is whether we'd actually want to include the code in core directly, or use system libraries (and if some packagers might decide to disable that, for whatever reason). I'd personally say we should have an included version, and a --with-system-... flag that uses the system one. +1
Re: POC: Cleaning up orphaned files using undo logs
On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar wrote: > > On Mon, Jul 22, 2019 at 2:21 PM Amit Kapila wrote: > > > > I have reviewed 0012-Infrastructure-to-execute-pending-undo-actions, > Please find my comment so far. .. > 4. > +void > +undoaction_redo(XLogReaderState *record) > +{ > + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; > + > + switch (info) > + { > + case XLOG_UNDO_APPLY_PROGRESS: > + undo_xlog_apply_progress(record); > + break; > > For HotStandby it doesn't make sense to apply this wal as this > progress is only required when we try to apply the undo action after > restart > but in HotStandby we never apply undo actions. > Hmm, I think it is required. Think what if Hotstandby is later promoted to master and a large part of undo is already applied? In such a case, we can skip the already applied undo. > 6. > + if ((slot == NULL) || (UndoRecPtrGetLogNo(urecptr) != slot->logno)) > + slot = UndoLogGetSlot(UndoRecPtrGetLogNo(urecptr), false); > + > + Assert(slot != NULL); > We are passing missing_ok as false in UndoLogGetSlot. But, not sure > why we are expecting that undo lot can not be dropped. In multi-log > transaction it's possible > that the tablespace in which next undolog is there is already dropped? > If the transaction spans multiple logs, then both the logs should be in the same tablespace. So, how is it possible to drop the tablespace when part of undo is still pending? AFAICS, the code in choose_undo_tablespace() doesn't seem to allow switching tablespace for the same transaction, but I agree if someone used a different algorithm, then it might be possible. I think the important question is whether we should allow the same transactions undo to span across tablespaces? If so, then what you are telling makes sense and we should handle that, if not, then I think we are fine here. One might argue that there should be some more strong checks to ensure that the same transaction will always get the undo logs from the same tablespace, but I think that is a different thing then what you are raising here. Thomas, others, do you have any opinion on this matter? In FindUndoEndLocationAndSize, there is a check if the next log is discarded (Case 4: If the transaction is overflowed to ...), won't this case (considering it is possible) get detected by that check? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Avoid full GIN index scan when possible
Attached 6th version of the patches. On 01.08.2019 22:28, Tom Lane wrote: Alexander Korotkov writes: On Thu, Aug 1, 2019 at 9:59 PM Tom Lane wrote: While I've not attempted to fix that here, I wonder whether we shouldn't fix it by just forcing forcedRecheck to true in any case where we discard an ALL qualifier. +1 for setting forcedRecheck in any case we discard ALL qualifier. ISTM, real life number of cases we can skip recheck here is negligible. And it doesn't justify complexity. Yeah, that was pretty much what I was thinking --- by the time we got it fully right considering nulls and multicolumn indexes, the cases where not rechecking could actually do something useful would be pretty narrow. And a bitmap heap scan is always going to have to visit the heap, IIRC, so how much could skipping the recheck really save? I have simplified patch #1 setting forcedRecheck for all discarded ALL quals. (This solution is very close to the earliest unpublished version of the patch.) More accurate recheck-forcing logic was moved into patch #2 (multicolumn indexes were fixed). This patch also contains ginFillScanKey() refactoring and new internal mode GIN_SEARCH_MODE_NOT_NULL that is used only for GinScanKey.xxxConsistentFn initialization and transformed into GIN_SEARCH_MODE_ALL before GinScanEntry initialization. The cost estimation seems to be correct for both patch #1 and patch #2 and left untouched since v05. BTW, it's not particularly the fault of this patch, but: what does it even mean to specify GIN_SEARCH_MODE_ALL with a nonzero number of keys? It might mean we would like to see all the results, which don't contain given key. Ah, right, I forgot that the consistent-fn might look at the match results. Also I decided to go further and tried to optimize (patch #3) the case for GIN_SEARCH_MODE_ALL with a nonzero number of keys. Full GIN scan can be avoided in queries like this contrib/intarray query: "arr @@ '1' AND arr @@ '!2'" (search arrays containing 1 and not containing 2). Here we have two keys: - key '1' with GIN_SEARCH_MODE_DEFAULT - key '2' with GIN_SEARCH_MODE_ALL Key '2' requires full scan that can be avoided with the forced recheck. This query is equivalent to single-qual query "a @@ '1 & !2'" which emits only one GIN key '1' with recheck. Below is example for contrib/intarray operator @@: =# CREATE EXTENSION intarray; =# CREATE TABLE t (a int[]); =# INSERT INTO t SELECT ARRAY[i] FROM generate_series(1, 100) i; =# CREATE INDEX ON t USING gin (a gin__int_ops); =# SET enable_seqscan = OFF; -- master =# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM t WHERE a @@ '1' AND a @@ '!2'; QUERY PLAN Bitmap Heap Scan on t (cost=1695.45..16007168.16 rows=5019 width=24) (actual time=66.955..66.956 rows=1 loops=1) Recheck Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int)) Heap Blocks: exact=1 Buffers: shared hit=6816 -> Bitmap Index Scan on t_a_idx (cost=0.00..1694.19 rows=5019 width=0) (actual time=66.950..66.950 rows=1 loops=1) Index Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int)) Buffers: shared hit=6815 Planning Time: 0.086 ms Execution Time: 67.076 ms (9 rows) -- equivalent single-qual query =# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM t WHERE a @@ '1 & !2'; QUERY PLAN Bitmap Heap Scan on t (cost=78.94..7141.57 rows=5025 width=24) (actual time=0.019..0.019 rows=1 loops=1) Recheck Cond: (a @@ '1 & !2'::query_int) Heap Blocks: exact=1 Buffers: shared hit=8 -> Bitmap Index Scan on t_a_idx (cost=0.00..77.68 rows=5025 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (a @@ '1 & !2'::query_int) Buffers: shared hit=7 Planning Time: 0.056 ms Execution Time: 0.039 ms (9 rows) -- with patch #3 =# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM t WHERE a @@ '1' AND a @@ '!2'; QUERY PLAN Bitmap Heap Scan on t (cost=75.45..7148.16 rows=5019 width=24) (actual time=0.019..0.020 rows=1 loops=1) Recheck Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int)) Heap Blocks: exact=1 Buffers: shared hit=6 -> Bitmap Index Scan on t_a_idx (cost=0.00..74.19 rows=5019 width=0) (actual time=0.011..0.012 rows=1 loops=1) Index Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int)) Buffers: shared hit=5 Planning Time: 0.059 ms Execution Time: 0.040 ms (9 rows) Patch #3 again contains a similar ugly solution -- we have to remove already initialized
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, On 2019-08-02 20:57:20 -0400, Stephen Frost wrote: > No, I’m saying that we already *have* a library and we can add a few > functions to it and if people want to leverage those functions then they > can write glue code to do so, just like was done with libpq. The argument > that “we shouldn’t put code into the common library because only tools > written in C can use the common library” is what I was specifically taking > exception with and your response doesn’t change my opinion of that argument > one bit. Wait, which library is this? And which code is suitable for being put in a library right now? We're WAY WAY past feature freeze. This isn't the time to rewrite guc.c, guc-file.l to be suitable for running outside of a backend environment. > Apparently I don’t have the experiences that you do as I’ve not seen a lot > of systems which are constantly rewriting the conf file to the point where > keeping the versions would be likely to add up to anything interesting. Shrug. I've e.g. seen people continuously (every few minutes or so) change autovacuum settings depending on load and observed response times. Which isn't even a crazy thing to do. > Designing the system around “well, we don’t think you’ll modify the file > very much from an external tool, so we just won’t worry about it, but if > you use alter system then we will clean things up” certainly doesn’t strike > me as terribly principled. Well. You shouldn't change postgresql.conf.auto while the server is running, for fairly obvious reasons. Therefore external tools not using ALTER SYSTEM only make sense when the server is not running. And I don't think it's a crazy to assume that PG servers where you'd regularly change the config are running most of the time. And again, we're talking about v12 here. I don't think anybody is arguing that we shouldn't provide library/commandline tools to make make changes to postgresql.auto.conf conveniently possible without duplicating lines. BUT not for v12, especially not because as the person arguing for this, you've not provided a patch providing such a library. > > Personally, I don’t buy the “run out of disk space” argument but if we are > > > going to go there then we should apply it appropriately. > > > > > > Having the history of changes to auto.conf would actually be quite > > useful, > > > imv, and worth a bit of disk space (heck, it’s not exactly uncommon for > > > people to keep their config files in git repos..). I’d suggest we also > > > include the date/time of when the modification was made. > > > > That just seems like an entirely different project. It seems blindlingly > > obvious that we can't keep the entire history in the file that we're > > going to be parsing on a regular basis. Having some form of config > > history tracking might be interesting, but I think it's utterly and > > completely independent from what we need to fix for v12. > We don’t parse the file on anything like a “regular” basis. Well, everytime somebody does pg_reload_conf(), which for systems that do frequent ALTER SYSTEMs, is kinda frequent too... Greetings, Andres Freund
Re: Optimize single tuple fetch from nbtree index
On Fri, Aug 2, 2019 at 5:34 PM Peter Geoghegan wrote: > I wonder if some variety of block nested loop join would be helpful > here. I'm not aware of any specific design that would help with > Floris' case, but the idea of reducing the number of scans required on > the inner side by buffering outer side tuples (say based on the > "k=:val" constant) seems like it might generalize well enough. I > suggest Floris look into that possibility. This paper might be worth a > read: > > https://dl.acm.org/citation.cfm?id=582278 Actually, having looked at the test case in more detail, that now seems less likely. The test case seems designed to reward making it cheaper to access one specific tuple among a fairly large group of related tuples -- reducing the number of inner scans is not going to be possible there. If this really is totally representative of the case that Floris cares about, I suppose that the approach taken more or less makes sense. Unfortunately, it doesn't seem like an optimization that many other users would find compelling, partly because it's only concerned with fixed overheads, and partly because most queries don't actually look like this. -- Peter Geoghegan
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Fri, Aug 2, 2019 at 20:46 Andres Freund wrote: > Hi, > > On 2019-08-02 20:27:25 -0400, Stephen Frost wrote: > > On Fri, Aug 2, 2019 at 18:47 Tom Lane wrote: > > > Stephen Frost writes: > > > > On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: > > > >>> The proposal seems to be to run through the .auto.conf file, > remove any > > > >>> duplicates, and append the new entry at the end. That seems > reasonable. > > > > > > >> +1 > > > > > > > I disagree that this should only be addressed in alter system, as > I’ve > > > said > > > > before and as others have agreed with. Having one set of code that > can > > > be > > > > used to update parameters in the auto.conf and then have that be > used by > > > > pg_basebackup, alter system, and external tools, is the right > approach. > > > > > > I don't find that to be necessary or even desirable. Many (most?) of > the > > > situations where this would be important wouldn't have access to a > running > > > backend, and maybe not to any PG code at all --- what if your tool > isn't > > > written in C? > > > > > > What if you want to access PG and your tool isn’t written in C? You use > a > > module, extension, package, whatever, that provides the glue between what > > your language wants and what the C library provides. There’s psycopg2 > for > > python, DBD::Pg for Perl, et al, and they use libpq. There’s languages > that > > like to write their own too, like the JDBC driver, the Golang driver, but > > that doesn’t mean we shouldn’t provide libpq or that non-C tools can’t > > leverage libpq. This argument is just not sensible. > > Oh, comeon. Are you seriously suggesting that a few commands to add a a > new config setting to postgresql.auto.conf will cause a lot of people to > write wrappers around $new_config_library in their language of choice, > because they did the same for libpq? And that we should design such a > library, for v12? No, I’m saying that we already *have* a library and we can add a few functions to it and if people want to leverage those functions then they can write glue code to do so, just like was done with libpq. The argument that “we shouldn’t put code into the common library because only tools written in C can use the common library” is what I was specifically taking exception with and your response doesn’t change my opinion of that argument one bit. > I think it's perfectly fine to say that external tools need only append > > > to the file, which will require no special tooling. But then we need > > > ALTER SYSTEM to be willing to clean out duplicates, if only so you > don't > > > run out of disk space after awhile. > > > Uh, if you don’t ever run ALTER SYSTEM then you could also “run out of > disk > > space” due to external tools modifying by just adding to the file. > > That was commented upon in the emails you're replying to? It seems > hardly likely that you'd get enough config entries to make that > problematic while postgres is not running. While running it's a > different story. Apparently I don’t have the experiences that you do as I’ve not seen a lot of systems which are constantly rewriting the conf file to the point where keeping the versions would be likely to add up to anything interesting. Designing the system around “well, we don’t think you’ll modify the file very much from an external tool, so we just won’t worry about it, but if you use alter system then we will clean things up” certainly doesn’t strike me as terribly principled. > Personally, I don’t buy the “run out of disk space” argument but if we are > > going to go there then we should apply it appropriately. > > > > Having the history of changes to auto.conf would actually be quite > useful, > > imv, and worth a bit of disk space (heck, it’s not exactly uncommon for > > people to keep their config files in git repos..). I’d suggest we also > > include the date/time of when the modification was made. > > That just seems like an entirely different project. It seems blindlingly > obvious that we can't keep the entire history in the file that we're > going to be parsing on a regular basis. Having some form of config > history tracking might be interesting, but I think it's utterly and > completely independent from what we need to fix for v12. We don’t parse the file on anything like a “regular” basis. Thanks, Stephen
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, On 2019-08-02 20:27:25 -0400, Stephen Frost wrote: > On Fri, Aug 2, 2019 at 18:47 Tom Lane wrote: > > Stephen Frost writes: > > > On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: > > >>> The proposal seems to be to run through the .auto.conf file, remove any > > >>> duplicates, and append the new entry at the end. That seems reasonable. > > > > >> +1 > > > > > I disagree that this should only be addressed in alter system, as I’ve > > said > > > before and as others have agreed with. Having one set of code that can > > be > > > used to update parameters in the auto.conf and then have that be used by > > > pg_basebackup, alter system, and external tools, is the right approach. > > > > I don't find that to be necessary or even desirable. Many (most?) of the > > situations where this would be important wouldn't have access to a running > > backend, and maybe not to any PG code at all --- what if your tool isn't > > written in C? > > > What if you want to access PG and your tool isn’t written in C? You use a > module, extension, package, whatever, that provides the glue between what > your language wants and what the C library provides. There’s psycopg2 for > python, DBD::Pg for Perl, et al, and they use libpq. There’s languages that > like to write their own too, like the JDBC driver, the Golang driver, but > that doesn’t mean we shouldn’t provide libpq or that non-C tools can’t > leverage libpq. This argument is just not sensible. Oh, comeon. Are you seriously suggesting that a few commands to add a a new config setting to postgresql.auto.conf will cause a lot of people to write wrappers around $new_config_library in their language of choice, because they did the same for libpq? And that we should design such a library, for v12? > I think it's perfectly fine to say that external tools need only append > > to the file, which will require no special tooling. But then we need > > ALTER SYSTEM to be willing to clean out duplicates, if only so you don't > > run out of disk space after awhile. > Uh, if you don’t ever run ALTER SYSTEM then you could also “run out of disk > space” due to external tools modifying by just adding to the file. That was commented upon in the emails you're replying to? It seems hardly likely that you'd get enough config entries to make that problematic while postgres is not running. While running it's a different story. > Personally, I don’t buy the “run out of disk space” argument but if we are > going to go there then we should apply it appropriately. > > Having the history of changes to auto.conf would actually be quite useful, > imv, and worth a bit of disk space (heck, it’s not exactly uncommon for > people to keep their config files in git repos..). I’d suggest we also > include the date/time of when the modification was made. That just seems like an entirely different project. It seems blindlingly obvious that we can't keep the entire history in the file that we're going to be parsing on a regular basis. Having some form of config history tracking might be interesting, but I think it's utterly and completely independent from what we need to fix for v12. It seems pretty clear that there's more people disagreeing with your position than agreeing with you. Because of this conflict there's not been progress on this for weeks. I think it's beyond time that we just do the minimal thing for v12, and then continue from there in v13. - Andres
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, Aug 2, 2019 at 3:52 PM Tom Lane wrote: > Better ... but I'm the world's second worst Perl programmer, > so I have little to say about whether it's idiomatic. Perhaps Michael can weigh in here? I'd rather hear a second opinion on v4 of the patch before proceeding. -- Peter Geoghegan
Re: Optimize single tuple fetch from nbtree index
On Fri, Aug 2, 2019 at 1:43 PM Tom Lane wrote: > Meh. I think the odds that you got this 100% right are small, and the > odds that it would be maintainable are smaller. There's too much that > can happen if you're not holding any lock --- and there's a lot of active > work on btree indexes, which could break whatever assumptions you might > make today. I agree that this sounds very scary. > BTW, you haven't even really made the case that optimizing a query that > behaves this way is the right thing to be doing ... maybe some other > plan shape that isn't a nestloop around a LIMIT query would be a better > solution. I wonder if some variety of block nested loop join would be helpful here. I'm not aware of any specific design that would help with Floris' case, but the idea of reducing the number of scans required on the inner side by buffering outer side tuples (say based on the "k=:val" constant) seems like it might generalize well enough. I suggest Floris look into that possibility. This paper might be worth a read: https://dl.acm.org/citation.cfm?id=582278 (Though it also might not be worth a read -- I haven't actually read it myself.) -- Peter Geoghegan
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Fri, Aug 2, 2019 at 18:47 Tom Lane wrote: > Stephen Frost writes: > > On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: > >>> The proposal seems to be to run through the .auto.conf file, remove any > >>> duplicates, and append the new entry at the end. That seems reasonable. > > >> +1 > > > I disagree that this should only be addressed in alter system, as I’ve > said > > before and as others have agreed with. Having one set of code that can > be > > used to update parameters in the auto.conf and then have that be used by > > pg_basebackup, alter system, and external tools, is the right approach. > > I don't find that to be necessary or even desirable. Many (most?) of the > situations where this would be important wouldn't have access to a running > backend, and maybe not to any PG code at all --- what if your tool isn't > written in C? What if you want to access PG and your tool isn’t written in C? You use a module, extension, package, whatever, that provides the glue between what your language wants and what the C library provides. There’s psycopg2 for python, DBD::Pg for Perl, et al, and they use libpq. There’s languages that like to write their own too, like the JDBC driver, the Golang driver, but that doesn’t mean we shouldn’t provide libpq or that non-C tools can’t leverage libpq. This argument is just not sensible. I agree entirely that we want to be able to modify auto.conf without having PG running (and without using single mode, bleh, that’s horrid..). I think we can accept that there we can’t necessarily *validate* that every option is acceptable but that’s not the same as being able to simply parse the file and modify a value. I think it's perfectly fine to say that external tools need only append > to the file, which will require no special tooling. But then we need > ALTER SYSTEM to be willing to clean out duplicates, if only so you don't > run out of disk space after awhile. Uh, if you don’t ever run ALTER SYSTEM then you could also “run out of disk space” due to external tools modifying by just adding to the file. Personally, I don’t buy the “run out of disk space” argument but if we are going to go there then we should apply it appropriately. Having the history of changes to auto.conf would actually be quite useful, imv, and worth a bit of disk space (heck, it’s not exactly uncommon for people to keep their config files in git repos..). I’d suggest we also include the date/time of when the modification was made. Thanks, Stephen
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Fri, Aug 2, 2019 at 19:36 Ian Barwick wrote: > On 8/3/19 8:24 AM, Andres Freund wrote: > > Hi, > > > > On 2019-08-03 08:22:29 +0900, Ian Barwick wrote: > >> What I came up with shoehorned a stripped-down version of the backend > >> config parser into fe_utils and provides a function to modify > pg.auto.conf > >> in much the same way ALTER SYSTEM does, but with only the basic syntax > >> checking provided by the parser of course. And for completeness a > >> client utility which can be called by scripts etc. > > > >> I can clean it up and submit it later for reference (got distracted by > other things > >> recently) though I don't think it's a particularly good solution due to > the > >> lack of actual checks for the provided GUCSs (and the implementation > >> is ugly anyway); something like what Andres suggests below would be far > better. > > > > I think my main problem with that is that it duplicates a nontrivial > > amount of code. > > That is indeed part of the ugliness of the implementation. I agree that duplicate code isn’t good- the goal would be to eliminate the duplication by having it be common code instead of duplicated. We have other code that’s common to the frontend and backend and I don’t doubt that we will have more going forward... Thanks, Stephen >
Re: Remove HeapTuple and Buffer dependency for predicate locking functions
On Wed, Jul 31, 2019 at 2:06 PM Andres Freund wrote: > Looking at the code as of master, we currently have: > Super awesome feedback and insights, thank you! - PredicateLockTuple() calls SubTransGetTopmostTransaction() to figure > out a whether the tuple has been locked by the current > transaction. That check afaict just should be > TransactionIdIsCurrentTransactionId(), without all the other > stuff that's done today. > Agree. v1-0002 patch attached does that now. Please let me know if that's what you meant. TransactionIdIsCurrentTransactionId() imo ought to be optimized to > always check for the top level transactionid first - that's a good bet > today, but even moreso for the upcoming AMs that won't have separate > xids for subtransactions. Alternatively we shouldn't make that a > binary search for each subtrans level, but just have a small > simplehash hashtable for xids. > v1-0001 patch checks for GetTopTransactionIdIfAny() first in TransactionIdIsCurrentTransactionId() which seems yes better in general and more for future. That mostly meets the needs for current discussion. The alternative of not using binary search seems bigger refactoring and should be handled as separate optimization exercise outside of this thread. > - CheckForSerializableConflictOut() wants to get the toplevel xid for > the tuple, because that's the one the predicate hashtable stores. > > In your patch you've already moved the HTSV() call etc out of > CheckForSerializableConflictOut(). I'm somewhat inclined to think that > the SubTransGetTopmostTransaction() call ought to go along with that. > I don't really think that belongs in predicate.c, especially if > most/all new AMs don't use subtransaction ids. > > The only downside is that currently the > TransactionIdEquals(xid, GetTopTransactionIdIfAny()) check > avoids the SubTransGetTopmostTransaction() check. > > But again, the better fix for that seems to be to improve the generic > code. As written the check won't prevent a subtrans lookup for heap > when subtransactions are in use, and it's IME pretty common for tuples > to get looked at again in the transaction that has created them. So > I'm somewhat inclined to think that SubTransGetTopmostTransaction() > should have a fast-path for the current transaction - probably just > employing TransactionIdIsCurrentTransactionId(). > That optimization, as Kuntal also mentioned, seems something which can be done on-top afterwards on current patch. > I don't really see what we gain by having the subtrans handling in the > predicate code. Especially given that we've already moved the HTSV() > handling out, it seems architecturally the wrong place to me - but I > admit that that's a fuzzy argument. The relevant mapping should be one > line in the caller. > Okay, I moved the sub transaction handling out of CheckForSerializableConflictOut() and have it along side HTSV() now. The reason I felt leaving subtransaction handling in generic place, was it might be premature to thing no future AM will need it. Plus, all serializable function api's having same expectations is easier. Like PredicateLockTuple() can be passed top or subtransaction id and it can handle it but with the change CheckForSerializableConflictOut() only be feed top transaction ID. But its fine and can see the point of AM needing it can easily get top transaction ID and feed it as heap. > I wonder if it'd be wroth to combine the > TransactionIdIsCurrentTransactionId() calls in the heap cases that > currently do both, PredicateLockTuple() and > HeapCheckForSerializableConflictOut(). The heap_fetch() case probably > isn't commonly that hot a pathq, but heap_hot_search_buffer() is. > Maybe, will give thought to it separate from the current patch. > Minor notes: > - I don't think 'insert_xid' is necessarily great - it could also be the > updating xid etc. And while you can argue that an update is an insert > in the current heap, that's not the case for future AMs. > - to me > @@ -1621,7 +1622,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation > relation, Buffer buffer, > if (valid) > { > ItemPointerSetOffsetNumber(tid, offnum); > - PredicateLockTuple(relation, heapTuple, > snapshot); > + PredicateLockTID(relation, > &(heapTuple)->t_self, snapshot, > + > HeapTupleHeaderGetXmin(heapTuple->t_data)); > if (all_dead) > *all_dead = false; > return true; > > What are those parens - as placed they can't do anything. Did you > intend to write &(heapTuple->t_self)? Even that is pretty superfluous, > but it at least clarifies the precedence. > Fixed. No idea what I was thinking there, mostly feel I intended to have it as like &(heapTuple->t_self). I'm also a bit co
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On 8/3/19 8:24 AM, Andres Freund wrote: Hi, On 2019-08-03 08:22:29 +0900, Ian Barwick wrote: What I came up with shoehorned a stripped-down version of the backend config parser into fe_utils and provides a function to modify pg.auto.conf in much the same way ALTER SYSTEM does, but with only the basic syntax checking provided by the parser of course. And for completeness a client utility which can be called by scripts etc. I can clean it up and submit it later for reference (got distracted by other things recently) though I don't think it's a particularly good solution due to the lack of actual checks for the provided GUCSs (and the implementation is ugly anyway); something like what Andres suggests below would be far better. I think my main problem with that is that it duplicates a nontrivial amount of code. That is indeed part of the ugliness of the implementation. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, On 2019-08-03 08:22:29 +0900, Ian Barwick wrote: > What I came up with shoehorned a stripped-down version of the backend > config parser into fe_utils and provides a function to modify pg.auto.conf > in much the same way ALTER SYSTEM does, but with only the basic syntax > checking provided by the parser of course. And for completeness a > client utility which can be called by scripts etc. > I can clean it up and submit it later for reference (got distracted by other > things > recently) though I don't think it's a particularly good solution due to the > lack of actual checks for the provided GUCSs (and the implementation > is ugly anyway); something like what Andres suggests below would be far > better. I think my main problem with that is that it duplicates a nontrivial amount of code. Greetings, Andres Freund
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On 8/3/19 7:56 AM, Andres Freund wrote: Hi, On 2019-08-02 18:47:07 -0400, Tom Lane wrote: Stephen Frost writes: I disagree that this should only be addressed in alter system, as I’ve said before and as others have agreed with. Having one set of code that can be used to update parameters in the auto.conf and then have that be used by pg_basebackup, alter system, and external tools, is the right approach. I don't find that to be necessary or even desirable. Many (most?) of the situations where this would be important wouldn't have access to a running backend, and maybe not to any PG code at all --- what if your tool isn't written in C? I think a commandline tool to perform the equivalent of ALTER SYSTEM on a shutdown cluster would be a great tool. It's easy enough to add something with broken syntax, and further down the road such a tool could not only ensure the syntax is correct, but also validate individual settings as much as possible (obviously there's some hairy issues here). What I came up with shoehorned a stripped-down version of the backend config parser into fe_utils and provides a function to modify pg.auto.conf in much the same way ALTER SYSTEM does, but with only the basic syntax checking provided by the parser of course. And for completeness a client utility which can be called by scripts etc. I can clean it up and submit it later for reference (got distracted by other things recently) though I don't think it's a particularly good solution due to the lack of actual checks for the provided GUCSs (and the implementation is ugly anyway); something like what Andres suggests below would be far better. Quite possibly the most realistic way to implement something like that would be a postgres commandline switch, which'd start up far enough to perform GUC checks and execute AlterSystem(), and then shut down again. We already have -C, I think such an option could reasonably be implemented alongside it. Obviously this is widely out of scope for v12. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Hi, On 2019-08-02 10:54:35 +0200, Julien Rouhaud wrote: > On Thu, Aug 1, 2019 at 11:05 PM Andres Freund wrote: > > > > I'm actually quite unconvinced that it's sensible to update the global > > value for nested queries. That'll mean e.g. the log_line_prefix and > > pg_stat_activity values are most of the time going to be bogus while > > nested, because the querystring that's associated with those will *not* > > be the value that the queryid corresponds to. elog.c uses > > debug_query_string to log the statement, which is only updated for > > top-level queries (outside of some exceptions like parallel workers for > > parallel queries in a function or stuff like that). And pg_stat_activity > > is also only updated for top level queries. > > Having the nested queryid seems indeed quite broken for > log_line_prefix. However having the nested queryid in > pg_stat_activity would be convenient to track what is a long stored > functions currently doing. Maybe we could expose something like > top_level_queryid and current_queryid instead? Given that the query string is the toplevel one, I think that'd just be confusing. And given the fact that it adds *substantial* additional complexity, I'd just rip the subcommand bits out. Greetings, Andres Freund
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On 8/3/19 8:09 AM, Tom Lane wrote: I wrote: Andres Freund writes: I think a commandline tool to perform the equivalent of ALTER SYSTEM on a shutdown cluster would be a great tool. Perhaps, but ... Obviously this is widely out of scope for v12. ... this. Although, there's always echo "alter system set work_mem = 4242;" | postgres --single Maybe we could recommend that to tools that need to do potentially-repetitive modifications? The slight problem with that, particularly with the use-case I am concerned with (writing replication configuration), is: [2019-08-03 08:14:21 JST]FATAL: 0A000: standby mode is not supported by single-user servers (I may be missing something obvious of course) Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
I wrote: > Andres Freund writes: >> I think a commandline tool to perform the equivalent of ALTER SYSTEM on >> a shutdown cluster would be a great tool. > Perhaps, but ... >> Obviously this is widely out of scope for v12. > ... this. Although, there's always echo "alter system set work_mem = 4242;" | postgres --single Maybe we could recommend that to tools that need to do potentially-repetitive modifications? regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Andres Freund writes: > On 2019-08-02 18:47:07 -0400, Tom Lane wrote: >> I don't find that to be necessary or even desirable. Many (most?) of the >> situations where this would be important wouldn't have access to a running >> backend, and maybe not to any PG code at all --- what if your tool isn't >> written in C? > I think a commandline tool to perform the equivalent of ALTER SYSTEM on > a shutdown cluster would be a great tool. Perhaps, but ... > Obviously this is widely out of scope for v12. ... this. It's entirely insane to think we're going to produce any such thing for v12 (much less back-patch it into prior versions). In the short term I don't think there's any workable alternative except to decree that "just append to the end" is a supported way to alter pg.auto.conf. But, as you said, it's also not sane for ALTER SYSTEM to behave that way, because it won't cope for long with repetitive modifications. I think we can get away with the "just append" recommendation for most external drivers because they won't be doing that. If they are, they'll need to be smarter, and maybe some command-line tool would make their lives simpler down the line. But we aren't providing that in this cycle. regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote: Greetings, On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: Tomas Vondra writes: > There seems to be a consensus that this this not a pg_basebackup issue > (i.e. duplicate values don't make the file invalid), and it should be > handled in ALTER SYSTEM. Yeah. I doubt pg_basebackup is the only actor that can create such situations. > The proposal seems to be to run through the .auto.conf file, remove any > duplicates, and append the new entry at the end. That seems reasonable. +1 I disagree that this should only be addressed in alter system, as I’ve said before and as others have agreed with. Having one set of code that can be used to update parameters in the auto.conf and then have that be used by pg_basebackup, alter system, and external tools, is the right approach. The idea that alter system should be the only thing that doesn’t just append changes to the file is just going to lead to confusion and bugs down the road. I don't remember any suggestions ALTER SYSTEM should be the only thing that can rewrite the config file, but maybe it's buried somewhere in the thread history. The current proposal certainly does not prohibit any external tool from doing so, it just says we should expect duplicates. As I said before, an alternative could be to make alter system simply always append and declare that to be the way to update parameters in the auto.conf. That just seems strange, TBH. There was a discussion whether to print warnings about the duplicates. I > personally see not much point in doing that - if we consider duplicates > to be expected, and if ALTER SYSTEM has the license to rework the config > file any way it wants, why warn about it? Personally I agree that warnings are unnecessary. And at least Magnus and I disagree with that, as I recall from this thread. Let’s have a clean and clear way to modify the auto.conf and have everything that touches the file update it in a consistent way. Well, I personally don't feel very strongly about it. I think the warnings will be a nuisance bothering people with expeced stuff, but I'm not willing to fight against it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, On 2019-08-02 18:47:07 -0400, Tom Lane wrote: > Stephen Frost writes: > > I disagree that this should only be addressed in alter system, as I’ve said > > before and as others have agreed with. Having one set of code that can be > > used to update parameters in the auto.conf and then have that be used by > > pg_basebackup, alter system, and external tools, is the right approach. > > I don't find that to be necessary or even desirable. Many (most?) of the > situations where this would be important wouldn't have access to a running > backend, and maybe not to any PG code at all --- what if your tool isn't > written in C? I think a commandline tool to perform the equivalent of ALTER SYSTEM on a shutdown cluster would be a great tool. It's easy enough to add something with broken syntax, and further down the road such a tool could not only ensure the syntax is correct, but also validate individual settings as much as possible (obviously there's some hairy issues here). Quite possibly the most realistic way to implement something like that would be a postgres commandline switch, which'd start up far enough to perform GUC checks and execute AlterSystem(), and then shut down again. We already have -C, I think such an option could reasonably be implemented alongside it. Obviously this is widely out of scope for v12. Greetings, Andres Freund
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
Peter Geoghegan writes: > How about the attached? I've simply removed the "if ($oid > $prev_oid > + 2)" test. Better ... but I'm the world's second worst Perl programmer, so I have little to say about whether it's idiomatic. regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, On 2019-08-02 18:38:46 -0400, Stephen Frost wrote: > On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: > > > Tomas Vondra writes: > > > There seems to be a consensus that this this not a pg_basebackup issue > > > (i.e. duplicate values don't make the file invalid), and it should be > > > handled in ALTER SYSTEM. > > > > Yeah. I doubt pg_basebackup is the only actor that can create such > > situations. > > > > > The proposal seems to be to run through the .auto.conf file, remove any > > > duplicates, and append the new entry at the end. That seems reasonable. > > > > +1 > I disagree that this should only be addressed in alter system, as I’ve said > before and as others have agreed with. Having one set of code that can be > used to update parameters in the auto.conf and then have that be used by > pg_basebackup, alter system, and external tools, is the right approach. > > The idea that alter system should be the only thing that doesn’t just > append changes to the file is just going to lead to confusion and bugs down > the road. To me that seems like an alternative that needs a good chunk more work than just having ALTER SYSTEM fix things up, and isn't actually likely to prevent such scenarios from occurring in practice. Providing a decent API to change conflict files from various places, presumably including a commandline utility to do so, would be a nice feature, but it seems vastly out of scope for v12. My vote is to fix this via ALTER SYSTEM in v12, and then for whoever is interested enough to provide better tools down the road. > As I said before, an alternative could be to make alter system simply > always append and declare that to be the way to update parameters in the > auto.conf. Why would that be a good idea? We'd just take longer and longer to parse it. There's people that change database settings on a regular and automated basis using ALTER SYSTEm. > > There was a discussion whether to print warnings about the duplicates. I > > > personally see not much point in doing that - if we consider duplicates > > > to be expected, and if ALTER SYSTEM has the license to rework the config > > > file any way it wants, why warn about it? > > > > Personally I agree that warnings are unnecessary. +1 Greetings, Andres Freund
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Stephen Frost writes: > On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: >>> The proposal seems to be to run through the .auto.conf file, remove any >>> duplicates, and append the new entry at the end. That seems reasonable. >> +1 > I disagree that this should only be addressed in alter system, as I’ve said > before and as others have agreed with. Having one set of code that can be > used to update parameters in the auto.conf and then have that be used by > pg_basebackup, alter system, and external tools, is the right approach. I don't find that to be necessary or even desirable. Many (most?) of the situations where this would be important wouldn't have access to a running backend, and maybe not to any PG code at all --- what if your tool isn't written in C? I think it's perfectly fine to say that external tools need only append to the file, which will require no special tooling. But then we need ALTER SYSTEM to be willing to clean out duplicates, if only so you don't run out of disk space after awhile. regards, tom lane
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, Aug 2, 2019 at 3:19 PM Tom Lane wrote: > The "if ($oid > $prev_oid + 2)" test seems unnecessary. > It's certainly wrong to keep iterating beyond the first > oid that's > $suggestion. Sorry. That was just carelessness on my part. (Being the world's worst Perl programmer is no excuse.) How about the attached? I've simply removed the "if ($oid > $prev_oid + 2)" test. -- Peter Geoghegan v4-0001-unused_oids-suggestion.patch Description: Binary data
Re: tableam vs. TOAST
Hi, On 2019-08-01 12:23:42 -0400, Robert Haas wrote: > Roughly, on both master and with the patches, the first one takes > about 4.2 seconds, the second 7.5, and the third 1.2. The third one > is the fastest because it doesn't do any compression. Since it does > less irrelevant work than the other two cases, it has the best chance > of showing up any performance regression that the patch has caused -- > if any regression existed, I suppose that it would be an increased > per-toast-fetch or per-toast-chunk overhead. However, I can't > reproduce any such regression. My first attempt at testing that case > showed the patch about 1% slower, but that wasn't reliably > reproducible when I did it a bunch more times. So as far as I can > figure, this patch does not regress performance in any > easily-measurable way. Hm, those all include writing, right? And for read-only we don't expect any additional overhead, correct? The write overhead is probably too large show a bit of function call overhead - but if so, it'd probably be on unlogged tables? And with COPY, because that utilizes multi_insert, which means more toasting in a shorter amount of time? .oO(why does everyone attach attachements out of order? Is that a gmail thing?) > From a4c858c75793f0f8aff7914c572a6615ea5babf8 Mon Sep 17 00:00:00 2001 > From: Robert Haas > Date: Mon, 8 Jul 2019 11:58:05 -0400 > Subject: [PATCH 1/4] Split tuptoaster.c into three separate files. > > detoast.c/h contain functions required to detoast a datum, partially > or completely, plus a few other utility functions for examining the > size of toasted datums. > > toast_internals.c/h contain functions that are used internally to the > TOAST subsystem but which (mostly) do not need to be accessed from > outside. > > heaptoast.c/h contains code that is intrinsically specific to the > heap AM, either because it operates on HeapTuples or is based on the > layout of a heap page. > > detoast.c and toast_internals.c are placed in > src/backend/access/common rather than src/backend/access/heap. At > present, both files still have dependencies on the heap, but that will > be improved in a future commit. I wonder if toasting.c should be moved too? If anybody doesn't know git's --color-moved, it makes patches like this a lot easier to review. > index 000..582af147ea1 > --- /dev/null > +++ b/src/include/access/detoast.h > @@ -0,0 +1,92 @@ > +/*- > + * > + * detoast.h > + *Access to compressed and external varlena values. > > Hm. Does it matter that that also includes stuff like expanded datums? > > + * Copyright (c) 2000-2019, PostgreSQL Global Development Group > + * > + * src/include/access/detoast.h > + * > + *- > + */ > +#ifndef DETOAST_H > +#define DETOAST_H trailing whitespace after "#ifndef DETOAST_H ". > commit 60d51e6510c66f79c51e43fe22730fe050d87854 > Author: Robert Haas > Date: 2019-07-08 12:02:16 -0400 > > Create an API for inserting and deleting rows in TOAST tables. > > This moves much of the non-heap-specific logic from toast_delete and > toast_insert_or_update into a helper functions accessible via a new > header, toast_helper.h. Using the functions in this module, a table > AM can implement creation and deletion of TOAST table rows with > much less code duplication than was possible heretofore. Some > table AMs won't want to use the TOAST logic at all, but for those > that do this will make that easier. > > Discussion: > http://postgr.es/m/CA+TgmoZv-=2iwm4jcw5zhjel18hf96+w1yjeyrngmydkffn...@mail.gmail.com Hm. This leaves toast_insert_or_update() as a name. That makes it sound like it's generic toast code, rather than heap specific? It's definitely nice how a lot of repetitive code has been deduplicated. Also makes it easier to see how algorithmically expensive toast_insert_or_update() is :(. > /* >* Second we look for attributes of attstorage 'x' or 'e' that are still >* inline, and make them external. But skip this if there's no toast >* table to push them to. >*/ > while (heap_compute_data_size(tupleDesc, > toast_values, > toast_isnull) > maxDataLen && > rel->rd_rel->reltoastrelid != InvalidOid) Shouldn't this condition be the other way round? > if (for_compression) > skip_colflags |= TOASTCOL_INCOMPRESSIBLE; > > for (i = 0; i < numAttrs; i++) > { > Form_pg_attribute att = TupleDescAttr(tupleDesc, i); > > if ((ttc->ttc_attr[i].tai_colflags & skip_colflags) != 0) > continue; > if (VARATT_IS_EXTERNAL(DatumGetPointer(ttc->ttc_values[i]))) > continue; /* can't happen, > toast_action would be 'p' */
Re: jsonb_plperl bug
=?UTF-8?B?SXZhbiBQYW5jaGVua28=?= writes: > I have found a bug in jsonb_plperl extension. A possible fix is proposed > below. > ... > + /* SVt_PV without POK flag is also NULL */ > + if(SvTYPE(in) == SVt_PV) Ugh. Doesn't Perl provide some saner way to determine the type of a SV? The core code seems to think that SvOK() is a sufficient test for an undef. Should we be doing that before the switch, perhaps? (My underlying concern here is mostly about whether we have other similar bugs. There are a lot of places checking SvTYPE.) regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
On 8/3/19 7:27 AM, Tom Lane wrote: Tomas Vondra writes: There seems to be a consensus that this this not a pg_basebackup issue (i.e. duplicate values don't make the file invalid), and it should be handled in ALTER SYSTEM. Yeah. I doubt pg_basebackup is the only actor that can create such situations. The proposal seems to be to run through the .auto.conf file, remove any duplicates, and append the new entry at the end. That seems reasonable. +1 There was a discussion whether to print warnings about the duplicates. I personally see not much point in doing that - if we consider duplicates to be expected, and if ALTER SYSTEM has the license to rework the config file any way it wants, why warn about it? Personally I agree that warnings are unnecessary. Having played around with the pg.auto.conf stuff for a while, my feeling is that ALTER SYSTEM does indeed have a license to rewrite it (which is what currently happens anyway, with comments and include directives [1] being silently removed) so it seems reasonable to remove duplicate entries and ensure the correct one is processed. [1] suprisingly any include directives present are honoured, which seems crazy to me, see: https://www.postgresql.org/message-id/flat/8c8bcbca-3bd9-dc6e-8986-04a5abdef142%402ndquadrant.com The main issue however is that no code was written yet. Seems like it ought to be relatively simple ... but I didn't look. The patch I originally sent does exactly this. The thread then drifted off into a discussion about providing ways for applications to properly write to pg.auto.conf while PostgreSQL is not running; I have a patch for that which I can submit later (though it is a thing of considerable ugliness). Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Greetings, On Fri, Aug 2, 2019 at 18:27 Tom Lane wrote: > Tomas Vondra writes: > > There seems to be a consensus that this this not a pg_basebackup issue > > (i.e. duplicate values don't make the file invalid), and it should be > > handled in ALTER SYSTEM. > > Yeah. I doubt pg_basebackup is the only actor that can create such > situations. > > > The proposal seems to be to run through the .auto.conf file, remove any > > duplicates, and append the new entry at the end. That seems reasonable. > > +1 I disagree that this should only be addressed in alter system, as I’ve said before and as others have agreed with. Having one set of code that can be used to update parameters in the auto.conf and then have that be used by pg_basebackup, alter system, and external tools, is the right approach. The idea that alter system should be the only thing that doesn’t just append changes to the file is just going to lead to confusion and bugs down the road. As I said before, an alternative could be to make alter system simply always append and declare that to be the way to update parameters in the auto.conf. > There was a discussion whether to print warnings about the duplicates. I > > personally see not much point in doing that - if we consider duplicates > > to be expected, and if ALTER SYSTEM has the license to rework the config > > file any way it wants, why warn about it? > > Personally I agree that warnings are unnecessary. And at least Magnus and I disagree with that, as I recall from this thread. Let’s have a clean and clear way to modify the auto.conf and have everything that touches the file update it in a consistent way. Thanks, Stephen
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Tomas Vondra writes: > There seems to be a consensus that this this not a pg_basebackup issue > (i.e. duplicate values don't make the file invalid), and it should be > handled in ALTER SYSTEM. Yeah. I doubt pg_basebackup is the only actor that can create such situations. > The proposal seems to be to run through the .auto.conf file, remove any > duplicates, and append the new entry at the end. That seems reasonable. +1 > There was a discussion whether to print warnings about the duplicates. I > personally see not much point in doing that - if we consider duplicates > to be expected, and if ALTER SYSTEM has the license to rework the config > file any way it wants, why warn about it? Personally I agree that warnings are unnecessary. > The main issue however is that no code was written yet. Seems like it ought to be relatively simple ... but I didn't look. regards, tom lane
Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Hi, This thread discusses an issue that's tracked as an open item for pg12, but it's been quiet for the last ~1 month. I think it's probably time to decide what to do with it. The thread is a bit long, so let me sum what the issue is and what options we have. The problem is that ALTER SYSTEM does not handle duplicate entries in postgresql.auto.conf file correctly, because it simply modifies the first item, but the value is then overridden by the duplicate items. This contradicts the idea that duplicate GUCs are allowed, and that we should use the last item. This bug seems to exist since ALTER SYSTEM was introduced, so it's not a clear PG12 item. But it was made more prominent by the removal of recovery.conf in PG12, because pg_basebackup now appends stuff to postgresql.auto.conf and may easily create duplicate items. There seems to be a consensus that this this not a pg_basebackup issue (i.e. duplicate values don't make the file invalid), and it should be handled in ALTER SYSTEM. The proposal seems to be to run through the .auto.conf file, remove any duplicates, and append the new entry at the end. That seems reasonable. There was a discussion whether to print warnings about the duplicates. I personally see not much point in doing that - if we consider duplicates to be expected, and if ALTER SYSTEM has the license to rework the config file any way it wants, why warn about it? The main issue however is that no code was written yet. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
Peter Geoghegan writes: > Attached is v3, which implements your suggestion, generating output > like the above. I haven't written a line of Perl in my life prior to > today, so basic code review would be helpful. The "if ($oid > $prev_oid + 2)" test seems unnecessary. It's certainly wrong to keep iterating beyond the first oid that's > $suggestion. Perhaps you meant to go back and try a different suggestion if there's not at least 2 free OIDs? But then there needs to be an outer loop around both of these loops. regards, tom lane
Re: Optimize single tuple fetch from nbtree index
Hi Tom, Thanks for your quick reply! > Regardless of whether there's actually a LIMIT 1? That seems disastrous > for every other case than the narrow one where the optimization wins. > Because every other case is now going to have to touch the index page > twice. That's more CPU and about double the contention --- if you could > not measure any degradation from that, you're not measuring the right > thing. I thought the same as well at first. Note that I did measure degradation of 2-3% as mentioned on some cases, but initially I also expected worse. Do you have any ideas on cases that would suffer the most? I thought the tight inner nested loop that I posted in my performance tests would have this index lookup as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because these improve by a factor 2-3 with the patch). And my theory is that for a LIMIT 3, the price paid for this optimization is highest, because it would touch the page twice and read all items from it, while only returning three of them. > In principle, you could pass down knowledge of whether there's a LIMIT, > using the same mechanism used to enable top-N sorting. But it'd have to > also go through the AM interface layer, so I'm not sure how messy this > would be. This was an idea I had as well and I would be willing to implement such a thing if this is deemed interesting enough by the community. However, I didn't want to do this for the first version of this patch, as it would be quite some extra work, which would be useless if the idea of the patch itself gets rejected already. :-) I'd appreciate any pointers in the right direction - I can take a look at how top-N sorting pushes the LIMIT down. Given enough interest for the basic idea of this patch, I will implement it. >> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, >> descend the tree to a leaf page and read just the first (or possibly two) >> tuples. It won't touch the rest of the page yet. If indeed just one tuple >> was required, there won't be a call to _bt_next and we're done. If we do >> need more than one tuple, _bt_next will resume reading tuples from the index >> page at the point where we left off. > How do you know how many index entries you have to fetch to get a tuple that's live/visible to the query? Indeed we don't know that - that's why this initial patch does not make any assumptions about this and just assumes the good-weather scenario that everything is visible. I'm not sure if it's possible to give an estimation of this and whether or not that would be useful. Currently, if it turns out that the tuple is not visible, there'll just be another call to _bt_next again which will resume reading the page as normal. I'm open to implement any suggestions that may improve this. >> - We need to take into account page splits, insertions and vacuums while we >> do not have the read-lock in between _bt_first and the first call to >> _bt_next. This made my patch quite a bit more complicated than my initial >> implementation. > Meh. I think the odds that you got this 100% right are small, and the > odds that it would be maintainable are smaller. There's too much that > can happen if you're not holding any lock --- and there's a lot of active > work on btree indexes, which could break whatever assumptions you might > make today. Agreed, which is also why I posted this initial version of the patch here already, to get some input from the experts on this topic what assumptions can be made now and in the future. If it turns out that it's completely not feasible to do an optimization like this, because of other constraints in the btree implementation, then we're done pretty quickly here. :-) For what it's worth: the patch at least passes make check consistently - I caught a lot of these edge cases related to page splits and insertions while running the regression tests, which runs the modified bits of code quite often and in parallel. There may be plenty of edge cases left however... > I'm not unalterably opposed to doing something like this, but my sense > is that the complexity and probable negative performance impact on other > cases are not going to look like a good trade-off for optimizing the > case at hand. I do think it could be a big win if we could get something like this working. Cases with a LIMIT seem common enough to me to make it possible to add some extra optimizations, especially if that could lead to 2-3x the TPS for these kind of queries. However, it indeed needs to be within a reasonable complexity. If it turns out that in order for us to optimize this, we need to add a lot of extra complexity, it may not be worth it to add it. > BTW, you haven't even really made the case that optimizing a query that > behaves this way is the right thing to be doing ... maybe some other > plan shape that isn't a nestloop around a LIMIT query would be a better > solution. It is prett
jsonb_plperl bug
Hi, I have found a bug in jsonb_plperl extension. A possible fix is proposed below. jsonb_perl is the contrib module, which defines TRANSFORM functions for jsonb data type and PL/Perl procedural language. The bug can be reproduced as follows: CREATE EXTENSION plperl; CREATE EXTENSION jsonb_plperl; CREATE OR REPLACE FUNCTION text2jsonb (text) RETURNS jsonb LANGUAGE plperl TRANSFORM FOR TYPE jsonb AS $$ my $x = shift; my $ret = {a=>$x}; return $ret; $$; SELECT text2jsonb(NULL); SELECT text2jsonb('11'); SELECT text2jsonb(NULL); The last SELECT produces a strange error. ERROR: cannot transform this Perl type to jsonb A brief research has shown that the problem is in an incomplete logic inside the transform function. The reason can be illustrated by the flollowing Perl one-liner: perl -MDevel::Peek -e 'sub x { my $x = shift; Dump $x; warn "\n\n"; }; x(undef); x("a"); x(undef); ' It outputs: SV = NULL(0x0) at 0x73a1b8 REFCNT = 1 FLAGS = (PADMY) SV = PV(0x71da50) at 0x73a1b8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7409a0 "a"\0 CUR = 1 LEN = 16 SV = PV(0x71da50) at 0x73a1b8 REFCNT = 1 FLAGS = (PADMY) PV = 0x7409a0 "a"\0 CUR = 1 LEN = 16 This shows that internal representation of the same undef in perl is different in first and third function calls. It is the way Perl reuses the the lexical variable, probably, for optimization reasons. Current jsonb_plperl implementation works good for the first (most evident) case, but does not work at all for the third, which results in the abovementioned error. The attached patch solves this issue and defines corresponding tests. Regards, Ivan diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out index 6dc090a..b784ca1 100644 --- a/contrib/jsonb_plperl/expected/jsonb_plperl.out +++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out @@ -228,6 +228,31 @@ SELECT roundtrip('{"1": {"2": [3, 4, 5]}, "2": 3}', 'HASH'); {"1": {"2": [3, 4, 5]}, "2": 3} (1 row) +CREATE FUNCTION text2jsonb (text) RETURNS jsonb +LANGUAGE plperl +TRANSFORM FOR TYPE jsonb +AS $$ +my $x = shift; +return {a=>$x}; +$$; +SELECT text2jsonb(NULL); + text2jsonb +- + {"a": null} +(1 row) + +SELECT text2jsonb('11'); + text2jsonb +- + {"a": "11"} +(1 row) + +SELECT text2jsonb(NULL); + text2jsonb +- + {"a": null} +(1 row) + \set VERBOSITY terse \\ -- suppress cascade details DROP EXTENSION plperl CASCADE; -NOTICE: drop cascades to 7 other objects +NOTICE: drop cascades to 8 other objects diff --git a/contrib/jsonb_plperl/expected/jsonb_plperlu.out b/contrib/jsonb_plperl/expected/jsonb_plperlu.out index 434327b..7fe0721 100644 --- a/contrib/jsonb_plperl/expected/jsonb_plperlu.out +++ b/contrib/jsonb_plperl/expected/jsonb_plperlu.out @@ -255,6 +255,31 @@ INFO: $VAR1 = {'1' => {'2' => ['3','4','5']},'2' => '3'}; {"1": {"2": [3, 4, 5]}, "2": 3} (1 row) +CREATE FUNCTION text2jsonb (text) RETURNS jsonb +LANGUAGE plperlu +TRANSFORM FOR TYPE jsonb +AS $$ +my $x = shift; +return {a=>$x}; +$$; +SELECT text2jsonb(NULL); + text2jsonb +- + {"a": null} +(1 row) + +SELECT text2jsonb('11'); + text2jsonb +- + {"a": "11"} +(1 row) + +SELECT text2jsonb(NULL); + text2jsonb +- + {"a": null} +(1 row) + \set VERBOSITY terse \\ -- suppress cascade details DROP EXTENSION plperlu CASCADE; -NOTICE: drop cascades to 7 other objects +NOTICE: drop cascades to 8 other objects diff --git a/contrib/jsonb_plperl/jsonb_plperl.c b/contrib/jsonb_plperl/jsonb_plperl.c index 79c5f57..5244ab7 100644 --- a/contrib/jsonb_plperl/jsonb_plperl.c +++ b/contrib/jsonb_plperl/jsonb_plperl.c @@ -257,6 +257,12 @@ SV_to_JsonbValue(SV *in, JsonbParseState **jsonb_state, bool is_elem) } else { +/* SVt_PV without POK flag is also NULL */ +if(SvTYPE(in) == SVt_PV) +{ + out.type = jbvNull; + break; +} /* * XXX It might be nice if we could include the Perl type in * the error message. diff --git a/contrib/jsonb_plperl/sql/jsonb_plperl.sql b/contrib/jsonb_plperl/sql/jsonb_plperl.sql index 8b062df..622141b 100644 --- a/contrib/jsonb_plperl/sql/jsonb_plperl.sql +++ b/contrib/jsonb_plperl/sql/jsonb_plperl.sql @@ -99,6 +99,17 @@ SELECT roundtrip('{"1": "string1"}', 'HASH'); SELECT roundtrip('{"1": {"2": [3, 4, 5]}, "2": 3}', 'HASH'); +CREATE FUNCTION text2jsonb (text) RETURNS jsonb +LANGUAGE plperl +TRANSFORM FOR TYPE jsonb +AS $$ +my $x = shift; +return {a=>$x}; +$$; + +SELECT text2jsonb(NULL); +SELECT text2jsonb('11'); +SELECT text2jsonb(NULL); \set VERBOSITY terse \\ -- suppress cascade details DROP EXTENSION plperl CASCADE; diff --git a/contrib/jsonb_plperl/sql/jsonb_plperlu.sql b/contrib/jsonb_plperl/sql/jsonb_plperlu.sql index 8d8e841..9981c37 100644 --- a/contrib/jsonb_plperl/sql/jsonb_plperlu.sql +++ b/contrib/jsonb_plperl/sql
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, Aug 2, 2019 at 2:52 PM Isaac Morland wrote: > Noob question here: why not start with the next unused OID in the range, and > on the other hand reserve the range for sequentially-assigned values? The general idea is to avoid OID collisions while a patch is under development. Choosing a value that aligns nicely with already-allocated OIDs makes these collisions much more likely, which commit a6417078 addressed back in March. We want a random choice among patches, but OIDs used within a patch should be consecutive. (There is still some chance of a collision, but you have to be fairly unlucky to have that happen under the system introduced by commit a6417078.) It's probably the case that most patches that create a new pg_proc entry only create one. The question of consecutive OIDs only comes up with a fairly small number of patches. -- Peter Geoghegan
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, 2 Aug 2019 at 16:49, Tom Lane wrote: > Peter Geoghegan writes: > > I've taken your patch, and changed the wording a bit. I think that > > it's worth being a bit more explicit. The attached revision produces > > output that looks like this: > > > Patches should use a more-or-less consecutive range of OIDs. > > Best practice is to make a random choice in the range 8000-. > > Suggested random unused OID: 9099 > Noob question here: why not start with the next unused OID in the range, and on the other hand reserve the range for sequentially-assigned values?
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, Aug 2, 2019 at 1:49 PM Tom Lane wrote: > Maybe s/make a/start with/ ? > Also, once people start doing this, it'd be unfriendly to suggest > 9099 if 9100 is already committed. There should be some attention > to *how many* consecutive free OIDs will be available if one starts > at the suggestion. How about this wording?: Patches should use a more-or-less consecutive range of OIDs. Best practice is to start with a random choice in the range 8000-. Suggested random unused OID: 9591 (409 consecutive OID(s) available starting here) Attached is v3, which implements your suggestion, generating output like the above. I haven't written a line of Perl in my life prior to today, so basic code review would be helpful. -- Peter Geoghegan v3-0001-unused_oids-suggestion.patch Description: Binary data
Re: partition routing layering in nodeModifyTable.c
Hi, On 2019-08-01 18:38:09 +0900, Etsuro Fujita wrote: > On Thu, Aug 1, 2019 at 10:33 AM Amit Langote wrote: > > If it's the approach of adding a resultRelation Index field to > > ForeignScan node, I tried and had to give up, realizing that we don't > > maintain ResultRelInfos in an array that is indexable by RT indexes. > > It would've worked if es_result_relations had mirrored es_range_table, > > although that probably complicates how the individual ModifyTable > > nodes attach to that array. We know at plan time what the the resultRelation offset for a ModifyTable node is. We just need to transport that to the respective foreign scan node, and update it properly in setrefs? Then we can index es_result_relations without any additional mapping? Maybe I'm missing something? I think all we need to do is to have setrefs.c:set_plan_refs() iterate over ->fdwDirectModifyPlans or such, and set the respective node's result_relation_offset or whatever we're naming it to splan->resultRelIndex + offset from fdwDirectModifyPlans? > > In any case, given this discussion, further hacking on a global > > variable like es_result_relations may be a course we might not want > > to pursue. I don't think es_result_relations really is problem - it doesn't have to change while processing individual subplans / partitions / whatnot. If we needed a mapping between rtis and result indexes, I'd not see a problem. Doubtful it's needed though. There's a fundamental difference between EState->es_result_relations and EState->es_result_relation_info. The former stays static during the whole query once initialized, whereas es_result_relation_info changes depending on which relation we're processing. The latter is what makes the code more complicated, because we cannot ever return early etc. Similarly, ModifyTableState->mt_per_subplan_tupconv_maps is not a problem, it stays static, but e.g. mtstate->mt_transition_capture is a problem, because we have to change for each subplan / routing / partition movement. Greetings, Andres Freund
Re: [PROPOSAL] Temporal query processing with range types
Hi, I have rebased the patch and currently reviewing the patch on master (1e2fddfa33d3c7cc93ca3ee0f32852699bd3e012). On Mon, Jul 1, 2019 at 4:45 PM Thomas Munro wrote: > On Wed, Apr 3, 2019 at 2:12 AM Ibrar Ahmed wrote: > > I start looking at the patch, there is a couple of problems with the > patch. The first one is the OID conflict, which I fixed on my machine. The > second problem is assertion failure. I think you have not compiled the > PostgreSQL code with the assertion. > > Hi Peter, > > Looks like there was some good feedback for this WIP project last time > around. It's currently in "Needs Review" status in the July > Commitfest. To encourage more review and see some automated compile > and test results, could we please have a fresh rebase? The earlier > patches no longer apply. > > Thanks, > > -- > Thomas Munro > https://enterprisedb.com > -- Ibrar Ahmed 001_temporal_query_processing_with_range_types_v4.patch Description: Binary data
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
Peter Geoghegan writes: > I've taken your patch, and changed the wording a bit. I think that > it's worth being a bit more explicit. The attached revision produces > output that looks like this: > Patches should use a more-or-less consecutive range of OIDs. > Best practice is to make a random choice in the range 8000-. > Suggested random unused OID: 9099 Maybe s/make a/start with/ ? Also, once people start doing this, it'd be unfriendly to suggest 9099 if 9100 is already committed. There should be some attention to *how many* consecutive free OIDs will be available if one starts at the suggestion. You could perhaps print "9099 (42 OIDs available starting here)", and if the user doesn't like the amount of headroom in that, they could just run it again for a different suggestion. regards, tom lane
Re: Optimize single tuple fetch from nbtree index
Floris Van Nee writes: > Every time the index scan is done, all tuples from the leaf page are > read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an > exception for this *only* the first time amgettuple gets called. Regardless of whether there's actually a LIMIT 1? That seems disastrous for every other case than the narrow one where the optimization wins. Because every other case is now going to have to touch the index page twice. That's more CPU and about double the contention --- if you could not measure any degradation from that, you're not measuring the right thing. In principle, you could pass down knowledge of whether there's a LIMIT, using the same mechanism used to enable top-N sorting. But it'd have to also go through the AM interface layer, so I'm not sure how messy this would be. > This calls _bt_first in nbtsearch.c, which will, if there are scankeys, > descend the tree to a leaf page and read just the first (or possibly two) > tuples. It won't touch the rest of the page yet. If indeed just one tuple was > required, there won't be a call to _bt_next and we're done. If we do need > more than one tuple, _bt_next will resume reading tuples from the index page > at the point where we left off. How do you know how many index entries you have to fetch to get a tuple that's live/visible to the query? > - We need to take into account page splits, insertions and vacuums while we > do not have the read-lock in between _bt_first and the first call to > _bt_next. This made my patch quite a bit more complicated than my initial > implementation. Meh. I think the odds that you got this 100% right are small, and the odds that it would be maintainable are smaller. There's too much that can happen if you're not holding any lock --- and there's a lot of active work on btree indexes, which could break whatever assumptions you might make today. I'm not unalterably opposed to doing something like this, but my sense is that the complexity and probable negative performance impact on other cases are not going to look like a good trade-off for optimizing the case at hand. BTW, you haven't even really made the case that optimizing a query that behaves this way is the right thing to be doing ... maybe some other plan shape that isn't a nestloop around a LIMIT query would be a better solution. regards, tom lane
Re: partition routing layering in nodeModifyTable.c
Hi, On 2019-08-03 05:20:35 +0900, Etsuro Fujita wrote: > On Sat, Aug 3, 2019 at 3:01 AM Andres Freund wrote: > > On 2019-07-31 17:04:38 +0900, Amit Langote wrote: > > > I looked into trying to do the things I mentioned above and it seems > > > to me that revising BeginDirectModify()'s API to receive the > > > ResultRelInfo directly as Andres suggested might be the best way > > > forward. I've implemented that in the attached 0001. > > > Fujita-san, do you have any comments on the FDW API change? Or anybody > > else? > > > > I'm a bit woried about the move of BeginDirectModify() into > > nodeModifyTable.c - it just seems like an odd control flow to me. Not > > allowing any intermittent nodes between ForeignScan and ModifyTable also > > seems like an undesirable restriction for the future. I realize that we > > already do that for BeginForeignModify() (just btw, that already accepts > > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify > > makes sense), but it still seems like the wrong direction to me. > > > > The need for that move, I assume, comes from needing knowing the correct > > ResultRelInfo, correct? I wonder if we shouldn't instead determine the > > at plan time (in setrefs.c), somewhat similar to how we determine > > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard? > > I'd vote for that; I created a patch for that [1]. > > [1] > https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com Oh, missed that. But that's not quite what I'm proposing. I don't like ExecFindResultRelInfo at all. What's the point of it? It's introduction is still an API break - I don't understand why that break is better than just passing the ResultRelInfo directly to BeginDirectModify()? I want to again remark that BeginForeignModify() does get the ResultRelInfo - it should have been done the same when adding direct modify. Even if you need the loop - which I don't think is right - it should live somewhere that individual FDWs don't have to care about. - Andres
Re: SQL:2011 PERIODS vs Postgres Ranges?
The patch does not work. postgres=# CREATE TABLE foo (id int,r int4range, valid_at tsrange, CONSTRAINT bar_pk PRIMARY KEY (r, valid_at WITHOUT OVERLAPS)); CREATE TABLE postgres=# CREATE TABLE bar (id int,r int4range, valid_at tsrange, CONSTRAINT bar_fk FOREIGN KEY (r, PERIOD valid_at) REFERENCES foo); ERROR: cache lookup failed for type 0 The new status of this patch is: Waiting on Author
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
Le ven. 2 août 2019 à 20:12, Peter Geoghegan a écrit : > On Fri, Aug 2, 2019 at 1:42 AM Julien Rouhaud wrote: > > Trivial patch for that attached. > > Thanks! > > > The output is now like: > > > > [...] > > Using an oid in the 8000- range is recommended. > > For instance: 9427 > > > > (checking that the suggested random oid is not used yet.) > > I've taken your patch, and changed the wording a bit. I think that > it's worth being a bit more explicit. The attached revision produces > output that looks like this: > > Patches should use a more-or-less consecutive range of OIDs. > Best practice is to make a random choice in the range 8000-. > Suggested random unused OID: 9099 > > I would like to push this patch shortly. How do people feel about this > wording? (It's based on the documentation added by commit a6417078.) > I'm fine with it! >
First draft of back-branch release notes is done
See https://git.postgresql.org/pg/commitdiff/082c9f5f761ced18a6f014f2638096f6a8228164 Please send comments/corrections before Sunday. regards, tom lane
Re: partition routing layering in nodeModifyTable.c
Hi, On Sat, Aug 3, 2019 at 3:01 AM Andres Freund wrote: > On 2019-07-31 17:04:38 +0900, Amit Langote wrote: > > I looked into trying to do the things I mentioned above and it seems > > to me that revising BeginDirectModify()'s API to receive the > > ResultRelInfo directly as Andres suggested might be the best way > > forward. I've implemented that in the attached 0001. > Fujita-san, do you have any comments on the FDW API change? Or anybody > else? > > I'm a bit woried about the move of BeginDirectModify() into > nodeModifyTable.c - it just seems like an odd control flow to me. Not > allowing any intermittent nodes between ForeignScan and ModifyTable also > seems like an undesirable restriction for the future. I realize that we > already do that for BeginForeignModify() (just btw, that already accepts > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify > makes sense), but it still seems like the wrong direction to me. > > The need for that move, I assume, comes from needing knowing the correct > ResultRelInfo, correct? I wonder if we shouldn't instead determine the > at plan time (in setrefs.c), somewhat similar to how we determine > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard? I'd vote for that; I created a patch for that [1]. > Then we could just have BeginForeignModify, BeginDirectModify, > BeginForeignScan all be called from ExecInitForeignScan(). I think so too. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com
Re: using index or check in ALTER TABLE SET NOT NULL
Tomas Vondra writes: > I think there's a consensus to change INFO to DEBUG1 in pg12, and then > maybe imlpement something like VERBOSE mode in the future. Objections? ISTM the consensus is "we'd rather reduce the verbosity, but we don't want to give up test coverage". So what's blocking this is lack of a patch to show that there's another way to verify what code path was taken. > As for the reduction of test coverage, can't we deduce whether a > constraint was used from data in pg_stats or something like that? Not sure how exactly ... and we've already learned that pg_stats isn't too reliable. regards, tom lane
Re: using index or check in ALTER TABLE SET NOT NULL
On Wed, Jun 12, 2019 at 08:34:57AM +1200, David Rowley wrote: On Tue, 11 Jun 2019 at 03:35, Sergei Kornilov wrote: > Does anyone think we shouldn't change the INFO message in ATTACH > PARTITION to a DEBUG1 in PG12? Seems no one wants to vote against this change. I'm concerned about two things: 1. The patch reduces the test coverage of ATTACH PARTITION. We now have no way to ensure the constraint was used to validate the rows in the partition. 2. We're inconsistent with what we do for SET NOT NULL and ATTACH PARTITION. We raise an INFO message when we use a constraint for ATTACH PARTITION and only a DEBUG1 for SET NOT NULL. Unfortunately, my two concerns conflict with each other. We're getting close to beta3/rc1, and this thread was idle for ~1 month. I think there's a consensus to change INFO to DEBUG1 in pg12, and then maybe imlpement something like VERBOSE mode in the future. Objections? As for the reduction of test coverage, can't we deduce whether a constraint was used from data in pg_stats or something like that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pglz performance
On Fri, Aug 02, 2019 at 11:20:03AM -0700, Andres Freund wrote: Hi, On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote: Hmmm, I don't remember the details of those patches so I didn't realize it allows incremental recompression. If that's possible, that would mean existing systems can start using it. Which is good. That depends on what do you mean by "incremental"? A single toasted datum can only have one compression type, because we only update them all in one anyway. But different datums can be compressed differently. I meant different toast values using different compression algorithm, sorry for the confusion. Another question is whether we'd actually want to include the code in core directly, or use system libraries (and if some packagers might decide to disable that, for whatever reason). I'd personally say we should have an included version, and a --with-system-... flag that uses the system one. OK. I'd say to require a system library, but that's a minor detail. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgbench - implement strict TPC-B benchmark
Hi, On 2019-08-02 10:34:24 +0200, Fabien COELHO wrote: > > Hello Andres, > > Thanks a lot for these feedbacks and comments. > > > Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1 > > e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's > > a pretty significant share. > > Fine, but what is the corresponding server load? 211%? 611%? And what actual > time is spent in pgbench itself, vs libpq and syscalls? System wide pgbench, including libpq, is about 22% of the whole system. As far as I can tell there's a number of things that are wrong: - prepared statement names are recomputed for every query execution - variable name lookup is done for every command, rather than once, when parsing commands - a lot of string->int->string type back and forths > Conclusion: pgbench-specific overheads are typically (much) below 10% of the > total client-side cpu cost of a transaction, while over 90% of the cpu cost > is spent in libpq and system, for the worst case do-nothing query. I don't buy that that's the actual worst case, or even remotely close to it. I e.g. see higher pgbench overhead for the *modify* case than for the pgbench's readonly case. And that's because some of the meta commands are slow, in particular everything related to variables. And the modify case just has more variables. > > > + 12.35% pgbench pgbench[.] threadRun > > +3.54% pgbench pgbench[.] dopr.constprop.0 > > +3.30% pgbench pgbench[.] fmtint > > +1.93% pgbench pgbench[.] getVariable > > ~ 21%, probably some inlining has been performed, because I would have > expected to see significant time in "advanceConnectionState". Yea, there's plenty inlining. Note dopr() is string processing. > > +2.95% pgbench libpq.so.5.13 [.] PQsendQueryPrepared > > +2.15% pgbench libpq.so.5.13 [.] pqPutInt > > +4.47% pgbench libpq.so.5.13 [.] pqParseInput3 > > +1.66% pgbench libpq.so.5.13 [.] pqPutMsgStart > > +1.63% pgbench libpq.so.5.13 [.] pqGetInt > > ~ 13% A lot of that is really stupid. We need to improve libpq. PqsendQueryGuts (attributed to PQsendQueryPrepared here), builds the command in many separate pqPut* commands, which reside in another translation unit, is pretty sad. > > +3.16% pgbench libc-2.28.so [.] __strcmp_avx2 > > +2.95% pgbench libc-2.28.so [.] malloc > > +1.85% pgbench libc-2.28.so [.] ppoll > > +1.85% pgbench libc-2.28.so [.] __strlen_avx2 > > +1.85% pgbench libpthread-2.28.so [.] __libc_recv > > ~ 11%, str is a pain… Not sure who is calling though, pgbench or > libpq. Both. Most of the strcmp is from getQueryParams()/getVariable(). The dopr() is from pg_*printf, which is mostly attributable to preparedStatementName() and getVariable(). > This is basically 47% pgbench, 53% lib*, on the sample provided. I'm unclear > about where system time is measured. It was excluded in this profile, both to reduce profiling costs, and to focus on pgbench. Greetings, Andres Freund
Re: pglz performance
Hi, On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote: > Hmmm, I don't remember the details of those patches so I didn't realize > it allows incremental recompression. If that's possible, that would mean > existing systems can start using it. Which is good. That depends on what do you mean by "incremental"? A single toasted datum can only have one compression type, because we only update them all in one anyway. But different datums can be compressed differently. > Another question is whether we'd actually want to include the code in > core directly, or use system libraries (and if some packagers might > decide to disable that, for whatever reason). I'd personally say we should have an included version, and a --with-system-... flag that uses the system one. Greetings, Andres Freund
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, Aug 2, 2019 at 1:42 AM Julien Rouhaud wrote: > Trivial patch for that attached. Thanks! > The output is now like: > > [...] > Using an oid in the 8000- range is recommended. > For instance: 9427 > > (checking that the suggested random oid is not used yet.) I've taken your patch, and changed the wording a bit. I think that it's worth being a bit more explicit. The attached revision produces output that looks like this: Patches should use a more-or-less consecutive range of OIDs. Best practice is to make a random choice in the range 8000-. Suggested random unused OID: 9099 I would like to push this patch shortly. How do people feel about this wording? (It's based on the documentation added by commit a6417078.) -- Peter Geoghegan v2-0001-unused_oids-suggestion.patch Description: Binary data
Re: partition routing layering in nodeModifyTable.c
Hi, On 2019-07-31 17:04:38 +0900, Amit Langote wrote: > I looked into trying to do the things I mentioned above and it seems > to me that revising BeginDirectModify()'s API to receive the > ResultRelInfo directly as Andres suggested might be the best way > forward. I've implemented that in the attached 0001. Patches that > were previously 0001, 0002, and 0003 are now 0002, 003, and 0004, > respectively. 0002 is now a patch to "remove" > es_result_relation_info. Thanks! Some minor quibbles aside, the non FDW patches look good to me. Fujita-san, do you have any comments on the FDW API change? Or anybody else? I'm a bit woried about the move of BeginDirectModify() into nodeModifyTable.c - it just seems like an odd control flow to me. Not allowing any intermittent nodes between ForeignScan and ModifyTable also seems like an undesirable restriction for the future. I realize that we already do that for BeginForeignModify() (just btw, that already accepts resultRelInfo as a parameter, so being symmetrical for BeginDirectModify makes sense), but it still seems like the wrong direction to me. The need for that move, I assume, comes from needing knowing the correct ResultRelInfo, correct? I wonder if we shouldn't instead determine the at plan time (in setrefs.c), somewhat similar to how we determine ModifyTable.resultRelIndex. Doesn't look like that'd be too hard? Then we could just have BeginForeignModify, BeginDirectModify, BeginForeignScan all be called from ExecInitForeignScan(). Path 04 is such a nice improvement. Besides getting rid of a substantial amount of code, it also makes the control flow a lot easier to read. > @@ -4644,9 +4645,7 @@ GetAfterTriggersTableData(Oid relid, CmdType cmdType) > * If there are no triggers in 'trigdesc' that request relevant transition > * tables, then return NULL. > * > - * The resulting object can be passed to the ExecAR* functions. The caller > - * should set tcs_map or tcs_original_insert_tuple as appropriate when > dealing > - * with child tables. > + * The resulting object can be passed to the ExecAR* functions. > * > * Note that we copy the flags from a parent table into this struct (rather > * than subsequently using the relation's TriggerDesc directly) so that we > can > @@ -5750,14 +5749,26 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo > *relinfo, >*/ > if (row_trigger && transition_capture != NULL) > { > - TupleTableSlot *original_insert_tuple = > transition_capture->tcs_original_insert_tuple; > - TupleConversionMap *map = transition_capture->tcs_map; > + TupleTableSlot *original_insert_tuple; > + PartitionRoutingInfo *pinfo = relinfo->ri_PartitionInfo; > + TupleConversionMap *map = pinfo ? > + > pinfo->pi_PartitionToRootMap : > + > relinfo->ri_ChildToRootMap; > booldelete_old_table = > transition_capture->tcs_delete_old_table; > boolupdate_old_table = > transition_capture->tcs_update_old_table; > boolupdate_new_table = > transition_capture->tcs_update_new_table; > boolinsert_new_table = > transition_capture->tcs_insert_new_table; > > /* > + * Get the originally inserted tuple from the global variable > and set > + * the latter to NULL because any given tuple must be read only > once. > + * Note that the TransitionCaptureState is shared across many > calls > + * to this function. > + */ > + original_insert_tuple = > transition_capture->tcs_original_insert_tuple; > + transition_capture->tcs_original_insert_tuple = NULL; Maybe I'm missing something, but original_insert_tuple is not a global variable? > @@ -888,7 +889,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate, > PartitionTupleRouting *proute, > PartitionDispatch dispatch, > ResultRelInfo *partRelInfo, > - int partidx) > + int partidx, > + bool is_update_result_rel) > { > MemoryContext oldcxt; > PartitionRoutingInfo *partrouteinfo; > @@ -935,10 +937,15 @@ ExecInitRoutingInfo(ModifyTableState *mtstate, > if (mtstate && > (mtstate->mt_transition_capture || > mtstate->mt_oc_transition_capture)) > { > - partrouteinfo->pi_PartitionToRootMap = > - > convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc), > - > RelationGetDescr(partRelInfo->ri_PartitionRoot), > -
Re: pglz performance
On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote: Hi, On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote: On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote: > Hi, > > On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote: > > We have some kind of "roadmap" of "extensible pglz". We plan to > > provide implementation on Novembers CF. > > I don't understand why it's a good idea to improve the compression side > of pglz. There's plenty other people that spent a lot of time > developing better compression algorithms. > Isn't it beneficial for existing systems, that will be stuck with pglz even if we end up adding other algorithms? Why would they be stuck continuing to *compress* with pglz? As we fully retoast on write anyway we can just gradually switch over to the better algorithm. Decompression speed is another story, of course. Hmmm, I don't remember the details of those patches so I didn't realize it allows incremental recompression. If that's possible, that would mean existing systems can start using it. Which is good. Another question is whether we'd actually want to include the code in core directly, or use system libraries (and if some packagers might decide to disable that, for whatever reason). But yeah, I agree you may have a point about optimizing pglz compression. Here's what I had a few years back: https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de see also https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de I think we should refresh something like that patch, and: - make the compression algorithm GUC an enum, rename - add --with-system-lz4 - obviously refresh the copy of lz4 - drop snappy That's a reasonable plan, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pglz performance
Hi, On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote: > On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote: > > Hi, > > > > On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote: > > > We have some kind of "roadmap" of "extensible pglz". We plan to > > > provide implementation on Novembers CF. > > > > I don't understand why it's a good idea to improve the compression side > > of pglz. There's plenty other people that spent a lot of time > > developing better compression algorithms. > > > > Isn't it beneficial for existing systems, that will be stuck with pglz > even if we end up adding other algorithms? Why would they be stuck continuing to *compress* with pglz? As we fully retoast on write anyway we can just gradually switch over to the better algorithm. Decompression speed is another story, of course. Here's what I had a few years back: https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de see also https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de I think we should refresh something like that patch, and: - make the compression algorithm GUC an enum, rename - add --with-system-lz4 - obviously refresh the copy of lz4 - drop snappy Greetings, Andres Freund
Re: pglz performance
On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote: Hi, On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote: We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF. I don't understand why it's a good idea to improve the compression side of pglz. There's plenty other people that spent a lot of time developing better compression algorithms. Isn't it beneficial for existing systems, that will be stuck with pglz even if we end up adding other algorithms? Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to any data with common substrings: this will enhance compression ratio. It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate function which produces some "adapted common prefix" from users's data. Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which common prefix to use. But also system command can say "invoke decompression from extension". Thus, user will be able to train database compression on his data and substitute pglz compression with custom compression method seamlessly. This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd, brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user wants for his foot. I think this is way too complicated, and will provide not particularly much benefit for the majority users. I agree with this. I do see value in the feature, but probably not as a drop-in replacement for the default compression algorithm. I'd compare it to the "custom compression methods" patch that was submitted some time ago. In fact, I'll argue that we should flat out reject any such patch until we have at least one decent default compression algorithm in core. You're trying to work around a poor compression algorithm with complicated dictionary improvement, that require user interaction, and only will work in a relatively small subset of the cases, and will very often increase compression times. I wouldn't be so strict I guess. But I do agree an algorithm that requires additional steps (training, ...) is unlikely to be a good candidate for default instance compression alorithm. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pglz performance
Hi, On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote: > We have some kind of "roadmap" of "extensible pglz". We plan to provide > implementation on Novembers CF. I don't understand why it's a good idea to improve the compression side of pglz. There's plenty other people that spent a lot of time developing better compression algorithms. > Currently, pglz starts with empty cache map: there is no prior 4k bytes > before start. We can add imaginary prefix to any data with common substrings: > this will enhance compression ratio. > It is hard to decide on training data set for this "common prefix". So we > want to produce extension with aggregate function which produces some > "adapted common prefix" from users's data. > Then we can "reserve" few negative bytes for "decompression commands". This > command can instruct database on which common prefix to use. > But also system command can say "invoke decompression from extension". > > Thus, user will be able to train database compression on his data and > substitute pglz compression with custom compression method seamlessly. > > This will make hard-choosen compression unneeded, but seems overly hacky. But > there will be no need to have lz4, zstd, brotli, lzma and others in core. Why > not provide e.g. "time series compression"? Or "DNA compression"? Whatever > gun user wants for his foot. I think this is way too complicated, and will provide not particularly much benefit for the majority users. In fact, I'll argue that we should flat out reject any such patch until we have at least one decent default compression algorithm in core. You're trying to work around a poor compression algorithm with complicated dictionary improvement, that require user interaction, and only will work in a relatively small subset of the cases, and will very often increase compression times. Greetings, Andres Freund
Re: Add client connection check during the execution of the query
On 18.07.2019 6:19, Tatsuo Ishii wrote: I noticed that there are some confusions in the doc and code regarding what the new configuration parameter means. According to the doc: +Default value is zero - it disables connection +checks, so the backend will detect client disconnection only when trying +to send a response to the query. But guc.c comment says: + gettext_noop("Sets the time interval for checking connection with the client."), + gettext_noop("A value of -1 disables this feature. Zero selects a suitable default value."), Probably the doc is correct since the actual code does so. Yes, value -1 is not even accepted due to the specified range. tps = 67715.993428 (including connections establishing) tps = 67717.251843 (excluding connections establishing) So the performance is about 5% down with the feature enabled in this case. For me, 5% down is not subtle. Probably we should warn this in the doc. I also see some performance degradation, although it is not so large in my case (I used pgbench with scale factor 10 and run the same command as you). In my case difference is 103k vs. 105k TPS is smaller than 2%. It seems to me that it is not necessary to enable timeout at each command: @@ -4208,6 +4210,9 @@ PostgresMain(int argc, char *argv[], */ CHECK_FOR_INTERRUPTS(); DoingCommandRead = false; + if (client_connection_check_interval) + enable_timeout_after(SKIP_CLIENT_CHECK_TIMEOUT, + client_connection_check_interval); /* * (5) turn off the idle-in-transaction timeout Unlike statement timeout or idle in transaction timeout price start of measuring time is not important. So it is possible to do once before main backend loop: @@ -3981,6 +3983,10 @@ PostgresMain(int argc, char *argv[], if (!IsUnderPostmaster) PgStartTime = GetCurrentTimestamp(); + if (client_connection_check_interval) + enable_timeout_after(SKIP_CLIENT_CHECK_TIMEOUT, + client_connection_check_interval); + /* * POSTGRES main processing loop begins here * But actually I do not see much difference from moving enabling timeout code. Moreover the difference in performance almost not depend on the value of timeout. I set it to 100 seconds with pgbench loop 30 seconds (so timeout never fired and recv is never called) and still there is small difference in performance. After some experiments I found out that just presence of active timer results some small performance penalty. You can easily check it: set for example statement_timeout to the same large value (100 seconds) and you will get the same small slowdown. So recv() itself is not source of the problem. Actually any system call (may be except fsync) performed with frequency less than one second can not have some noticeable impact on performance. So I do not think that recv(MSG_PEEK) can cause any performance problem at Windows or any other platform. But I wonder why we can not perform just pool with POLLOUT flag and zero timeout. If OS detected closed connection, it should return POLLHUP, should not it? I am not sure if it is more portable or more efficient way - just seems to be a little bit more natural way (from my point of view) to check if connection is still alive. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Index Skip Scan
Hi, On 8/2/19 8:14 AM, Dmitry Dolgov wrote: And this too (slightly rewritten:). We will soon post the new version of patch with updates about UniqueKey from Jesper. Yes. We decided to send this now, although there is still feedback from David that needs to be considered/acted on. The patches can be reviewed independently, but we will send them as a set from now on. Development of UniqueKey will be kept separate though [1]. Note, that while UniqueKey can form the foundation of optimizations for GROUP BY queries it isn't the focus for this patch series. Contributions are very welcomed of course. [1] https://github.com/jesperpedersen/postgres/tree/uniquekey Best regards, Jesper >From 35018a382d792d6ceeb8d0e9d16bc14ea2e3f148 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 2 Aug 2019 07:52:08 -0400 Subject: [PATCH 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 +++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile| 2 +- src/backend/optimizer/path/allpaths.c | 8 ++ src/backend/optimizer/path/costsize.c | 5 + src/backend/optimizer/path/indxpath.c | 41 +++ src/backend/optimizer/path/pathkeys.c | 72 ++-- src/backend/optimizer/path/uniquekey.c | 147 + src/backend/optimizer/plan/planner.c | 18 ++- src/backend/optimizer/util/pathnode.c | 12 ++ src/backend/optimizer/util/tlist.c | 1 - src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 18 +++ src/include/nodes/print.h | 2 +- src/include/optimizer/pathnode.h | 1 + src/include/optimizer/paths.h | 11 ++ 16 files changed, 377 insertions(+), 15 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index e6ce8e2110..9a4f3e8e4b 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1720,6 +1720,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2201,6 +2202,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2210,6 +2212,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2397,6 +2400,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node) { @@ -4073,6 +4084,9 @@ outNode(StringInfo str, const void *obj) case T_PathKey: _outPathKey(str, obj); break; + case T_UniqueKey: +_outUniqueKey(str, obj); +break; case T_PathTarget: _outPathTarget(str, obj); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 4ecde6b421..62c9d4ef7a 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable) printf(")\n"); } +/* + * print_uniquekeys - + * unique_key an UniqueKey + */ +void +print_uniquekeys(const List *uniquekeys, const List *rtable) +{ + ListCell *l; + + printf("("); + foreach(l, uniquekeys) + { + UniqueKey *unique_key = (UniqueKey *) lfirst(l); + EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause; + ListCell *k; + bool first = true; + + /* chase up */ + while (eclass->ec_merged) + eclass = eclass->ec_merged; + + printf("("); + foreach(k, eclass->ec_members) + { + EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); + + if (first) +first = false; + else +printf(", "); + print_expr((Node *) mem->em_expr, rtable); + } + printf(")"); + if (lnext(uniquekeys, l)) + printf(", "); + } + printf(")\n"); +} + /* * print_tl * print targetlist in a more legible way. diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 6864a62132..8249a6b6db 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -13,6 +13,6 @@ top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \ - joinpath.o
Re: pglz performance
Thanks for looking into this! > 2 авг. 2019 г., в 19:43, Tomas Vondra > написал(а): > > On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote: >> >> It takes me some time to understand that your memcpy optimization is >> correct;) Seems that comments are not explanatory enough... will try to fix. >> I have tested different ways of optimizing this fragment of code, but failed >> tooutperform your implementation! JFYI we tried optimizations with memcpy with const size (optimized into assembly instead of call), unrolling literal loop and some others. All these did not work better. >> But ... below are results for lz4: >> >> Decompressor score (summ of all times): >> NOTICE: Decompressor lz4_decompress result 3.660066 >> Compressor score (summ of all times): >> NOTICE: Compressor lz4_compress result 10.288594 >> >> There is 2 times advantage in decompress speed and 10 times advantage in >> compress speed. >> So may be instead of "hacking" pglz algorithm we should better switch to lz4? >> > > I think we should just bite the bullet and add initdb option to pick > compression algorithm. That's been discussed repeatedly, but we never > ended up actually doing that. See for example [1]. > > If there's anyone willing to put some effort into getting this feature > over the line, I'm willing to do reviews & commit. It's a seemingly > small change with rather insane potential impact. > > But even if we end up doing that, it still makes sense to optimize the > hell out of pglz, because existing systems will still use that > (pg_upgrade can't switch from one compression algorithm to another). We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF. Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to any data with common substrings: this will enhance compression ratio. It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate function which produces some "adapted common prefix" from users's data. Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which common prefix to use. But also system command can say "invoke decompression from extension". Thus, user will be able to train database compression on his data and substitute pglz compression with custom compression method seamlessly. This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd, brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user wants for his foot. Best regards, Andrey Borodin.
[PATCH] Improve performance of NOTIFY over many databases (v2)
Hoi hackers, Here is a reworked version of the previous patches. The original three patches have been collapsed into one as given the changes discussed it didn't make sense to keep them separate. There are now two patches (the third is just to help with testing): Patch 1: Tracks the listening backends in a list so non-listening backends can be quickly skipped over. This is separate because it's orthogonal to the rest of the changes and there are other ways to do this. Patch 2: This is the meat of the change. It implements all the suggestions discussed: - The queue tail is now only updated lazily, whenever the notify queue moves to a new page. This did require a new global to track this state through the transaction commit, but it seems worth it. - Only backends for the current database are signalled when a notification is made - Slow backends are woken up one at a time rather than all at once - A backend is allowed to lag up to 4 SLRU pages behind before being signalled. This is a tradeoff between how often to get woken up verses how much work to do once woken up. - All the relevant comments have been updated to describe the new algorithm. Locking should also be correct now. This means in the normal case where listening backends get a notification occasionally, no-one will ever be considered slow. An exclusive lock for cleanup will happen about once per SLRU page. There's still the exclusive locks on adding notifications but that's unavoidable. One minor issue is that pg_notification_queue_usage() will now return a small but non-zero number (about 3e-6) even when nothing is really going on. This could be fixed by having it take an exclusive lock instead and updating to the latest values but that barely seems worth it. Performance-wise it's even better than my original patches, with about 20-25% reduction in CPU usage in my test setup (using the test script sent previously). Here is the log output from my postgres, where you see the signalling in action: -- 16:42:48.673 [10188] martijn@test_131 DEBUG: PreCommit_Notify 16:42:48.673 [10188] martijn@test_131 DEBUG: NOTIFY QUEUE = (74,896)...(79,0) 16:42:48.673 [10188] martijn@test_131 DEBUG: backendTryAdvanceTail -> true 16:42:48.673 [10188] martijn@test_131 DEBUG: AtCommit_Notify 16:42:48.673 [10188] martijn@test_131 DEBUG: ProcessCompletedNotifies 16:42:48.673 [10188] martijn@test_131 DEBUG: backendTryAdvanceTail -> false 16:42:48.673 [10188] martijn@test_131 DEBUG: asyncQueueAdvanceTail 16:42:48.673 [10188] martijn@test_131 DEBUG: waking backend 137 (pid 10055) 16:42:48.673 [10055] martijn@test_067 DEBUG: ProcessIncomingNotify 16:42:48.673 [10187] martijn@test_131 DEBUG: ProcessIncomingNotify 16:42:48.673 [10055] martijn@test_067 DEBUG: asyncQueueAdvanceTail 16:42:48.673 [10055] martijn@test_067 DEBUG: waking backend 138 (pid 10056) 16:42:48.673 [10187] martijn@test_131 DEBUG: ProcessIncomingNotify: done 16:42:48.673 [10055] martijn@test_067 DEBUG: ProcessIncomingNotify: done 16:42:48.673 [10056] martijn@test_067 DEBUG: ProcessIncomingNotify 16:42:48.673 [10056] martijn@test_067 DEBUG: asyncQueueAdvanceTail 16:42:48.673 [10056] martijn@test_067 DEBUG: ProcessIncomingNotify: done 16:42:48.683 [9991] martijn@test_042 DEBUG: Async_Notify(changes) 16:42:48.683 [9991] martijn@test_042 DEBUG: PreCommit_Notify 16:42:48.683 [9991] martijn@test_042 DEBUG: NOTIFY QUEUE = (75,7744)...(79,32) 16:42:48.683 [9991] martijn@test_042 DEBUG: AtCommit_Notify - Have a nice weekend. -- Martijn van Oosterhout http://svana.org/kleptog/ From 82366f1dbc0fc234fdd10dbc15519b3cf7104684 Mon Sep 17 00:00:00 2001 From: Martijn van Oosterhout Date: Tue, 23 Jul 2019 16:49:30 +0200 Subject: [PATCH 1/3] Maintain queue of listening backends to speed up loops Especially the loop to advance the tail pointer is called fairly often while holding an exclusive lock. --- src/backend/commands/async.c | 45 +++- 1 file changed, 40 insertions(+), 5 deletions(-) diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index 6e9c580ec6..ba0b1baecc 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -217,6 +217,7 @@ typedef struct QueueBackendStatus { int32 pid; /* either a PID or InvalidPid */ Oid dboid; /* backend's database OID, or InvalidOid */ + int nextListener; /* backendid of next listener, 0=last */ QueuePosition pos; /* backend has read queue up to here */ } QueueBackendStatus; @@ -247,6 +248,7 @@ typedef struct AsyncQueueControl QueuePosition tail; /* the global tail is equivalent to the pos of * the "slowest" backend */ TimestampTz lastQueueFillWarn; /* time of last queue-full msg */ + int firstListener; /* backendId of first listener, 0=none */ QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER]; /* backend[0] is not used; used entries are from [1] to [MaxBackends] */ } AsyncQueueControl; @@ -257,8 +259,11 @@ static Asy
Optimize single tuple fetch from nbtree index
Hi hackers, While I was reviewing some code in another patch, I stumbled upon a possible optimization in the btree index code in nbtsearch.c for queries using 'LIMIT 1'. I have written a small patch that implements this optimization, but I'd like some feedback on the design of the patch, whether it is correct at all to use this optimization, and whether the performance tradeoffs are deemed worth it by the community. Basically, an example of the case I'd like to optimize is the following. Given a table 'tbl' with an index on columns (k,ts DESC): SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1; And, even more importantly, when this query gets called in an inner loop like: SELECT* FROM generate_series(:start_ts, :end_ts, :interval) ts -- perhaps thousands of iterations, could also be a loop over values of 'k' rather than timestamps. this is just an example CROSS JOIN LATERAL ( SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1; ) _; With time-series data, this case often arises as you have a certain natural key for which you store updates as they occur. Getting the state of k at a specific time then boils down to the given query there, which is almost always the fastest way to get this information, since the index scan with LIMIT 1 is very fast already. However, there seems to be a possibility to make this even faster (up to nearly 3x faster in test cases that use this nested loop of index scans) Every time the index scan is done, all tuples from the leaf page are read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an exception for this *only* the first time amgettuple gets called. This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off. There are a few caveats: - Possible performance decrease for queries that need a small number of tuples (but more than one), because they now need to lock the same page twice. This can happen in several cases, for example: LIMIT 3; LIMIT 1 but the first tuple returned does not match other scan conditions; LIMIT 1 but the tuple returned is not visible; no LIMIT at all but there are just only a few matching rows. - We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation. I did performance tests for some best case and worst case test scenarios. TPS results were stable and reproducible in re-runs on my, otherwise idle, server. Attached are the full results and how to reproduce. I picked test cases that show best performance as well as worst performance compared to master. Summary: the greatest performance improvement can be seen for the cases with the subquery in a nested loop. In a nested loop of 100 times, the performance is roughly two times better, for 1 times the performance is roughly three times better. For most test cases that don't use LIMIT 1, I couldn't find a noticeable difference, except for the nested loop with a LIMIT 3 (or similarly, a nested loop without any LIMIT-clause that returns just three tuples). This is also theoretically the worst-case test case, because it has to lock the page again and then read it, just for one tuple. In this case, I saw TPS decrease by 2-3% in a few cases (details in the attached file), due to it having to lock/unlock the same page in both _bt_first and _bt_next. A few concrete questions to the community: - Does the community also see this as a useful optimization? - Is the way it is currently implemented safe? I struggled quite a bit to get everything working with respect to page splits and insertions. In particular, I don't know why in my patch, _bt_find_offset_for_tid needs to consider searching for items with an offset *before* the passed offset. As far as my understanding goes, this could only happen when the index gets vacuumed in the mean-time. However, we hold a pin on the buffer the whole time (we even assert this), so vacuum should not have been possible. Still, this code gets triggered sometimes, so it seems to be necessary. Perhaps someone in the community who's more of an expert on this can shed some light on it. - What are considered acceptable performance tradeoffs in this case? Is a performance degradation in any part generally not acceptable at all? I'd also welcome any feedback on the process - this is my first patch and while I tried to follow the guidelines, I may have missed something along the way. Attachments: - out_limit.txt: pgbench resu
Re: Memory-Bounded Hash Aggregation
On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote: On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote: I'm late to the party. You are welcome to join any time! These two approaches both spill the input tuples, what if the skewed groups are not encountered before the hash table fills up? The spill files' size and disk I/O could be downsides. Let's say the worst case is that we encounter 10 million groups of size one first; just enough to fill up memory. Then, we encounter a single additional group of size 20 million, and need to write out all of those 20 million raw tuples. That's still not worse than Sort+GroupAgg which would need to write out all 30 million raw tuples (in practice Sort is pretty fast so may still win in some cases, but not by any huge amount). Greenplum spills all the groups by writing the partial aggregate states, reset the memory context, process incoming tuples and build in-memory hash table, then reload and combine the spilled partial states at last, how does this sound? That can be done as an add-on to approach #1 by evicting the entire hash table (writing out the partial states), then resetting the memory context. It does add to the complexity though, and would only work for the aggregates that support serializing and combining partial states. It also might be a net loss to do the extra work of initializing and evicting a partial state if we don't have large enough groups to benefit. Given that the worst case isn't worse than Sort+GroupAgg, I think it should be left as a future optimization. That would give us time to tune the process to work well in a variety of cases. +1 to leaving that as a future optimization I think it's clear there's no perfect eviction strategy - for every algorithm we came up with we can construct a data set on which it performs terribly (I'm sure we could do that for the approach used by Greenplum, for example). So I think it makes sense to do what Jeff proposed, and then maybe try improving that in the future with a switch to different eviction strategy based on some heuristics. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Memory-Bounded Hash Aggregation
On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote: > I'm late to the party. You are welcome to join any time! > These two approaches both spill the input tuples, what if the skewed > groups are not encountered before the hash table fills up? The spill > files' size and disk I/O could be downsides. Let's say the worst case is that we encounter 10 million groups of size one first; just enough to fill up memory. Then, we encounter a single additional group of size 20 million, and need to write out all of those 20 million raw tuples. That's still not worse than Sort+GroupAgg which would need to write out all 30 million raw tuples (in practice Sort is pretty fast so may still win in some cases, but not by any huge amount). > Greenplum spills all the groups by writing the partial aggregate > states, > reset the memory context, process incoming tuples and build in-memory > hash table, then reload and combine the spilled partial states at > last, > how does this sound? That can be done as an add-on to approach #1 by evicting the entire hash table (writing out the partial states), then resetting the memory context. It does add to the complexity though, and would only work for the aggregates that support serializing and combining partial states. It also might be a net loss to do the extra work of initializing and evicting a partial state if we don't have large enough groups to benefit. Given that the worst case isn't worse than Sort+GroupAgg, I think it should be left as a future optimization. That would give us time to tune the process to work well in a variety of cases. Regards, Jeff Davis
Re: Patch to document base64 encoding
On Fri, 02 Aug 2019 10:44:43 -0400 Tom Lane wrote: > I don't really have a problem with > repeating the entries for other functions that exist in both > text and bytea variants, either. Ok. Thanks. I'll repeat entries then. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: Patch to document base64 encoding
On 8/2/19 10:32 AM, Karl O. Pinc wrote: > But I'm not happy with putting any function that works with > bytea into the binary string section. This would mean moving, > say, length() out of the regular string section. I'm not sure why. The bytea section already has an entry for its length() function. There are also length() functions for bit, character, lseg, path, tsvector I don't really think of those as "a length() function" that works on all those types. I think of a variety of types, many of which offer a length() function. That seems to be reflected in the way the docs are arranged. Regards, -Chap
Re: Patch to document base64 encoding
"Karl O. Pinc" writes: > But I'm not happy with putting any function that works with > bytea into the binary string section. This would mean moving, > say, length() out of the regular string section. There's a > lot of functions that work on both string and bytea inputs > and most (not all, see below) are functions that people > typically associate with string data. Well, there are two different length() functions --- length(text) and length(bytea) are entirely different things, they don't even measure in the same units. I think documenting them separately is the right thing to do. I don't really have a problem with repeating the entries for other functions that exist in both text and bytea variants, either. There aren't that many. > What I think I'd like to do is add a column to the table > in the string section that says whether or not the function > works with both string and bytea. Meh. Seems like what that would mostly do is ensure that neither page is understandable on its own. regards, tom lane
Re: pglz performance
On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote: On 27.06.2019 21:33, Andrey Borodin wrote: 13 мая 2019 г., в 12:14, Michael Paquier написал(а): Decompression can matter a lot for mostly-read workloads and compression can become a bottleneck for heavy-insert loads, so improving compression or decompression should be two separate problems, not two problems linked. Any improvement in one or the other, or even both, is nice to have. Here's patch hacked by Vladimir for compression. Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations): 1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less constraints for compiler to optimize. 2. More compact hash table: use indexes instead of pointers. 3. More robust segment comparison: like memcmp, but return index of first different byte In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine. Current implementation is integrated into test_pglz suit for benchmarking purposes[0]. Best regards, Andrey Borodin. [0] https://github.com/x4m/test_pglz It takes me some time to understand that your memcpy optimization is correct;) I have tested different ways of optimizing this fragment of code, but failed tooutperform your implementation! Results at my computer is simlar with yours: Decompressor score (summ of all times): NOTICE: Decompressor pglz_decompress_hacked result 6.627355 NOTICE: Decompressor pglz_decompress_hacked_unrolled result 7.497114 NOTICE: Decompressor pglz_decompress_hacked8 result 7.412944 NOTICE: Decompressor pglz_decompress_hacked16 result 7.792978 NOTICE: Decompressor pglz_decompress_vanilla result 10.652603 Compressor score (summ of all times): NOTICE: Compressor pglz_compress_vanilla result 116.970005 NOTICE: Compressor pglz_compress_hacked result 89.706105 But ... below are results for lz4: Decompressor score (summ of all times): NOTICE: Decompressor lz4_decompress result 3.660066 Compressor score (summ of all times): NOTICE: Compressor lz4_compress result 10.288594 There is 2 times advantage in decompress speed and 10 times advantage in compress speed. So may be instead of "hacking" pglz algorithm we should better switch to lz4? I think we should just bite the bullet and add initdb option to pick compression algorithm. That's been discussed repeatedly, but we never ended up actually doing that. See for example [1]. If there's anyone willing to put some effort into getting this feature over the line, I'm willing to do reviews & commit. It's a seemingly small change with rather insane potential impact. But even if we end up doing that, it still makes sense to optimize the hell out of pglz, because existing systems will still use that (pg_upgrade can't switch from one compression algorithm to another). regards [1] https://www.postgresql.org/message-id/flat/55341569.1090107%402ndquadrant.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Patch to document base64 encoding
On Tue, 30 Jul 2019 23:44:49 +0200 (CEST) Fabien COELHO wrote: > Personnaly, I'd be ok with having a separate "conversion function" > table, and also with Tom suggestion to have string functions with > "only simple string" functions, and if any binary appears it is moved > into the binary section. I'll make a "conversion function" table and put it in the binary section. But I'm not happy with putting any function that works with bytea into the binary string section. This would mean moving, say, length() out of the regular string section. There's a lot of functions that work on both string and bytea inputs and most (not all, see below) are functions that people typically associate with string data. What I think I'd like to do is add a column to the table in the string section that says whether or not the function works with both string and bytea. The result would be: The hash functions (md5(), sha256(), etc.) would move to the string section, because they work on both strings and binary data. So the binary function table would considerably shorten. There would be a new table for conversions between bytea and string (both directions). This would be placed in the binary string section. So the binary string section would be "just simple bytea", plus conversion functions. Kind of the opposite of Tom's suggestion. Please let me know what you think. Thanks. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: Recent failures in IsolationCheck deadlock-hard
Thomas Munro writes: > There have been five failures on three animals like this, over the > past couple of months: Also worth noting is that anole failed its first try at the new deadlock-parallel isolation test: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16 What that looks like is the queries got stuck and eventually isolationtester gave up and canceled the test. So I'm suspicious that there's a second bug in the parallel deadlock detection code. Possibly relevant factoids: all three of the animals in question run HEAD with force_parallel_mode = regress, and there's reason to think that their timing behavior could be different from other animals (anole and gharial run on HPUX, while hyrax uses CLOBBER_CACHE_ALWAYS). regards, tom lane
Re: pglz performance
On 27.06.2019 21:33, Andrey Borodin wrote: 13 мая 2019 г., в 12:14, Michael Paquier написал(а): Decompression can matter a lot for mostly-read workloads and compression can become a bottleneck for heavy-insert loads, so improving compression or decompression should be two separate problems, not two problems linked. Any improvement in one or the other, or even both, is nice to have. Here's patch hacked by Vladimir for compression. Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations): 1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less constraints for compiler to optimize. 2. More compact hash table: use indexes instead of pointers. 3. More robust segment comparison: like memcmp, but return index of first different byte In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine. Current implementation is integrated into test_pglz suit for benchmarking purposes[0]. Best regards, Andrey Borodin. [0] https://github.com/x4m/test_pglz It takes me some time to understand that your memcpy optimization is correct;) I have tested different ways of optimizing this fragment of code, but failed tooutperform your implementation! Results at my computer is simlar with yours: Decompressor score (summ of all times): NOTICE: Decompressor pglz_decompress_hacked result 6.627355 NOTICE: Decompressor pglz_decompress_hacked_unrolled result 7.497114 NOTICE: Decompressor pglz_decompress_hacked8 result 7.412944 NOTICE: Decompressor pglz_decompress_hacked16 result 7.792978 NOTICE: Decompressor pglz_decompress_vanilla result 10.652603 Compressor score (summ of all times): NOTICE: Compressor pglz_compress_vanilla result 116.970005 NOTICE: Compressor pglz_compress_hacked result 89.706105 But ... below are results for lz4: Decompressor score (summ of all times): NOTICE: Decompressor lz4_decompress result 3.660066 Compressor score (summ of all times): NOTICE: Compressor lz4_compress result 10.288594 There is 2 times advantage in decompress speed and 10 times advantage in compress speed. So may be instead of "hacking" pglz algorithm we should better switch to lz4? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: block-level incremental backup
On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke wrote: > > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke > wrote: >> >> I am almost done writing the patch for pg_combinebackup and will post soon. > > > Attached patch which implements the pg_combinebackup utility used to combine > full basebackup with one or more incremental backups. > > I have tested it manually and it works for all best cases. > > Let me know if you have any inputs/suggestions/review comments? > Some comments: 1) There will be some link files created for tablespace, we might require some special handling for it 2) + while (numretries <= maxretries) + { + rc = system(copycmd); + if (rc == 0) + return; + + pg_log_info("could not copy, retrying after %d seconds", + sleeptime); + pg_usleep(numretries++ * sleeptime * 100L); + } Retry functionality is hanlded only for copying of full files, should we handle retry for copying of partial files 3) + maxretries = atoi(optarg); + if (maxretries < 0) + { + pg_log_error("invalid value for maxretries"); + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname); + exit(1); + } + break; + case 's': + sleeptime = atoi(optarg); + if (sleeptime <= 0 || sleeptime > 60) + { + pg_log_error("invalid value for sleeptime"); + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname); + exit(1); + } + break; we can have some range for maxretries similar to sleeptime 4) + fp = fopen(filename, "r"); + if (fp == NULL) + { + pg_log_error("could not read file \"%s\": %m", filename); + exit(1); + } + + labelfile = malloc(statbuf.st_size + 1); + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size) + { + pg_log_error("corrupted file \"%s\": %m", filename); + free(labelfile); + exit(1); + } Should we check for malloc failure 5) Should we add display of progress as backup may take some time, this can be added as enhancement. We can get other's opinion on this. 6) + if (nIncrDir == MAX_INCR_BK_COUNT) + { + pg_log_error("too many incremental backups to combine"); + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); + exit(1); + } + + IncrDirs[nIncrDir] = optarg; + nIncrDir++; + break; If the backup count increases providing the input may be difficult, Shall user provide all the incremental backups from a parent folder and can we handle the ordering of incremental backup internally 7) + if (isPartialFile) + { + if (verbose) + pg_log_info("combining partial file \"%s.partial\"", fn); + + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn); + } + else + copy_whole_file(infn, outfn); Add verbose for copying whole file 8) We can also check if approximate space is available in disk before starting combine backup, this can be added as enhancement. We can get other's opinion on this. 9) + printf(_(" -i, --incr-backup=DIRECTORY incremental backup directory (maximum %d)\n"), MAX_INCR_BK_COUNT); + printf(_(" -o, --output-dir=DIRECTORY combine backup into directory\n")); + printf(_("\nGeneral options:\n")); + printf(_(" -n, --no-clean do not clean up after errors\n")); Combine backup into directory can be combine backup directory 10) +/* Max number of incremental backups to be combined. */ +#define MAX_INCR_BK_COUNT 10 + +/* magic number in incremental backup's .partial file */ MAX_INCR_BK_COUNT can be increased little, some applications use 1 full backup at the beginning of the month and use 30 incremental backups rest of the days in the month Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Recent failures in IsolationCheck deadlock-hard
Hi, There have been five failures on three animals like this, over the past couple of months: step s6a7: LOCK TABLE a7; step s7a8: LOCK TABLE a8; step s8a1: LOCK TABLE a1; -step s8a1: <... completed> step s7a8: <... completed> -error in steps s8a1 s7a8: ERROR: deadlock detected +step s8a1: <... completed> +ERROR: deadlock detected step s8c: COMMIT; step s7c: COMMIT; step s6a7: <... completed> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-07-18%2021:57:59 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-07-10%2005:59:16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2019-07-08%2015:02:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-23%2004:17:09 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-12%2021:46:24 Before that there were some like that a couple of years back: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-09%2021:58:03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2021:58:04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2005:19:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-07%2000:23:39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-05%2018:58:04 -- Thomas Munro https://enterprisedb.com
Re: Index Skip Scan
> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi > wrote: > > I feel uncomfortable to look into indexinfo there. Couldnd't we > use indexskipprefix == -1 to signal !amcanskip from > create_index_path? Looks like it's not that straightforward to do this only in create_index_path, since to make this decision we need to have both parts, indexinfo and distinct keys. > Yeah, your explanation was perfect for me. What I failed to > understand was what is expected to be done in the case. I > reconsidered and understood that: > > For example, the following query: > > select distinct (a, b) a, b, c from t where c < 100; > > skip scan returns one tuple for one distinct set of (a, b) with > arbitrary one of c, If the choosed c doesn't match the qual and > there is any c that matches the qual, we miss that tuple. > > If this is correct, an explanation like the above might help. Yes, that's correct, I've added this into commentaries. > Maybe something like the following will work *for me*:p > > | When we are fetching a cursor in backward direction, return the > | tuples that forward fetching should have returned. In other > | words, we return the last scanned tuple in a DISTINCT set. Skip > | to that tuple before returning the first tuple. And this too (slightly rewritten:). We will soon post the new version of patch with updates about UniqueKey from Jesper.
Re: Commitfest 2019-07, the first of five* for PostgreSQL 13
I guess the CF app could show those kind of metrics, but having a written report from a human seems to be a good idea (I got it from Alvaro's blog[1]). The CF is now closed, and here are the final numbers: status | w1 | w2 | w3 | w4 | final +++++--- Committed | 32 | 41 | 49 | 59 |64 Moved to next CF | 5 | 6 | 6 | 6 | 145 Rejected | 2 | 2 | 2 | 2 | 2 Returned with feedback | 2 | 2 | 2 | 2 | 9 Withdrawn | 8 | 8 | 9 | 10 |11 In percentages, we returned and rejected 5%, withdrew 5%, committed 28%, and pushed 62% to the next 'fest. That's a wrap. Thanks everyone. [1] https://www.2ndquadrant.com/en/blog/managing-a-postgresql-commitfest/ Thanks. Attached a small graphical display of CF results over time. -- Fabien.
Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData
On Fri, Aug 2, 2019 at 6:37 AM Thomas Munro wrote: > Thanks. This looks pretty reasonable to me, and I don't think we need > to worry about the subxid list for now. Why not just do them all at once? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Built-in connection pooler
On 02.08.2019 12:57, DEV_OPS wrote: Hello Konstantin would you please re-base this patch? I'm going to test it, and back port into PG10 stable and PG9 stable thank you very much Thank you. Rebased patch is attached. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index c91e3e1..119daac 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -719,6 +719,137 @@ include_dir 'conf.d' + + max_sessions (integer) + + max_sessions configuration parameter + + + + + The maximum number of client sessions that can be handled by + one connection proxy when session pooling is enabled. + This parameter does not add any memory or CPU overhead, so + specifying a large max_sessions value + does not affect performance. + If the max_sessions limit is reached new connections are not accepted. + + + The default value is 1000. This parameter can only be set at server start. + + + + + + session_pool_size (integer) + + session_pool_size configuration parameter + + + + + Enables session pooling and defines the maximum number of + backends that can be used by client sessions for each database/user combination. + Launched non-tainted backends are never terminated even if there are no active sessions. + Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements. + Tainted backend can server only one client. + + + The default value is 10, so up to 10 backends will serve each database, + + + + + + proxy_port (integer) + + proxy_port configuration parameter + + + + + Sets the TCP port for the connection pooler. + Clients connected to main "port" will be assigned dedicated backends, + while client connected to proxy port will be connected to backends through proxy which + performs transaction level scheduling. + + + The default value is 6543. + + + + + + connection_proxies (integer) + + connection_proxies configuration parameter + + + + + Sets number of connection proxies. + Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing). + Each proxy launches its own subset of backends. + So maximal number of non-tainted backends is session_pool_size*connection_proxies*databases*roles. + + + The default value is 0, so session pooling is disabled. + + + + + + session_schedule (enum) + + session_schedule configuration parameter + + + + + Specifies scheduling policy for assigning session to proxies in case of + connection pooling. Default policy is round-robin. + + + With round-robin policy postmaster cyclicly scatter sessions between proxies. + + + With random policy postmaster randomly choose proxy for new session. + + + With load-balancing policy postmaster choose proxy with lowest load average. + Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections. + + + + + + idle_pool_worker_timeout (integer) + + idle_pool_worker_timeout configuration parameter + + + + + Terminate an idle connection pool worker after the specified number of milliseconds. + The default value is 0, so pool workers are never terminated. + + + + + + restart_pooler_on_reload (string) + + restart_pooler_on_reload configuration parameter + + + + + Restart session pool workers once pg_reload_conf() is called. + The default value is false. + + + + unix_socket_directories (string) diff --git a/doc/src/sgml/connpool.sgml b/doc/src/sgml/connpool.sgml new file mode 100644 index 000..bc6547b --- /dev/null +++ b/doc/src/sgml/connpool.sgml @@ -0,0 +1,174 @@ + + + + Connection pooling + + + built-in connection pool proxy + + + +PostgreSQL spawns a separate process (backend) for each client. +For large number of clients this model can consume a large number of system +resources and lead to significant performance degradation, especially on computers with large +number of CPU cores. The
Re: Fix typos
On Fri, Aug 2, 2019 at 12:11 AM Michael Paquier wrote: > On Thu, Aug 01, 2019 at 11:01:59PM -0400, Alvaro Herrera wrote: > > I think slight variations don't really detract from the value of the > > product, and consider the odd variation a reminder of the diversity of > > the project. I don't suggest that we purposefully introduce spelling > > variations, or that we refrain from fixing ones that appear in code > > we're changing, but I don't see the point in changing a line for the > > sole reason of standardising the spelling of a word. > > Agreed. This always reminds me of ANALYZE vs. ANALYSE where we don't > actually document the latter :) > I didn't know about that. That's a fun one! > > > That said, I'm not a native English speaker. > > Neither am I. > I am. Consistency is nice but either reads fine to me. Only brought it up as I didn't see many other usages so seemed out of place. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: Should we add xid_current() or a int8->xid cast?
On Thu, Jul 25, 2019 at 1:11 PM Thomas Munro wrote: > On Thu, Jul 25, 2019 at 12:42 PM Tom Lane wrote: > > Andres Freund writes: > > > On 2019-07-24 20:34:39 -0400, Tom Lane wrote: > > >> Yeah, I would absolutely NOT recommend that you open that can of worms > > >> right now. We have looked at adding unsigned integer types in the past > > >> and it looked like a mess. > > > > > I assume Thomas was thinking more of another bespoke type like xid, just > > > wider. There's some notational advantage in not being able to > > > immediately do math etc on xids. > > > > Well, we could invent an xid8 type if we want, just don't try to make > > it part of the numeric hierarchy (as indeed xid isn't). > > Yeah, I meant an xid64/xid8/fxid/pg_something/... type that isn't a > kind of number. I played around with an xid8 type over here (not tested much yet, in particular not tested on 32 bit box): https://www.postgresql.org/message-id/CA%2BhUKGKbQtX8E5TEdcZaYhTxqLqrvcpN1Vjb7eCu2bz5EACZbw%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Re: Fix typos
On Thu, Aug 1, 2019 at 10:18 PM Tom Lane wrote: > It's British vs. American spelling. For the most part, Postgres > follows American spelling, but there's the odd Briticism here and > there. Thanks for the explanation. I thought that might be the case but didn't find any other usages of "serialise" so was not sure. > I'm not sure whether it's worth trying to standardize. > I think the most recent opinion on this was Munro's: > > > https://www.postgresql.org/message-id/ca+hukgjz-pdmgwxroiwvn-aeg4-ajdwj3gwdqkosa8g65sp...@mail.gmail.com Either reads fine to me and the best rationale I can think of for going with one spelling is to not have the same "fix" come up again. If there is a desire to change this, attached is updated to include one more instance of "materialise" and a change to the commit message to match some similar ones I found in the past. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/ From 20c018b9350cb17e87930cebb66b715bc4b9fcf4 Mon Sep 17 00:00:00 2001 From: Sehrope Sarkuni Date: Fri, 2 Aug 2019 05:58:35 -0400 Subject: [PATCH] Use American spelling for "serialize" and "materalize" --- contrib/pgstattuple/pgstatapprox.c | 2 +- contrib/xml2/xpath.c| 2 +- src/backend/access/transam/README | 4 ++-- src/backend/storage/buffer/bufmgr.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index 636c8d40ac..83b3270554 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -278,7 +278,7 @@ pgstattuple_approx_internal(Oid relid, FunctionCallInfo fcinfo) errmsg("cannot access temporary tables of other sessions"))); /* - * We support only ordinary relations and materialised views, because we + * We support only ordinary relations and materialized views, because we * depend on the visibility map and free space map for our estimates about * unscanned pages. */ diff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c index 1e5b71d9a0..11749b6c47 100644 --- a/contrib/xml2/xpath.c +++ b/contrib/xml2/xpath.c @@ -573,7 +573,7 @@ xpath_table(PG_FUNCTION_ARGS) errmsg("xpath_table must be called as a table function"))); /* - * We want to materialise because it means that we don't have to carry + * We want to materialize because it means that we don't have to carry * libxml2 parser state between invocations of this function */ if (!(rsinfo->allowedModes & SFRM_Materialize)) diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index ad4083eb6b..49b6809c82 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -609,8 +609,8 @@ or more buffer locks be held concurrently, you must lock the pages in appropriate order, and not release the locks until all the changes are done. Note that we must only use PageSetLSN/PageGetLSN() when we know the action -is serialised. Only Startup process may modify data blocks during recovery, -so Startup process may execute PageGetLSN() without fear of serialisation +is serialized. Only Startup process may modify data blocks during recovery, +so Startup process may execute PageGetLSN() without fear of serialization problems. All other processes must only call PageSet/GetLSN when holding either an exclusive buffer lock or a shared lock plus buffer header lock, or be writing the data block directly rather than through shared buffers diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6f3a402854..3d9e0a3488 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3519,7 +3519,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) /* * Set the page LSN if we wrote a backup block. We aren't supposed * to set this when only holding a share lock but as long as we - * serialise it somehow we're OK. We choose to set LSN while + * serialize it somehow we're OK. We choose to set LSN while * holding the buffer header lock, which causes any reader of an * LSN who holds only a share lock to also obtain a buffer header * lock before using PageGetLSN(), which is enforced in -- 2.17.1
Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData
On Fri, Aug 2, 2019 at 1:38 AM vignesh C wrote: > On Thu, Aug 1, 2019 at 5:36 PM Thomas Munro wrote: > > On Thu, Aug 1, 2019 at 9:32 PM vignesh C wrote: > > > There is also one more comment which is yet to be concluded. The > > > comment discusses about changing subxids which are of TransactionId > > > type to FullTransactionId type being written in two phase transaction > > > file. We could not conclude this as the data is similarly stored in > > > TransactionStateData. > > > > No comment on that question or the patch yet but could you please add > > this to the next Commitfest so that cfbot starts testing it? > > > I have added it to the commitfest. Thanks. This looks pretty reasonable to me, and I don't think we need to worry about the subxid list for now. Here's a version I ran though pgindent to fix some whitespace problems. The pg_prepared_xacts view seems like as good a place as any to start showing FullTransactionId values to users. Here's an experimental patch on top to do that, introducing a new "xid8" type. After pg_resetwal -e 10 -D pgdata you can make transactions that look like this: postgres=# select transaction, gid from pg_prepared_xacts ; transaction | gid -+- 429496729600496 | tx1 (1 row) -- Thomas Munro https://enterprisedb.com 0001-Use-FullTransactionId-for-two-phase-commit.patch Description: Binary data 0002-Add-SQL-type-xid8-to-expose-FullTransactionId-to-use.patch Description: Binary data 0003-Use-xid8-in-the-pg_prepared_xacts-view.patch Description: Binary data
Re: Partial join
On Thu, Aug 1, 2019 at 7:46 PM Arne Roland wrote: > Hello Richard, > > thanks for your quick reply. > > > > We need to fix this. > > > Do you have a better idea than just keeping the old quals - possibly just > the ones that get eliminated - in a separate data structure? Is the push > down of quals the only case of elimination of quals, only counting the ones > which happen before the restrict lists are generated? > In you case, the restriction 'sl = sl' is just not generated for the join, because it forms an EC with const, which is not considered when generating join clauses. Please refer to the code snippet below: @@ -1164,8 +1164,8 @@ generate_join_implied_equalities(PlannerInfo *root, List *sublist = NIL; /* ECs containing consts do not need any further enforcement */ if (ec->ec_has_const) continue; Thanks Richard
Re: Partial join
On Thu, Aug 1, 2019 at 10:15 PM Tom Lane wrote: > Richard Guo writes: > > For the third query, a rough investigation shows that, the qual 'sl = > > 5' and 'sc.sl = sg.sl' will form an equivalence class and generate two > > implied equalities: 'sc.sl = 5' and 'sg.sl = 5', which can be pushed > > down to the base rels. One consequence of the deduction is when > > constructing restrict lists for the joinrel, we lose the original > > restrict 'sc.sl = sg.sl', and this would fail the check > > have_partkey_equi_join(), which checks if there exists an equi-join > > condition for each pair of partition keys. As a result, this joinrel > > would not be considered as an input to further partitionwise joins. > > > We need to fix this. > > Uh ... why? The pushed-down restrictions should result in pruning > away any prunable partitions at the scan level, leaving nothing for > the partitionwise join code to do. > Hmm..In the case of multiple partition keys, for range partitioning, if we have no clauses for a given key, any later keys would not be considered for partition pruning. That is to day, for table 'p partition by range (k1, k2)', quals like 'k2 = Const' would not prune partitions. For query: select * from p as t1 join p as t2 on t1.k1 = t2.k1 and t1.k2 = t2.k2 and t1.k2 = 2; Since we don't consider ECs containing consts when generating join clauses, we don't have restriction 't1.k2 = t2.k2' when building the joinrel. As a result, partitionwise join is not considered as it requires there existing an equi-join condition for each pair of partition keys. Is this a problem? What's your opinion? Thanks Richard
A couple of random BF failures in kerberosCheck
Hello, https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25 I wondered if this might be like the recently fixed problem with slapd not being ready to handle requests yet, since we start up krb5kdc first and then don't do anything explicit to wait for it, but it doesn't look like an obvious failure to reach it. It looks like test 3 on elver connected successfully but didn't like the answer it got for this query: SELECT gss_authenticated AND encrypted from pg_stat_gssapi where pid = pg_backend_pid(); -- Thomas Munro https://enterprisedb.com
Re: [HACKERS] Cached plans and statement generalization
Am 02.08.2019 um 10:57 schrieb Konstantin Knizhnik: On 02.08.2019 11:25, Daniel Migowski wrote: I have two suggestions however: 1. Please allow to gather information about the autoprepared statements by returning them in pg_prepared_statements view. I would love to monitor usage of them as well as the memory consumption that occurs. I suggested a patch to display that in https://www.postgresql.org/message-id/41ed3f5450c90f4d8381bc4d8df6bbdcf02e1...@exchangeserver.ikoffice.de Sorry, but there is pg_autoprepared_statements view. I think that it will be better to distinguish explicitly and implicitly prepared statements, will not it? Do you want to add more information to this view? Right now it shows query string, types of parameters and number of this query execution. My sorry, I didn't notice it in the patch. If there is another view for those that's perfectly fine. 2. Please not only use a LRU list, but maybe something which would prefer statements that get reused at all. We often create ad hoc statements with parameters which don't really need to be kept. Maybe I can suggest an implementation of an LRU list where a reusal of a statement not only pulls it to the top of the list but also increases a reuse counter. When then a statement would drop off the end of the list one checks if the reusal count is non-zero, and only really drops it if the resual count is zero. Else the reusal count is decremented (or halved) and the element is also placed at the start of the LRU list, so it is kept a bit longer. There are many variations of LRU and alternatives: clock, segmented LRU, ... I agree that classical LRU may be not the best replacement policy for autoprepare. Application is either use static queries, either constructing them dynamically. It will be better to distinguish queries executed at least twice from one-shot queries. So may be SLRU will be the best choice. But actually situation you have described (We often create ad hoc statements with parameters which don't really need to be kept) is not applicable to atutoprepare. Autoprepare deals only with statements which are actually executed. We have another patch which limits number of stored prepared plans to avoid memory overflow (in case of using stored procedures which implicitly prepare all issued queries). Here choice of replacement policy is more critical. Alright, just wanted to have something better than a LRU, so it is not a stepback from the implementation in the JDBC driver for us. Kind regards, Daniel Migowski
Re: [HACKERS] Cached plans and statement generalization
On 02.08.2019 11:25, Daniel Migowski wrote: Am 01.08.2019 um 18:56 schrieb Konstantin Knizhnik: I decided to implement your proposal. Much simple version of autoprepare patch is attached. At my computer I got the following results: pgbench -M simple -S 22495 TPS pgbench -M extended -S 47633 TPS pgbench -M prepared -S 47683 TPS So autoprepare speedup execution of queries sent using extended protocol more than twice and it is almost the same as with explicitly prepared statements. I failed to create regression test for it because I do not know how to force psql to use extended protocol. Any advice is welcome. I am very interested in such a patch, because currently I use the same functionality within my JDBC driver and having this directly in PostgreSQL would surely speed things up. I have two suggestions however: 1. Please allow to gather information about the autoprepared statements by returning them in pg_prepared_statements view. I would love to monitor usage of them as well as the memory consumption that occurs. I suggested a patch to display that in https://www.postgresql.org/message-id/41ed3f5450c90f4d8381bc4d8df6bbdcf02e1...@exchangeserver.ikoffice.de Sorry, but there is pg_autoprepared_statements view. I think that it will be better to distinguish explicitly and implicitly prepared statements, will not it? Do you want to add more information to this view? Right now it shows query string, types of parameters and number of this query execution. 2. Please not only use a LRU list, but maybe something which would prefer statements that get reused at all. We often create ad hoc statements with parameters which don't really need to be kept. Maybe I can suggest an implementation of an LRU list where a reusal of a statement not only pulls it to the top of the list but also increases a reuse counter. When then a statement would drop off the end of the list one checks if the reusal count is non-zero, and only really drops it if the resual count is zero. Else the reusal count is decremented (or halved) and the element is also placed at the start of the LRU list, so it is kept a bit longer. There are many variations of LRU and alternatives: clock, segmented LRU, ... I agree that classical LRU may be not the best replacement policy for autoprepare. Application is either use static queries, either constructing them dynamically. It will be better to distinguish queries executed at least twice from one-shot queries. So may be SLRU will be the best choice. But actually situation you have described (We often create ad hoc statements with parameters which don't really need to be kept) is not applicable to atutoprepare. Autoprepare deals only with statements which are actually executed. We have another patch which limits number of stored prepared plans to avoid memory overflow (in case of using stored procedures which implicitly prepare all issued queries). Here choice of replacement policy is more critical. Thanks, Daniel Migowski -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Thu, Aug 1, 2019 at 11:05 PM Andres Freund wrote: > > I'm actually quite unconvinced that it's sensible to update the global > value for nested queries. That'll mean e.g. the log_line_prefix and > pg_stat_activity values are most of the time going to be bogus while > nested, because the querystring that's associated with those will *not* > be the value that the queryid corresponds to. elog.c uses > debug_query_string to log the statement, which is only updated for > top-level queries (outside of some exceptions like parallel workers for > parallel queries in a function or stuff like that). And pg_stat_activity > is also only updated for top level queries. Having the nested queryid seems indeed quite broken for log_line_prefix. However having the nested queryid in pg_stat_activity would be convenient to track what is a long stored functions currently doing. Maybe we could expose something like top_level_queryid and current_queryid instead?
Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
On Fri, Aug 2, 2019 at 6:21 AM Michael Paquier wrote: > > On Thu, Aug 01, 2019 at 06:59:06PM -0700, Peter Geoghegan wrote: > > Seems like I should propose a patch this time around. I don't do Perl, > > but I suppose I could manage something as trivial as this. > > Well, that new project policy is not that well-advertised then, see > for example the recent 5925e55, c085e1c and 313f87a. So having some > kind of safety net would be nice. Trivial patch for that attached. The output is now like: [...] Using an oid in the 8000- range is recommended. For instance: 9427 (checking that the suggested random oid is not used yet.) suggest_unused_oid.diff Description: Binary data
Re: pgbench - implement strict TPC-B benchmark
Hello Andres, Thanks a lot for these feedbacks and comments. Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1 e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's a pretty significant share. Fine, but what is the corresponding server load? 211%? 611%? And what actual time is spent in pgbench itself, vs libpq and syscalls? Figures and discussion below. And before you argue that that's just about a read-only workload: I'm fine with worth case scenarii:-) Let's do the worse for my 2 cores running at 2.2 GHz laptop: (0) we can run a really do nearly nothing script: sh> cat nope.sql \sleep 0 # do not sleep, so stay awake… sh> time pgbench -f nope.sql -T 10 -r latency average = 0.000 ms tps = 12569499.226367 (excluding connections establishing) # 12.6M statement latencies in milliseconds: 0.000 \sleep 0 real 0m10.072s, user 0m10.027s, sys 0m0.012s Unsurprisingly pgbench is at about 100% cpu load, and the transaction cost (transaction loop and stat collection) is 0.080 µs (1/12.6M) per script execution (one client on one thread). (1) a pgbench complex-commands-only script: sh> cat set.sql \set x random_exponential(1, :scale * 10, 2.5) + 2.1 \set y random(1, 9) + 17.1 * :x \set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end sh> time pgbench -f set.sql -T 10 -r latency average = 0.001 ms tps = 1304989.729560 (excluding connections establishing) # 1.3M statement latencies in milliseconds: 0.000 \set x random_exponential(1, :scale * 10, 2.5) + 2.1 0.000 \set y random(1, 9) + 17.1 * :x 0.000 \set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end real 0m10.038s, user 0m10.003s, sys 0m0.000s Again pgbench load is near 100%, with only pgbench stuff (thread loop, expression evaluation, variables, stat collection) costing about 0.766 µs cpu per script execution. This is about 10 times the previous case, 90% of pgbench cpu cost is in expressions and variables, not a surprise. Probably this under-a-µs could be reduced… but what overall improvements would it provide? An answer with the last test: (2) a ridiculously small SQL query, tested through a local unix socket: sh> cat empty.sql ; # yep, an empty query! sh> time pgbench -f empty.sql -T 10 -r latency average = 0.016 ms tps = 62206.501709 (excluding connections establishing) # 62.2K statement latencies in milliseconds: 0.016 ; real 0m10.038s, user 0m1.754s, sys 0m3.867s We are adding minimal libpq and underlying system calls to pgbench internal cpu costs in the most favorable (or worst:-) sql query with the most favorable postgres connection. Apparent load is about (1.754+3.867)/10.038 = 56%, so the cpu cost per script is 0.56 / 62206.5 = 9 µs, over 100 times the cost of a do-nothing script (0), and over 10 times the cost of a complex expression command script (1). Conclusion: pgbench-specific overheads are typically (much) below 10% of the total client-side cpu cost of a transaction, while over 90% of the cpu cost is spent in libpq and system, for the worst case do-nothing query. A perfect bench driver which would have zero overheads would reduce the cpu cost by at most 10%, because you still have to talk to the database. through the system. If pgbench cost were divided by two, which would be a reasonable achievement, the benchmark client cost would be reduced by 5%. Wow? I have already given some thought in the past to optimize "pgbench", especially to avoid long switches (eg in expression evaluation) and maybe to improve variable management, but as show above I would not expect a gain worth the effort and assume that a patch would probably be justly rejected, because for a realistic benchmark script these costs are already much less than other inevitable libpq/syscall costs. That does not mean that nothing needs to be done, but the situation is currently quite good. In conclusion, ISTM that current pgbench allows to saturate a postgres server with a client significantly smaller than the server, which seems like a reasonable benchmarking situation. Any other driver in any other language would necessarily incur the same kind of costs. [...] And the largest part of the overhead is in pgbench's interpreter loop: Indeed, the figures below are very interesting! Thanks for collecting them. + 12.35% pgbench pgbench[.] threadRun +3.54% pgbench pgbench[.] dopr.constprop.0 +3.30% pgbench pgbench[.] fmtint +1.93% pgbench pgbench[.] getVariable ~ 21%, probably some inlining has been performed, because I would have expected to see significant time in "advanceConnectionState". +2.95% pgbench libpq.so.5.13 [.] PQsendQueryPrepared +2.15% pgbench libpq.so.5.13 [.] pqPutInt +4.47% pgbench libpq.so.5.13 [.] pqParseInput3 +1.6
Re: [HACKERS] Cached plans and statement generalization
Am 01.08.2019 um 18:56 schrieb Konstantin Knizhnik: I decided to implement your proposal. Much simple version of autoprepare patch is attached. At my computer I got the following results: pgbench -M simple -S 22495 TPS pgbench -M extended -S 47633 TPS pgbench -M prepared -S 47683 TPS So autoprepare speedup execution of queries sent using extended protocol more than twice and it is almost the same as with explicitly prepared statements. I failed to create regression test for it because I do not know how to force psql to use extended protocol. Any advice is welcome. I am very interested in such a patch, because currently I use the same functionality within my JDBC driver and having this directly in PostgreSQL would surely speed things up. I have two suggestions however: 1. Please allow to gather information about the autoprepared statements by returning them in pg_prepared_statements view. I would love to monitor usage of them as well as the memory consumption that occurs. I suggested a patch to display that in https://www.postgresql.org/message-id/41ed3f5450c90f4d8381bc4d8df6bbdcf02e1...@exchangeserver.ikoffice.de 2. Please not only use a LRU list, but maybe something which would prefer statements that get reused at all. We often create ad hoc statements with parameters which don't really need to be kept. Maybe I can suggest an implementation of an LRU list where a reusal of a statement not only pulls it to the top of the list but also increases a reuse counter. When then a statement would drop off the end of the list one checks if the reusal count is non-zero, and only really drops it if the resual count is zero. Else the reusal count is decremented (or halved) and the element is also placed at the start of the LRU list, so it is kept a bit longer. Thanks, Daniel Migowski
Re: idea: log_statement_sample_rate - bottom limit for sampling
On 8/1/19 12:04 PM, Tomas Vondra wrote: > On Thu, Aug 01, 2019 at 11:47:46AM +0200, Adrien Nayrat wrote: >> Hi, >> >> As we are at the end of this CF and there is still discussions about whether >> we >> should revert log_statement_sample_limit and log_statement_sample_rate, or >> try >> to fix it in v12. >> I moved this patch to next commit fest and change status from "ready for >> commiter" to "need review". I hope I didn't make a mistake. >> > > Thanks. The RFC status was clearly stale, so thanks for updating. I should > have done that after my review. I think the patch would be moved to the > next CF at the end, but I might be wrong. In any case, I don't think > you've done any mistake. > > As for the sampling patch - I think we'll end up reverting the feature for > v12 - it's far too late to rework it at this point. Sorry about that, I > know it's not a warm feeling when you get something done, and then it gets > reverted on the last minute. :-( > Don't worry, I understand. It is better to add straigforward GUC in v13 than confusionning in v12 we will regret.
Re: complier warnings from ecpg tests
Hi >> Thanks for the set of flags. So this comes from the use of -Og, and >> the rest of the tree does not complain. The issue is that gcc >> complains about the buffer not being large enough, but %d only uses up >> to 2 characters so there is no overflow. In order to fix the issue it >> is fine enough to increase the buffer size to 28 bytes, so I would >> recommend to just do that. This is similar to the business done in >> 3a4b891. > > And fixed with a9f301d. Thank you! My compiler is now quiet regards, Sergei
Re: [proposal] de-TOAST'ing using a iterator
On Tue, Jul 30, 2019 at 8:20 PM Binguo Bao wrote: > > John Naylor 于2019年7月29日周一 上午11:49写道: >> >> 1). For every needle comparison, text_position_next_internal() >> calculates how much of the value is needed and passes that to >> detoast_iterate(), which then calculates if it has to do something or >> not. This is a bit hard to follow. There might also be a performance >> penalty -- the following is just a theory, but it sounds plausible: >> The CPU can probably correctly predict that detoast_iterate() will >> usually return the same value it did last time, but it still has to >> call the function and make sure, which I imagine is more expensive >> than advancing the needle. Ideally, we want to call the iterator only >> if we have to. >> >> In the attached patch (applies on top of your v5), >> text_position_next_internal() simply compares hptr to the detoast >> buffer limit, and calls detoast_iterate() until it can proceed. I >> think this is clearer. > > > Yes, I think this is a general scenario where the caller continually > calls detoast_iterate until gets enough data, so I think such operations can > be extracted as a macro, as I did in patch v6. In the macro, the > detoast_iterate > function is called only when the data requested by the caller is greater than > the > buffer limit. I like the use of a macro here. However, I think we can find a better location for the definition. See the header comment of fmgr.h: "Definitions for the Postgres function manager and function-call interface." Maybe tuptoaster.h is as good a place as any? >> I also noticed that no one updates or looks at >> "toast_iter.done" so I removed that as well. > > > toast_iter.done is updated when the buffer limit reached the buffer capacity > now. > So, I added it back. Okay. >> 2). detoast_iterate() and fetch_datum_iterate() return a value but we >> don't check it or do anything with it. Should we do something with it? >> It's also not yet clear if we should check the iterator state instead >> of return values. I've added some XXX comments as a reminder. We >> should also check the return value of pglz_decompress_iterate(). > > > IMO, we need to provide users with a simple iterative interface. > Using the required data pointer to compare with the buffer limit is an easy > way. > And the application scenarios of the iterator are mostly read operations. > So I think there is no need to return a value, and the iterator needs to > throw an > exception for some wrong calls, such as all the data have been iterated, > but the user still calls the iterator. Okay, and see these functions now return void. The orignal pglz_decompress() returned a value that was check against corruption. Is there a similar check we can do for the iterative version? >> 3). Speaking of pglz_decompress_iterate(), I diff'd it with >> pglz_decompress(), and I have some questions on it: >> >> a). >> + srcend = (const unsigned char *) (source->limit == source->capacity >> ? source->limit : (source->limit - 4)); >> >> What does the 4 here mean in this expression? > > > Since we fetch chunks one by one, if we make srcend equals to the source > buffer limit, > In the while loop "while (sp < srcend && dp < destend)", sp may exceed the > source buffer limit and read unallocated bytes. Why is this? That tells me the limit is incorrect. Can the setter not determine the right value? > Giving a four-byte buffer can prevent sp from exceeding the source buffer > limit. Why 4? That's a magic number. Why not 2, or 27? > If we have read all the chunks, we don't need to be careful to cross the > border, > just make srcend equal to source buffer limit. I've added comments to explain > it in patch v6. That's a good thing to comment on, but it doesn't explain why. This logic seems like a band-aid and I think a committer would want this to be handled in a more principled way. >> Is it possible it's >> compensating for this bit in init_toast_buffer()? >> >> + buf->limit = VARDATA(buf->buf); >> >> It seems the initial limit should also depend on whether the datum is >> compressed, right? Can we just do this: >> >> + buf->limit = buf->position; > > > I'm afraid not. buf->position points to the data portion of the buffer, but > the beginning of > the chunks we read may contain header information. For example, for > compressed data chunks, > the first four bytes record the size of raw data, this means that limit is > four bytes ahead of position. > This initialization doesn't cause errors, although the position is less than > the limit in other cases. > Because we always fetch chunks first, then decompress it. I see what you mean now. This could use a comment or two to explain the stated constraints may not actually be satisfied at initialization. >> b). >> - while (sp < srcend && dp < destend) >> ... >> + while (sp + 1 < srcend && dp < destend && >> ... >> >> Why is it here "sp + 1"? > > > Ignore it, I set the inactive state of detoast_iter->ctrl to 8 in patch v
Re: Hash join explain is broken
Hi, On 2019-07-02 10:50:02 -0400, Tom Lane wrote: > I wrote: > > Andres Freund writes: > >> Tom, any comments? Otherwise I'll go ahead, and commit after a round or > >> two of polishing. > > > Sorry for not getting to this sooner --- I'll try to look tomorrow. > > I took a look, and I think this is going in the right direction. > We definitely need a test case corresponding to the live bug, > and I think the comments could use more work, and there are some > cosmetic things (I wouldn't add the new struct Hash field at the > end, for instance). I finally pushed a substantially polished version of this. I ended up moving, as I had wondered about, hashoperator and hashcollation computation to the planner too - without that we would end up with two very similar loops during plan and execution time. I've added a test that puts subplans just about everywhere possible in a hash join - it's the only reliable way I found to trigger errors (only during EXPLAIN, as deparsing there tries to find the associated node, for column names etc, and failed because the subplan referenced an INNER_VAR, even though Hash doesn't have an inner plan). Makes the test query a bit hard to read, but I didn't get any better ideas, and it doesn't seem too bad. Thanks Tom for the review, thanks Alexander and Nikita for the report. Sorry that it took this long. Greetings, Andres Freund
Proposal: Clean up RangeTblEntry nodes after query preparation
Hello, I currently have the problem that a simple prepared statement for a query like select * from vw_report_invoice where id = $1 results in 33MB memory consumption on my side. It is a query that does about >20 joins over partially wide tables, but only a very small subset of columns is really needed. I already argued if PreparedStatements contain all the metadata about all used tables and it turns out it is even worse (because that metadata is even copied into mem multiple times). The reason for the absurd memory consumption are RangeTableEntrys which are created for every touched table and for every join set done. I printed the query lists created after preparation and found multiple RangeTableEntrys containing >4000 columns, including the names of the columns as well as a list of all columns types, even when only a very small subset is really required. The minimum memory consumption is 46 bytes + the name of the column, guessing it will be 64+32? bytes when palloced, resulting in 400k memory for just one of the many RTEs in the large query where maybe 50 columns are really used. One can test this easily. When I create two simple tables: CREATE TABLE invoice ( id serial PRIMARY KEY, number varchar(10), amount float8, customer_id int4 ); CREATE TABLE invoiceitem ( id serial PRIMARY KEY, invoice_id int4, position int4, quantity float8, priceperunit float8, amount float8, description text ); ALTER TABLE invoiceitem ADD CONSTRAINT fk_invoiceitem_invoice_id FOREIGN KEY (invoice_id) REFERENCES invoice(id); And now I preparey a simple join over these tables: PREPARE invoicequerytest AS SELECT inv.id, inv.number, item.id, item.description FROM invoice inv LEFT JOIN invoiceitem item ON item.invoice_id = inv.id WHERE inv.id = $1; The pprint-ed RTE for the join alone is this: {RTE :alias <> :eref {ALIAS :aliasname unnamed_join :colnames ("id" "number" "amount" "customer_id" "id" "invoice_id" "po sition" "quantity" "priceperunit" "amount" "description") } :rtekind 2 :jointype 1 :joinaliasvars ( {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location -1 } {VAR :varno 1 :varattno 2 :vartype 1043 :vartypmod 14 :varcollid 100 :varlevelsup 0 :varnoold 1 :varoattno 2 :location -1 } {VAR :varno 1 :varattno 3 :vartype 701 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 3 :location -1 } {VAR :varno 1 :varattno 4 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 4 :location -1 } {VAR :varno 2 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 1 :location -1 } {VAR :varno 2 :varattno 2 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 2 :location -1 } {VAR :varno 2 :varattno 3 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 3 :location -1 } {VAR :varno 2 :varattno 4 :vartype 701 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 4 :location -1 } {VAR :varno 2 :varattno 5 :vartype 701 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 5 :location -1 } {VAR :varno 2 :varattno 6 :vartype 701 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 2 :varoattno 6 :location -1 } {VAR :varno 2 :varattno 7 :vartype 25 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 2 :varoattno 7 :location -1 } ) :lateral false :inh false :inFromCl true :requiredPerms 0 :checkAsUser 0 :selectedCols (b) :insertedCols (b) :updatedCols (b) :securityQuals <> } It contains every col from both tables. Useful cols: "id" "number" "id" "invoice_id" "description" Useless because complety unreferenced cols: "amount" "customer_id" "position