Re: Adminpack removal
Le 27/06/2024 à 10:38, Matthias van de Meent a écrit : On Thu, 27 Jun 2024, 07:34 Philippe BEAUDOIN, wrote: Hi, I have just tested PG17 beta1 with the E-Maj solution I maintain. The only issue I found is the removal of the adminpack contrib. In the emaj extension, which is the heart of the solution, and which is written in plpgsql, the code just uses the pg_file_unlink() function to automatically remove files produced by COPY TO statements when no rows have been written. In some specific use cases, it avoids the user to get a few interesting files among numerous empty files in a directory. I have found a workaround. That's a little bit ugly, but it works. So this is not blocking for me. FYI, the project's repo is on github (https://github.com/dalibo/emaj), which was supposed to be scanned to detect potential adminpack usages. The extension at first glance doesn't currently seem to depend on adminpack: it is not included in the control file as dependency, and has not been included as a dependency since the creation of that file. You are right. Even before the adminpack usage removal, the extension was not listed as prerequisite in the control file. In fact, I introduced a new E-Maj feature in the version of last automn, that used the adminpack extension in one specific case. But the user may not install adminpack. In such a case, the feature was limited and a warning message told the user why it reached the limitation. I was waiting for some feedbacks before possibly adding adminpack as a real prerequisite. Where else would you expect us to search for dependencies? The word "adminpack" can be found in the sql source file (sql/emaj--4.4.0.sql), and in 2 documentation source files (in docs/en/*.rst). The pg_file_unlink() function name can be found in the same sql source file. But, I understand that looking for simple strings in all types of files in a lot of repo is costly and may report a lot of noise. More broadly, my feeling is that just looking at public repositories is not enough. The Postgres features usage can be found in: - public tools, visible in repo (in github, gitlab and some other platforms) ; - softwares from commercial vendors, so in close source ; - and a huge number of applications developed in all organizations, and that are not public. So just looking in public repo covers probably less than 1% of the code. However, this may give a first idea, especialy if a feature use is already detected. In this "adminpack" case, it may be interesting to distinguish the pg_logdir_ls() function, which covers a very specific administration feature, and the other functions, which are of a general interest. It wouldn't be surprising that pg_logdir_ls() be really obsolete now that it is not used by pgAdmin anymore, and thus could be removed if nobody complains about that. May be the others functions could be directly integrated into the core (or left in adminpack, with the pgAdmin reference removed from the documentation). Kind Regards. Philippe.
Re: Document use of ldapurl with LDAP simple bind
On 24.05.24 20:54, Jacob Champion wrote: Our documentation implies that the ldapurl setting in pg_hba is used for search+bind mode only. It was pointed out to me recently that this is not true, and if you're dealing with simple bind on a non-standard scheme or port, then ldapurl makes the HBA easier to read: ... ldap ldapurl="ldaps://ldap.example.net:49151" ldapprefix="cn=" ldapsuffix=", dc=example, dc=net" 0001 tries to document this helpful behavior a little better, and 0002 pins it with a test. WDYT? Yes, this looks correct. Since ldapurl is really just a shorthand that is expanded to various other parameters, it makes sense that it would work for simple bind as well. hba.c has this error message: "cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, ldapsearchattribute, ldapsearchfilter, or ldapurl together with ldapprefix" This appears to imply that specifying ldapurl is only applicable for search+bind. Maybe that whole message should be simplified to something like "configuration mixes arguments for simple bind and search+bind" (The old wording also ignores that the error might arise via "ldapsuffix".)
Re: Support "Right Semi Join" plan shapes
On Fri, Jun 28, 2024 at 2:54 PM Richard Guo wrote: > On Mon, Jun 24, 2024 at 5:59 PM Richard Guo wrote: > > I noticed that this patch changes the plan of a query in join.sql from > > a semi join to right semi join, compromising the original purpose of > > this query, which was to test the fix for neqjoinsel's behavior for > > semijoins (see commit 7ca25b7d). > > > > -- > > -- semijoin selectivity for <> > > -- > > explain (costs off) > > select * from int4_tbl i4, tenk1 a > > where exists(select * from tenk1 b > > where a.twothousand = b.twothousand and a.fivethous <> > > b.fivethous) > > and i4.f1 = a.tenthous; > > > > So I've changed this test case a bit so that it is still testing what it > > is supposed to test. > > I've refined this test case further to make it more stable by using an > additional filter 'a.tenthous < 5000'. Besides, I noticed a surplus > blank line in ExecHashJoinImpl(). I've removed it in the v7 patch. BTW, I've also verified the empty-rel optimization for hash join and AFAICT it works correctly for the new right-semi join. Thanks Richard
Re: walsender.c comment with no context is hard to understand
On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila wrote: > > On Fri, Jun 28, 2024 at 5:15 AM Peter Smith wrote: > > > > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier wrote: > > > > > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote: > > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith wrote: > > > >> Perhaps the comment should say something like it used to: > > > >> /* Fail if there is not enough WAL available. This can happen during > > > >> shutdown. */ > > > > > > > > Agree with this, +1 for this change. > > > > > > That would be an improvement. Would you like to send a patch with all > > > the areas you think could stand for improvements? > > > -- > > > > OK, I attached a patch equivalent of the suggestion in this thread. > > > > Shouldn't the check for flushptr (if (flushptr < targetPagePtr + > reqLen)) be moved immediately after the call to WalSndWaitForWal(). > The comment seems to suggests the same: "Make sure we have enough WAL > available before retrieving the current timeline. .." > Yes, as I wrote in the first post, those lines did once used to be adjacent in logical_read_xlog_page. I also wondered if they still belonged together, but I opted for the safest option of fixing only the comment instead of refactoring old code when no problem had been reported. AFAICT these lines became separated due to a 2023 patch [1], and you were one of the reviewers of that patch, so I assumed the code separation was deemed OK at that time. Unless it was some mistake that slipped past multiple reviewers? == [1] https://github.com/postgres/postgres/commit/0fdab27ad68a059a1663fa5ce48d76333f1bd74c#diff-da7052ce339ec037f5c721e08a9532b1fcfb4405966cf6a78d0c811550fd7b43 Kind Regards, Peter Smith. Fujitsu Australia
Re: Flush pgstats file during checkpoints
On 18/06/2024 9:01 am, Michael Paquier wrote: Hi all, On HEAD, xlog.c has the following comment, which has been on my own TODO list for a couple of weeks now: * TODO: With a bit of extra work we could just start with a pgstat file * associated with the checkpoint redo location we're starting from. Please find a patch series to implement that, giving the possibility to keep statistics after a crash rather than discard them. I have been looking at the code for a while, before settling down on: - Forcing the flush of the pgstats file to happen during non-shutdown checkpoint and restart points, after updating the control file's redo LSN and the critical sections in the area. - Leaving the current before_shmem_exit() callback around, as a matter of delaying the flush of the stats for as long as possible in a shutdown sequence. This also makes the single-user mode shutdown simpler. - Adding the redo LSN to the pgstats file, with a bump of PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change is independently useful on its own when loading stats after a clean startup, as well. - The crash recovery case is simplified, as there is no more need for the "discard" code path. - Not using a logic where I would need to stick a LSN into the stats file name, implying that we would need a potential lookup at the contents of pg_stat/ to clean up past files at crash recovery. These should not be costly, but I'd rather not add more of these. 7ff23c6d277d, that has removed the last call of CreateCheckPoint() from the startup process, is older than 5891c7a8ed8f, still it seems to me that pgstats relies on some areas of the code that don't make sense on HEAD (see locking mentioned at the top of the write routine for example). The logic gets straight-forward to think about as restart points and checkpoints always run from the checkpointer, implying that pgstat_write_statsfile() is already called only from the postmaster in single-user mode or the checkpointer itself, at shutdown. Attached is a patch set, with the one being the actual feature, with some stuff prior to that: - 0001 adds the redo LSN to the pgstats file flushed. - 0002 adds an assertion in pgstat_write_statsfile(), to check from where it is called. - 0003 with more debugging. - 0004 is the meat of the thread. I am adding that to the next CF. Thoughts and comments are welcome. Thanks, -- Michael Hi Michael. I am working mostly on the same problem - persisting pgstat state in Neon (because of separation of storage and compute it has no local files). I have two questions concerning this PR and the whole strategy for saving pgstat state between sessions. 1. Right now pgstat.stat is discarded after abnormal Postgres termination. And in your PR we are storing LSN in pgstat.staty to check that it matches checkpoint redo LSN. I wonder if having outdated version of pgstat.stat is worse than not having it at all? Comment in xlog.c says: "When starting with crash recovery, reset pgstat data - it might not be valid." But why it may be invalid? We are writing it first to temp file and then rename. May be fsync() should be added here and durable_rename() should be used instead of rename(). But it seems to be better than loosing statistics. And should not add some significant overhead (because it happens only at shutdown). In your case we are checking LSN of pgstat.stat file. But once again - why it is better to discard file than load version from previous checkpoint? 2. Second question is also related with 1). So we saved pgstat.stat on checkpoint, then did some updates and then Postgres crashed. As far as I understand with your patch we will successfully restore pgstats.stat file. But it is not actually up-to-date: it doesn't reflect information about recent updates. If it was so critical to keep pgstat.stat up-to-date, then why do we allow to restore state on most recent checkpoint? Thanks, Konstantin
Re: Converting README documentation to Markdown
On 15.05.24 14:26, Daniel Gustafsson wrote: Another aspect of platform/flavour was to make the markdown version easy to maintain for hackers writing content. Requiring the minimum amount of markup seems like the developer-friendly way here to keep productivity as well as document quality high. Most importantly though, I targeted reading the files as plain text without any rendering. We keep these files in text format close to the code for a reason, and maintaining readability as text was a north star. I've been thinking about this some more. I think the most value here would be to just improve the plain-text formatting, so that there are consistent list styles, header styles, indentation, some of the ambiguities cleared up -- much of which your 0001 patch does. You might as well be targeting markdown-like conventions with this; they are mostly reasonable. I tend to think that actually converting all the README files to README.md could be a net negative for maintainability. Because now you are requiring everyone who potentially wants to edit those to be aware of Markdown syntax and manually check the rendering. With things like DocBook, if you make a mess, you get error messages from the build step. If you make a mess in Markdown, you have to visually find it yourself. There are many READMEs that contain nested lists and code snippets and diagrams and such all mixed together. Getting that right in Markdown can be quite tricky. I'm also foreseeing related messes of trailing whitespace, spaces-vs-tab confusion, gitattributes violations, etc. It can be a lot of effort. It's okay to do this for prominent files like the top-level one, but I suggest that for the rest we can keep it simple and just use plain text.
Re: remaining sql/json patches
Hi Amit, 28.06.2024 09:15, Amit Langote wrote: Hi Alexander, Thanks for the report. Yeah, those comments that got added in 7081ac46ace are obsolete. Thanks for paying attention to that! Could you also look at comments for transformJsonObjectAgg() and transformJsonArrayAgg(), aren't they obsolete too? Best regards, Alexander
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Fri, 28 Jun 2024 at 00:41, Peter Geoghegan wrote: > Typically, no, it won't be. But there's really no telling for sure. > The access patterns for a composite index on '(a, b)' with a qual > "WHERE b = 5" are identical to a qual explicitly written "WHERE a = > any() AND b = 5". Hmm, that's true. But in that case the explain plan gives a clear hint that something like that might be going on, because you'll see: Index Cond: a = any() AND b = 5 That does make me think of another way, and maybe more "correct" way, of representing Masahiros situation than adding a new "Skip Scan Cond" row to the EXPLAIN output. We could explicitly include a comparison to all prefix columns in the Index Cond: Index Cond: ((test.id1 = 1) AND (test.id2 = ANY) AND (test.id3 = 101)) Or if you want to go with syntactically correct SQL we could do the following: Index Cond: ((test.id1 = 1) AND ((test.id2 IS NULL) OR (test.id2 IS NOT NULL)) AND (test.id3 = 101)) An additional benefit this provides is that you now know which additional column you should use a more specific filter on to speed up the query. In this case test.id2 OTOH, maybe it's not a great way because actually running that puts the IS NULL+ IS NOT NULL query in the Filter clause (which honestly surprises me because I had expected this "always true expression" would have been optimized away by the planner). > EXPLAIN (VERBOSE, ANALYZE) SELECT id1, id2, id3 FROM test WHERE id1 = 1 AND > (id2 IS NULL OR id2 IS NOT NULL) AND id3 = 101; QUERY PLAN ─ Index Only Scan using test_idx on public.test (cost=0.42..12809.10 rows=1 width=12) (actual time=0.027..11.234 rows=1 loops=1) Output: id1, id2, id3 Index Cond: ((test.id1 = 1) AND (test.id3 = 101)) Filter: ((test.id2 IS NULL) OR (test.id2 IS NOT NULL)) > What about cases where we legitimately have to vary our strategy > during the same index scan? Would my new suggestion above address this? > In fact, that'd make sense even today, without skip scan (just with > the 17 work on nbtree SAOP scans). Even with regular SAOP nbtree index > scans, the number of primitive scans is hard to predict, and quite > indicative of what's really going on with the scan. *googles nbtree SAOP scans and finds the very helpful[1]* Yes, I feel like this should definitely be part of the ANALYZE output. Seeing how Lukas has to look at pg_stat_user_tables to get this information seems quite annoying[2] and only possible on systems that have no concurrent queries. So it sounds like we'd want a "Primitive Index Scans" counter in ANALYZE too. In addition to the number of filtered rows by, which if we go with my proposal above should probably be called "Rows Removed by Index Cond". [1]: https://www.youtube.com/watch?v=jg2KeSB5DR8 [2]: https://youtu.be/jg2KeSB5DR8?t=188
Re: walsender.c comment with no context is hard to understand
On Fri, Jun 28, 2024 at 12:55 PM Peter Smith wrote: > > On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila wrote: > > > > On Fri, Jun 28, 2024 at 5:15 AM Peter Smith wrote: > > > > > > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier > > > wrote: > > > > > > > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote: > > > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith > > > > > wrote: > > > > >> Perhaps the comment should say something like it used to: > > > > >> /* Fail if there is not enough WAL available. This can happen during > > > > >> shutdown. */ > > > > > > > > > > Agree with this, +1 for this change. > > > > > > > > That would be an improvement. Would you like to send a patch with all > > > > the areas you think could stand for improvements? > > > > -- > > > > > > OK, I attached a patch equivalent of the suggestion in this thread. > > > > > > > Shouldn't the check for flushptr (if (flushptr < targetPagePtr + > > reqLen)) be moved immediately after the call to WalSndWaitForWal(). > > The comment seems to suggests the same: "Make sure we have enough WAL > > available before retrieving the current timeline. .." > > > > Yes, as I wrote in the first post, those lines did once used to be > adjacent in logical_read_xlog_page. > > I also wondered if they still belonged together, but I opted for the > safest option of fixing only the comment instead of refactoring old > code when no problem had been reported. > > AFAICT these lines became separated due to a 2023 patch [1], and you > were one of the reviewers of that patch, so I assumed the code > separation was deemed OK at that time. Unless it was some mistake that > slipped past multiple reviewers? > I don't know whether your assumption is correct. AFAICS, those two lines should be together. Let us ee if Bertrand remembers anything? -- With Regards, Amit Kapila.
Re: Converting README documentation to Markdown
On Fri, 28 Jun 2024 at 09:38, Peter Eisentraut wrote: > Getting that right in Markdown can be quite tricky. I agree that in some cases it's tricky. But what's the worst case that can happen when you get it wrong? It renders weird on github.com. Luckily there's a "code" button to go to the plain text format[1]. In all other cases (which I expect will be most) the doc will be easier to read. Forcing plaintext, just because sometimes we might make a mistake in the syntax seems like an overcorrection imho. Especially because these docs are (hopefully) read more often than written. [1]: https://github.com/postgres/postgres/blob/master/README.md?plain=1
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
On Sun, 23 Jun 2024 at 22:30, Alexander Lakhin wrote: > Unfortunately, the buildfarm log doesn't contain regress_log_002_limits, > but I managed to reproduce the failure on that my device, when it's storage > as slow as: > $ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k > 1024+0 records in > 1024+0 records out > 134217728 bytes (134 MB, 128 MiB) copied, 33.9446 s, 4.0 MB/s > > The past ~1 week, I tried to space out all other tasks on the machine, so as to ensure that 1-min CPU is mostly <2 (and thus not many things hammering the disk) and with that I see 0 failures these past few days. This isn't conclusive by any means, but it does seem that reducing IO contention has helped remove the errors, like what Alexander suspects / repros here. Just a note, that I've reverted some of those recent changes now, and so if the theory holds true, I wouldn't be surprised if some of these errors restarted on dodo. - robins
Re: pg_createsubscriber: drop pre-existing subscriptions from the converted node
On Thu, Jun 27, 2024 at 11:47 AM Hayato Kuroda (Fujitsu) wrote: > > > It seems disabling subscriptions on the primary can make the primary > > stop functioning for some duration of time. I feel we need some > > solution where after converting to subscriber, we disable and drop > > pre-existing subscriptions. One idea could be that we use the list of > > new subscriptions created by the tool such that any subscription not > > existing in that list will be dropped. > > Previously I avoided coding like yours, because there is a room that converted > node can connect to another publisher. But per off-list discussion, we can > skip > it by setting max_logical_replication_workers = 0. I refactored with the > approach. > Note that the GUC is checked at verification phase, so an attribute is added > to > start_standby_server() to select the workload. > Thanks, this is a better approach. I have changed a few comments and made some other cosmetic changes. See attached. Euler, Peter E., and others, do you have any comments/suggestions? BTW, why have you created a separate test file for this test? I think we should add a new test to one of the existing tests in 040_pg_createsubscriber. You can create a dummy subscription on node_p and do a test similar to what we are doing in "# Create failover slot to test its removal". -- With Regards, Amit Kapila. v3-0001-pg_createsubscriber-Drop-pre-existing-subscriptio.patch Description: Binary data
Re: Showing applied extended statistics in explain Part 2
Hi Tomas and All, Attached file is a new patch including: 6) Add stats option to explain command 7) The patch really needs some docs (partly) >4) Add new node (resolve errors in cfbot and prepared statement) I tried adding a new node in pathnode.h, but it doesn't work well. So, it needs more time to implement it successfully because this is the first time to add a new node in it. > 8) Add regression test (stats_ext.sql) Actually, I am not yet able to add new test cases to stats_ext.sql. Instead, I created a simple test (test.sql) and have attached it. Also, output.txt is the test result. To add new test cases to stats_ext.sql, I'd like to decide on a strategy for modifying it. In particular, there are 381 places where the check_estimated_rows function is used, so should I include the same number of tests, or should we include the bare minimum of tests that cover the code path? I think only the latter would be fine. Any advice is appreciated. :-D P.S. I'm going to investigate how to use CI this weekend hopefully. Regards, Tatsuro Yamada CREATE FUNCTION CREATE FUNCTION CREATE TABLE INSERT 0 10 QUERY PLAN --- Seq Scan on t (cost=0.00..1944.77 rows=3 width=8) (actual time=0.049..37.182 rows=1 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9 Planning Time: 0.380 ms Execution Time: 37.751 ms (5 rows) clause | extstats +-- (0 rows) QUERY PLAN --- HashAggregate (cost=1944.77..2044.89 rows=10012 width=12) Group Key: a, b -> Seq Scan on t (cost=0.00..1444.18 rows=100118 width=8) (3 rows) clause | extstats +-- (0 rows) CREATE STATISTICS ANALYZE List of extended statistics Schema | Name | Definition | Ndistinct | Dependencies | MCV +--+-+---+--+- public | s| a, b FROM t | defined | defined | defined (1 row) QUERY PLAN -- Seq Scan on t (cost=0.00..1943.00 rows=9947 width=8) (actual time=0.021..26.203 rows=1 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9 Ext Stats: public.s Clauses: ((a = 1) AND (b = 1)) Planning Time: 0.169 ms Execution Time: 26.958 ms (6 rows) clause | extstats ---+-- Filter: ((a = 1) AND (b = 1)) | public.s Clauses: ((a = 1) AND (b = 1)) (1 row) QUERY PLAN --- HashAggregate (cost=1943.00..1943.10 rows=10 width=12) (actual time=96.912..96.918 rows=10 loops=1) Group Key: a, b Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..1443.00 rows=10 width=8) (actual time=0.026..24.280 rows=10 loops=1) Ext Stats: public.s Clauses: a, b Planning Time: 0.353 ms Execution Time: 96.974 ms (7 rows) clause |extstats -+- Group Key: a, b | public.s Clauses: a, b (1 row) SET QUERY PLAN --- HashAggregate (cost=1943.00..1943.10 rows=10 width=12) (actual time=89.360..89.366 rows=10 loops=1) Group Key: a, b Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..1443.00 rows=10 width=8) (actual time=0.017..19.903 rows=10 loops=1) Ext Stats: public.s Clauses: a, b Ext Stats: public.s Clauses: a, b Planning Time: 0.094 ms Execution Time: 89.409 ms (8 rows) clause |extstats -+- Group Key: a, b | public.s Clauses: a, b (1 row) PREPARE QUERY PLAN -- Seq Scan on t (cost=0.00..1943.00 rows=9850 width=8) (actual time=0.024..27.818 rows=1 loops=1) Filter: ((a = 5) AND (b = 5)) Rows Removed by Filter: 9 Planning Time: 0.108 ms Execution Time: 28.510 ms (5 rows) DROP TABLE DROP FUNCTION List of extended statistics Schema
Re: pgsql: Add more SQL/JSON constructor functions
On Fri, Jun 28, 2024 at 3:14 PM jian he wrote: > On Thu, Jun 27, 2024 at 7:48 PM Amit Langote wrote: > > > > > > I've attempted that in the attached 0001, which removes > > > JsonExpr.coercion_expr and a bunch of code around it. > > > > > > 0002 is now the original patch minus the changes to make > > > JSON_EXISTS(), JSON_QUERY(), and JSON_VALUE() behave as we would like, > > > because the changes in 0001 covers them. The changes for JsonBehavior > > > expression coercion as they were in the last version of the patch are > > > still needed, but I decided to move those into 0001 so that the > > > changes for query functions are all in 0001 and those for constructors > > > in 0002. It would be nice to get rid of that coerce_to_target_type() > > > call to coerce the "behavior expression" to RETURNING type, but I'm > > > leaving that as a task for another day. > > > > Updated 0001 to remove outdated references, remove some more unnecessary > > code. > > > > i found some remaining references of "coercion_expr" should be removed. > > src/include/nodes/primnodes.h > /* JsonExpr's collation, if coercion_expr is NULL. */ > > > src/include/nodes/execnodes.h > /* > * Address of the step to coerce the result value of jsonpath evaluation > * to the RETURNING type using JsonExpr.coercion_expr. -1 if no coercion > * is necessary or if either JsonExpr.use_io_coercion or > * JsonExpr.use_json_coercion is true. > */ > int jump_eval_coercion; > > src/backend/jit/llvm/llvmjit_expr.c > /* coercion_expr code */ > LLVMPositionBuilderAtEnd(b, b_coercion); > if (jsestate->jump_eval_coercion >= 0) > LLVMBuildBr(b, opblocks[jsestate->jump_eval_coercion]); > else > LLVMBuildUnreachable(b); > > > src/backend/executor/execExprInterp.c > /* > * Checks if an error occurred either when evaluating JsonExpr.coercion_expr > or > * in ExecEvalJsonCoercion(). If so, this sets JsonExprState.error to trigger > * the ON ERROR handling steps. > */ > void > ExecEvalJsonCoercionFinish(ExprState *state, ExprEvalStep *op) > { > } > > if (jbv == NULL) > { > /* Will be coerced with coercion_expr, if any. */ > *op->resvalue = (Datum) 0; > *op->resnull = true; > } > > > src/backend/executor/execExpr.c > /* > * Jump to coerce the NULL using coercion_expr if present. Coercing NULL > * is only interesting when the RETURNING type is a domain whose > * constraints must be checked. jsexpr->coercion_expr containing a > * CoerceToDomain node must have been set in that case. > */ > > /* > * Jump to coerce the NULL using coercion_expr if present. Coercing NULL > * is only interesting when the RETURNING type is a domain whose > * constraints must be checked. jsexpr->coercion_expr containing a > * CoerceToDomain node must have been set in that case. > */ Thanks for checking. Will push the attached 0001 and 0002 shortly. -- Thanks, Amit Langote v5-0001-SQL-JSON-Fix-coercion-of-constructor-outputs-to-t.patch Description: Binary data v5-0002-SQL-JSON-Always-coerce-JsonExpr-result-at-runtime.patch Description: Binary data
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Fri, 28 Jun 2024 at 10:59, Jelte Fennema-Nio wrote: > > On Fri, 28 Jun 2024 at 00:41, Peter Geoghegan wrote: > > Typically, no, it won't be. But there's really no telling for sure. > > The access patterns for a composite index on '(a, b)' with a qual > > "WHERE b = 5" are identical to a qual explicitly written "WHERE a = > > any() AND b = 5". > > Hmm, that's true. But in that case the explain plan gives a clear hint > that something like that might be going on, because you'll see: > > Index Cond: a = any() AND b = 5 > > That does make me think of another way, and maybe more "correct" way, > of representing Masahiros situation than adding a new "Skip Scan Cond" > row to the EXPLAIN output. We could explicitly include a comparison to > all prefix columns in the Index Cond: > > Index Cond: ((test.id1 = 1) AND (test.id2 = ANY) AND (test.id3 = 101)) > > Or if you want to go with syntactically correct SQL we could do the following: > > Index Cond: ((test.id1 = 1) AND ((test.id2 IS NULL) OR (test.id2 IS > NOT NULL)) AND (test.id3 = 101)) > > An additional benefit this provides is that you now know which > additional column you should use a more specific filter on to speed up > the query. In this case test.id2 > > OTOH, maybe it's not a great way because actually running that puts > the IS NULL+ IS NOT NULL query in the Filter clause (which honestly > surprises me because I had expected this "always true expression" > would have been optimized away by the planner). > > > EXPLAIN (VERBOSE, ANALYZE) SELECT id1, id2, id3 FROM test WHERE id1 = 1 AND > > (id2 IS NULL OR id2 IS NOT NULL) AND id3 = 101; >QUERY PLAN > ─ > Index Only Scan using test_idx on public.test (cost=0.42..12809.10 > rows=1 width=12) (actual time=0.027..11.234 rows=1 loops=1) >Output: id1, id2, id3 >Index Cond: ((test.id1 = 1) AND (test.id3 = 101)) >Filter: ((test.id2 IS NULL) OR (test.id2 IS NOT NULL)) > > > What about cases where we legitimately have to vary our strategy > > during the same index scan? > > Would my new suggestion above address this? > > > In fact, that'd make sense even today, without skip scan (just with > > the 17 work on nbtree SAOP scans). Even with regular SAOP nbtree index > > scans, the number of primitive scans is hard to predict, and quite > > indicative of what's really going on with the scan. > > *googles nbtree SAOP scans and finds the very helpful[1]* > > Yes, I feel like this should definitely be part of the ANALYZE output. > Seeing how Lukas has to look at pg_stat_user_tables to get this > information seems quite annoying[2] and only possible on systems that > have no concurrent queries. > > So it sounds like we'd want a "Primitive Index Scans" counter in > ANALYZE too. In addition to the number of filtered rows by, which if > we go with my proposal above should probably be called "Rows Removed > by Index Cond". This all just made me more confident that this shows a need to enable index AMs to provide output for EXPLAIN: The knowledge about how index scankeys are actually used is exclusively known to the index AM, because the strategies are often unique to the index AM (or even chosen operator classes), and sometimes can only be applied at runtime: while the index scankeys' sizes, attribute numbers and operators are known in advance (even if not all arguments are filled in; `FROM a JOIN b ON b.id = ANY (a.ref_array)`), the AM can at least show what strategy it is likely going to choose, and how (in)efficient that strategy could be. Right now, Masahiro-san's patch tries to do that with an additional field in IndexPath populated (by proxy) exclusively in btcostestimate. I think that design is wrong, because it wires explain- and btree-specific data through the planner, adding overhead everywhere which is only used for btree- and btree-compatible indexes. I think the better choice would be adding an IndexAmRoutine->amexplain support function, which would get called in e.g. explain.c's ExplainIndexScanDetails to populate a new "Index Scan Details" (name to be bikeshed) subsection of explain plans. This would certainly be possible, as the essentials for outputting things to EXPLAIN are readily available in the explain.h header. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Parallel heap vacuum
On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada wrote: > > # Benchmark results > > * Test-1: parallel heap scan on the table without indexes > > I created 20GB table, made garbage on the table, and run vacuum while > changing parallel degree: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 6); --- 20GB table > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > Here are the results (total time and heap scan time): > > PARALLEL 0: 21.99 s (single process) > PARALLEL 1: 11.39 s > PARALLEL 2: 8.36 s > PARALLEL 3: 6.14 s > PARALLEL 4: 5.08 s > > * Test-2: parallel heap scan on the table with one index > > I used a similar table to the test case 1 but created one btree index on it: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 6); --- 20GB table > create index on test (a); > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > I've measured the total execution time as well as the time of each > vacuum phase (from left heap scan time, index vacuum time, and heap > vacuum time): > > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) > > Overall, I can see the parallel heap scan in lazy vacuum has a decent > scalability; In both test-1 and test-2, the execution time of heap > scan got ~4x faster with 4 parallel workers. On the other hand, when > it comes to the total vacuum execution time, I could not see much > performance improvement in test-2 (45.11 vs. 35.84). Looking at the > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), > and heap scan in case 2 was not as fast as in case 1 with 1 parallel > worker (12.75 vs. 11.39). > > I think the reason is the shared TidStore is not very scalable since > we have a single lock on it. In all cases in the test-1, we don't use > the shared TidStore since all dead tuples are removed during heap > pruning. So the scalability was better overall than in test-2. In > parallel 0 case in test-2, we use the local TidStore, and from > parallel degree of 1 in test-2, we use the shared TidStore and > parallel worker concurrently update it. Also, I guess that the lookup > performance of the local TidStore is better than the shared TidStore's > lookup performance because of the differences between a bump context > and an DSA area. I think that this difference contributed the fact > that index vacuuming got slower (16.74 vs. 22.04). > > There are two obvious improvement ideas to improve overall vacuum > execution time: (1) improve the shared TidStore scalability and (2) > support parallel heap vacuum. For (1), several ideas are proposed by > the ART authors[1]. I've not tried these ideas but it might be > applicable to our ART implementation. But I prefer to start with (2) > since it would be easier. Feedback is very welcome. > Starting with (2) sounds like a reasonable approach. We should study a few more things like (a) the performance results where there are 3-4 indexes, (b) What is the reason for performance improvement seen with only heap scans. We normally get benefits of parallelism because of using multiple CPUs but parallelizing scans (I/O) shouldn't give much benefits. Is it possible that you are seeing benefits because most of the data is either in shared_buffers or in memory? We can probably try vacuuming tables by restarting the nodes to ensure the data is not in memory. -- With Regards, Amit Kapila.
Re: SQL:2011 application time
On Thu, Jun 27, 2024 at 5:56 PM Paul Jungwirth wrote: > I did add a relperiods column, but I have a mostly-complete branch here (not > included in the > patches) that does without. Not maintaining that new column is simpler for > sure. The consequence is > that the relcache must scan for WITHOUT OVERLAPS constraints on every table. > That seems like a high > performance cost for a feature most databases won't use. Since we try hard to > avoid that kind of > thing (e.g. [1]), I thought adding relperiods would be preferred. If that's > the wrong tradeoff I can > change it. I'm sure that you are right that nobody is going to like an extra index scan just to find periods. So, suppose we do as you propose and add relperiods. In the situation where we are adding the first period (or whatever the right term is) to the table, what kind of lock are we holding on the table? Conversely, when we drop the last period, what kind of lock are we holding on the table? If, hypothetically, both answers were AccessExclusiveLock, this might not be too bad, but if you say "ShareLock" then we've got a lot of problems; that's not even self-exclusive. > These patches still add some if-clauses to psql and pg_dump that say `if > (fout->remoteVersion >= > 17)`. But if I change them to 18 I get failures in e.g. the pg_dump > tests. What do other > people do here before a release is cut? Sometimes I make a commit that bumps the version number (update major version in src/tools/version_stamp.pl, then run it, then run autoconf, then commit). Then I build my patch set on top of that. Once the actual major release bump happens, I just drop that commit from the stack. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Test to dump and restore objects left behind by regression
Sorry for delay, but here's next version of the patchset for review. On Thu, Jun 6, 2024 at 5:07 AM Michael Paquier wrote: > On Wed, Jun 05, 2024 at 05:09:58PM +0530, Ashutosh Bapat wrote: > > Thanks for the suggestion. I didn't understand the dependency with the > > buildfarm module. Will the new module be used in buildfarm separately? I > > will work on this soon. Thanks for changing CF entry to WoA. > > I had some doubts about PGBuild/Modules/TestUpgradeXversion.pm, but > after double-checking it loads dynamically AdjustUpgrade from the core > tree based on the base path where all the modules are: > # load helper module from source tree > unshift(@INC, "$srcdir/src/test/perl"); > require PostgreSQL::Test::AdjustUpgrade; > PostgreSQL::Test::AdjustUpgrade->import; > shift(@INC); > It would be annoying to tweak the buildfarm code more to have a > different behavior depending on the branch of Postgres tested. > Anyway, from what I can see, you could create a new module with the > dump filtering rules that AdjustUpgrade requires without having to > update the buildfarm code. > The two filtering rules that I picked from AdjustUpgrade() are a. use unix style newline b. eliminate blank lines. I think we could copy those rule into the new module (as done in the patch) without creating any dependency between modules. There's little gained by creating another perl function just for those two sed commands. There's no way to do that otherwise. If we keep those two modules independent, we will be free to change each module as required in future. Do we need to change buildfarm code to load the AdjustDump module like above? I am not familiar with the buildfarm code. Here's a description of patches and some notes 0001 --- 1. Per your suggestion the logic to handle dump output differences is externalized in PostgreSQL::Test::AdjustDump. Instead of eliminating those differences altogether from both the dump outputs, the corresponding DDL in the original dump output is adjusted to look like that from the restored database. Thus we retain full knowledge of what differences to expect. 2. I have changed the name filter_dump to filter_dump_for_upgrade so as to differentiate between two adjustments 1. for upgrade and 2. for dump/restore. Ideally the name should have been adjust_dump_for_ugprade() . It's more of an adjustment than filtering as indicated by the function it calls. But I haven't changed that. The new function to adjust dumps for dump and restore tests is named adjust_dump_for_restore() however. 3. As suggested by Daniel upthread, the test for dump and restore happens before upgrade which might change the old cluster thus changing the state of objects left behind by regression. The test is not executed if regression is not used to create the old cluster. 4. The code to compare two dumps and report differences if any is moved to its own function compare_dumps() which is used for both upgrade and dump/restore tests. The test uses the custom dump format for dumping and restoring the database. 0002 -- This commit expands the previous test to test all dump formats. But as expected that increases the time taken by this test. On my laptop 0001 takes approx 28 seconds to run the test and with 0002 it takes approx 35 seconds. But there's not much impact on the duration of running all the tests (2m30s vs 2m40s). The code which creates the DDL statements in the dump is independent of the dump format. So usually we shouldn't require to test all the formats in this test. But each format stores the dependencies between dumped objects in a different manner which would be tested with the changes in this patch. I think this patch is also useful. If we decide to keep this test, the patch is intended to be merged into 0001. -- Best Wishes, Ashutosh Bapat From dea7d55a8b938c1b670eebe7662a0dace5077a0d Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Thu, 27 Jun 2024 15:17:29 +0530 Subject: [PATCH 3/3] Test dump and restore in all formats Expanding on the previous commit, this commit modifies the test to dump and restore regression database using all dump formats one by one. But this changes increases the time to run test from 51s to 78s on my laptop. If that's acceptable this commit should be squashed into the previous commit before committing upstream. Ashutosh Bapat --- src/bin/pg_upgrade/t/002_pg_upgrade.pl | 84 +- 1 file changed, 54 insertions(+), 30 deletions(-) diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl index 0e181b294d..5093b2bcaa 100644 --- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl +++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl @@ -576,41 +576,65 @@ sub test_regression_dump_restore { my ($src_node, %node_params) = @_; my $dst_node = PostgreSQL::Test::Cluster->new('dst_node'); - my $dump3_file = "$tempdir/dump3.custom"; - my $dump4_file = "$tempdir/dump4.sql"; - my $dump5_file = "$tempdir/dump5.sql"; - - - #
Re: pgindent exit status if a file encounters an error
On Wed, Jun 26, 2024 at 8:54 PM Tom Lane wrote: > Ashutosh Bapat writes: > > The usage help mentions exit code 2 specifically while describing --check > > option but it doesn't mention exit code 1. Neither does the README. So I > > don't think we need to document exit code 3 anywhere. Please let me know > if > > you think otherwise. > > I think we should have at least a code comment summarizing the > possible exit codes, along the lines of > > # Exit codes: > # 0 -- all OK > # 1 -- could not invoke pgindent, nothing done > # 2 -- --check mode and at least one file requires changes > # 3 -- pgindent failed on at least one file > Thanks. Here's a patch with these lines. In an offline chat Andrew mentioned that he expects more changes in the patch and he would take care of those. Will review and test his patch. -- Best Wishes, Ashutosh Bapat From 2a61830f0a2b6559d04bef75c6aa224e30ed1609 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Wed, 26 Jun 2024 15:46:53 +0530 Subject: [PATCH 1/2] pgindent exit status on error When pg_bsd_indent exits with a non-zero status or reports an error, make pgindent exit with non-zero status 3. The program does not exit on the first instance of the error. Instead it continues to process remaining files as long as some other exit condition is encountered, in which case exit code 3 is reported. This helps to detect errors automatically when pgindent is run in shells which interpret non-zero exit status as failure. Author: Ashutosh Bapat Reviewed by: Andrew Dunstan Discussion: https://www.postgresql.org/message-id/caexhw5sprsifeldp-u1fa5ba7ys2f0gvljmkoobopkadjwq...@mail.gmail.com --- src/tools/pgindent/pgindent | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent index 48d83bc434..03b319c9c3 100755 --- a/src/tools/pgindent/pgindent +++ b/src/tools/pgindent/pgindent @@ -1,5 +1,12 @@ #!/usr/bin/perl +# Program to maintain uniform layout style in our C and Perl code. +# Exit codes: +# 0 -- all OK +# 1 -- could not invoke pgindent, nothing done +# 2 -- --check mode and at least one file requires changes +# 3 -- pgindent failed on at least one file + # Copyright (c) 2021-2024, PostgreSQL Global Development Group use strict; @@ -409,6 +416,7 @@ foreach my $source_filename (@files) if ($source eq "") { print STDERR "Failure in $source_filename: " . $error_message . "\n"; + $status = 3; next; } @@ -429,7 +437,7 @@ foreach my $source_filename (@files) if ($check) { -$status = 2; +$status ||= 2; last unless $diff; } } -- 2.34.1
Re: Converting README documentation to Markdown
> I've been thinking about this some more. I think the most value here > would be to just improve the plain-text formatting, so that there are > consistent list styles, header styles, indentation, some of the > ambiguities cleared up -- much of which your 0001 patch does. You > might as well be targeting markdown-like conventions with this; they > are mostly reasonable. > > I tend to think that actually converting all the README files to > README.md could be a net negative for maintainability. Because now > you are requiring everyone who potentially wants to edit those to be > aware of Markdown syntax and manually check the rendering. With > things like DocBook, if you make a mess, you get error messages from > the build step. If you make a mess in Markdown, you have to visually > find it yourself. There are many READMEs that contain nested lists > and code snippets and diagrams and such all mixed together. Getting > that right in Markdown can be quite tricky. I'm also foreseeing > related messes of trailing whitespace, spaces-vs-tab confusion, > gitattributes violations, etc. It can be a lot of effort. It's okay > to do this for prominent files like the top-level one, but I suggest > that for the rest we can keep it simple and just use plain text. Agreed. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: walsender.c comment with no context is hard to understand
Hi, On Fri, Jun 28, 2024 at 03:15:22PM +0530, Amit Kapila wrote: > On Fri, Jun 28, 2024 at 12:55 PM Peter Smith wrote: > > > > On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila wrote: > > > > > > On Fri, Jun 28, 2024 at 5:15 AM Peter Smith wrote: > > > > > > > > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier > > > > wrote: > > > > > > > > > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote: > > > > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith > > > > > > wrote: > > > > > >> Perhaps the comment should say something like it used to: > > > > > >> /* Fail if there is not enough WAL available. This can happen > > > > > >> during > > > > > >> shutdown. */ > > > > > > > > > > > > Agree with this, +1 for this change. > > > > > > > > > > That would be an improvement. Would you like to send a patch with all > > > > > the areas you think could stand for improvements? > > > > > -- > > > > > > > > OK, I attached a patch equivalent of the suggestion in this thread. > > > > > > > > > > Shouldn't the check for flushptr (if (flushptr < targetPagePtr + > > > reqLen)) be moved immediately after the call to WalSndWaitForWal(). > > > The comment seems to suggests the same: "Make sure we have enough WAL > > > available before retrieving the current timeline. .." > > > > > > > Yes, as I wrote in the first post, those lines did once used to be > > adjacent in logical_read_xlog_page. > > > > I also wondered if they still belonged together, but I opted for the > > safest option of fixing only the comment instead of refactoring old > > code when no problem had been reported. > > > > AFAICT these lines became separated due to a 2023 patch [1], and you > > were one of the reviewers of that patch, so I assumed the code > > separation was deemed OK at that time. Unless it was some mistake that > > slipped past multiple reviewers? > > > > I don't know whether your assumption is correct. AFAICS, those two > lines should be together. Let us ee if Bertrand remembers anything? > IIRC the WalSndWaitForWal() call has been moved to ensure that we can determine the timeline accurately. I agree with Amit that it would make more sense to move the (flushptr < targetPagePtr + reqLen) check just after the flushptr assignement. I don't recall that we discussed any reason of not doing so. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: cfbot update: Using GitHub for patch review
On Fri, Jun 21, 2024 at 8:06 PM Jelte Fennema-Nio wrote: > I recently got write access to the cfbot repo[1] and machine from > Thomas. And I deployed a few improvements this week. The most > significant one is that it is now much easier to use GitHub as part of > your patch review workflow. > > On the cfbot website[2] there's now a "D" (diff) link next to each > commit fest entry. A good example of such a link would be the one for > my most recent commitfest entry[3]. There is a separate commit for > each patch file and those commits contain the "git format-patch" > metadata. (this is not done using git am, but using git mailinfo + > patch + sed, because git am is horrible at resolving conflicts) > > The killer feature (imho) of GitHub diffs over looking at patch files: > You can press the "Expand up"/"Expand down" buttons on the left of the > diff to see some extra context that the patch file doesn't contain. > > You can also add the cfbot repo as a remote to your local git > repository. That way you don't have to manually download patches and > apply them to your local checkout anymore: > > # Add the remote > git remote add -f cfbot https://github.com/postgresql-cfbot/postgresql.git > # make future git pulls much quicker (optional) > git maintenance start > # check out a commitfest entry > git checkout cf/5065 > > P.S. Suggestions for further improvements are definitely appreciated. > We're currently already working on better integration between the > commitfest app website and the cfbot website. > > P.P.S The "D" links don't work for patches that need to be rebased > since before I deployed this change, but that problem should fix > itself with time. > Thanks. Very helpful. Will it be possible to make it send an email containing the review comments? Better even if a reply to that email adds comments/responses back to PR. I need to sign in to github to add my review comments. So those who do not have a github account can not use it for review. But I don't think that can be fixed. We need a way to know who left review comments. There was some discussion at pgconf.dev about using gitlab instead of github. How easy is it to use gitlab if we decide to go that way? -- Best Wishes, Ashutosh Bapat
Re: remaining sql/json patches
Hi Alexander, On Fri, Jun 28, 2024 at 5:00 PM Alexander Lakhin wrote: > > Hi Amit, > > 28.06.2024 09:15, Amit Langote wrote: > > Hi Alexander, > > > > > > Thanks for the report. Yeah, those comments that got added in > > 7081ac46ace are obsolete. > > > > Thanks for paying attention to that! > > Could you also look at comments for transformJsonObjectAgg() and > transformJsonArrayAgg(), aren't they obsolete too? You're right. I didn't think they needed to be similarly fixed, because I noticed the code like the following in in transformJsonObjectAgg() which sets the OID of the function to call from, again, JsonConstructorExpr: { if (agg->absent_on_null) if (agg->unique) aggfnoid = F_JSONB_OBJECT_AGG_UNIQUE_STRICT; else aggfnoid = F_JSONB_OBJECT_AGG_STRICT; else if (agg->unique) aggfnoid = F_JSONB_OBJECT_AGG_UNIQUE; else aggfnoid = F_JSONB_OBJECT_AGG; aggtype = JSONBOID; } So, yes, the comments for them should be fixed too like the other two to also mention JsonConstructorExpr. Updated patch attached. Wonder if Alvaro has any thoughts on this. -- Thanks, Amit Langote v2-0001-SQL-JSON-Fix-some-obsolete-comments.patch Description: Binary data
Re: Why is infinite_recurse test suddenly failing?
Hello hackers, 15.08.2019 10:17, Thomas Munro wrote: On Thu, Aug 15, 2019 at 5:49 PM Tom Lane wrote: So that leads to the thought that "the infinite_recurse test is fine if it runs by itself, but it tends to fall over if there are concurrently-running backends". I have absolutely no idea how that would happen on anything that passes for a platform built in this century. Still, it's a place to start, which we hadn't before. Hmm. mereswin's recent failure on REL_11_STABLE was running the serial schedule. I read about 3 ways to get SEGV from stack-related faults: you can exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV (per man pages), you can access a page that is inside the mapping but is beyond the stack pointer (with some tolerance, exact details vary by arch), and you can fail to allocate a page due to low memory. The first kind of failure doesn't seem right -- we carefully set max_stack_size based on RLIMIT_STACK minus some slop, so that theory would require child processes to have different stack limits than the postmaster as you said (perhaps OpenStack, Docker, related tooling or concurrent activity on the host system is capable of changing it?), or a bug in our slop logic. The second kind of failure would imply that we have a bug -- we're accessing something below the stack pointer -- but that doesn't seem right either -- I think various address sanitising tools would have told us about that, and it's hard to believe there is a bug in the powerpc and arm implementation of the stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c). That leaves the third explanation, except then I'd expect to see other kinds of problems like OOM etc before you get to that stage, and why just here? Confused. Also notable is that we now have a couple of hits on ARM, not only ppc64. Don't know what to make of that. Yeah, that is indeed interesting. Excuse me for reviving this ancient thread, but aforementioned mereswine animal has failed again recently [1]: 002_pg_upgrade_old_node.log contains: 2024-06-26 02:49:06.742 PDT [29121:4] LOG: server process (PID 30908) was terminated by signal 9: Killed 2024-06-26 02:49:06.742 PDT [29121:5] DETAIL: Failed process was running: select infinite_recurse(); I believe this time it's caused by OOM condition and I think this issue occurs on armv7 mereswine because 1) armv7 uses the stack very efficiently (thanks to 32-bitness and maybe also the Link Register) and 2) such old machines are usually tight on memory. I've analyzed buildfarm logs and found from the check stage of that failed run: wget [2] -O log grep 'SQL function "infinite_recurse" statement 1' log | wc -l 5818 (that is, the nesting depth is 5818 levels for a successful run of the test) For comparison, mereswine on HEAD [3], [4] shows 5691 levels; alimoche (aarch64) on HEAD [5] — 3535; lapwing (i686) on HEAD [6] — 5034; alligator (x86_64) on HEAD [7] — 3965; So it seems to me that unlike [9] this failure may be explained by reaching OOM condition. I have an armv7 device with 2GB RAM that doesn't pass `make check` nor even `TESTS=infinite_recurse make -s check-tests` from time to time due to: 2024-06-28 12:40:49.947 UTC postmaster[20019] LOG: server process (PID 20078) was terminated by signal 11: Segmentation fault 2024-06-28 12:40:49.947 UTC postmaster[20019] DETAIL: Failed process was running: select infinite_recurse(); ... Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". Core was generated by `postgres: android regression [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 downcase_identifier (ident=0xa006d837 "infinite_recurse", len=16, warn=true, truncate=truncate@entry=true) at scansup.c:52 52 result = palloc(len + 1); (gdb) p $sp $1 = (void *) 0xbe9b0020 It looks more like [9], but I think the OOM effect is OS/kernel dependent. Though the test passes reliably with lower max_stack_depth values, so I've analyzed how much memory the backend consumes (total size and the size of it's largest segment) depending on the value: 1500kB adfe1000 220260K rw--- [ anon ] total 419452K --- 1600kB ac7e5000 234748K rw--- [ anon ] total 434040 --- 1700kB acf61000 249488K rw--- [ anon ] total 448880K --- default value (2048kB) aac65000 300528K rw--- [ anon ] total 501424K (roughly, increasing max_stack_depth by 100kB increases the backend memory consumption by 15MB during the test) So I think reducing max_stack_depth for mereswine, say to 1000kB, should prevent such failures in the future. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45 [2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45&stg=check&raw=1 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07 [4] https://buildfarm.postgresql.o
Re: Check lateral references within PHVs for memoize cache keys
On 6/18/24 08:47, Richard Guo wrote: On Mon, Mar 18, 2024 at 4:36 PM Richard Guo wrote: Here is another rebase over master so it applies again. I also added a commit message to help review. Nothing else has changed. AFAIU currently we do not add Memoize nodes on top of join relation paths. This is because the ParamPathInfos for join relation paths do not maintain ppi_clauses, as the set of relevant clauses varies depending on how the join is formed. In addition, joinrels do not maintain lateral_vars. So we do not have a way to extract cache keys from joinrels. (Besides, there are places where the code doesn't cope with Memoize path on top of a joinrel path, such as in get_param_path_clause_serials.) Therefore, when extracting lateral references within PlaceHolderVars, there is no need to consider those that are due to be evaluated at joinrels. Hence, here is v7 patch for that. In passing, this patch also includes a comment explaining that Memoize nodes are currently not added on top of join relation paths (maybe we should have a separate patch for this?). Hi, I have reviewed v7 of the patch. This improvement is good enough to be applied, thought. Here is some notes: Comment may be rewritten for clarity: "Determine if the clauses in param_info and innerrel's lateral_vars" - I'd replace lateral_vars with 'lateral references' to combine in one phrase PHV from rel and root->placeholder_list sources. I wonder if we can add whole PHV expression instead of the Var (as discussed above) just under some condition: if (!bms_intersect(pull_varnos(root, (Node *) phinfo->ph_var->phexpr), innerrelids)) { // Add whole PHV } else { // Add only pulled vars } I got the point about Memoize over join, but as a join still calls replace_nestloop_params to replace parameters in its clauses, why not to invent something similar to find Memoize keys inside specific JoinPath node? It is not the issue of this patch, though - but is it doable? IMO, the code: if (bms_nonempty_difference(outerPlan->chgParam, node->keyparamids)) cache_purge_all(node); is a good place to check an assertion: is it really the parent query parameters that make a difference between memoize keys and node list of parameters? Generally, this patch looks good for me to be committed. -- regards, Andrei Lepikhov
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Thu, Jun 27, 2024 at 11:06 PM wrote: > OK. I would like to understand more about your proposed patch. I > have also registered as a reviewer in the commitfests entry. Great! > Although I haven't looked on your patch yet, if it's difficult to know > how it can optimize during the planning phase, it's enough for me to just > show "Skip Scan Cond (or Non-Key Filter)". This is because users can > understand that inefficient index scans *may* occur. That makes sense. The goal of your patch is to highlight when an index scan is using an index that is suboptimal for a particular query (a query that the user runs through EXPLAIN or EXPLAIN ANALYZE). The underlying rules that determine "access predicate vs. filter predicate" are not very complicated -- they're intuitive, even. But even an expert can easily make a mistake on a bad day. It seems to me that all your patch really needs to do is to give the user a friendly nudge in that direction, when it makes sense to. You want to subtly suggest to the user "hey, are you sure that the index the plan uses is exactly what you expected?". Fortunately, even when skip scan works well that should still be a useful nudge. If we assume that the query that the user is looking at is much more important than other queries, then the user really shouldn't be using skip scan in the first place. Even a good skip scan is a little suspicious (it's okay if it "stands out" a bit). > In terms of the concept of EXPLAIN output, I thought that runtime partition > pruning is similar. "EXPLAIN without ANALYZE" only shows the possibilities and > "EXPLAIN ANALYZE" shows the actual results. That seems logical. -- Peter Geoghegan
Re: Converting README documentation to Markdown
On 28.06.24 11:56, Jelte Fennema-Nio wrote: On Fri, 28 Jun 2024 at 09:38, Peter Eisentraut wrote: Getting that right in Markdown can be quite tricky. I agree that in some cases it's tricky. But what's the worst case that can happen when you get it wrong? It renders weird on github.com. I have my "less" set up so that "less somefile.md" automatically renders the markdown. That's been pretty useful. But if that now keeps making a mess out of PostgreSQL's README files, then I'm going to have to keep fixing things, and I might get really mad. That's the worst that could happen. ;-) So I don't agree with "aspirational markdown". If we're going to do it, then I expect that the files are marked up correctly at all times. Conversely, what's the best that could happen?
Restart pg_usleep when interrupted
Attachment protected by Amazon: [0001-Handle-Sleep-interrupts.patch] https://us-east-1.secure-attach.amazon.com/fcdc82ce-7887-4aa1-af9e-c1161a6b1d6f/bc81fa24-41de-48f9-a767-a6d15801754b Amazon has replaced attachment with download link. Downloads will be available until July 28, 2024, 19:59 (UTC+00:00). [Tell us what you think] https://amazonexteu.qualtrics.com/jfe/form/SV_ehuz6zGo8YnsRKK [For more information click here] https://docs.secure-attach.amazon.com/guide Hi, In the proposal by Bertrand [1] to implement vacuum cost delay tracking in pg_stat_progress_vacuum, it was discovered that the vacuum cost delay ends early on the leader process of a parallel vacuum due to parallel workers reporting progress on index vacuuming, which was introduced in 17 with commit [2]. With this patch, everytime a parallel worker completes a vacuum index, it will send a completion message to the leader. The facility that allows a parallel worker to report progress to the leader was introduced in commit [3]. In the case of the patch being proposed by Bertrand, the number of interrupts will be much more frequent as parallel workers would send a message to the leader to update the vacuum delay counters every vacuum_delay_point call. Looking into this, the ideal solution appears to provide the ability for a pg_usleep call to restart after being interrupted. Thanks to the usage of nanosleep as of commit[4], this should be trivial to do as nanosleep provides a remaining time, which can be used to restart the sleep. This idea was also mentioned in [5]. I am attaching a POC patch that does the $SUBJECT. Rather than changing the existing pg_usleep, the patch introduces a function, pg_usleep_handle_interrupt, that takes in the sleep duration and a boolean flag to force handling of the interrupt, or not. This function can replace pg_usleep inside vacuum_delay_point, and could be useful in other cases in which it’s important to handle interrupts. For Windows, the “force” = “true” flag of the new function uses SleepEx which from what I can tell based on the code comments is a non-interruptible sleep. Thoughts? [1] https://www.postgresql.org/message-id/ZnbIIUQpFJKAJx2W%40ip-10-97-1-34.eu-west-3.compute.internal [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=46ebdfe164c61fbac961d1eb7f40e9a684289ae6 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f1889729dd3ab0352dc0ccc2ffcc1b1901f8e39f [4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a948e49e2ef11815be0b211723bfc5b53b7f75a8 [5] https://www.postgresql.org/message-id/24848.1473607575%40sss.pgh.pa.us Regards, Sami Imseih Amazon Web Services (AWS)
Re: Track the amount of time waiting due to cost_delay
> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers > reports > progress, and we currently don't handle interrupts by restarting the sleep > with > the remaining time. nanosleep does provide the ability to restart with the > remaining > time [1], but I don't think it's worth the effort to ensure more accurate > vacuum delays for the leader process. After discussing offline with Bertrand, it may be better to have a solution to deal with the interrupts and allows the sleep to continue to completion. This will simplify this patch and will be useful for other cases in which parallel workers need to send a message to the leader. This is the thread [1] for that discussion. [1] https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-00%40email.amazonses.com Regards, Sami
Re: Restart pg_usleep when interrupted
Hi, I think you need to find a way to disable this "Attachment protected by Amazon" thing: http://postgr.es/m/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000...@email.amazonses.com We want patches to be in the pgsql-hackers archives, not temporarily accessible via some link. ...Robert
Re: Restart pg_usleep when interrupted
> I think you need to find a way to disable this "Attachment protected > by Amazon" thing: Yes, I am looking into that. I only noticed after sending the email. Sorry about that. Sami
Doc: Move standalone backup section, mention -X argument
A documentation comment came in [1] causing me to review some of our backup documentation and I left the current content and location of the standalone backups was odd. I propose to move it to a better place, under file system backups. Adding to commitfest. David J. [1] https://www.postgresql.org/message-id/CAKFQuwZ%3DWxdWJ6O66yQ9dnWTLO12p7h3HpfhowCj%2B0U_bNrzdg%40mail.gmail.com v1-0001-docs-move-standalone-pg_basebackup-docs-to-file-syst.patch Description: Binary data
Re: Restart pg_usleep when interrupted
> We want patches to be in the pgsql-hackers archives, not temporarily > accessible via some link. > > …Robert > Moving to another email going forward. Reattaching the patch. 0001-Handle-Sleep-interrupts.patch Description: Binary data Regards, Sami Imseih Amazon Web Services
Re: Restart pg_usleep when interrupted
Sami Imseih writes: > Reattaching the patch. I feel like this is fundamentally a wrong solution, for the reasons cited in the comment for pg_usleep: long sleeps are a bad idea because of the resulting uncertainty about whether we'll respond to interrupts and such promptly. An example here is that if we get a query cancel interrupt, we should probably not insist on finishing out the current sleep before responding. Therefore, rather than "improving" pg_usleep (and uglifying its API), the right answer is to fix parallel vacuum leaders to not depend on pg_usleep in the first place. A better idea might be to use pg_sleep() or equivalent code. regards, tom lane
Re: Confine vacuum skip logic to lazy_scan_skip
On Sun, Mar 17, 2024 at 2:53 AM Thomas Munro wrote: > > On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman > wrote: > > I've rebased the attached v10 over top of the changes to > > lazy_scan_heap() Heikki just committed and over the v6 streaming read > > patch set. I started testing them and see that you are right, we no > > longer pin too many buffers. However, the uncached example below is > > now slower with streaming read than on master -- it looks to be > > because it is doing twice as many WAL writes and syncs. I'm still > > investigating why that is. --snip-- > 4. For learning/exploration only, I rebased my experimental vectored > FlushBuffers() patch, which teaches the checkpointer to write relation > data out using smgrwritev(). The checkpointer explicitly sorts > blocks, but I think ring buffers should naturally often contain > consecutive blocks in ring order. Highly experimental POC code pushed > to a public branch[2], but I am not proposing anything here, just > trying to understand things. The nicest looking system call trace was > with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads > and WAL writes 128kB at a time: > > pwrite(32,...,131072,0xfc6000) = 131072 (0x2) > fdatasync(32) = 0 (0x0) > pwrite(27,...,131072,0x6c) = 131072 (0x2) > pread(27,...,131072,0x73e000) = 131072 (0x2) > pwrite(27,...,131072,0x6e) = 131072 (0x2) > pread(27,...,131072,0x75e000) = 131072 (0x2) > pwritev(27,[...],3,0x77e000) = 131072 (0x2) > preadv(27,[...],3,0x77e000) = 131072 (0x2) > > That was a fun experiment, but... I recognise that efficient cleaning > of ring buffers is a Hard Problem requiring more concurrency: it's > just too late to be flushing that WAL. But we also don't want to > start writing back data immediately after dirtying pages (cf. OS > write-behind for big sequential writes in traditional Unixes), because > we're not allowed to write data out without writing the WAL first and > we currently need to build up bigger WAL writes to do so efficiently > (cf. some other systems that can write out fragments of WAL > concurrently so the latency-vs-throughput trade-off doesn't have to be > so extreme). So we want to defer writing it, but not too long. We > need something cleaning our buffers (or at least flushing the > associated WAL, but preferably also writing the data) not too late and > not too early, and more in sync with our scan than the WAL writer is. > What that machinery should look like I don't know (but I believe > Andres has ideas). I've attached a WIP v11 streaming vacuum patch set here that is rebased over master (by Thomas), so that I could add a CF entry for it. It still has the problem with the extra WAL write and fsync calls investigated by Thomas above. Thomas has some work in progress doing streaming write-behind to alleviate the issues with the buffer access strategy and streaming reads. When he gets a version of that ready to share, he will start a new "Streaming Vacuum" thread. - Melanie From 050b6c3fa73c9153aeef58fcd306533c1008802e Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 26 Apr 2024 08:32:44 +1200 Subject: [PATCH v11 2/3] Refactor tidstore.c memory management. Previously, TidStoreIterateNext() would expand the set of offsets for each block into a buffer that it overwrote each time. In order to be able to collect the offsets for multiple blocks before working with them, change the contract. Now, the offsets are obtained by a separate call to TidStoreGetBlockOffsets(), which can be called at a later time, and TidStoreIteratorResult objects are safe to copy and store in a queue. This will be used by a later patch, to avoid the need for expensive extra copies of offset array and associated memory management. --- src/backend/access/common/tidstore.c | 68 +-- src/backend/access/heap/vacuumlazy.c | 9 ++- src/include/access/tidstore.h | 12 ++-- .../modules/test_tidstore/test_tidstore.c | 9 ++- 4 files changed, 53 insertions(+), 45 deletions(-) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index fb3949d69f6..c3c1987204b 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -147,9 +147,6 @@ struct TidStoreIter TidStoreIterResult output; }; -static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, - BlocktableEntry *page); - /* * Create a TidStore. The TidStore will live in the memory context that is * CurrentMemoryContext at the time of this call. The TID storage, backed @@ -486,13 +483,6 @@ TidStoreBeginIterate(TidStore *ts) iter = palloc0(sizeof(TidStoreIter)); iter->ts = ts; - /* - * We start with an array large enough to contain at least the offsets - * from one completely full bitmap element. - */ - iter->output.max_offset = 2 * BITS_PER_BITMAPWORD; - iter->output.offsets = palloc(sizeof(OffsetNumber) * iter
Re: Add LSN <-> time conversion functionality
On Wed, Jun 26, 2024 at 10:04 PM Melanie Plageman wrote: > > I've implemented these review points in the attached v4. I realized the docs had a compilation error. Attached v5 fixes that as well as three bugs I found while using this patch set more intensely today. I see Michael has been working on some crash safety for stats here [1]. I wonder if that would be sufficient for the LSNTimeStream. I haven't examined his patch functionality yet, though. I also had an off-list conversation with Robert where he suggested I could perhaps change the user-facing functions for estimating an LSN/time conversion to instead return a floor and a ceiling -- instead of linearly interpolating a guess. This would be a way to keep users from misunderstanding the accuracy of the functions to translate LSN <-> time. I'm interested in what others think of this. I like this idea a lot because it allows me to worry less about how I decide to compress the data and whether or not it will be accurate for use cases different than my own (the opportunistic freezing heuristic). If I can provide a floor and a ceiling that are definitely accurate, I don't have to worry about misleading people. - Melanie [1] https://www.postgresql.org/message-id/ZnEiqAITL-VgZDoY%40paquier.xyz From f492af31c1b9917aa27ba3ad76560e59f3fd5c9b Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Wed, 27 Dec 2023 16:40:27 -0500 Subject: [PATCH v5 2/6] Add LSNTimeStream for converting LSN <-> time Add a new structure, LSNTimeStream, consisting of LSNTimes -- each an LSN, time pair. The LSNTimeStream is fixed size, so when a new LSNTime is inserted to a full LSNTimeStream, an LSNTime is dropped and the new LSNTime is inserted. We drop the LSNTime whose absence would cause the least error when interpolating between its adjoining points. LSN <-> time conversions can be done using linear interpolation with two LSNTimes on the LSNTimeStream. This commit does not add a global instance of LSNTimeStream. It adds the structures and functions needed to maintain and access such a stream. --- src/backend/utils/activity/pgstat_wal.c | 233 src/include/pgstat.h| 32 src/tools/pgindent/typedefs.list| 2 + 3 files changed, 267 insertions(+) diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c index 0e374f133a9..cef9429994c 100644 --- a/src/backend/utils/activity/pgstat_wal.c +++ b/src/backend/utils/activity/pgstat_wal.c @@ -17,8 +17,11 @@ #include "postgres.h" +#include "access/xlog.h" #include "executor/instrument.h" +#include "utils/builtins.h" #include "utils/pgstat_internal.h" +#include "utils/timestamp.h" PgStat_PendingWalStats PendingWalStats = {0}; @@ -32,6 +35,11 @@ PgStat_PendingWalStats PendingWalStats = {0}; static WalUsage prevWalUsage; +static void lsntime_insert(LSNTimeStream *stream, TimestampTz time, XLogRecPtr lsn); + +XLogRecPtr estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time); +TimestampTz estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn); + /* * Calculate how much WAL usage counters have increased and update * shared WAL and IO statistics. @@ -184,3 +192,228 @@ pgstat_wal_snapshot_cb(void) sizeof(pgStatLocal.snapshot.wal)); LWLockRelease(&stats_shmem->lock); } + +/* + * Given three LSNTimes, calculate the area of the triangle they form were they + * plotted with time on the X axis and LSN on the Y axis. + */ +static int +lsn_ts_calculate_error_area(LSNTime *left, LSNTime *mid, LSNTime *right) +{ + int rectangle_all = (right->time - left->time) * (right->lsn - left->lsn); + int triangle1 = rectangle_all / 2; + int triangle2 = (mid->lsn - left->lsn) * (mid->time - left->time) / 2; + int triangle3 = (right->lsn - mid->lsn) * (right->time - mid->time) / 2; + int rectangle_part = (right->lsn - mid->lsn) * (mid->time - left->time); + + return rectangle_all - triangle1 - triangle2 - triangle3 - rectangle_part; +} + +/* + * Determine which LSNTime to drop from a full LSNTimeStream. Once the LSNTime + * is dropped, points between it and either of its adjacent LSNTimes will be + * interpolated between those two LSNTimes instead. To keep the LSNTimeStream + * as accurate as possible, drop the LSNTime whose absence would have the least + * impact on future interpolations. + * + * We determine the error that would be introduced by dropping a point on the + * stream by calculating the area of the triangle formed by the LSNTime and its + * adjacent LSNTimes. We do this for each LSNTime in the stream (except for the + * first and last LSNTimes) and choose the LSNTime with the smallest error + * (area). We avoid extrapolation by never dropping the first or last points. + */ +static int +lsntime_to_drop(LSNTimeStream *stream) +{ + int min_area = INT_MAX; + int target_point = stream->length - 1; + + /* Don't drop points if free space available */ + Assert(stream->length == LSNTIMESTREA
Re: Restart pg_usleep when interrupted
On Sat, Jun 29, 2024 at 9:34 AM Tom Lane wrote: > Therefore, rather than "improving" pg_usleep (and uglifying its API), > the right answer is to fix parallel vacuum leaders to not depend on > pg_usleep in the first place. A better idea might be to use > pg_sleep() or equivalent code. In case it's useful for someone looking into that, in earlier discussions we figured out that it is possible to have high resolution timeouts AND support latch multiplexing on Linux, FreeBSD, macOS: https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BhC9mFx8tEcBsyo7-cAfWgtbRy1eDizeFuff2K7T%3D4bA%40mail.gmail.com
Re: Restart pg_usleep when interrupted
Thanks for the feedback! > On Jun 28, 2024, at 4:34 PM, Tom Lane wrote: > > Sami Imseih writes: >> Reattaching the patch. > > I feel like this is fundamentally a wrong solution, for the reasons > cited in the comment for pg_usleep: long sleeps are a bad idea > because of the resulting uncertainty about whether we'll respond to > interrupts and such promptly. An example here is that if we get > a query cancel interrupt, we should probably not insist on finishing > out the current sleep before responding. The case which brought up this discussion is the pg_usleep that is called within the vacuum_delay_point being interrupted. When I read the same code comment you cited, it sounded to me that “long sleeps” are those that are in seconds or minutes. The longest vacuum delay allowed is 100ms. > Therefore, rather than "improving" pg_usleep (and uglifying its API), > the right answer is to fix parallel vacuum leaders to not depend on > pg_usleep in the first place. A better idea might be to use > pg_sleep() or equivalent code. Yes, that is a good idea to explore and it will not require introducing an awkward new API. I will look into using something similar to pg_sleep. Regards, Sami
Re: cfbot update: Using GitHub for patch review
On Sat, Jun 29, 2024 at 1:10 AM Ashutosh Bapat wrote: > I need to sign in to github to add my review comments. So those who do not > have a github account can not use it for review. But I don't think that can > be fixed. We need a way to know who left review comments. I don't think Jelte was talking about moving review discussion to Github, just providing a link to *view* the patches there. Now I'm wondering if there is a way to disable comments on commits in the postgresql-cfbot GH account. I guess they'd be lost after 48 hours anyway when the branch gets force-pushed and commit hash changes? I don't want people to start posting comments there that no one is looking at. > There was some discussion at pgconf.dev about using gitlab instead of github. > How easy is it to use gitlab if we decide to go that way? cfbot could certainly be configured to push (ie mirror) the same branches to gitlab too (I don't have much experience with Gitlab, but if it's just a matter of registering an account + ssh key, adding it as a remote and pushing...). Then there could be [View on Github] [View on Gitlab] buttons, if people think that's useful (note "View", not "Review"!). The Cirrus CI system is currently only capable of testing stuff pushed to Github, though, so cfbot would continue to push stuff there. If memory servers, Cirrus used to say that they were planning to add support for testing code in public Gitlab next, but now their FAQ says their next public git host will be Bit Bucket: https://cirrus-ci.org/faq/#only-github-support Given that cfbot is currently only using Github because we have to to reach Cirrus CI, not because we actually want Github features like issue tracking or pull requests with review discussion, it hardly matters if it's Github, Gitlab or any other public git host. And if we eventually decide to move our whole workflow to one of those systems and shut down the CF app, then cfbot will be retired, and you'll just create PRs on that system. But so far, we continue to prefer the CF app + email. The reason we liked Cirrus so much despite the existence of many other CI systems including the ones build into GH, GL, etc and many 3rd party ones, was because it was the only provider that allowed enough compute minutes for our needs, supported lots of operating systems, and had public links to log files suitable for sharing on out mailing list or cfbot's web interface (if you click to see the log, it doesn't say "Rol up roll up, welcome to Foo Corporation, get your tickets here!"). I still don't know of any other CI system that would be as good for us, other than building our own. I would say it's been a very good choice so far. The original cfbot goal was "feed the mailing list to a CI system", with Github just a necessary part of the plumbing. It is a nice way to view patches though.
Re: race condition in pg_class
On Fri, Jun 28, 2024 at 01:17:22AM -0400, Tom Lane wrote: > Noah Misch writes: > > Pushed. Buildfarm member prion is failing the new inplace-inval.spec, > > almost > > surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is > > testing an extant failure to inval a cache entry. Naturally, inexorable > > inval > > masks the extant bug. Ideally, I'd just skip the test under any kind of > > cache > > clobber option. I don't know a pleasant way to do that, so these are > > known-feasible things I'm considering: > > > 1. Neutralize the test in all branches, probably by having it just not > > report > >the final answer. Undo in the later fix patch. > > > 2. v14+ has pg_backend_memory_contexts. In the test, run some plpgsql that > >uses heuristics on that to deduce whether caches are getting released. > >Have a separate expected output for the cache-release scenario. Perhaps > >also have the test treat installcheck like cache-release, since > >installcheck could experience sinval reset with similar consequences. > >Neutralize the test in v12 & v13. > > > 3. Add a test module with a C function that reports whether any kind of > > cache > >clobber is active. Call it in this test. Have a separate expected > > output > >for the cache-release scenario. > > > Preferences or other ideas? I'm waffling between (1) and (2). I'll give it > > more thought over the next day. > > I'd just go for (1). We were doing fine without this test case. > I can't see expending effort towards hiding its result rather > than actually fixing anything. Good point, any effort on (2) would be wasted once the fixes get certified. I pushed (1). I'm attaching the rebased fix patches. Author: Noah Misch Commit: Noah Misch Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions. The current use always releases this locktag. A planned use will continue that intent. It will involve more areas of code, making unlock omissions easier. Warn under debug_assertions, like we do for various resource leaks. Back-patch to v12 (all supported versions), the plan for the commit of the new use. Reviewed by FIXME. Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 0400a50..461d925 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks) locallock->numLockOwners = 0; } +#ifdef USE_ASSERT_CHECKING + if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks) + elog(WARNING, "tuple lock held at commit"); +#endif + /* * If the lock or proclock pointers are NULL, this lock was taken via * the relation fast-path (and is not known to have been transferred). Author: Noah Misch Commit: Noah Misch Fix data loss at inplace update after heap_update(). As previously-added tests demonstrated, heap_inplace_update() could instead update an unrelated tuple of the same catalog. It could lose the update. Losing relhasindex=t was a source of index corruption. Inplace-updating commands like VACUUM will now wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a long-running GRANT already hurts VACUUM progress more just by keeping an XID running. The VACUUM will behave like a DELETE or UPDATE waiting for the uncommitted change. For implementation details, start at the heap_inplace_update_scan() header comment and README.tuplock. Back-patch to v12 (all supported versions). In back branches, retain a deprecated heap_inplace_update(), for extensions. Reviewed by FIXME. Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 6441e8b..dbfa2b7 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -153,3 +153,56 @@ The following infomask bits are applicable: We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit is set. + +Locking to write inplace-updated tables +--- + +[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.] + +If IsInplaceUpdateRelation() returns true for a table, the table is a system +catalog that receives heap_inplace_update_scan() calls. Preparing a +heap_update() of these tables follows additional locking rules, to ensure we +don't lose the effects of an inplace update. In particular, consider a moment +when a backend has fetched the old tuple to modify, not yet having called +heap_up
Re: Inval reliability, especially for inplace updates
On Mon, Jun 17, 2024 at 04:58:54PM -0700, Noah Misch wrote: > attached v2 patch stack. Rebased. This applies on top of three patches from https://postgr.es/m/20240629024251.03.nmi...@google.com. I'm attaching those to placate cfbot, but this thread is for review of the last patch only. Author: Noah Misch Commit: Noah Misch Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions. The current use always releases this locktag. A planned use will continue that intent. It will involve more areas of code, making unlock omissions easier. Warn under debug_assertions, like we do for various resource leaks. Back-patch to v12 (all supported versions), the plan for the commit of the new use. Reviewed by FIXME. Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 0400a50..461d925 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks) locallock->numLockOwners = 0; } +#ifdef USE_ASSERT_CHECKING + if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks) + elog(WARNING, "tuple lock held at commit"); +#endif + /* * If the lock or proclock pointers are NULL, this lock was taken via * the relation fast-path (and is not known to have been transferred). Author: Noah Misch Commit: Noah Misch Fix data loss at inplace update after heap_update(). As previously-added tests demonstrated, heap_inplace_update() could instead update an unrelated tuple of the same catalog. It could lose the update. Losing relhasindex=t was a source of index corruption. Inplace-updating commands like VACUUM will now wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a long-running GRANT already hurts VACUUM progress more just by keeping an XID running. The VACUUM will behave like a DELETE or UPDATE waiting for the uncommitted change. For implementation details, start at the heap_inplace_update_scan() header comment and README.tuplock. Back-patch to v12 (all supported versions). In back branches, retain a deprecated heap_inplace_update(), for extensions. Reviewed by FIXME. Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 6441e8b..dbfa2b7 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -153,3 +153,56 @@ The following infomask bits are applicable: We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit is set. + +Locking to write inplace-updated tables +--- + +[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.] + +If IsInplaceUpdateRelation() returns true for a table, the table is a system +catalog that receives heap_inplace_update_scan() calls. Preparing a +heap_update() of these tables follows additional locking rules, to ensure we +don't lose the effects of an inplace update. In particular, consider a moment +when a backend has fetched the old tuple to modify, not yet having called +heap_update(). Another backend's inplace update starting then can't conclude +until the heap_update() places its new tuple in a buffer. We enforce that +using locktags as follows. While DDL code is the main audience, the executor +follows these rules to make e.g. "MERGE INTO pg_class" safer. Locking rules +are per-catalog: + + pg_class heap_inplace_update_scan() callers: before the call, acquire + LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock + (VACUUM), or a mode with strictly more conflicts. If the update targets a + row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be + on the table. Locking the index rel is optional. (This allows VACUUM to + overwrite per-index pg_class while holding a lock on the table alone.) We + could allow weaker locks, in which case the next paragraph would simply call + for stronger locks for its class of commands. heap_inplace_update_scan() + acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for + ExclusiveLock, on each tuple it overwrites. + + pg_class heap_update() callers: before copying the tuple to modify, take a + lock that conflicts with at least one of those from the preceding paragraph. + SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE. + After heap_update(), release any LOCKTAG_TUPLE. Most of these callers opt + to acquire just the LOCKTAG_RELATION. + + pg_database: before copying the
Re: Backporting BackgroundPsql
On Thu, Jun 27, 2024 at 10:05 PM Heikki Linnakangas wrote: > > Ok, I pushed commits to backport BackgroundPsql down to v12. I used > "option 2", i.e. I changed background_psql() to return the new > BackgroundPsql object. > > Don't we need to add install and uninstall rules for the new module, like we did in https://git.postgresql.org/pg/commitdiff/a4c17c86176cfa712f541b81b2a026ae054b275e and https://git.postgresql.org/pg/commitdiff/7039c7cff6736780c3bbb41a90a6dfea0f581ad2 ? Thanks, Pavan
Incremental backup from a streaming replication standby
I played around with incremental backup yesterday and tried $subject The WAL summarizer is running on the standby server, but when I try to take an incremental backup, I get an error that I understand to mean that WAL summarizing hasn't caught up yet. I am not sure if that is working as designed, but if it is, I think it should be documented. Yours, Laurenz Albe