Regarding canditate_restart_lsn in logical decoding.
Hi, I have created my own output plugin for logical decoding. I am storing decoded data in Apache kafka via pg_recvlogical utility. Using pg_recvlogical I am updating confirmed_flush_lsn of slot based on the value that I'm storing in kafka,this is done for every 10 secs. In case of walsender shutdown I found that canditate_restart_lsn is not cleared from shared memory.I also found just after restarting sometimes canditate_restart_lsn is far more greater than actual restart_lsn of slot,due to this frequent checkpoints are deleting the required WAL files. Can I clear the canditate_restart_lsn in plugin_start callback.Is there any consequences for this? Thanks and Regards, G R MANJUNATH.
Invalid Assert while validating REPLICA IDENTITY?
While working on some other code I noticed that in FindReplTupleInLocalRel() there is an assert [1] that seems to be passing IndexRelation to GetRelationIdentityOrPK() whereas it should be passing normal relation. [1] if (OidIsValid(localidxoid)) { #ifdef USE_ASSERT_CHECKING Relation idxrel = index_open(localidxoid, AccessShareLock); /* Index must be PK, RI, or usable for REPLICA IDENTITY FULL tables */ Assert(GetRelationIdentityOrPK(idxrel) == localidxoid || IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel), edata->targetRel->attrmap)); index_close(idxrel, AccessShareLock); #endif In the above code, we are passing idxrel to GetRelationIdentityOrPK(), whereas we should be passing localrel Fix should be @@ -2929,7 +2929,7 @@ FindReplTupleInLocalRel(ApplyExecutionData *edata, Relation localrel, Relationidxrel = index_open(localidxoid, AccessShareLock); /* Index must be PK, RI, or usable for REPLICA IDENTITY FULL tables */ - Assert(GetRelationIdentityOrPK(idxrel) == localidxoid || + Assert(GetRelationIdentityOrPK(localrel) == localidxoid || IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel), edata->targetRel->attrmap)); index_close(idxrel, AccessShareLock); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Hello, Matthias! Just wanted to update you with some information about the next steps in work. > In heapam_index_build_range_scan, it seems like you're popping the > snapshot and registering a new one while holding a tuple from > heap_getnext(), thus while holding a page lock. I'm not so sure that's > OK, expecially when catalogs are also involved (specifically for > expression indexes, where functions could potentially be updated or > dropped if we re-create the visibility snapshot) I have returned to the solution with a dedicated catalog_xmin for backends [1]. Additionally, I have added catalog_xmin to pg_stat_activity [2]. > In heapam_index_build_range_scan, you pop the snapshot before the > returned heaptuple is processed and passed to the index-provided > callback. I think that's incorrect, as it'll change the visibility of > the returned tuple before it's passed to the index's callback. I think > the snapshot manipulation is best added at the end of the loop, if we > add it at all in that function. Now it's fixed, and the snapshot is reset between pages [3]. Additionally, I resolved the issue with potential duplicates in unique indexes. It looks a bit clunky, but it works for now [4]. Single commit from [5] also included, just for stable stress testing. Full diff is available at [6]. Best regards, Mikhail. [1]: https://github.com/michail-nikolaev/postgres/commit/01a47623571592c52c7a367f85b1cff9d8b593c0 [2]: https://github.com/michail-nikolaev/postgres/commit/d3345d60bd51fe2e0e4a73806774b828f34ba7b6 [3]: https://github.com/michail-nikolaev/postgres/commit/7d1dd4f971e8d03f38de95f82b730635ffe09aaf [4]: https://github.com/michail-nikolaev/postgres/commit/4ad56e14dd504d5530657069068c2bdf172e482d [5]: https://commitfest.postgresql.org/49/5160/ [6]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach?diff=split&w=
Re: scalability bottlenecks with (many) partitions (and more)
On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra wrote: > I don't think that's possible with hard-coded size of the array - that > allocates the memory for everyone. We'd need to make it variable-length, > and while doing those benchmarks I think we actually already have a GUC > for that - max_locks_per_transaction tells us exactly what we need to > know, right? I mean, if I know I'll need ~1000 locks, why not to make > the fast-path array large enough for that? I really like this idea. I'm not sure about exactly how many fast path slots you should get for what value of max_locks_per_transaction, but coupling the two things together in some way sounds smart. > Of course, the consequence of this would be making PGPROC variable > length, or having to point to a memory allocated separately (I prefer > the latter option, I think). I haven't done any experiments, but it > seems fairly doable - of course, not sure if it might be more expensive > compared to compile-time constants. I agree that this is a potential problem but it sounds like the idea works well enough that we'd probably still come out quite far ahead even with a bit more overhead. -- Robert Haas EDB: http://www.enterprisedb.com
Re: query_id, pg_stat_activity, extended query protocol
On Sat, Aug 31, 2024 at 09:47:41AM +0800, jian he wrote: > /* test \bind queryid exists */ > select query_id is not null as query_id_exist > from pg_stat_activity where pid = pg_backend_pid() \bind \g > > /* test that \parse \bind_named queryid exists */ > select pg_backend_pid() as current_pid \gset pref01_ > select query_id is not null as query_id_exist from pg_stat_activity > where pid = $1 \parse stmt11 > \bind_named stmt11 :pref01_current_pid \g I need to spend a bit more time with my head down for this thread, but couldn't we use these commands with various query patterns in pg_stat_statements and look at the shmem counters reported through its view? -- Michael signature.asc Description: PGP signature
Re: Re: bt Scankey in another contradictory case
Hi, I have reanalysed the code of function _bt_first. I notice that using a multi-attribute index if we can't identify the starting boundaries and the following attributes markded not required , that means we need start at first or last page in the index to examine every tuple to satisfy the qual or not, in the meantime the scan will be stopped while the first attribute evaluated failed. For instance: create table c_s( x int, y int); insert into c_s select generate_series(1, 2), generate_series(1, 2); create index my_c_s_idx on c_s using btree(x,y); explain (analyze, buffers) select * from c_s where x >4000 and y >10 and y <10 order by x desc; Index Only Scan Backward using my_c_s_idx on c_s (cost=0.29..384.31 rows=1 width=8) (actual time=1.302..1.304 rows=0 loops=1) Index Cond: ((x > 4000) AND (y > 10) AND (y < 10)) Heap Fetches: 0 Buffers: shared read=46 Planning: Buffers: shared hit=51 read=15 Planning Time: 1.311 ms Execution Time: 1.435 ms (8 rows) The instance is a little different for description above due to the implies NOT NULL Scankey, but it has no effect on the whole situation. What's more, if i change the number 4000 to 1000. - Sort (cost=441.01..441.01 rows=1 width=8) (actual time=2.974..2.975 rows=0 loops=1) Sort Key: x DESC Sort Method: quicksort Memory: 25kB Buffers: shared hit=89 -> Seq Scan on c_s (cost=0.00..441.00 rows=1 width=8) (actual time=2.971..2.971 rows=0 loops=1) Filter: ((x > 1000) AND (y > 10) AND (y < 10)) Rows Removed by Filter: 2 Buffers: shared hit=89 Planning: Buffers: shared hit=2 Planning Time: 0.113 ms Execution Time: 2.990 ms (12 rows) The planner choosed the Seq Scan, and the executor have done the unnecessary jobs 2 times. Let's don't confine to the plain attributes or row comparison and Seq Scan or Index Scan . We can pretend row-comparison as multi-attributes comparison. The qual is implicit-AND format, that means once one attribute is self-contradictory, we can abandon the qual immediately. Maybe we can do more efficient jobs during planning time. Like at the phase of function deconstruct_recurse invoked, we can identify qual that is self-contradictory then flag it. With this information we know who is a dummy relation named arbitrarily. For instance: explain (analyze, buffers) select * from c_s a , c_s b where a.y >10 and a.y<10 and a.x=b.x; QUERY PLAN --- Nested Loop (cost=0.29..393.31 rows=1 width=16) (actual time=1.858..1.858 rows=0 loops=1) Buffers: shared hit=89 -> Seq Scan on c_s a (cost=0.00..389.00 rows=1 width=8) (actual time=1.857..1.857 rows=0 loops=1) Filter: ((y > 10) AND (y < 10)) Rows Removed by Filter: 2 Buffers: shared hit=89 -> Index Only Scan using my_c_s_idx on c_s b (cost=0.29..4.30 rows=1 width=8) (never executed) Index Cond: (x = a.x) Heap Fetches: 0 Planning: Buffers: shared hit=12 Planning Time: 0.236 ms Execution Time: 1.880 ms (13 rows) After deconstruct_recurse invoked, qual(a.y >10 and a.y<10) was distributed to relation "a" and relation "a" is a dummy relation. When "a" and "b" make inner join, "a" will supply nothing. That means the inner-join rel is a dummy relation too. If there is a outer join above the inner join and some one who can commute with it, we can do more efficient jobs and so on. It also has benefit on path-competing phase with this feature due to it doesn't cost anything. It's need to discuss the idea whether is feasible or not and it will takes a lot of effort to complete this feature. Anyway we can fix these issues what we had encountered first. bigbro...@hotmail.com From: Peter Geoghegan Date: 2024-08-30 22:32 To: b ro CC: pgsql-hackers Subject: Re: bt Scankey in another contradictory case On Fri, Aug 30, 2024 at 7:36 AM b ro wrote: > this is the patch attachment. We discussed this recently: https://www.postgresql.org/message-id/80384.1715458896%40sss.pgh.pa.us I think that we should do this. It doesn't make a huge difference in practice, because we'll still end the scan once the leaf level is reached. But it matters more when array keys are involved, where there might be more than one descent to the leaf level. Plus we might as well just be thorough about this stuff. -- Peter Geoghegan
Re: Collect statistics about conflicts in logical replication
On Fri, Aug 30, 2024 at 4:24 PM shveta malik wrote: > > On Fri, Aug 30, 2024 at 10:53 AM Peter Smith wrote: > > ... > > 2. Arrange all the counts into an intuitive/natural order > > > > There is an intuitive/natural ordering for these counts. For example, > > the 'confl_*' count fields are in the order insert -> update -> > > delete, which LGTM. > > > > Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not > > in a good order. > > > > IMO it makes more sense if everything is ordered as: > > 'sync_error_count', then 'apply_error_count', then all the 'confl_*' > > counts. > > > > This comment applies to lots of places, e.g.: > > - docs (doc/src/sgml/monitoring.sgml) > > - function pg_stat_get_subscription_stats (pg_proc.dat) > > - view pg_stat_subscription_stats (src/backend/catalog/system_views.sql) > > - TAP test SELECTs (test/subscription/t/026_stats.pl) > > > > As all those places are already impacted by this patch, I think it > > would be good if (in passing) we (if possible) also swapped the > > sync/apply counts so everything is ordered intuitively top-to-bottom > > or left-to-right. > > Not sure about this though. It does not seem to belong to the current patch. > Fair enough. But, besides being inappropriate to include in the current patch, do you think the suggestion to reorder them made sense? If it has some merit, then I will propose it again as a separate thread. == Kind Regards, Peter Smith. Fujitsu Australia
Re: JIT: Remove some unnecessary instructions.
On Fri, Aug 30, 2024 at 8:50 PM Andreas Karlsson wrote: > > On 8/30/24 5:55 AM, Xing Guo wrote: > > I find there are some unnecessary load/store instructions being > > emitted by the JIT compiler. > > Well spotted! All of these are obvious dead instructions and while LLVM > might be able to optimize them away there is no reason to create extra > work for the optimizer. > > The patch looks good, applies and the tests passes. Thanks for testing it! I spotted another unnecessary store instruction and added it in my V2 patch. Best Regards, Xing From ca30976ebf70ebed12ede7999d4ab637a9851641 Mon Sep 17 00:00:00 2001 From: Xing Guo Date: Mon, 2 Sep 2024 10:21:37 +0800 Subject: [PATCH v2] JIT: Remove unnecessary load/store instructions. --- src/backend/jit/llvm/llvmjit_expr.c | 16 +--- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c index 48ccdb942a..7ece2e3d2e 100644 --- a/src/backend/jit/llvm/llvmjit_expr.c +++ b/src/backend/jit/llvm/llvmjit_expr.c @@ -720,11 +720,6 @@ llvm_compile_expr(ExprState *state) v_boolnull = l_load(b, TypeStorageBool, v_resnullp, ""); v_boolvalue = l_load(b, TypeSizeT, v_resvaluep, ""); - /* set resnull to boolnull */ - LLVMBuildStore(b, v_boolnull, v_resnullp); - /* set revalue to boolvalue */ - LLVMBuildStore(b, v_boolvalue, v_resvaluep); - /* check if current input is NULL */ LLVMBuildCondBr(b, LLVMBuildICmp(b, LLVMIntEQ, v_boolnull, @@ -816,11 +811,6 @@ llvm_compile_expr(ExprState *state) v_boolnull = l_load(b, TypeStorageBool, v_resnullp, ""); v_boolvalue = l_load(b, TypeSizeT, v_resvaluep, ""); - /* set resnull to boolnull */ - LLVMBuildStore(b, v_boolnull, v_resnullp); - /* set revalue to boolvalue */ - LLVMBuildStore(b, v_boolvalue, v_resvaluep); - LLVMBuildCondBr(b, LLVMBuildICmp(b, LLVMIntEQ, v_boolnull, l_sbool_const(1), ""), @@ -875,10 +865,8 @@ llvm_compile_expr(ExprState *state) case EEOP_BOOL_NOT_STEP: { LLVMValueRef v_boolvalue; - LLVMValueRef v_boolnull; LLVMValueRef v_negbool; - v_boolnull = l_load(b, TypeStorageBool, v_resnullp, ""); v_boolvalue = l_load(b, TypeSizeT, v_resvaluep, ""); v_negbool = LLVMBuildZExt(b, @@ -887,8 +875,7 @@ llvm_compile_expr(ExprState *state) l_sizet_const(0), ""), TypeSizeT, ""); - /* set resnull to boolnull */ - LLVMBuildStore(b, v_boolnull, v_resnullp); + /* set revalue to !boolvalue */ LLVMBuildStore(b, v_negbool, v_resvaluep); @@ -1615,7 +1602,6 @@ llvm_compile_expr(ExprState *state) LLVMPositionBuilderAtEnd(b, b_argsequal); LLVMBuildStore(b, l_sbool_const(1), v_resnullp); LLVMBuildStore(b, l_sizet_const(0), v_resvaluep); - LLVMBuildStore(b, v_retval, v_resvaluep); LLVMBuildBr(b, opblocks[opno + 1]); break; -- 2.46.0
Re: ANALYZE ONLY
On 2024-09-01 10:31, Michael Harris wrote: Hello Atsushi & Melih Thank you both for your further feedback. On Thu, 29 Aug 2024 at 21:31, Melih Mutlu wrote: I believe moving "[ ONLY ]" to the definitions of table_and_columns below, as you did with "[ * ]", might be better to be consistent with other places (see [1]) Agreed. I have updated this. + if ((options & VACOPT_VACUUM) && is_partitioned_table && ! include_children) There are also some issues with coding conventions in some places (e.g. the space between "!" and "include_children" abode). I think running pgindent would resolve such issue in most places. I fixed that extra space, and then ran pgindent. It did not report any more problems. On Fri, 30 Aug 2024 at 16:45, torikoshia wrote: -- https://www.postgresql.org/docs/devel/progress-reporting.html > Note that when ANALYZE is run on a partitioned table, all of its > partitions are also recursively analyzed. Should we also note this is the default, i.e. not using ONLY option behavior here? -- https://www.postgresql.org/docs/devel/ddl-partitioning.html > If you are using manual VACUUM or ANALYZE commands, don't forget that > you need to run them on each child table individually. A command like: > > ANALYZE measurement; > will only process the root table. This part also should be modified, shouldn't it? Agreed. I have updated both of these. Thanks! When running ANALYZE VERBOSE ONLY on a partition table, the INFO message is like this: =# ANALYZE VERBOSE ONLY only_parted; INFO: analyzing "public.only_parted" inheritance tree I may be wrong, but 'inheritance tree' makes me feel it includes child tables. Removing 'inheritance tree' and just describing the table name as below might be better: INFO: analyzing "public.only_parted" I'm not sure about that one. If I understand the source correctly, that particular progress message tells the user that the system is gathering stats from the inheritance tree in order to update the stats of the given table, not that it is actually updating the stats of the descendant tables. That makes sense. When analyzing an inheritance structure with the ONLY you see something like this: => ANALYZE VERBOSE ONLY only_inh_parent; INFO: analyzing "public.only_inh_parent" INFO: "only_inh_parent": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows INFO: analyzing "public.only_inh_parent" inheritance tree INFO: "only_inh_child": scanned 1 of 1 pages, containing 3 live rows and 0 dead rows; 3 rows in sample, 3 estimated total rows ANALYZE The reason you don't see the first one for partitioned tables is that it corresponds to sampling the contents of the parent table itself, which in the case of a partitioned table is guaranteed to be empty, so it is skipped. I agree the text could be confusing, and in fact is probably confusing even today without the ONLY keyword, Yeah, it seems this isn't dependent on your proposal. (BTW I'm also wondering if the expression “inheritance" is appropriate when the target is a partitioned table, but this should be discussed in another thread) -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Re: optimizing pg_upgrade's once-in-each-database steps
On Sat, Aug 31, 2024 at 01:18:10AM +0100, Ilya Gladyshev wrote: > LGTM in general, but here are some final nitpicks: Thanks for reviewing. > + if (maxFd != 0) > + (void) select(maxFd + 1, &input_mask, &output_mask, > &except_mask, NULL); > > It´s a good idea to check for the return value of select, in case it > returns any errors. Done. > + dbs_complete++; > + (void) PQgetResult(slot->conn); > + PQfinish(slot->conn); > > Perhaps it´s rather for me to understand, what does PQgetResult call do > here? I believe I was trying to follow the guidance that you should always call PQgetResult() until it returns NULL, but looking again, I don't see any need to call it since we free the connection immediately afterwards. > + /* Check for connection failure. */ > + if (PQconnectPoll(slot->conn) == PGRES_POLLING_FAILED) > + pg_fatal("connection failure: %s", > PQerrorMessage(slot->conn)); > + > + /* Check whether the connection is still establishing. > */ > + if (PQconnectPoll(slot->conn) != PGRES_POLLING_OK) > + return; > > Are the two consecutive calls of PQconnectPoll intended here? Seems a bit > wasteful, but maybe that´s ok. I think we can actually just use PQstatus() here. But furthermore, I think the way I was initiating connections was completely bogus. IIUC before calling PQconnectPoll() the first time, we should wait for a write indicator from select(), and then we should only call PQconnectPoll() after subsequent indicators from select(). After spending quite a bit of time staring at the PQconnectPoll() code, I'm quite surprised I haven't seen any problems thus far. If I had to guess, I'd say that either PQconnectPoll() is more resilient than I think it is, or I've just gotten lucky because pg_upgrade establishes connections quickly. Anyway, to fix this, I've added some more fields to the slot struct to keep track of the information we need to properly establish connections, and we now pay careful attention to the return value of select() so that we know which slots are ready for processing. This seemed like a nice little optimization independent of fixing connection establishment. I was worried this was going to require a lot more code, but I think it only added ~50 lines or so. > We should probably mention this change in the docs as well, I found these > two places that I think can be improved: I've adjusted the documentation in v11. -- nathan >From 02a9eb5bb035f5fdaf1c165161d96695dec06d45 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 28 Aug 2024 10:45:59 -0500 Subject: [PATCH v11 01/11] Introduce framework for parallelizing various pg_upgrade tasks. A number of pg_upgrade steps require connecting to every database in the cluster and running the same query in each one. When there are many databases, these steps are particularly time-consuming, especially since these steps are performed sequentially in a single process. This commit introduces a new framework that makes it easy to parallelize most of these once-in-each-database tasks. Specifically, it manages a simple state machine of slots and uses libpq's asynchronous APIs to establish the connections and run the queries. The --jobs option is used to determine the number of slots to use. To use this new task framework, callers simply need to provide the query and a callback function to process its results, and the framework takes care of the rest. A more complete description is provided at the top of the new task.c file. None of the eligible once-in-each-database tasks are converted to use this new framework in this commit. That will be done via several follow-up commits. Reviewed-by: Jeff Davis, Robert Haas, Daniel Gustafsson, Ilya Gladyshev, Corey Huinker Discussion: https://postgr.es/m/20240516211638.GA1688936%40nathanxps13 --- doc/src/sgml/ref/pgupgrade.sgml | 6 +- src/bin/pg_upgrade/Makefile | 1 + src/bin/pg_upgrade/meson.build | 1 + src/bin/pg_upgrade/pg_upgrade.h | 21 ++ src/bin/pg_upgrade/task.c| 491 +++ src/tools/pgindent/typedefs.list | 5 + 6 files changed, 522 insertions(+), 3 deletions(-) create mode 100644 src/bin/pg_upgrade/task.c diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml index 9877f2f01c..fc2d0ff845 100644 --- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -118,7 +118,7 @@ PostgreSQL documentation -j njobs --jobs=njobs - number of simultaneous processes or threads to use + number of simultaneous connections and processes/threads to use @@ -587,8 +587,8 @@ NET STOP postgresql-&majorversion; The --jobs option allows multiple CPU cores to be used - for copying/linking of files and to dump and restor
Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description
On Mon, Sep 2, 2024 at 5:47 AM Peter Smith wrote: > > Hi hackers. While reviewing another thread I had cause to look at the > docs for the pg_replication_slot 'inactive_since' field [1] for the > first time. > > I was confused by the description, which is as follows: > > inactive_since timestamptz > The time since the slot has become inactive. NULL if the slot is > currently being used. > > > Then I found the github history for the patch [2], and the > accompanying long thread discussion [3] about the renaming of that > field. I have no intention to re-open that can-of-worms, but OTOH I > feel the first sentence of the field description is wrong and needs > fixing. > > Specifically, IMO describing something as "The time since..." means > some amount of elapsed time since some occurrence, but that is not the > correct description for this timestamp field. > > This is not just a case of me being pedantic. For example, here is > what Chat-GPT had to say: > > I asked: > What does "The time since the slot has become inactive." mean? > > ChatGPT said: > "The time since the slot has become inactive" refers to the duration > that has passed from the moment a specific slot (likely a database > replication slot or a similar entity) stopped being active. In other > words, it measures how much time has elapsed since the slot > transitioned from an active state to an inactive state. > > For example, if a slot became inactive 2 hours ago, "the time since > the slot has become inactive" would be 2 hours. > > > To summarize, the current description wrongly describes the field as a > time duration: > "The time since the slot has become inactive." > > I suggest replacing it with: > "The slot has been inactive since this time." > +1 for the change. If I had read the document without knowing about the patch, I too would have interpreted it as a duration. thanks Shveta
Re: Improving tracking/processing of buildfarm test failures
Hello hackers, Please take a look at the August report on buildfarm failures: # SELECT br, count(*) FROM failures WHERE dt >= '2024-08-01' AND dt < '2024-09-01' GROUP BY br; REL_12_STABLE: 2 REL_13_STABLE: 2 REL_14_STABLE: 12 REL_15_STABLE: 3 REL_16_STABLE: 5 REL_17_STABLE: 17 HEAD: 38 -- Total: 79 (Counting test failures only, excluding indent-check, Configure, Build errors.) # SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE dt >= '2024-08-01' AND dt < '2024-09-01'); 21 # SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-08-01' AND dt < '2024-09-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 6; https://www.postgresql.org/message-id/8ce8261a-bf3a-25e6-b473-4808f50a6ea7%40gmail.com: 13 -- An environmental issue; fixed https://www.postgresql.org/message-id/a9a97e83-9ec8-5de5-bf69-80e9560f5...@gmail.com: 9 -- An environmental issue?; probably fixed https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319...@gmail.com: 7 -- Fixed https://www.postgresql.org/message-id/c720cdc3-5ce0-c410-4854-70788175c...@gmail.com: 6 -- Expected to be fixed with Release 18 of the buildfarm client https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986b...@gmail.com: 5 https://www.postgresql.org/message-id/3618203.1722473...@sss.pgh.pa.us: 4 -- Fixed # SELECT count(*) FROM failures WHERE dt >= '2024-08-01' AND dt < '2024-09-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures 13 Short-lived failures: 21 There were also two mysterious never-before-seen failures, both occurred on POWER animals: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kingsnake&dt=2024-08-19%2019%3A17%3A59 - REL_17_STABLE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57 - REL_15_STABLE (I'm not sure yet, whether they should be considered "unhelpful". I'll wait for more information from these animals/buildfarm in general to determine what to do with these failures.) Best regards, Alexander
Re: ANALYZE ONLY
On Sun, 1 Sept 2024 at 13:41, Michael Harris wrote: > https://commitfest.postgresql.org/49/5226/ > > I was not sure what to put for some of the fields - for 'reviewer' should > I have put the people who have provided feedback on this thread? I think it's fairly common that only reviewers either add themselves or don't bother. I don't think it's common that patch authors add reviewers. Some reviewers like their reviews to be more informal and putting themselves as a reviewer can put other people off reviewing as they think it's already handled by someone else. That may result in the patch being neglected by reviewers. > Is there anything else I need to do? I see you set the target version as 17. That should be blank or "18". PG17 is about to go into release candidate, so it is too late for that. It's been fairly consistent that we open for new patches in early July and close before the 2nd week of April the following year. We're currently 2 months into PG18 dev. We don't back-patch new features. Nothing else aside from continuing to address reviews, as you are already. David
Re: In-placre persistance change of a relation
On Sun, Sep 01, 2024 at 10:15:00PM +0300, Heikki Linnakangas wrote: > I wonder if the twophase state files and undo log files should be merged > into one file. They're similar in many ways: there's one file per > transaction, named using the XID. I haven't thought this fully through, just > a thought.. Hmm. It could be possible to extract some of this knowledge out of twophase.c and design some APIs that could be used for both, but would that be really necessary? The 2PC data and the LSNs used by the files to check if things are replayed or on disk rely on GlobalTransactionData that has its own idea of things and timings at recovery. Or perhaps your point is actually to do that and add one layer for the file handlings and their flush timings? I am not sure, TBH, what this thread is trying to fix is complicated enough that it may be better to live with two different code paths. But perhaps my gut feeling is just wrong reading your paragraph. -- Michael signature.asc Description: PGP signature
Re: Make printtup a bit faster
Andy Fan writes: > The attached is PoC of this idea, not matter which method are adopted > (rewrite all the outfunction or a optional print function), I think the > benefit will be similar. In the blew test case, it shows us 10%+ > improvements. (0.134ms vs 0.110ms) After working on more {type}_print functions, I'm thinking it is pretty like the 3rd IO function which shows some confused maintainence effort. so I agree refactoring the existing out function is a better idea. I'd like to work on _print function body first for easy review and testing. after all, if some common issues exists in these changes, it is better to know that before we working on the 700+ out functions. -- Best Regards Andy Fan
Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
On Fri, Aug 30, 2024 at 04:10:32PM -0400, Andrew Dunstan wrote: > > On 2024-08-29 Th 4:44 PM, Jacob Champion wrote: > > As for the other patches, I'll ping Andrew about 0001, > > > Patch 0001 looks sane to me. So does 0002 to me. I'm not much a fan of the addition of pgstat_bestart_pre_auth(), which is just a shortcut to set a different state in the backend entry to tell that it is authenticating. Is authenticating the term for this state of the process startups, actually? Could it be more transparent to use a "startup" or "starting"" state instead that gets also used by pgstat_bestart() in the case of the patch where !pre_auth? The addition of the new wait event states in 0004 is a good idea, indeed, and these can be seen in pg_stat_activity once we get out of PGSTAT_END_WRITE_ACTIVITY() (err.. Right?). -- Michael signature.asc Description: PGP signature
Re: Collect statistics about conflicts in logical replication
On Fri, Aug 30, 2024 at 6:36 PM shveta malik wrote: > > On Fri, Aug 30, 2024 at 12:15 PM Zhijie Hou (Fujitsu) > wrote: > > > > > > Here is V5 patch which addressed above and Shveta's[1] comments. > > > > The patch looks good to me. > Patch v5 LGTM. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Improving tracking/processing of buildfarm test failures
Hello everyone I am a developer interested in this project. Had a little involvement with MariaDB and now I like to work on Postgres. Never worked with mailing lists so I am not sure if this is the way I should interact. Liked to be pointed to some tasks and documents to get started.
Add ExprState hashing for GROUP BY and hashed SubPlans
adf97c156 added support to allow ExprStates to support hashing and adjusted Hash Join to make use of that. That allowed a speedup in hash value generation as it allowed JIT compilation of hash values. It also allowed more efficient tuple deforming as all required attributes are deformed in one go rather than on demand when hashing each join key. The attached does the same for GROUP BY and hashed SubPlans. The win for the tuple deformation does not exist here, but there does seem to be some gains still to be had from JIT compilation. Using a scale=1 TPC-H lineitem table, I ran the attached script. The increase is far from impressive, but likely worth migrating these over to use ExprState too. master: alter system set jit = 0; latency average = 1509.116 ms latency average = 1502.496 ms latency average = 1507.560 ms alter system set jit = 1; latency average = 1396.015 ms latency average = 1392.138 ms latency average = 1396.476 ms alter system set jit_optimize_above_cost = 0; latency average = 1290.463 ms latency average = 1293.364 ms latency average = 1290.366 ms alter system set jit_inline_above_cost = 0; latency average = 1294.540 ms latency average = 1300.970 ms latency average = 1302.181 ms patched: alter system set jit = 0; latency average = 1500.183 ms latency average = 1500.911 ms latency average = 1504.150 ms (+0.31%) alter system set jit = 1; latency average = 1367.427 ms latency average = 1367.329 ms latency average = 1366.473 ms (+2.03%) alter system set jit_optimize_above_cost = 0; latency average = 1273.453 ms latency average = 1265.348 ms latency average = 1272.598 ms (+1.65%) alter system set jit_inline_above_cost = 0; latency average = 1264.657 ms latency average = 1272.661 ms latency average = 1273.179 ms (+2.29%) David #!/bin/bash nloops=3 rows=1000 dbname=postgres port=5432 seconds=30 psql -c "drop table if exists hjtbl;" -p $port $dbname psql -c "create table hjtbl (a int not null, b int not null, c int not null, d int not null, e int not null, f int not null);" -p $port $dbname psql -c "insert into hjtbl select a,a,a,a,a,a from generate_series(1,$rows) a;" -p $port $dbname psql -c "vacuum freeze analyze hjtbl;" -p $port $dbname psql -c "alter system set jit_above_cost = 0;" -p $port $dbname psql -c "alter system set jit_optimize_above_cost = 10;" -p $port $dbname psql -c "alter system set jit_inline_above_cost = 10;" -p $port $dbname psql -c "select pg_reload_conf();" -p $port $dbname for alt_sys in "alter system set jit = 0;" "alter system set jit = 1;" "alter system set jit_optimize_above_cost = 0;" "alter system set jit_inline_above_cost = 0;" do echo "$alt_sys" psql -c "$alt_sys" -p $port $dbname > /dev/null psql -c "select pg_Reload_conf();" -p $port $dbname > /dev/null q=1 for sql in "select count(*) c from (select a,b,c,d,e,f from hjtbl cross join generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a);" "select count(*) c from (select a,b,c,d,e,f from hjtbl cross join generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b);" "select count(*) c from (select a,b,c,d,e,f from hjtbl cross join generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c);" "select count(*) c from (select a,b,c,d,e,f from hjtbl cross join generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c,d);" "select count(*) c from (select a,b,c,d,e,f from hjtbl cross join generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c,d,e);" "select count(*) c from (select a,b,c,d,e,f from hjtbl cross join generate_Series(1,$nloops) offset 0) h1 inner join hjtbl using(a,b,c,d,e,f);" do echo $sql > bench.sql echo -n "Q$q " pgbench -n -f bench.sql -p $port -M prepared -T $seconds $dbname | grep latency q=$((q+1)) done done v1-0001-Use-ExprStates-for-hashing-in-GROUP-BY-and-SubPla.patch Description: Binary data
Re: Test 041_checkpoint_at_promote.pl faild in installcheck due to missing injection_points
On Fri, Aug 23, 2024 at 06:45:02PM -0400, Tom Lane wrote: > It exports it twice, though, which is pretty confusing. Right. I am not sure what was my state of mind back then, but this pattern has spread a bit. REL_17_STABLE is frozen for a few more days, so I'll address all the items of this thread that once the release of this week is tagged: the export duplicates and the installcheck issue. These are staged on a local branch for now. -- Michael signature.asc Description: PGP signature
Re: Collect statistics about conflicts in logical replication
On Mon, Sep 2, 2024 at 4:20 AM Peter Smith wrote: > > On Fri, Aug 30, 2024 at 4:24 PM shveta malik wrote: > > > > On Fri, Aug 30, 2024 at 10:53 AM Peter Smith wrote: > > > > ... > > > 2. Arrange all the counts into an intuitive/natural order > > > > > > There is an intuitive/natural ordering for these counts. For example, > > > the 'confl_*' count fields are in the order insert -> update -> > > > delete, which LGTM. > > > > > > Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not > > > in a good order. > > > > > > IMO it makes more sense if everything is ordered as: > > > 'sync_error_count', then 'apply_error_count', then all the 'confl_*' > > > counts. > > > > > > This comment applies to lots of places, e.g.: > > > - docs (doc/src/sgml/monitoring.sgml) > > > - function pg_stat_get_subscription_stats (pg_proc.dat) > > > - view pg_stat_subscription_stats (src/backend/catalog/system_views.sql) > > > - TAP test SELECTs (test/subscription/t/026_stats.pl) > > > > > > As all those places are already impacted by this patch, I think it > > > would be good if (in passing) we (if possible) also swapped the > > > sync/apply counts so everything is ordered intuitively top-to-bottom > > > or left-to-right. > > > > Not sure about this though. It does not seem to belong to the current patch. > > > > Fair enough. But, besides being inappropriate to include in the > current patch, do you think the suggestion to reorder them made sense? > If it has some merit, then I will propose it again as a separate > thread. > Yes, I think it makes sense. With respect to internal code, it might be still okay as is, but when it comes to pg_stat_subscription_stats, I think it is better if user finds it in below order: subid | subname | sync_error_count | apply_error_count | confl_* rather than the existing one: subid | subname | apply_error_count | sync_error_count | confl_* thanks Shveta
Re: long-standing data loss bug in initial sync of logical replication
On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal wrote: > > Next I am planning to test solely on the logical decoding side and > will share the results. > Thanks, the next set of proposed tests makes sense to me. It will also be useful to generate some worst-case scenarios where the number of invalidations is more to see the distribution cost in such cases. For example, Truncate/Drop a table with 100 or 1000 partitions. -- With Regards, Amit Kapila.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Thu, Aug 29, 2024 at 11:31 AM Bharath Rupireddy wrote: > > Thanks for looking into this. > > On Mon, Aug 26, 2024 at 4:35 PM Amit Kapila wrote: > > > > Few comments on 0001: > > 1. > > @@ -651,6 +651,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid > > > > + /* > > + * Skip the sync if the local slot is already invalidated. We do this > > + * beforehand to avoid slot acquire and release. > > + */ > > > > I was wondering why you have added this new check as part of this > > patch. If you see the following comments in the related code, you will > > know why we haven't done this previously. > > Removed. Can deal with optimization separately. > > > 2. > > + */ > > + InvalidateObsoleteReplicationSlots(RS_INVAL_INACTIVE_TIMEOUT, > > +0, InvalidOid, InvalidTransactionId); > > > > Why do we want to call this for shutdown case (when is_shutdown is > > true)? I understand trying to invalidate slots during regular > > checkpoint but not sure if we need it at the time of shutdown. > > Changed it to invalidate only for non-shutdown checkpoints. inactive_timeout > invalidation isn't critical for shutdown unlike wal_removed which can help > shutdown by freeing up some disk space. > > > The > > other point is can we try to check the performance impact with 100s of > > slots as mentioned in the code comments? > > I first checked how much does the wal_removed invalidation check add to the > checkpoint (see 2nd and 3rd column). I then checked how much inactive_timeout > invalidation check adds to the checkpoint (see 4th column), it is not more > than wal_remove invalidation check. I then checked how much the wal_removed > invalidation check adds for replication slots that have already been > invalidated due to inactive_timeout (see 5th column), looks like not much. > > | # of slots | HEAD (no invalidation) ms | HEAD (wal_removed) ms | PATCHED > (inactive_timeout) ms | PATCHED (inactive_timeout+wal_removed) ms | > |||---|---|--| > | 100| 18.591 | 370.586 | 359.299 > | 373.882 | > | 1000 | 15.722 | 4834.901 | 5081.751 > | 5072.128 | > | 1 | 19.261 | 59801.062 | 61270.406 > | 60270.099| > > Having said that, I'm okay to implement the optimization specified. Thoughts? > The other possibility is to try invalidating due to timeout along with wal_removed case during checkpoint. The idea is that if the slot can be invalidated due to WAL then fine, otherwise check if it can be invalidated due to timeout. This can avoid looping the slots and doing similar work multiple times during the checkpoint. -- With Regards, Amit Kapila.
Re: Minor refactor: Use more consistent names for the labels of PG_Locale_Strategy
On Thu, Aug 29, 2024 at 10:06:32AM +0900, Michael Paquier wrote: > +1 for your suggestion, as you are suggesting. The original intention > when PG_Locale_Strategy got introduced was to have everything named as > PG_REGEX_LOCALE_*, but with the built-in part coming in play in this > code adding "STRATEGY" is cleaner than just "LOCALE". Applied this one as 23138284cde4. -- Michael signature.asc Description: PGP signature
Re: Track the amount of time waiting due to cost_delay
Hi, On Tue, Aug 20, 2024 at 12:48:29PM +, Bertrand Drouvot wrote: > As it looks like we have a consensus not to wait on [0] (as reducing the > number > of interrupts makes sense on its own), then please find attached v4, a rebase > version (that also makes clear in the doc that that new field might show > slightly > old values, as mentioned in [1]). Please find attached v5, a mandatory rebase. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 1a14b708e0ee74c2f38835968d828c54022a5526 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 24 Jun 2024 08:43:26 + Subject: [PATCH v5] Report the total amount of time that vacuum has been delayed due to cost delay This commit adds one column: time_delayed to the pg_stat_progress_vacuum system view to show the total amount of time in milliseconds that vacuum has been delayed. This uses the new parallel message type for progress reporting added by f1889729dd. In case of parallel worker, to avoid the leader to be interrupted too frequently (while it might be sleeping for cost delay), the report is done only if the last report has been done more than 1 second ago. Having a time based only approach to throttle the reporting of the parallel workers sounds reasonable. Indeed when deciding about the throttling: 1. The number of parallel workers should not come into play: 1.1) the more parallel workers is used, the less the impact of the leader on the vacuum index phase duration/workload is (because the repartition is done on more processes). 1.2) the less parallel workers is, the less the leader will be interrupted ( less parallel workers would report their delayed time). 2. The cost limit should not come into play as that value is distributed proportionally among the parallel workers (so we're back to the previous point). 3. The cost delay does not come into play as the leader could be interrupted at the beginning, the midle or whatever part of the wait and we are more interested about the frequency of the interrupts. 3. A 1 second reporting "throttling" looks a reasonable threshold as: 3.1 the idea is to have a significant impact when the leader could have been interrupted say hundred/thousand times per second. 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum multiple times per second (so a one second reporting granularity seems ok). Bump catversion because this changes the definition of pg_stat_progress_vacuum. --- doc/src/sgml/monitoring.sgml | 13 src/backend/catalog/system_views.sql | 2 +- src/backend/commands/vacuum.c| 49 src/include/catalog/catversion.h | 2 +- src/include/commands/progress.h | 1 + src/test/regress/expected/rules.out | 3 +- 6 files changed, 67 insertions(+), 3 deletions(-) 23.5% doc/src/sgml/ 4.2% src/backend/catalog/ 63.4% src/backend/commands/ 4.6% src/include/ 4.0% src/test/regress/expected/ diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 55417a6fa9..d87604331a 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -6307,6 +6307,19 @@ FROM pg_stat_get_backend_idset() AS backendid; cleaning up indexes. + + + + time_delayed bigint + + + Total amount of time spent in milliseconds waiting due to vacuum_cost_delay + or autovacuum_vacuum_cost_delay. In case of parallel + vacuum the reported time is across all the workers and the leader. This + column is updated at a 1 Hz frequency (one time per second) so could show + slightly old values. + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 19cabc9a47..875df7d0e4 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1218,7 +1218,7 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, S.param8 AS num_dead_item_ids, S.param9 AS indexes_total, -S.param10 AS indexes_processed +S.param10 AS indexes_processed, S.param11 AS time_delayed FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 7d8e9d2045..5bf2e37d3f 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -40,6 +40,7 @@ #include "catalog/pg_inherits.h" #include "commands/cluster.h" #include "commands/defrem.h" +#include "commands/progress.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "nodes/makefuncs.h" @@ -60,6 +61,12 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" +/* + * Minimum amount of time (in ms)
DOCS - pg_replication_slot . Fix the 'inactive_since' description
Hi hackers. While reviewing another thread I had cause to look at the docs for the pg_replication_slot 'inactive_since' field [1] for the first time. I was confused by the description, which is as follows: inactive_since timestamptz The time since the slot has become inactive. NULL if the slot is currently being used. Then I found the github history for the patch [2], and the accompanying long thread discussion [3] about the renaming of that field. I have no intention to re-open that can-of-worms, but OTOH I feel the first sentence of the field description is wrong and needs fixing. Specifically, IMO describing something as "The time since..." means some amount of elapsed time since some occurrence, but that is not the correct description for this timestamp field. This is not just a case of me being pedantic. For example, here is what Chat-GPT had to say: I asked: What does "The time since the slot has become inactive." mean? ChatGPT said: "The time since the slot has become inactive" refers to the duration that has passed from the moment a specific slot (likely a database replication slot or a similar entity) stopped being active. In other words, it measures how much time has elapsed since the slot transitioned from an active state to an inactive state. For example, if a slot became inactive 2 hours ago, "the time since the slot has become inactive" would be 2 hours. To summarize, the current description wrongly describes the field as a time duration: "The time since the slot has become inactive." I suggest replacing it with: "The slot has been inactive since this time." The attached patch makes this suggested change. == [1] docs - https://www.postgresql.org/docs/devel/view-pg-replication-slots.html [2] thread - https://www.postgresql.org/message-id/ca+tgmob_ta-t2ty8qrkhbgnnlrf4zycwhghgfsuuofraedw...@mail.gmail.com [3] push - https://github.com/postgres/postgres/commit/6d49c8d4b4f4a20eb5b4c501d78cf894fa13c0ea Kind Regards, Peter Smith. Fujitsu Australia v1-0001-fix-description-for-inactive_since.patch Description: Binary data
Re: scalability bottlenecks with (many) partitions (and more)
Hi, While discussing this patch with Robert off-list, one of the questions he asked was is there's some size threshold after which it starts to have negative impact. I didn't have a good answer to that - I did have some intuition (that making it too large would not hurt), but I haven't done any tests with "extreme" sizes of the fast-path structs. So I ran some more tests, with up to 4096 "groups" (which means 64k fast-path slots). And no matter how I slice the results, there's no clear regression points, beyond which the performance would start to decline (even just slowly). It's the same for all benchmarks, client counts, query mode, and so on. I'm attaching two PDFs with results for the "join" benchmark I described earlier (query with a join on many partitions) from EPYC 7763 (64/128c). The first one is with "raw" data (throughput = tps), the second one is relative throughput to the first column (which is pretty much current master, with no optimizations applied). The complete results including some nice .odp spreadsheets and scripts are available here: https://github.com/tvondra/pg-lock-scalability-results There's often a very clear point where the performance significantly improves - this is usually when all the relation locks start to fit into the fast-path array. With 1000 relations that's ~64 groups, and so on. But there's no point where it would start declining. My explanation is that the PGPROC (where the fast-path array is) is so large already (close to 1kB), that making it large does not really cause any additional cache misses, etc. And if it does, it's far out-weighted by cost of accessing (or not having to) the shared lock table. So I don't think there's any point at which point we'd start to regress, at least not because of cache misses, CPU etc. It stops improving, but that's just a sign that we've hit some other bottleneck - that's not a fault of this patch. But that's not the whole story, of course. Because if there were no issues, why not to just make the fast-path array insanely large? In another off-list discussion Andres asked me about the memory this would need, and after looking at the numbers I think that's a strong argument to keep the numbers reasonable. I did a quick experiment to see the per-connection memory requirements, and how would it be affected by this patch. I simply logged the amount of shared memory CalculateShmemSize(), started the server with 100 and 1000 connections, and did a bit of math to calculate how much memory we need "per connection". For master and different numbers of fast-path groups I got this: master 64 1024 32765 - 47668 52201 121324 2406892 So by default we need ~48kB / connection, with 64 groups we need ~52kB (which makes sense because that's 1024 x 4B slots), and then with 1024 slots we get to 120kB etc and with 32k ~2.5MB. I guess those higher values seem a bit insane - we don't want to just increase the per-connection memory requirements 50x for everyone, right? But what about the people who actually want this many locks? Let's bump the max_locks_per_transactions from 64 to 1024, and we get this: master64 1024 32765 - 4193674239094930222778590 Suddenly, the differences are much smaller, especially for the 64 groups, which is roughly the same number of fast-path slots as the max locks per transactions. That shrunk to ~1% difference. But wen for 1024 groups it's now just ~20%, which I think it well worth the benefits. And likely something the system should have available - with 1000 connections that's ~80MB. And if you run with 1000 connections, 80MB should be rounding error, IMO. Of course, it does not seem great to force everyone to pay this price, even if they don't need that many locks (and so there's no benefit). So how would we improve that? I don't think that's possible with hard-coded size of the array - that allocates the memory for everyone. We'd need to make it variable-length, and while doing those benchmarks I think we actually already have a GUC for that - max_locks_per_transaction tells us exactly what we need to know, right? I mean, if I know I'll need ~1000 locks, why not to make the fast-path array large enough for that? Of course, the consequence of this would be making PGPROC variable length, or having to point to a memory allocated separately (I prefer the latter option, I think). I haven't done any experiments, but it seems fairly doable - of course, not sure if it might be more expensive compared to compile-time constants. At this point I think it's fairly clear we have significant bottlenecks when having to lock many relations - and that won't go away, thanks to partitioning etc. We're already fixing various other bottlenecks for these workloads, which will just increase pressure on locking. Fundamentally, I think we'll need to either evolve the fast-path sys
Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch
On Fri, Aug 30, 2024 at 04:06:03PM -0500, Nathan Bossart wrote: > I suppose it would be difficult to argue that it is actually useful, given > it hasn't worked since v11 and apparently nobody noticed until recently. > If we're content to leave it unsupported, then sure, let's just remove the > "relkind == RELKIND_SEQUENCE" check in pgstat_relation(). But I also don't > have a great reason to _not_ support it. It used to work (which appears to > have been intentional, based on the code), it was unintentionally broken, > and it'd work again with a ~1 line change. "SELECT count(*) FROM > my_sequence" probably doesn't provide a lot of value, but I have no > intention of proposing a patch that removes support for that. IMO, it can be useful to check the state of the page used by a sequence. We have a few tweaks in sequence.c like manipulations of t_infomask, and I can be good to check for its status on corrupted systems. -- Michael signature.asc Description: PGP signature
Re: In-placre persistance change of a relation
On 31/08/2024 19:09, Kyotaro Horiguchi wrote: - UNDO log(0002): This handles file deletion during transaction aborts, which was previously managed, in part, by the commit XLOG record at the end of a transaction. - Prevent orphan files after a crash (0005): This is another use-case of the UNDO log system. Nice, I'm very excited if we can fix that long-standing issue! I'll try to review this properly later, but at a quick 5 minute glance, one thing caught my eye: This requires fsync()ing the per-xid undo log file every time a relation is created. I fear that can be a pretty big performance hit for workloads that repeatedly create and drop small tables. Especially if they're otherwise running with synchronous_commit=off. Instead of flushing the undo log file after every write, I'd suggest WAL-logging the undo log like regular relations and SLRUs. So before writing the entry to the undo log, WAL-log it. And with a little more effort, you could postpone creating the files altogether until a checkpoint happens, similar to how twophase state files are checkpointed nowadays. I wonder if the twophase state files and undo log files should be merged into one file. They're similar in many ways: there's one file per transaction, named using the XID. I haven't thought this fully through, just a thought.. +static void +undolog_set_filename(char *buf, TransactionId xid) +{ + snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid); +} I'd suggest using FullTransactionId. Doesn't matter much, but seems like a good future-proofing. -- Heikki Linnakangas Neon (https://neon.tech)