Re: Sorting regression of text function result since commit 586b98fdf1aae
On Mon, 11 Dec 2023 15:43:12 -0500 Tom Lane wrote: > Jehan-Guillaume de Rorthais writes: > > It looks like since 586b98fdf1aae, the result type collation of > > "convert_from" is forced to "C", like the patch does for type "name", > > instead of the "default" collation for type "text". > > Well, convert_from() inherits its result collation from the input, > per the normal rules for collation assignment [1]. > > > Looking at hints in the header comment of function "exprCollation", I poked > > around and found that the result collation wrongly follow the input > > collation in this case. > > It's not "wrong", it's what the SQL standard requires. Mh, OK. This is at least a surprising behavior. Having a non-data related argument impacting the result collation seems counter-intuitive. But I understand this is by standard, no need to discuss it. > > I couldn't find anything explaining this behavior in the changelog. It looks > > like a regression to me, but if this is actually expected, maybe this > > deserve some documentation patch? > > The v12 release notes do say > > Type name now behaves much like a domain over type text that has > default collation “C”. Sure, and I saw it, but reading at this entry, I couldn't guess this could have such implication on text result from a function call. That's why I hunt for the precise commit and was surprise to find this was the actual change. > You'd have similar results from an expression involving such a domain, > I believe. > > I'm less than excited about patching the v12 release notes four > years later. Maybe, if this point had come up in a more timely > fashion, we'd have mentioned it --- but it's hardly possible to > cover every potential implication of such a change in the > release notes. This could have been documented in the collation concept page, as a trap to be aware of. A link from the release note to such a small paragraph would have been enough to warn devs this might have implications when mixed with other collatable types. But I understand we can not document all the traps paving the way to the standard anyway. Thank you for your explanation! Regards,
Sorting regression of text function result since commit 586b98fdf1aae
Hi, A customer found what looks like a sort regression while testing his code from v11 on a higher version. We hunt this regression down to commit 586b98fdf1aae, introduced in v12. Consider the following test case: createdb -l fr_FR.utf8 -T template0 reg psql reg <<<" BEGIN; CREATE TABLE IF NOT EXISTS reg ( id bigint NOT NULL, reg bytea NOT NULL ); INSERT INTO reg VALUES (1, convert_to( 'aaa', 'UTF8')), (2, convert_to( 'aa}', 'UTF8')); SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8');" In parent commit 68f6f2b7395fe, it results: id 2 1 And in 586b98fdf1aae: id 1 2 Looking at the plan, the sort node are different: * 68f6f2b7395fe: Sort Key: (convert_from(reg, 'UTF8'::name)) * 586b98fdf1aae: Sort Key: (convert_from(reg, 'UTF8'::name)) COLLATE "C" It looks like since 586b98fdf1aae, the result type collation of "convert_from" is forced to "C", like the patch does for type "name", instead of the "default" collation for type "text". Looking at hints in the header comment of function "exprCollation", I poked around and found that the result collation wrongly follow the input collation in this case. With 586b98fdf1aae: -- 2nd parameter type resolved as "name" so collation forced to "C" SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8'); -- 1 -- 2 -- Collation of 2nd parameter is forced to something else SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8' COLLATE \"default\"); -- 2 -- 1 -- Sort -- Sort Key: (convert_from(reg, 'UTF8'::name COLLATE "default")) -- -> Seq Scan on reg It seems because the second parameter type is "name", the result collation become "C" instead of being the collation associated with "text" type: "default". I couldn't find anything explaining this behavior in the changelog. It looks like a regression to me, but if this is actually expected, maybe this deserve some documentation patch? Regards,
Re: Query execution in Perl TAP tests needs work
On Wed, 18 Oct 2023 18:25:01 +0200 Alvaro Herrera wrote: > On 2023-Oct-18, Robert Haas wrote: > > > Without FFI::Platypus, we have to write Perl code that can speak the > > wire protocol directly. Basically, we're writing our own PostgreSQL > > driver for Perl, though we might need only a subset of the things a > > real driver would need to handle, and we might add some extra things, > > like code that can send intentionally botched protocol messages. > > We could revive the old src/interfaces/perl5 code, which was a libpq > wrapper -- at least the subset of it that the tests need. It was moved > to gborg by commit 9a0b4d7f8474 and a couple more versions were made > there, which can be seen at > https://www.postgresql.org/ftp/projects/gborg/pgperl/stable/, > version 2.1.1 being apparently the latest. The complete driver was > about 3000 lines, judging by the commit that removed it. Presumably we > don't need the whole of that. +1 to test this. I can give it some time to revive it and post results here if you agree and no one think of some show stopper.
Re: Detoasting optionally to make Explain-Analyze less misleading
Hi Stepan & all, On Tue, 12 Sep 2023 17:16:00 +0200 stepan rutz wrote: ... > Attached a new patch. Hoping for feedback, Nice addition to EXPLAIN! On the feature front, what about adding the actual detoasting/serializing time in the explain output? That could be: => explain (analyze,serialize,costs off,timing off) select * from test_detoast; QUERY PLAN ─ Seq Scan on public.test_detoast (actual rows=Nv loops=N) Planning Time: N ms Execution Time: N ms Serialize Time: N ms
Re: EBCDIC sorting as a use case for ICU rules
Hi, Sorry to chime in so lately, I was waiting for some customer feedback. On Wed, 21 Jun 2023 15:28:38 +0200 "Daniel Verite" wrote: > At a conference this week I was asked if ICU could be able to > sort like EBCDIC [2]. > It turns out it has been already asked on > -general a few years ago [3] with no satisfactory answer at the time , > and that it can be implemented with rules in v16. We worked with a customer few months ago about this question and end up with a procedure to build new locale/collation for glibc and load them in PostgreSQL [1]. Our customer built the fr_ebcdic locale file themselves, based on the EBCDIC IBM500 codepage (including about the same characters than iso 8859-1) and share it under the BY-CC licence. See in attachment. The procedure is quite simple: 1. copy this file under "/usr/share/i18n/locales/fr_ebcdic" 2. build it using "localedef -c -i fr_ebcdic -f UTF-8 fr_ebcdic.UTF-8" 3. restart your PostgreSQL instance (because of localeset weird behavior) 4. "pg_import_system_collations('schema')" or create the collation, eg.: CREATE COLLATION fr_ebcdic ( PROVIDER = libc, LC_COLLATE = fr_ebcdic.utf8, LC_CTYPE = fr_ebcdic.utf8 ); Now, same question than for the ICU: do we want to provide documentation about this? Online documentation about such feature are quite arid. In fact, this could be useful in various other way than just EBCDIC. Regards, [1] https://www.postgresql.org/message-id/20230209144947.1dfad6c0%40karst fr_ebcdic Description: Binary data
Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
On Thu, 3 Aug 2023 11:02:43 +0200 Alvaro Herrera wrote: > On 2023-Aug-03, tender wang wrote: > > > I think old "sub-FK" should not be dropped, that will be violates foreign > > key constraint. > > Yeah, I've been playing more with the patch and it is definitely not > doing the right things. Just eyeballing the contents of pg_trigger and > pg_constraint for partitions added by ALTER...ATTACH shows that the > catalog contents are inconsistent with those added by CREATE TABLE > PARTITION OF. Well, as stated in my orignal message, at the patch helps understanding the problem and sketch a possible solution. It definitely is not complete. After DETACHing the table, we surely needs to check everything again and recreating what is needed to keep the FK consistent. But should we keep the FK after DETACH? Did you check the two other discussions related to FK, self-FK & partition? Unfortunately, as Tender experienced, the more we dig the more we find bugs. Moreover, the second one might seems unsolvable and deserve a closer look. See: * FK broken after DETACHing referencing part https://www.postgresql.org/message-id/20230420144344.40744130%40karst * Issue attaching a table to a partitioned table with an auto-referenced foreign key https://www.postgresql.org/message-id/20230707175859.17c91538%40karst
Re: Issue attaching a table to a partitioned table with an auto-referenced foreign key
So I gave a look at this one... And it's a tricky one. The current policy about DETACHing a partition is to keep/adjust all FK referencing it or referenced by it. However, in this exact self-referencing usecase, we can have rows referencing rows from the same partition OR another one. It seems like an impossible issue to solve. Here is an example based on Guillaume's scenario ([c1_old, c2_old] -> [c1, c2]): t1: t1_a: c1 | c1_old | c2 | c2_old +++ 1 | NULL | 2 | NULL 1 | 1 | 3 | 2 1 | 2 | 4 | 2 t1_b: c1 | c1_old | c2 | c2_old +++ 2 | 1 | 2 | 3 Now, what happens with the FK when we DETACH t1_a? * it's not enough t1_a only keeps a self-FK, as it references some rows from t1_b: (1, 2, 4, 2) -> (2, 1, 2, 3) * and t1_a can not only keeps a FK referencing t1 either as it references some rows fro itself: (1, 1, 3, 2) -> (1, NULL, 2, NULL) I'm currently not able to think about a constraint we could build to address this situation after the DETACH. The only clean way out would be to drop the FK between the old partition and the partitioned table. But then, it breaks the current policy to keep the constraint after DETACH. Not mentioning the nightmare to detect this situation from some other ones. Thoughts? On Wed, 22 Mar 2023 11:14:19 +0100 Guillaume Lelarge wrote: > One last ping, hoping someone will have more time now than in january. > > Perhaps my test is wrong, but I'd like to know why. > > Thanks. > > Le mar. 17 janv. 2023 à 16:53, Guillaume Lelarge a > écrit : > > > Quick ping, just to make sure someone can get a look at this issue :) > > Thanks. > > > > > > Le ven. 6 janv. 2023 à 11:07, Guillaume Lelarge > > a écrit : > > > >> Hello, > >> > >> One of our customers has an issue with partitions and foreign keys. He > >> works on a v13, but the issue is also present on v15. > >> > >> I attach a SQL script showing the issue, and the results on 13.7, 13.9, > >> and 15.1. But I'll explain the script here, and its behaviour on 13.9. > >> > >> There is one partitioned table, two partitions and a foreign key. The > >> foreign key references the same table: > >> > >> create table t1 ( > >> c1 bigint not null, > >> c1_old bigint null, > >> c2 bigint not null, > >> c2_old bigint null, > >> primary key (c1, c2) > >> ) > >> partition by list (c1); > >> create table t1_a partition of t1 for values in (1); > >> create table t1_def partition of t1 default; > >> alter table t1 add foreign key (c1_old, c2_old) references t1 (c1, c2) on > >> delete restrict on update restrict; > >> > >> I've a SQL function that shows me some information from pg_constraints > >> (code of the function in the SQL script attached). Here is the result of > >> this function after creating the table, its partitions, and its foreign > >> key: > >> > >> select * from show_constraints(); > >> conname | t| tref | coparent > >> +++--- > >> t1_c1_old_c2_old_fkey | t1 | t1 | > >> t1_c1_old_c2_old_fkey | t1_a | t1 | t1_c1_old_c2_old_fkey > >> t1_c1_old_c2_old_fkey | t1_def | t1 | t1_c1_old_c2_old_fkey > >> t1_c1_old_c2_old_fkey1 | t1 | t1_a | t1_c1_old_c2_old_fkey > >> t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey > >> (5 rows) > >> > >> The constraint works great : > >> > >> insert into t1 values(1, NULL, 2, NULL); > >> insert into t1 values(2, 1,2, 2); > >> delete from t1 where c1 = 1; > >> psql:ticket15010_v3.sql:34: ERROR: update or delete on table "t1_a" > >> violates foreign key constraint "t1_c1_old_c2_old_fkey1" on table "t1" > >> DETAIL: Key (c1, c2)=(1, 2) is still referenced from table "t1". > >> > >> This error is normal since the line I want to delete is referenced on the > >> other line. > >> > >> If I try to detach the partition, it also gives me an error. > >> > >> alter table t1 detach partition t1_a; > >> psql:ticket15010_v3.sql:36: ERROR: removing partition "t1_a" violates > >> foreign key constraint "t1_c1_old_c2_old_fkey1" > >> DETAIL: Key (c1_old, c2_old)=(1, 2) is still referenced from table "t1". > >> > >> Sounds good to me too (well, I'd like it to be smarter and find that the > >> constraint is still good after the detach, but I can understand why it > >> won't allow it). > >> > >> The pg_constraint didn't change of course: > >> > >> select * from show_constraints(); > >> conname | t| tref | coparent > >> +++--- > >> t1_c1_old_c2_old_fkey | t1 | t1 | > >> t1_c1_old_c2_old_fkey | t1_a | t1 | t1_c1_old_c2_old_fkey > >> t1_c1_old_c2_old_fkey | t1_def | t1 | t1_c1_old_c2_old_fkey > >> t1_c1_old_c2_old_fkey1 | t1 | t1_a | t1_c1_old_c2_old_fkey > >> t1_c1_old_c2_old_fkey2 |
[BUG] Fix DETACH with FK pointing to a partitioned table fails
Hi, (patch proposal below). Consider a table with a FK pointing to a partitioned table. CREATE TABLE p ( id bigint PRIMARY KEY ) PARTITION BY list (id); CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1); CREATE TABLE r_1 ( id bigint PRIMARY KEY, p_id bigint NOT NULL, FOREIGN KEY (p_id) REFERENCES p (id) ); Now, attach this table "refg_1" as partition of another one having the same FK: CREATE TABLE r ( id bigint PRIMARY KEY, p_id bigint NOT NULL, FOREIGN KEY (p_id) REFERENCES p (id) ) PARTITION BY list (id); ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1); The old sub-FKs (below 18289) created in this table to enforce the action triggers on referenced partitions are not deleted when the table becomes a partition. Because of this, we have additional and useless triggers on the referenced partitions and we can not DETACH this partition on the referencing side anymore: => ALTER TABLE r DETACH PARTITION r_1; ERROR: could not find ON INSERT check triggers of foreign key constraint 18289 => SELECT c.oid, conparentid, conrelid::regclass, confrelid::regclass, t.tgfoid::regproc FROM pg_constraint c JOIN pg_trigger t ON t.tgconstraint = c.oid WHERE confrelid::regclass = 'p_1'::regclass; oid │ conparentid │ conrelid │ confrelid │ tgfoid ───┼─┼──┼───┼ 18289 │ 18286 │ r_1 │ p_1 │ "RI_FKey_noaction_del" 18289 │ 18286 │ r_1 │ p_1 │ "RI_FKey_noaction_upd" 18302 │ 18299 │ r│ p_1 │ "RI_FKey_noaction_del" 18302 │ 18299 │ r│ p_1 │ "RI_FKey_noaction_upd" (4 rows) The legitimate constraint and triggers here are 18302. The old sub-FK 18289 having 18286 as parent should have gone during the ATTACH PARTITION. Please, find in attachment a patch dropping old "sub-FK" during the ATTACH PARTITION command and adding a regression test about it. At the very least, it help understanding the problem and sketch a possible solution. Regards, >From 5fc7997b9f9a17ee5a31f059c18e6c01fd716c04 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Wed, 5 Jul 2023 19:19:40 +0200 Subject: [PATCH v1] Remove useless parted-FK constraints when attaching a partition When a FK is referencing a partitioned table, this FK is parenting as many other "parted-FK" constraints than the referencing side has partitions. Each of these sub-constraints enforce only the UPDATE/DELETE triggers on each referenced partition, but not the check triggers on the referencing partition as this is already handle by the parent constraint. These parted-FK are half-backed on purpose. However when attaching such standalone table to a partitioned table having the same FK definition, all these sub-constraints were not removed. This leave a useless action trigger on each referenced partition, the legitimate one already checking against the root of the referencing partition. Moreover, when the partition is later detached, DetachPartitionFinalize() look for all existing FK on the relation and try to detach them from the parent triggers and constraints. But because these parted-FK are half backed, calling GetForeignKeyCheckTriggers() to later detach the check triggers raise an ERROR: ERROR: could not find ON INSERT check triggers of foreign key constraint N. --- src/backend/commands/tablecmds.c | 57 +-- src/test/regress/expected/foreign_key.out | 22 + src/test/regress/sql/foreign_key.sql | 27 +++ 3 files changed, 103 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index fce5e6f220..7c1aa5d395 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -10566,9 +10566,11 @@ tryAttachPartitionForeignKey(ForeignKeyCacheInfo *fk, Form_pg_constraint parentConstr; HeapTuple partcontup; Form_pg_constraint partConstr; - ScanKeyData key; + ScanKeyData key[3]; SysScanDesc scan; HeapTuple trigtup; + HeapTuple contup; + Relation pg_constraint; Oid insertTriggerOid, updateTriggerOid; @@ -10631,12 +10633,12 @@ tryAttachPartitionForeignKey(ForeignKeyCacheInfo *fk, * in the partition. We identify them because they have our constraint * OID, as well as being on the referenced rel. */ - ScanKeyInit(, + ScanKeyInit(key, Anum_pg_trigger_tgconstraint, BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(fk->conoid)); scan = systable_beginscan(trigrel, TriggerConstraintIndexId, true, - NULL, 1, ); + NULL, 1, key); while ((trigtup = systable_getnext(scan)) != NULL) { Form_pg_trigger trgform = (Form_pg_trigger) GETSTRUCT(trigtup); @@ -10684,6 +10686,55 @@ tryAttachPartitionForeignKey(ForeignKeyCa
Re: Memory leak from ExecutorState context?
On Fri, 19 May 2023 17:23:56 +0200 Tomas Vondra wrote: > On 5/19/23 00:27, Melanie Plageman wrote: > > v10 LGTM. > > Thanks! > > I've pushed 0002 and 0003, after some general bikeshedding and minor > rewording (a bit audacious, admittedly). Thank you Melanie et Tomas for your help, review et commit!
Re: Memory leak from ExecutorState context?
On Wed, 17 May 2023 13:46:35 -0400 Melanie Plageman wrote: > On Wed, May 17, 2023 at 07:10:08PM +0200, Jehan-Guillaume de Rorthais wrote: > > On Tue, 16 May 2023 16:00:52 -0400 > > Melanie Plageman wrote: > > > ... > > > There are some existing indentation issues in these files, but you can > > > leave those or put them in a separate commit. > > > > Reindented in v9. > > > > I put existing indentation issues in a separate commit to keep the actual > > patches clean from distractions. > > It is a matter of opinion, but I tend to prefer the commit with the fix > for the existing indentation issues to be first in the patch set. OK. moved in v10 patch set. > ... > > > > void > > > > ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue, > > > > - BufFile **fileptr) > > > > + BufFile **fileptr, > > > > MemoryContext cxt) { > > > > Note that I changed this to: > > > > ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue, > >BufFile **fileptr, HashJoinTable hashtable) { > > > > As this function must allocate BufFile buffer in spillCxt, I suppose > > we should force it explicitly in the function code. > > > > Plus, having hashtable locally could be useful for later patch to eg. fine > > track allocated memory in spaceUsed. > > I think if you want to pass the hashtable instead of the memory context, > I think you'll need to explain why you still pass the buffile pointer > (because ExecHashJoinSaveTuple() is called for inner and outer batch > files) instead of getting it from the hashtable's arrays of buffile > pointers. Comment added in v10 > > @@ -1310,22 +1311,38 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate) > ... > > Suggested small edit to the second paragraph: > > During the build phase, buffered files are created for inner batches. > Each batch's buffered file is closed (and its buffer freed) after the > batch is loaded into memory during the outer side scan. Therefore, it > is necessary to allocate the batch file buffer in a memory context > which outlives the batch itself. Changed. > I'd also mention the reason for passing the buffile pointer above the > function. Added. Regards, >From 25ef004b31e07956a018dd08170ee1588b7719b8 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Wed, 17 May 2023 19:01:06 +0200 Subject: [PATCH v10 1/3] Run pgindent on nodeHash.c and nodeHashjoin.c --- src/backend/executor/nodeHash.c | 6 +++--- src/backend/executor/nodeHashjoin.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 5fd1c5553b..ac3eb32d97 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -1327,7 +1327,7 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable) else { size_t tuple_size = -MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len); + MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len); /* It belongs in a later batch. */ hashtable->batches[batchno].estimated_size += tuple_size; @@ -1369,7 +1369,7 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable) for (i = 1; i < old_nbatch; ++i) { ParallelHashJoinBatch *shared = - NthParallelHashJoinBatch(old_batches, i); + NthParallelHashJoinBatch(old_batches, i); old_inner_tuples[i] = sts_attach(ParallelHashJoinBatchInner(shared), ParallelWorkerNumber + 1, @@ -3317,7 +3317,7 @@ ExecHashTableDetachBatch(HashJoinTable hashtable) while (DsaPointerIsValid(batch->chunks)) { HashMemoryChunk chunk = -dsa_get_address(hashtable->area, batch->chunks); + dsa_get_address(hashtable->area, batch->chunks); dsa_pointer next = chunk->next.shared; dsa_free(hashtable->area, batch->chunks); diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c index 0a3f32f731..b29a8ff48b 100644 --- a/src/backend/executor/nodeHashjoin.c +++ b/src/backend/executor/nodeHashjoin.c @@ -1170,7 +1170,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate) { SharedTuplestoreAccessor *inner_tuples; Barrier*batch_barrier = - >batches[batchno].shared->batch_barrier; +>batches[batchno].shared->batch_barrier; switch (BarrierAttach(batch_barrier)) { @@ -1558,7 +1558,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt) { int plan_node_id = state->js.ps.plan->plan_node_id; ParallelHashJoinState *pstate = - shm_toc_lookup(pcxt->toc, plan_node_id, false); + shm_toc_lookup(
Re: Memory leak from ExecutorState context?
On Tue, 16 May 2023 16:00:52 -0400 Melanie Plageman wrote: > On Tue, May 16, 2023 at 04:00:51PM +0200, Jehan-Guillaume de Rorthais wrote: > > > From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman > > Date: Thu, 30 Apr 2020 07:16:28 -0700 > > Subject: [PATCH v8 1/3] Describe hash join implementation > > > > This is just a draft to spark conversation on what a good comment might > > be like in this file on how the hybrid hash join algorithm is > > implemented in Postgres. I'm pretty sure this is the accepted term for > > this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join > > I recommend changing the commit message to something like this: > > Describe hash join implementation > > Add details to the block comment in nodeHashjoin.c describing the > hybrid hash join algorithm at a high level. > > Author: Melanie Plageman > Author: Jehan-Guillaume de Rorthais > Reviewed-by: Tomas Vondra > Discussion: https://postgr.es/m/20230516160051.4267a800%40karst Done, but assigning myself as a reviewer as I don't remember having authored anything in this but the reference to the Hybrid hash page, which is questionable according to Tomas :) > > diff --git a/src/backend/executor/nodeHashjoin.c > > b/src/backend/executor/nodeHashjoin.c index 0a3f32f731..93a78f6f74 100644 > > --- a/src/backend/executor/nodeHashjoin.c > > ... > > + * TODO: should we discuss that tuples can only spill forward? > > I would just cut this for now since we haven't started on an agreed-upon > wording. Removed in v9. > > From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais > > Date: Tue, 16 May 2023 15:42:14 +0200 > > Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in a dedicated > > context > > Here is a draft commit message for the second patch: > > ... Thanks. Adopted with some minor rewording... hopefully it's OK. > I recommend editing and adding links to the various discussions on this > topic from your research. Done in v9. > As for the patch itself, I noticed that there are few things needing > pgindenting. > > I usually do the following to run pgindent (in case you haven't done > this recently). > > ... Thank you for your recipe. > There are some existing indentation issues in these files, but you can > leave those or put them in a separate commit. Reindented in v9. I put existing indentation issues in a separate commit to keep the actual patches clean from distractions. > ... > Add a period at the end of this comment. > > > + /* > > +* Use hash join spill memory context to allocate accessors and > > their > > +* buffers > > +*/ Fixed in v9. > Add a period at the end of this comment. > > > + /* Use hash join spill memory context to allocate accessors */ Fixed in v9. > > diff --git a/src/backend/executor/nodeHashjoin.c > > b/src/backend/executor/nodeHashjoin.c @@ -1313,21 +1314,30 @@ > > ExecParallelHashJoinNewBatch(HashJoinState *hjstate) > > * The data recorded in the file for each tuple is its hash value, > > * then the tuple in MinimalTuple format. > > * > > - * Note: it is important always to call this in the regular executor > > - * context, not in a shorter-lived context; else the temp file buffers > > - * will get messed up. > > + * If this is the first write to the batch file, this function first > > + * create it. The associated BufFile buffer is allocated in the given > > + * context. It is important to always give the HashSpillContext > > + * context. First to avoid a shorter-lived context, else the temp file > > + * buffers will get messed up. Second, for a better accounting of the > > + * spilling memory consumption. > > + * > > */ > > Here is my suggested wording fot this block comment: > > The batch file is lazily created. If this is the first tuple written to > this batch, the batch file is created and its buffer is allocated in the > given context. It is important to pass in a memory context which will > live for the entirety of the lifespan of the batch. Reworded. The context must actually survive the batch itself, not just live during the lifespan of the batch. > > void > > ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue, > > - BufFile **fileptr) > > + BufFile **fileptr, MemoryContext > > cxt) { Note that I changed this to: ExecHashJoinSaveTuple(MinimalTuple tuple, u
Re: Memory leak from ExecutorState context?
Hi, On Tue, 16 May 2023 12:01:51 +0200 Tomas Vondra wrote: > On 5/16/23 00:15, Jehan-Guillaume de Rorthais wrote: > > On Sat, 13 May 2023 23:47:53 +0200 > > Tomas Vondra wrote: > ... > >> I'm not really sure about calling this "hybrid hash-join". What does it > >> even mean? The new comments simply describe the existing batching, and > >> how we increment number of batches, etc. > > > > Unless I'm wrong, I believed it comes from the "Hybrid hash join algorithm" > > as described here (+ see pdf in this page ref): > > > > https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join > > > > I added the ref in the v7 documentation to avoid futur confusion, is it ok? > > Ah, I see. I'd just leave out the "hybrid" entirely. We've lived without > it until now, we know what the implementation does ... I changed the title, but kept the reference. There's still two other uses of "hybrid hash join algorithm" in function and code comments. Keeping the ref in this header doesn't cost much and help new comers. > >> 0002 > >> ... > >> The modified comment in hashjoin.h has a some alignment issues. > > > > I see no alignment issue. I suppose what bother you might be the spaces > > before spillCxt and batchCxt to show they are childs of hashCxt? Should I > > remove them? > > It didn't occur to me this is intentional to show the contexts are > children of hashCxt, so maybe it's not a good way to document that. I'd > remove it, and if you think it's something worth mentioning, I'd add an > explicit comment. Changed. Thanks, >From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 30 Apr 2020 07:16:28 -0700 Subject: [PATCH v8 1/3] Describe hash join implementation This is just a draft to spark conversation on what a good comment might be like in this file on how the hybrid hash join algorithm is implemented in Postgres. I'm pretty sure this is the accepted term for this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join --- src/backend/executor/nodeHashjoin.c | 41 + 1 file changed, 41 insertions(+) diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c index 0a3f32f731..93a78f6f74 100644 --- a/src/backend/executor/nodeHashjoin.c +++ b/src/backend/executor/nodeHashjoin.c @@ -10,6 +10,47 @@ * IDENTIFICATION * src/backend/executor/nodeHashjoin.c * + * HASH JOIN + * + * This is based on the "hybrid hash join" algorithm described shortly in the + * following page, and in length in the pdf in page's notes: + * + * https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join + * + * If the inner side tuples of a hash join do not fit in memory, the hash join + * can be executed in multiple batches. + * + * If the statistics on the inner side relation are accurate, planner chooses a + * multi-batch strategy and estimates the number of batches. + * + * The query executor measures the real size of the hashtable and increases the + * number of batches if the hashtable grows too large. + * + * The number of batches is always a power of two, so an increase in the number + * of batches doubles it. + * + * Serial hash join measures batch size lazily -- waiting until it is loading a + * batch to determine if it will fit in memory. While inserting tuples into the + * hashtable, serial hash join will, if that tuple were to exceed work_mem, + * dump out the hashtable and reassign them either to other batch files or the + * current batch resident in the hashtable. + * + * Parallel hash join, on the other hand, completes all changes to the number + * of batches during the build phase. If it increases the number of batches, it + * dumps out all the tuples from all batches and reassigns them to entirely new + * batch files. Then it checks every batch to ensure it will fit in the space + * budget for the query. + * + * In both parallel and serial hash join, the executor currently makes a best + * effort. If a particular batch will not fit in memory, it tries doubling the + * number of batches. If after a batch increase, there is a batch which + * retained all or none of its tuples, the executor disables growth in the + * number of batches globally. After growth is disabled, all batches that would + * have previously triggered an increase in the number of batches instead + * exceed the space allowed. + * + * TODO: should we discuss that tuples can only spill forward? + * * PARALLELISM * * Hash joins can participate in parallel query execution in several ways. A -- 2.40.1 >From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Tue, 16 May 2023 15:42:14 +0200 Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in
Re: Memory leak from ExecutorState context?
Hi, Thanks for your review! On Sat, 13 May 2023 23:47:53 +0200 Tomas Vondra wrote: > Thanks for the patches. A couple mostly minor comments, to complement > Melanie's review: > > 0001 > > I'm not really sure about calling this "hybrid hash-join". What does it > even mean? The new comments simply describe the existing batching, and > how we increment number of batches, etc. Unless I'm wrong, I believed it comes from the "Hybrid hash join algorithm" as described here (+ see pdf in this page ref): https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join I added the ref in the v7 documentation to avoid futur confusion, is it ok? > 0002 > > I wouldn't call the new ExecHashJoinSaveTuple parameter spillcxt but > just something generic (e.g. cxt). Yes, we're passing spillCxt, but from > the function's POV it's just a pointer. changed in v7. > I also wouldn't move the ExecHashJoinSaveTuple comment inside - it just > needs to be reworded that we're expecting the context to be with the > right lifespan. The function comment is the right place to document such > expectations, people are less likely to read the function body. moved and reworded in v7. > The modified comment in hashjoin.h has a some alignment issues. I see no alignment issue. I suppose what bother you might be the spaces before spillCxt and batchCxt to show they are childs of hashCxt? Should I remove them? Regards, >From 997a82a0ee9eda1774b5210401b2d9bf281318da Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 30 Apr 2020 07:16:28 -0700 Subject: [PATCH 1/3] Describe hybrid hash join implementation This is just a draft to spark conversation on what a good comment might be like in this file on how the hybrid hash join algorithm is implemented in Postgres. I'm pretty sure this is the accepted term for this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join --- src/backend/executor/nodeHashjoin.c | 41 + 1 file changed, 41 insertions(+) diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c index 0a3f32f731..a39164c15d 100644 --- a/src/backend/executor/nodeHashjoin.c +++ b/src/backend/executor/nodeHashjoin.c @@ -10,6 +10,47 @@ * IDENTIFICATION * src/backend/executor/nodeHashjoin.c * + * HYBRID HASH JOIN + * + * This is based on the hybrid hash join algorythm described shortly in the + * following page, and in length in the pdf in page's notes: + * + *https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join + * + * If the inner side tuples of a hash join do not fit in memory, the hash join + * can be executed in multiple batches. + * + * If the statistics on the inner side relation are accurate, planner chooses a + * multi-batch strategy and estimates the number of batches. + * + * The query executor measures the real size of the hashtable and increases the + * number of batches if the hashtable grows too large. + * + * The number of batches is always a power of two, so an increase in the number + * of batches doubles it. + * + * Serial hash join measures batch size lazily -- waiting until it is loading a + * batch to determine if it will fit in memory. While inserting tuples into the + * hashtable, serial hash join will, if that tuple were to exceed work_mem, + * dump out the hashtable and reassign them either to other batch files or the + * current batch resident in the hashtable. + * + * Parallel hash join, on the other hand, completes all changes to the number + * of batches during the build phase. If it increases the number of batches, it + * dumps out all the tuples from all batches and reassigns them to entirely new + * batch files. Then it checks every batch to ensure it will fit in the space + * budget for the query. + * + * In both parallel and serial hash join, the executor currently makes a best + * effort. If a particular batch will not fit in memory, it tries doubling the + * number of batches. If after a batch increase, there is a batch which + * retained all or none of its tuples, the executor disables growth in the + * number of batches globally. After growth is disabled, all batches that would + * have previously triggered an increase in the number of batches instead + * exceed the space allowed. + * + * TODO: should we discuss that tuples can only spill forward? + * * PARALLELISM * * Hash joins can participate in parallel query execution in several ways. A -- 2.40.1 >From 52e989d5eb6405e86e0d460dffbfaca292bc4274 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Mon, 27 Mar 2023 15:54:39 +0200 Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated context --- src/backend/executor/nodeHash.c | 43 --- src/backend/executor/nodeHashjoin.c | 22 src/backend/utils/sort/sharedtuplestore.c | 8 + src/include/executor
Re: Memory leak from ExecutorState context?
On Sun, 14 May 2023 00:10:00 +0200 Tomas Vondra wrote: > On 5/12/23 23:36, Melanie Plageman wrote: > > Thanks for continuing to work on this. > > > > Are you planning to modify what is displayed for memory usage in > > EXPLAIN ANALYZE? Yes, I already start to work on this. Tracking spilling memory in spaceUsed/spacePeak change the current behavior of the serialized HJ because it will increase the number of batch much faster, so this is a no go for v16. I'll try to accumulate the total allocated (used+not used) spill context memory in instrumentation. This is gross, but it avoids to track the spilling memory in its own structure entry. > We could do that, but we can do that separately - it's a separate and > independent improvement, I think. +1 > Also, do you have a proposal how to change the explain output? In > principle we already have the number of batches, so people can calculate > the "peak" amount of memory (assuming they realize what it means). We could add the batch memory consumption with the number of batches. Eg.: Buckets: 4096 (originally 4096) Batches: 32768 (originally 8192) using 256MB Memory Usage: 192kB > I think the main problem with adding this info to EXPLAIN is that I'm > not sure it's very useful in practice. I've only really heard about this > memory explosion issue when the query dies with OOM or takes forever, > but EXPLAIN ANALYZE requires the query to complete. It could be useful to help admins tuning their queries realize that the current number of batches is consuming much more memory than the join itself. This could help them fix the issue before OOM happen. Regards,
Re: Unlinking Parallel Hash Join inner batch files sooner
Hi, Thanks for working on this! On Wed, 10 May 2023 15:11:20 +1200 Thomas Munro wrote: > One complaint about PHJ is that it can, in rare cases, use a > surprising amount of temporary disk space where non-parallel HJ would > not. When it decides that it needs to double the number of batches to > try to fit each inner batch into memory, and then again and again > depending on your level of bad luck, it leaves behind all the earlier > generations of inner batch files to be cleaned up at the end of the > query. That's stupid. Here's a patch to unlink them sooner, as a > small improvement. This patch can indeed save a decent amount of temporary disk space. Considering its complexity is (currently?) quite low, it worth it. > The reason I didn't do this earlier is that sharedtuplestore.c > continues the pre-existing tradition where each parallel process > counts what it writes against its own temp_file_limit. At the time I > thought I'd need to have one process unlink all the files, but if a > process were to unlink files that it didn't create, that accounting > system would break. Without some new kind of shared temp_file_limit > mechanism that doesn't currently exist, per-process counters could go > negative, creating free money. In the attached patch, I realised > something that I'd missed before: there is a safe point for each > backend to unlink just the files that it created, and there is no way > for a process that created files not to reach that point. Indeed. For what it worth, from my new and non-experienced understanding of the parallel mechanism, waiting for all workers to reach WAIT_EVENT_HASH_GROW_BATCHES_REPARTITION, after re-dispatching old batches in new ones, seems like a safe place to instruct each workers to clean their old temp files. > Here's an example query that tries 8, 16 and then 32 batches on my > machine, because reltuples is clobbered with a bogus value. Nice! Regards,
Re: Memory leak from ExecutorState context?
Thank you for your review! On Mon, 8 May 2023 11:56:48 -0400 Melanie Plageman wrote: > ... > > 4. accessor->read_buffer is currently allocated in accessor->context as > > well. > > > >This buffer holds tuple read from the fileset. This is still a buffer, > > but not related to any file anymore... > > > > Because of 3 and 4, and the gross context granularity of SharedTuplestore > > related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt". > > I think bufCxt is a potentially confusing name. The context contains > pointers to the batch files and saying those are related to buffers is > confusing. It also sounds like it could include any kind of buffer > related to the hashtable or hash join. > > Perhaps we could call this memory context the "spillCxt"--indicating it > is for the memory required for spilling to permanent storage while > executing hash joins. "Spilling" seems fair and a large enough net to grab everything around temp files and accessing them. > I discuss this more in my code review below. > > > From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais > > Date: Mon, 27 Mar 2023 15:54:39 +0200 > > Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated > > context > > > diff --git a/src/backend/executor/nodeHash.c > > b/src/backend/executor/nodeHash.c index 5fd1c5553b..a4fbf29301 100644 > > --- a/src/backend/executor/nodeHash.c > > +++ b/src/backend/executor/nodeHash.c > > @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List > > *hashOperators, List *hashCollations, > > if (nbatch > 1 && hashtable->parallel_state == NULL) > > { > > + MemoryContext oldctx; > > + > > /* > > * allocate and initialize the file arrays in hashCxt (not > > needed for > > * parallel case which uses shared tuplestores instead of > > raw files) */ > > + oldctx = MemoryContextSwitchTo(hashtable->fileCxt); > > + > > hashtable->innerBatchFile = palloc0_array(BufFile *, > > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch); > > /* The files will not be opened until needed... */ > > /* ... but make sure we have temp tablespaces established > > for them */ > > I haven't studied it closely, but I wonder if we shouldn't switch back > into the oldctx before calling PrepareTempTablespaces(). > PrepareTempTablespaces() is explicit about what memory context it is > allocating in, however, I'm not sure it makes sense to call it in the > fileCxt. If you have a reason, you should add a comment and do the same > in ExecHashIncreaseNumBatches(). I had no reason. I catched it in ExecHashIncreaseNumBatches() while reviewing myself, but not here. Line moved in v6. > > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) > >hashtable, nbatch, hashtable->spaceUsed); > > #endif > > > > - oldcxt = MemoryContextSwitchTo(hashtable->hashCxt); > > - > > if (hashtable->innerBatchFile == NULL) > > { > > + MemoryContext oldcxt = > > MemoryContextSwitchTo(hashtable->fileCxt); + > > /* we had no file arrays before */ > > hashtable->innerBatchFile = palloc0_array(BufFile *, > > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch); > > + > > + MemoryContextSwitchTo(oldcxt); > > + > > /* time to establish the temp tablespaces, too */ > > PrepareTempTablespaces(); > > } > > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) > > I don't see a reason to call repalloc0_array() in a different context > than the initial palloc0_array(). Unless I'm wrong, wrapping the whole if/else blocks in memory context hashtable->fileCxt seemed useless as repalloc() actually realloc in the context the original allocation occurred. So I only wrapped the palloc() calls. > > diff --git a/src/backend/executor/nodeHashjoin.c > > ... > > @@ -1308,21 +1310,27 @@ ExecParallelHashJoinNewBatch(HashJoinState > > *hjstate) > > * The data recorded in the file for each tuple is its hash value, > > It doesn't sound accurate to me to say that it should be called *in* the > HashBatchFiles context. We now switch into that context, so you can > probably remove this comment and add a comment above the switch into the > filecxt which explains that the temp f
Re: Memory leak from ExecutorState context?
Hi, On Fri, 21 Apr 2023 16:44:48 -0400 Melanie Plageman wrote: > On Fri, Apr 7, 2023 at 8:01 PM Jehan-Guillaume de Rorthais > wrote: > > > > On Fri, 31 Mar 2023 14:06:11 +0200 > > Jehan-Guillaume de Rorthais wrote: > > > > > > [...] > > > > >> Hmmm, not sure is WARNING is a good approach, but I don't have a > > > > >> better idea at the moment. > > > > > > > > > > I stepped it down to NOTICE and added some more infos. > > > > > > > > > > [...] > > > > > NOTICE: Growing number of hash batch to 32768 is exhausting allowed > > > > > memory (137494656 > 2097152) > > > > [...] > > > > > > > > OK, although NOTICE that may actually make it less useful - the default > > > > level is WARNING, and regular users are unable to change the level. So > > > > very few people will actually see these messages. > > [...] > > > Anyway, maybe this should be added in the light of next patch, balancing > > > between increasing batches and allowed memory. The WARNING/LOG/NOTICE > > > message could appears when we actually break memory rules because of some > > > bad HJ situation. > > > > So I did some more minor editions to the memory context patch and start > > working on the balancing memory patch. Please, find in attachment the v4 > > patch set: > > > > * 0001-v4-Describe-hybrid-hash-join-implementation.patch: > > Adds documentation written by Melanie few years ago > > * 0002-v4-Allocate-hash-batches-related-BufFile-in-a-dedicated.patch: > > The batches' BufFile dedicated memory context patch > > This is only a review of the code in patch 0002 (the patch to use a more > granular memory context for multi-batch hash join batch files). I have > not reviewed the changes to comments in detail either. Ok. > I think the biggest change that is needed is to implement this memory > context usage for parallel hash join. Indeed. > To implement a file buffer memory context for multi-patch parallel hash > join [...] Thank you for your review and pointers! After (some days off and then) studying the parallel code, I end up with this: 1. As explained by Melanie, the v5 patch sets accessor->context to fileCxt. 2. BufFile buffers were wrongly allocated in ExecutorState context for accessor->read|write_file, from sts_puttuple and sts_parallel_scan_next when they first need to work with the accessor FileSet. The v5 patch now allocate them in accessor->context, directly in sts_puttuple and sts_parallel_scan_next. This avoids to wrap each single call of these functions inside MemoryContextSwitchTo calls. I suppose this is correct as the comment about accessor->context says "/* Memory context for **buffers**. */" in struct SharedTuplestoreAccessor. 3. accessor->write_chunk is currently already allocated in accessor->context. In consequence, this buffer is now allocated in the fileCxt instead of hashCxt context. This is a bit far fetched, but I suppose this is ok as it act as a second level buffer, on top of the BufFile. 4. accessor->read_buffer is currently allocated in accessor->context as well. This buffer holds tuple read from the fileset. This is still a buffer, but not related to any file anymore... Because of 3 and 4, and the gross context granularity of SharedTuplestore related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt". Regards, >From acaa94fab35ab966366a29a092780e54a68de0f7 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 30 Apr 2020 07:16:28 -0700 Subject: [PATCH 1/3] Describe hybrid hash join implementation This is just a draft to spark conversation on what a good comment might be like in this file on how the hybrid hash join algorithm is implemented in Postgres. I'm pretty sure this is the accepted term for this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join --- src/backend/executor/nodeHashjoin.c | 36 + 1 file changed, 36 insertions(+) diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c index 0a3f32f731..920d1831c2 100644 --- a/src/backend/executor/nodeHashjoin.c +++ b/src/backend/executor/nodeHashjoin.c @@ -10,6 +10,42 @@ * IDENTIFICATION * src/backend/executor/nodeHashjoin.c * + * HYBRID HASH JOIN + * + * If the inner side tuples of a hash join do not fit in memory, the hash join + * can be executed in multiple batches. + * + * If the statistics on the inner side relation are accurate, planner chooses a + * multi-batch strategy and estimates the number of batches. + * + * The query executor measures the real siz
Re: Commitfest 2023-03 starting tomorrow!
Hi, After catching up with this thread, where pending bugs are listed and discussed, I wonder if the current patches trying to lower the HashJoin memory explosion[1] could be added to the "Older bugs affecting stable branches" list of https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items as I think they deserve some discussion/triage for v16? The patch as it stands is not invasive. As I wrote previously[2], if we couldn't handle such situation better in v16, and if this patch is not backpatch-able in a minor release, then we will keep living another year, maybe more, with this bad memory behavior. Thank you, [1] https://www.postgresql.org/message-id/20230408020119.32a0841b%40karst [2] https://www.postgresql.org/message-id/20230320151234.38b2235e%40karst
[BUG] FK broken after DETACHing referencing part
Hi, Considering two partitionned tables with a FK between them: DROP TABLE IF EXISTS p, c, c_1 CASCADE; -- -- Parent table + partition + data CREATE TABLE p ( id bigint PRIMARY KEY ) PARTITION BY list (id); CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1); INSERT INTO p VALUES (1); -- Child table + partition + data CREATE TABLE c ( idbigint PRIMARY KEY, p_id bigint NOT NULL, FOREIGN KEY (p_id) REFERENCES p (id) ) PARTITION BY list (id); CREATE TABLE c_1 PARTITION OF c FOR VALUES IN (1); INSERT INTO c VALUES (1,1); After DETACHing the "c_1" partition, current implementation make sure it keeps the FK herited from its previous top table "c": ALTER TABLE c DETACH PARTITION c_1; \d c_1 -- outputs: -- [...] -- Foreign-key constraints: -- "c_p_id_fkey" FOREIGN KEY (p_id) REFERENCES p(id) However, because the referenced side is partionned, this FK is half backed, with only the referencing (insert/update on c_1) side enforced, but not the referenced side (update/delete on p): INSERT INTO c_1 VALUES (2,2); -- fails as EXPECTED -- ERROR: insert or update on table "child_1" violates foreign key [...] DELETE FROM p; -- should actually fail -- DELETE 1 SELECT * FROM c_1; -- id | parent_id -- +--- -- 1 | 1 -- (1 row) SELECT * FROM p; -- id -- -- (0 rows) When detaching "c_1", current implementation adds two triggers to enforce UPDATE/DELETE on "p" are restricted if "c_1" keeps referencing the related rows... But it forgets to add them on partitions of "p_1", where the triggers should actually fire. To make it clear, the FK c_1 -> p constraint and triggers after DETACHING c_1 are: SELECT c.oid AS conid, c.conname, c.conparentid AS conparent, r2.relname AS pkrel, t.tgrelid::regclass AS tgrel, p.proname FROM pg_constraint c JOIN pg_class r ON c.conrelid = r.oid JOIN pg_class r2 ON c.confrelid = r2.oid JOIN pg_trigger t ON t.tgconstraint = c.oid JOIN pg_proc p ON p.oid = t.tgfoid WHERE r.relname = 'c_1' AND r2.relname LIKE 'p%' ORDER BY r.relname, c.conname, t.tgrelid::regclass::text, p.proname; -- conid | conname | conparent | pkrel | tgrel | proname -- ---+-+---+---+---+-- -- 18454 | c_p_id_fkey | 0 | p | c_1 | RI_FKey_check_ins -- 18454 | c_p_id_fkey | 0 | p | c_1 | RI_FKey_check_upd -- 18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_del -- 18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_upd Where they should be: -- conid | conname| conparent | pkrel | tgrel | proname -- ---+--+---+---+---+-- -- 18454 | c_p_id_fkey | 0 | p | c_1 | RI_FKey_check_ins -- 18454 | c_p_id_fkey | 0 | p | c_1 | RI_FKey_check_upd -- 18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_del -- 18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_upd -- NEW!! | c_p_id_fkey1 | 18454 | p_1 | p_1 | RI_FKey_noaction_del -- NEW!! | c_p_id_fkey1 | 18454 | p_1 | p_1 | RI_FKey_noaction_upd I poked around DetachPartitionFinalize() to try to find a way to fix this, but it looks like it would duplicate a bunch of code from other code path (eg. from CloneFkReferenced). Instead of tweaking existing FK, keeping old constraint name (wouldn't "c_1_p_id_fkey" be better after detach?) and duplicating some code around, what about cleaning up the FK constraints from the detached table and recreating a cleaner one using the known code path ATAddForeignKeyConstraint() ? Thanks for reading me down to here! ++
Re: OOM in hash join
On Fri, 14 Apr 2023 13:21:05 +0200 Matthias van de Meent wrote: > On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik wrote: > > > > Hi hackers, > > > > Too small value of work_mem cause memory overflow in parallel hash join > > because of too much number batches. > > There is the plan: > > [...] > > > There is still some gap between size reported by memory context sump and > > actual size of backend. > > But is seems to be obvious, that trying to fit in work_mem > > sharedtuplestore creates so much batches, that them consume much more > > memory than work_mem. Indeed. The memory consumed by batches is not accounted and the consumption reported in explain analyze is wrong. Would you be able to test the latest patchset posted [1] ? This does not fix the work_mem overflow, but it helps to keep the number of batches balanced and acceptable. Any feedback, comment or review would be useful. [1] https://www.postgresql.org/message-id/flat/20230408020119.32a0841b%40karst#616c1f41fcc10e8f89d41e8e5693618c Regards,
Re: Memory leak from ExecutorState context?
On Sat, 8 Apr 2023 02:01:19 +0200 Jehan-Guillaume de Rorthais wrote: > On Fri, 31 Mar 2023 14:06:11 +0200 > Jehan-Guillaume de Rorthais wrote: > > [...] > > After rebasing Tomas' memory balancing patch, I did some memory measures > to answer some of my questions. Please, find in attachment the resulting > charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption > between HEAD and Tomas' patch. They shows an alternance of numbers > before/after calling ExecHashIncreaseNumBatches (see the debug patch). I > didn't try to find the exact last total peak of memory consumption during the > join phase and before all the BufFiles are destroyed. So the last number > might be underestimated. I did some more analysis about the total memory consumption in filecxt of HEAD, v3 and v4 patches. My previous debug numbers only prints memory metrics during batch increments or hash table destruction. That means: * for HEAD: we miss the batches consumed during the outer scan * for v3: adds twice nbatch in spaceUsed, which is a rough estimation * for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated" from there, here are the maximum allocated memory for bufFile context for each branch: batches max bufFiles total spaceAllowed rise HEAD16384 199966960 ~194MB v3 4096 65419456 ~78MB v4(*3) 2048 3427328048MB nbatch*sizeof(PGAlignedBlock)*3 v4(*4) 1024 17170160 60.6MB nbatch*sizeof(PGAlignedBlock)*4 v4(*5) 2048 34273280 42.5MB nbatch*sizeof(PGAlignedBlock)*5 It seems account for bufFile in spaceUsed allows a better memory balancing and management. The precise factor to rise spaceAllowed is yet to be defined. *3 or *4 looks good, but this is based on a single artificial test case. Also, note that HEAD is currently reporting ~4MB of memory usage. This is by far wrong with the reality. So even if we don't commit the balancing memory patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix? Regards,
Re: Memory leak from ExecutorState context?
On Tue, 28 Mar 2023 17:25:49 +0200 Tomas Vondra wrote: ... > * Note that BufFile structs are allocated with palloc(), and therefore > * will go away automatically at query/transaction end. Since the > underlying > * virtual Files are made with OpenTemporaryFile, all resources for > * the file are certain to be cleaned up even if processing is aborted > * by ereport(ERROR). The data structures required are made in the > * palloc context that was current when the BufFile was created, and > * any external resources such as temp files are owned by the ResourceOwner > * that was current at that time. > > which I take as confirmation that it's legal to allocate BufFile in any > memory context, and that cleanup is handled by the cache in fd.c. OK. I just over interpreted comments and been over prudent. > [...] > >> Hmmm, not sure is WARNING is a good approach, but I don't have a better > >> idea at the moment. > > > > I stepped it down to NOTICE and added some more infos. > > > > [...] > > NOTICE: Growing number of hash batch to 32768 is exhausting allowed > > memory (137494656 > 2097152) > [...] > > OK, although NOTICE that may actually make it less useful - the default > level is WARNING, and regular users are unable to change the level. So > very few people will actually see these messages. The main purpose of NOTICE was to notice user/dev, as client_min_messages=notice by default. But while playing with it, I wonder if the message is in the good place anyway. It is probably pessimistic as it shows memory consumption when increasing the number of batch, but before actual buffile are (lazily) allocated. The message should probably pop a bit sooner with better numbers. Anyway, maybe this should be added in the light of next patch, balancing between increasing batches and allowed memory. The WARNING/LOG/NOTICE message could appears when we actually break memory rules because of some bad HJ situation. Another way to expose the bad memory consumption would be to add memory infos to the HJ node in the explain output, or maybe collect some min/max/mean for pg_stat_statement, but I suspect tracking memory spikes by query is another challenge altogether... In the meantime, find in attachment v3 of the patch with debug and NOTICE messages removed. Given the same plan from my previous email, here is the memory contexts close to the query end: ExecutorState: 32768 total in 3 blocks; 15512 free (6 chunks); 17256 used HashTableContext: 8192 total in 1 blocks; 7720 free (0 chunks); 472 used HashBatchFiles: 28605680 total in 3256 blocks; 970128 free (38180 chunks); 27635552 used HashBatchContext: 960544 total in 23 blocks; 7928 free (0 chunks); 952616 used Regards, >From 6814994fa0576a8ba6458412ac5f944135fc3813 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Mon, 27 Mar 2023 15:54:39 +0200 Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context --- src/backend/executor/nodeHash.c | 27 ++- src/backend/executor/nodeHashjoin.c | 18 +- src/include/executor/hashjoin.h | 15 --- src/include/executor/nodeHashjoin.h | 2 +- 4 files changed, 48 insertions(+), 14 deletions(-) diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 748c9b0024..0c0d5b4a3c 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, * * The hashtable control block is just palloc'd from the executor's * per-query memory context. Everything else should be kept inside the - * subsidiary hashCxt or batchCxt. + * subsidiary hashCxt, batchCxt or fileCxt. */ hashtable = palloc_object(HashJoinTableData); hashtable->nbuckets = nbuckets; @@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, "HashBatchContext", ALLOCSET_DEFAULT_SIZES); + hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt, + "HashBatchFiles", + ALLOCSET_DEFAULT_SIZES); + /* Allocate data that will live for the life of the hashjoin */ oldcxt = MemoryContextSwitchTo(hashtable->hashCxt); @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, if (nbatch > 1 && hashtable->parallel_state == NULL) { + MemoryContext oldctx; + /* * allocate and initialize the file arrays in hashCxt (not needed for * parallel case which uses shared tuplestores instead of raw files) */ + oldctx = MemoryContextSwitchTo(hashtable->fileCxt); + hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch); hashtable->outer
Re: Memory leak from ExecutorState context?
Hi, Sorry for the late answer, I was reviewing the first patch and it took me some time to study and dig around. On Thu, 23 Mar 2023 08:07:04 -0400 Melanie Plageman wrote: > On Fri, Mar 10, 2023 at 1:51 PM Jehan-Guillaume de Rorthais > wrote: > > > So I guess the best thing would be to go through these threads, see what > > > the status is, restart the discussion and propose what to do. If you do > > > that, I'm happy to rebase the patches, and maybe see if I could improve > > > them in some way. > > > > OK! It took me some time, but I did it. I'll try to sum up the situation as > > simply as possible. > > Wow, so many memories! > > I'm excited that someone looked at this old work (though it is sad that > a customer faced this issue). And, Jehan, I really appreciate your great > summarization of all these threads. This will be a useful reference. Thank you! > > 1. "move BufFile stuff into separate context" > > [...] > >I suppose this simple one has been forgotten in the fog of all other > >discussions. Also, this probably worth to be backpatched. > > I agree with Jehan-Guillaume and Tomas that this seems fine to commit > alone. This is a WIP. > > 2. "account for size of BatchFile structure in hashJoin" > > [...] > > I think I would have to see a modern version of a patch which does this > to assess if it makes sense. But, I probably still agree with 2019 > Melanie :) I volunteer to work on this after the memory context patch, unless someone grab it in the meantime. > [...] > On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais > wrote: > > BNJL and/or other considerations are for 17 or even after. In the meantime, > > Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with > > other discussed solutions. No one down vote since then. Melanie, what is > > your opinion today on this patch? Did you change your mind as you worked > > for many months on BNLJ since then? > > So, in order to avoid deadlock, my design of adaptive hash join/block > nested loop hash join required a new parallelism concept not yet in > Postgres at the time -- the idea of a lone worker remaining around to do > work when others have left. > > See: BarrierArriveAndDetachExceptLast() > introduced in 7888b09994 > > Thomas Munro had suggested we needed to battle test this concept in a > more straightforward feature first, so I implemented parallel full outer > hash join and parallel right outer hash join with it. > > https://commitfest.postgresql.org/42/2903/ > > This has been stalled ready-for-committer for two years. It happened to > change timing such that it made an existing rarely hit parallel hash > join bug more likely to be hit. Thomas recently committed our fix for > this in 8d578b9b2e37a4d (last week). It is my great hope that parallel > full outer hash join goes in before the 16 feature freeze. This is really interesting to follow. I kinda feel/remember how this could be useful for your BNLJ patch. It's good to see things are moving, step by step. Thanks for the pointers. > If it does, I think it could make sense to try and find committable > smaller pieces of the adaptive hash join work. As it is today, parallel > hash join does not respect work_mem, and, in some sense, is a bit broken. > > I would be happy to work on this feature again, or, if you were > interested in picking it up, to provide review and any help I can if for > you to work on it. I don't think I would be able to pick up such a large and complex patch. But I'm interested to help, test and review, as far as I can! Regards,
Re: Memory leak from ExecutorState context?
On Tue, 28 Mar 2023 00:43:34 +0200 Tomas Vondra wrote: > On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote: > > Please, find in attachment a patch to allocate bufFiles in a dedicated > > context. I picked up your patch, backpatch'd it, went through it and did > > some minor changes to it. I have some comment/questions thought. > > > > 1. I'm not sure why we must allocate the "HashBatchFiles" new context > > under ExecutorState and not under hashtable->hashCxt? > > > > The only references I could find was in hashjoin.h:30: > > > >/* [...] > > * [...] (Exception: data associated with the temp files lives in the > > * per-query context too, since we always call buffile.c in that > > context.) > > > > And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this > > original comment in the patch): > > > >/* [...] > > * Note: it is important always to call this in the regular executor > > * context, not in a shorter-lived context; else the temp file buffers > > * will get messed up. > > > > > > But these are not explanation of why BufFile related allocations must be > > under a per-query context. > > > > Doesn't that simply describe the current (unpatched) behavior where > BufFile is allocated in the per-query context? I wasn't sure. The first quote from hashjoin.h seems to describe a stronger rule about «**always** call buffile.c in per-query context». But maybe it ought to be «always call buffile.c from one of the sub-query context»? I assume the aim is to enforce the tmp files removal on query end/error? > I mean, the current code calls BufFileCreateTemp() without switching the > context, so it's in the ExecutorState. But with the patch it very clearly is > not. > > And I'm pretty sure the patch should do > > hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt, > "HashBatchFiles", > ALLOCSET_DEFAULT_SIZES); > > and it'd still work. Or why do you think we *must* allocate it under > ExecutorState? That was actually my very first patch and it indeed worked. But I was confused about the previous quoted code comments. That's why I kept your original code and decided to rise the discussion here. Fixed in new patch in attachment. > FWIW The comment in hashjoin.h needs updating to reflect the change. Done in the last patch. Is my rewording accurate? > > 2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context > > switch seems fragile as it could be forgotten in futur code path/changes. > > So I added an Assert() in the function to make sure the current memory > > context is "HashBatchFiles" as expected. > > Another way to tie this up might be to pass the memory context as > > argument to the function. > > ... Or maybe I'm over precautionary. > > > > I'm not sure I'd call that fragile, we have plenty other code that > expects the memory context to be set correctly. Not sure about the > assert, but we don't have similar asserts anywhere else. I mostly sticked it there to stimulate the discussion around this as I needed to scratch that itch. > But I think it's just ugly and overly verbose +1 Your patch was just a demo/debug patch by the time. It needed some cleanup now :) > it'd be much nicer to e.g. pass the memory context as a parameter, and do > the switch inside. That was a proposition in my previous mail, so I did it in the new patch. Let's see what other reviewers think. > > 3. You wrote: > > > >>> A separate BufFile memory context helps, although people won't see it > >>> unless they attach a debugger, I think. Better than nothing, but I was > >>> wondering if we could maybe print some warnings when the number of batch > >>> files gets too high ... > > > > So I added a WARNING when batches memory are exhausting the memory size > > allowed. > > > >+ if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed) > >+ elog(WARNING, "Growing number of hash batch is exhausting > > memory"); > > > > This is repeated on each call of ExecHashIncreaseNumBatches when BufFile > > overflows the memory budget. I realize now I should probably add the > > memory limit, the number of current batch and their memory consumption. > > The message is probably too cryptic for a user. It could probably be > > reworded, but some doc or additionnal hint around this message might help. > > > > Hmmm, not sure is WARN
Re: Memory leak from ExecutorState context?
Hi, On Mon, 20 Mar 2023 15:12:34 +0100 Jehan-Guillaume de Rorthais wrote: > On Mon, 20 Mar 2023 09:32:17 +0100 > Tomas Vondra wrote: > > > >> * Patch 1 could be rebased/applied/backpatched > > > > > > Would it help if I rebase Patch 1 ("move BufFile stuff into separate > > > context")? > > > > Yeah, I think this is something we'd want to do. It doesn't change the > > behavior, but it makes it easier to track the memory consumption etc. > > Will do this week. Please, find in attachment a patch to allocate bufFiles in a dedicated context. I picked up your patch, backpatch'd it, went through it and did some minor changes to it. I have some comment/questions thought. 1. I'm not sure why we must allocate the "HashBatchFiles" new context under ExecutorState and not under hashtable->hashCxt? The only references I could find was in hashjoin.h:30: /* [...] * [...] (Exception: data associated with the temp files lives in the * per-query context too, since we always call buffile.c in that context.) And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this original comment in the patch): /* [...] * Note: it is important always to call this in the regular executor * context, not in a shorter-lived context; else the temp file buffers * will get messed up. But these are not explanation of why BufFile related allocations must be under a per-query context. 2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context switch seems fragile as it could be forgotten in futur code path/changes. So I added an Assert() in the function to make sure the current memory context is "HashBatchFiles" as expected. Another way to tie this up might be to pass the memory context as argument to the function. ... Or maybe I'm over precautionary. 3. You wrote: >> A separate BufFile memory context helps, although people won't see it >> unless they attach a debugger, I think. Better than nothing, but I was >> wondering if we could maybe print some warnings when the number of batch >> files gets too high ... So I added a WARNING when batches memory are exhausting the memory size allowed. + if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed) + elog(WARNING, "Growing number of hash batch is exhausting memory"); This is repeated on each call of ExecHashIncreaseNumBatches when BufFile overflows the memory budget. I realize now I should probably add the memory limit, the number of current batch and their memory consumption. The message is probably too cryptic for a user. It could probably be reworded, but some doc or additionnal hint around this message might help. 4. I left the debug messages for some more review rounds Regards, >From a0dd541ad4e47735fc388d0c553e935f0956f254 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Mon, 27 Mar 2023 15:54:39 +0200 Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context --- src/backend/executor/nodeHash.c | 39 +++-- src/backend/executor/nodeHashjoin.c | 14 --- src/include/executor/hashjoin.h | 1 + 3 files changed, 49 insertions(+), 5 deletions(-) diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 748c9b0024..3f1ae9755e 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, * * The hashtable control block is just palloc'd from the executor's * per-query memory context. Everything else should be kept inside the - * subsidiary hashCxt or batchCxt. + * subsidiary hashCxt, batchCxt or fileCxt. */ hashtable = palloc_object(HashJoinTableData); hashtable->nbuckets = nbuckets; @@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, "HashBatchContext", ALLOCSET_DEFAULT_SIZES); + hashtable->fileCxt = AllocSetContextCreate(CurrentMemoryContext, + "HashBatchFiles", + ALLOCSET_DEFAULT_SIZES); + /* Allocate data that will live for the life of the hashjoin */ oldcxt = MemoryContextSwitchTo(hashtable->hashCxt); @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, if (nbatch > 1 && hashtable->parallel_state == NULL) { + MemoryContext oldctx; + /* * allocate and initialize the file arrays in hashCxt (not needed for * parallel case which uses shared tuplestores instead of raw files) */ + oldctx = MemoryContextSwitchTo(hashtable->fileCxt); + hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch); hashtable->outerBatchFile = palloc0_ar
Re: Memory leak from ExecutorState context?
On Mon, 20 Mar 2023 09:32:17 +0100 Tomas Vondra wrote: > >> * Patch 1 could be rebased/applied/backpatched > > > > Would it help if I rebase Patch 1 ("move BufFile stuff into separate > > context")? > > Yeah, I think this is something we'd want to do. It doesn't change the > behavior, but it makes it easier to track the memory consumption etc. Will do this week. > I think focusing on the backpatchability is not particularly good > approach. It's probably be better to fist check if there's any hope of > restarting the work on BNLJ, which seems like a "proper" solution for > the future. Backpatching worth some consideration. Balancing is almost there and could help in 16. As time is flying, it might well miss release 16, but maybe it could be backpacthed in 16.1 or a later minor release? But what if it is not eligible for backpatching? We would stall another year without having at least an imperfect stop-gap solution (sorry for picking your words ;)). BNJL and/or other considerations are for 17 or even after. In the meantime, Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with other discussed solutions. No one down vote since then. Melanie, what is your opinion today on this patch? Did you change your mind as you worked for many months on BNLJ since then? > On 3/19/23 20:31, Justin Pryzby wrote: > > On Fri, Mar 17, 2023 at 05:41:11PM +0100, Tomas Vondra wrote: > * Patch 2 is worth considering to backpatch > >> > >> I'm not quite sure what exactly are the numbered patches, as some of the > >> threads had a number of different patch ideas, and I'm not sure which > >> one was/is the most promising one. > > > > patch 2 is referring to the list of patches that was compiled > > https://www.postgresql.org/message-id/20230310195114.6d0c5406%40karst > > Ah, I see - it's just the "rebalancing" patch which minimizes the total > amount of memory used (i.e. grow work_mem a bit, so that we don't > allocate too many files). > > Yeah, I think that's the best we can do without reworking how we spill > data (slicing or whatever). Indeed, that's why I was speaking about backpatching for this one: * it surely helps 16 (and maybe previous release) in such skewed situation * it's constrained * it's not too invasive, it doesn't shake to whole algorithm all over the place * Melanie was +1 for it, no one down vote. What need to be discussed/worked: * any regressions for existing queries running fine or without OOM? * add the bufFile memory consumption in the planner considerations? > I think the spilling is "nicer" in that it actually enforces work_mem > more strictly than (a), Sure it enforces work_mem more strictly, but this is a discussion for 17 or later in my humble opinion. > but we should probably spend a bit more time on > the exact spilling strategy. I only really tried two trivial ideas, but > maybe we can be smarter about allocating / sizing the files? Maybe > instead of slices of constant size we could/should make them larger, > similarly to what log-structured storage does. > > For example we might double the number of batches a file represents, so > the first file would be for batch 1, the next one for batches 2+3, then > 4+5+6+7, then 8-15, then 16-31, ... > > We should have some model calculating (i) amount of disk space we would > need for the spilled files, and (ii) the amount of I/O we do when > writing the data between temp files. And: * what about Robert's discussion on uneven batch distribution? Why is it ignored? Maybe there was some IRL or off-list discussions? Or did I missed some mails? * what about dealing with all normal batch first, then revamp in freshly emptied batches the skewed ones, spliting them if needed, then rince & repeat? At some point, we would probably still need something like slicing and/or BNLJ though... > let's revive the rebalance and/or spilling patches. Imperfect but better than > nothing. +1 for rebalance. I'm not even sure it could make it to 16 as we are running out time, but it worth to try as it's the closest one that could be reviewed and ready'd-for-commiter. I might lack of ambition, but spilling patches seems too much to make it for 16. It seems to belongs with other larger patches/ideas (patches 4 a 5 in my sum up). But this is just my humble feeling. > And then in the end we can talk about if/what can be backpatched. > > FWIW I don't think there's a lot of rush, considering this is clearly a > matter for PG17. So the summer CF at the earliest, people are going to > be busy until then. 100% agree, there's no rush for patches 3, 4 ... and 5. Thanks!
Re: Memory leak from ExecutorState context?
Hi there, On Fri, 10 Mar 2023 19:51:14 +0100 Jehan-Guillaume de Rorthais wrote: > > So I guess the best thing would be to go through these threads, see what > > the status is, restart the discussion and propose what to do. If you do > > that, I'm happy to rebase the patches, and maybe see if I could improve > > them in some way. > > [...] > > > I was hoping we'd solve this by the BNL, but if we didn't get that in 4 > > years, maybe we shouldn't stall and get at least an imperfect stop-gap > > solution ... > > Indeed. So, to sum-up: > > * Patch 1 could be rebased/applied/backpatched Would it help if I rebase Patch 1 ("move BufFile stuff into separate context")? > * Patch 2 is worth considering to backpatch Same question. > * Patch 3 seemed withdrawn in favor of BNLJ > * Patch 4 is waiting for some more review and has some TODO > * discussion 5 worth few minutes to discuss before jumping on previous topics These other patches needs more discussions and hacking. They have a low priority compare to other discussions and running commitfest. However, how can avoid losing them in limbo again? Regards,
Re: Memory leak from ExecutorState context?
Hi, > So I guess the best thing would be to go through these threads, see what > the status is, restart the discussion and propose what to do. If you do > that, I'm happy to rebase the patches, and maybe see if I could improve > them in some way. OK! It took me some time, but I did it. I'll try to sum up the situation as simply as possible. I reviewed the following threads: * Out of Memory errors are frustrating as heck! 2019-04-14 -> 2019-04-28 https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net This discussion stalled, waiting for OP, but ideas there ignited all other discussions. * accounting for memory used for BufFile during hash joins 2019-05-04 -> 2019-09-10 https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh%40development This was suppose to push forward a patch discussed on previous thread, but it actually took over it and more ideas pops from there. * Replace hashtable growEnable flag 2019-05-15 -> 2019-05-16 https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com This one quickly merged to the next one. * Avoiding hash join batch explosions with extreme skew and weird stats 2019-05-16 -> 2020-09-24 https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com Another thread discussing another facet of the problem, but eventually end up discussing / reviewing the BNLJ implementation. Five possible fixes/ideas were discussed all over these threads: 1. "move BufFile stuff into separate context" last found patch: 2019-04-21 https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development https://www.postgresql.org/message-id/attachment/100822/0001-move-BufFile-stuff-into-separate-context.patch This patch helps with observability/debug by allocating the bufFiles in the appropriate context instead of the "ExecutorState" one. I suppose this simple one has been forgotten in the fog of all other discussions. Also, this probably worth to be backpatched. 2. "account for size of BatchFile structure in hashJoin" last found patch: 2019-04-22 https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development https://www.postgresql.org/message-id/attachment/100951/v2-simple-rebalance.patch This patch seems like a good first step: * it definitely helps older versions where other patches discussed are way too invasive to be backpatched * it doesn't step on the way of other discussed patches While looking at the discussions around this patch, I was wondering if the planner considers the memory allocation of bufFiles. But of course, Melanie already noticed that long before I was aware of this problem and discussion: 2019-07-10: «I do think that accounting for Buffile overhead when estimating the size of the hashtable during ExecChooseHashTableSize() so it can be used during planning is a worthwhile patch by itself (though I know it is not even part of this patch).» https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com Tomas Vondra agreed with this in his answer, but no new version of the patch where produced. Finally, Melanie was pushing the idea to commit this patch no matter other pending patches/ideas: 2019-09-05: «If Tomas or someone else has time to pick up and modify BufFile accounting patch, committing that still seems like the nest logical step.» https://www.postgresql.org/message-id/CAAKRu_b6%2BjC93WP%2BpWxqK5KAZJC5Rmxm8uquKtEf-KQ%2B%2B1Li6Q%40mail.gmail.com Unless I'm wrong, no one down voted this. 3. "per slice overflow file" last found patch: 2019-05-08 https://www.postgresql.org/message-id/20190508150844.rij36rtuk4lhvztw%40development https://www.postgresql.org/message-id/attachment/101080/v4-per-slice-overflow-file-20190508.patch This patch has been withdraw after an off-list discussion with Thomas Munro because of a missing parallel hashJoin implementation. Plus, before any effort started on the parallel implementation, the BNLJ idea appeared and seemed more appealing. See: https://www.postgresql.org/message-id/20190529145517.sj2poqmb3cr4cg6w%40development By the time, it still seems to have some interest despite the BNLJ patch: 2019-07-10: «If slicing is made to work for parallel-aware hashjoin and the code is in a committable state (and probably has the threshold I mentioned above), then I think that this patch should go in.» https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com But this might have been disapproved later by Tomas: 2019-09-10: «I have to admit I kinda lost track [...] My feeling is that we should get the BNLJ committed
Re: Memory leak from ExecutorState context?
On Thu, 2 Mar 2023 19:53:14 +0100 Tomas Vondra wrote: > On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote: ... > > There was some thoughts about how to make a better usage of the memory. As > > memory is exploding way beyond work_mem, at least, avoid to waste it with > > too many buffers of BufFile. So you expand either the work_mem or the > > number of batch, depending on what move is smarter. TJis is explained and > > tested here: > > > > https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development > > https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development > > > > And then, another patch to overflow each batch to a dedicated temp file and > > stay inside work_mem (v4-per-slice-overflow-file.patch): > > > > https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development > > > > Then, nothing more on the discussion about this last patch. So I guess it > > just went cold. > > I think a contributing factor was that the OP did not respond for a > couple months, so the thread went cold. > > > For what it worth, these two patches seems really interesting to me. Do you > > need any help to revive it? > > I think another reason why that thread went nowhere were some that we've > been exploring a different (and likely better) approach to fix this by > falling back to a nested loop for the "problematic" batches. > > As proposed in this thread: > > > https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development Unless I'm wrong, you are linking to the same «frustrated as heck!» discussion, for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch (balancing between increasing batches *and* work_mem). No sign of turning "problematic" batches to nested loop. Did I miss something? Do you have a link close to your hand about such algo/patch test by any chance? > I was hoping we'd solve this by the BNL, but if we didn't get that in 4 > years, maybe we shouldn't stall and get at least an imperfect stop-gap > solution ... I'll keep searching tomorrow about existing BLN discussions (is it block level nested loops?). Regards,
Re: Memory leak from ExecutorState context?
On Thu, 2 Mar 2023 19:15:30 +0100 Jehan-Guillaume de Rorthais wrote: [...] > For what it worth, these two patches seems really interesting to me. Do you > need any help to revive it? To avoid confusion, the two patches I meant were: * 0001-move-BufFile-stuff-into-separate-context.patch * v4-per-slice-overflow-file.patch Regards,
Re: Memory leak from ExecutorState context?
Hi! On Thu, 2 Mar 2023 13:44:52 +0100 Tomas Vondra wrote: > Well, yeah and no. > > In principle we could/should have allocated the BufFiles in a different > context (possibly hashCxt). But in practice it probably won't make any > difference, because the query will probably run all the hashjoins at the > same time. Imagine a sequence of joins - we build all the hashes, and > then tuples from the outer side bubble up through the plans. And then > you process the last tuple and release all the hashes. > > This would not fix the issue. It'd be helpful for accounting purposes > (we'd know it's the buffiles and perhaps for which hashjoin node). But > we'd still have to allocate the memory etc. (so still OOM). Well, accounting things in the correct context would already worth a patch I suppose. At least, it help to investigate faster. Plus, you already wrote a patch about that[1]: https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development Note that I did reference the "Out of Memory errors are frustrating as heck!" thread in my first email, pointing on a Tom Lane's email explaining that ExecutorState was not supposed to be so large[2]. [2] https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b > There's only one thing I think could help - increase the work_mem enough > not to trigger the explosive growth in number of batches. Imagine > there's one very common value, accounting for ~65MB of tuples. With > work_mem=64MB this leads to exactly the explosive growth you're > observing here. With 128MB it'd probably run just fine. > > The problem is we don't know how large the work_mem would need to be :-( > So you'll have to try and experiment a bit. > > I remembered there was a thread [1] about *exactly* this issue in 2019. > > [1] > https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net > > I even posted a couple patches that try to address this by accounting > for the BufFile memory, and increasing work_mem a bit instead of just > blindly increasing the number of batches (ignoring the fact that means > more memory will be used for the BufFile stuff). > > I don't recall why it went nowhere, TBH. But I recall there were > discussions about maybe doing something like "block nestloop" at the > time, or something. Or maybe the thread just went cold. So I read the full thread now. I'm still not sure why we try to avoid hash collision so hard, and why a similar data subset barely larger than work_mem makes the number of batchs explode, but I think I have a better understanding of the discussion and the proposed solutions. There was some thoughts about how to make a better usage of the memory. As memory is exploding way beyond work_mem, at least, avoid to waste it with too many buffers of BufFile. So you expand either the work_mem or the number of batch, depending on what move is smarter. TJis is explained and tested here: https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development And then, another patch to overflow each batch to a dedicated temp file and stay inside work_mem (v4-per-slice-overflow-file.patch): https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development Then, nothing more on the discussion about this last patch. So I guess it just went cold. For what it worth, these two patches seems really interesting to me. Do you need any help to revive it? Regards,
Re: Memory leak from ExecutorState context?
On Thu, 2 Mar 2023 01:30:27 +0100 Tomas Vondra wrote: > On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote: > >>> ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands > >>> of calls, always short-cut'ed to 1048576, I guess because of the > >>> conditional block «/* safety check to avoid overflow */» appearing early > >>> in this function. > >[...] But what about this other short-cut then? > > > > /* do nothing if we've decided to shut off growth */ > > if (!hashtable->growEnabled) > > return; > > > > [...] > > > > /* > > * If we dumped out either all or none of the tuples in the table, > > * disable > > * further expansion of nbatch. This situation implies that we have > > * enough tuples of identical hashvalues to overflow spaceAllowed. > > * Increasing nbatch will not fix it since there's no way to subdivide > > * the > > * group any more finely. We have to just gut it out and hope the server > > * has enough RAM. > > */ > > if (nfreed == 0 || nfreed == ninmemory) > > { > > hashtable->growEnabled = false; > > #ifdef HJDEBUG > > printf("Hashjoin %p: disabling further increase of nbatch\n", > >hashtable); > > #endif > > } > > > > If I guess correctly, the function is not able to split the current batch, > > so it sits and hopes. This is a much better suspect and I can surely track > > this from gdb. > > Yes, this would make much more sense - it'd be consistent with the > hypothesis that this is due to number of batches exploding (it's a > protection exactly against that). > > You specifically mentioned the other check earlier, but now I realize > you've been just speculating it might be that. Yes, sorry about that, I jumped on this speculation without actually digging it much... [...] > But I have another idea - put a breakpoint on makeBufFile() which is the > bit that allocates the temp files including the 8kB buffer, and print in > what context we allocate that. I have a hunch we may be allocating it in > the ExecutorState. That'd explain all the symptoms. That what I was wondering as well yesterday night. So, on your advice, I set a breakpoint on makeBufFile: (gdb) info br Num Type Disp Enb AddressWhat 1 breakpoint keep y 0x007229df in makeBufFile bt 10 p CurrentMemoryContext.name Then, I disabled it and ran the query up to this mem usage: VIRTRESSHR S %CPU %MEM 20.1g 7.0g 88504 t 0.0 22.5 Then, I enabled the breakpoint and look at around 600 bt and context name before getting bored. They **all** looked like that: Breakpoint 1, BufFileCreateTemp (...)at buffile.c:201 201 in buffile.c #0 BufFileCreateTemp (...) buffile.c:201 #1 ExecHashJoinSaveTuple (tuple=0x1952c180, ...) nodeHashjoin.c:1238 #2 ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398 #3 ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584 #4 ExecProcNodeInstr (node=)execProcnode.c:462 #5 ExecProcNode (node=0x31a6418) #6 ExecSort (pstate=0x31a6308) #7 ExecProcNodeInstr (node=) #8 ExecProcNode (node=0x31a6308) #9 fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0) $421643 = 0x99d7f7 "ExecutorState" These 600-ish 8kB buffer were all allocated in "ExecutorState". I could probably log much more of them if more checks/stats need to be collected, but it really slow down the query a lot, granting it only 1-5% of CPU time instead of the usual 100%. So It's not exactly a leakage, as memory would be released at the end of the query, but I suppose they should be allocated in a shorter living context, to avoid this memory bloat, am I right? > BTW with how many batches does the hash join start? * batches went from 32 to 1048576 before being growEnabled=false as suspected * original and current nbuckets were set to 1048576 immediately * allowed space is set to the work_mem, but current space usage is 1.3GB, as measured previously close before system refuse more memory allocation. Here are the full details about the hash associated with the previous backtrace: (gdb) up (gdb) up (gdb) p *((HashJoinState*)pstate)->hj_HashTable $421652 = { nbuckets = 1048576, log2_nbuckets = 20, nbuckets_original = 1048576, nbuckets_optimal = 1048576, log2_nbuckets_optimal = 20, buckets = {unshared = 0x68f12e8, shared = 0x68f12e8}, keepNulls = true, skewEnabled = false, skewBucket = 0x0, skewBucketLen = 0,
Re: Memory leak from ExecutorState context?
Hi, On Wed, 1 Mar 2023 20:29:11 +0100 Tomas Vondra wrote: > On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote: > > On Tue, 28 Feb 2023 20:51:02 +0100 > > Tomas Vondra wrote: > >> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote: > >>> * HashBatchContext goes up to 1441MB after 240s then stay flat until the > >>> end (400s as the last record) > >> > >> That's interesting. We're using HashBatchContext for very few things, so > >> what could it consume so much memory? But e.g. the number of buckets > >> should be limited by work_mem, so how could it get to 1.4GB? > >> > >> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets > >> and print how many batches/butches are there? > > > > I did this test this morning. > > > > Batches and buckets increased really quickly to 1048576/1048576. > > OK. I think 1M buckets is mostly expected for work_mem=64MB. It means > buckets will use 8MB, which leaves ~56B per tuple (we're aiming for > fillfactor 1.0). > > But 1M batches? I guess that could be problematic. It doesn't seem like > much, but we need 1M files on each side - 1M for the hash table, 1M for > the outer relation. That's 16MB of pointers, but the files are BufFile > and we keep 8kB buffer for each of them. That's ~16GB right there :-( > > In practice it probably won't be that bad, because not all files will be > allocated/opened concurrently (especially if this is due to many tuples > having the same value). Assuming that's what's happening here, ofc. And I suppose they are close/freed concurrently as well? > > ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands > > of calls, always short-cut'ed to 1048576, I guess because of the > > conditional block «/* safety check to avoid overflow */» appearing early in > > this function. > > Hmmm, that's a bit weird, no? I mean, the check is > > /* safety check to avoid overflow */ > if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2))) > return; > > Why would it stop at 1048576? It certainly is not higher than INT_MAX/2 > and with MaxAllocSize = ~1GB the second value should be ~33M. So what's > happening here? Indeed, not the good suspect. But what about this other short-cut then? /* do nothing if we've decided to shut off growth */ if (!hashtable->growEnabled) return; [...] /* * If we dumped out either all or none of the tuples in the table, disable * further expansion of nbatch. This situation implies that we have * enough tuples of identical hashvalues to overflow spaceAllowed. * Increasing nbatch will not fix it since there's no way to subdivide the * group any more finely. We have to just gut it out and hope the server * has enough RAM. */ if (nfreed == 0 || nfreed == ninmemory) { hashtable->growEnabled = false; #ifdef HJDEBUG printf("Hashjoin %p: disabling further increase of nbatch\n", hashtable); #endif } If I guess correctly, the function is not able to split the current batch, so it sits and hopes. This is a much better suspect and I can surely track this from gdb. Being able to find what are the fields involved in the join could help as well to check or gather some stats about them, but I hadn't time to dig this yet... [...] > >> Investigating memory leaks is tough, especially for generic memory > >> contexts like ExecutorState :-( Even more so when you can't reproduce it > >> on a machine with custom builds. > >> > >> What I'd try is this: [...] > > I couldn't print "size" as it is optimzed away, that's why I tracked > > chunk->size... Is there anything wrong with my current run and gdb log? > > There's definitely something wrong. The size should not be 0, and > neither it should be > 1GB. I suspect it's because some of the variables > get optimized out, and gdb just uses some nonsense :-( > > I guess you'll need to debug the individual breakpoints, and see what's > available. It probably depends on the compiler version, etc. For example > I don't see the "chunk" for breakpoint 3, but "chunk_size" works and I > can print the chunk pointer with a bit of arithmetics: > > p (block->freeptr - chunk_size) > > I suppose similar gympastics could work for the other breakpoints. OK, I'll give it a try tomorrow. Thank you! NB: the query has been killed by the replication.
Re: Memory leak from ExecutorState context?
On Wed, 1 Mar 2023 20:34:08 +0100 Tomas Vondra wrote: > On 3/1/23 19:09, Jehan-Guillaume de Rorthais wrote: > > On Wed, 1 Mar 2023 18:48:40 +0100 > > Jehan-Guillaume de Rorthais wrote: > > ... > >> You'll find some intermediate stats I already collected in attachment: > >> > >> * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree. > >> * most of the non-free'd chunk are allocated since the very beginning, > >> before the 5000's allocation call for almost 1M call so far. > >> * 3754 of them have a chunk->size of 0 > >> * it seems there's some buggy stats or data: > >># this one actually really comes from the gdb log > >>0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz) > >># this one might be a bug in my script > >> 0x2: break=2 num=945346sz=2 (weird > >> address) > >> * ignoring the weird size requested during the 191st call, the total amount > >> of non free'd memory is currently 5488MB > > > > I forgot one stat. I don't know if this is expected, normal or not, but 53 > > chunks has been allocated on an existing address that was not free'd before. > > > > It's likely chunk was freed by repalloc() and not by pfree() directly. > Or maybe the whole context got destroyed/reset, in which case we don't > free individual chunks. But that's unlikely for the ExecutorState. Well, as all breakpoints were conditional on ExecutorState, I suppose this might be repalloc then. Regards,
Re: Memory leak from ExecutorState context?
On Wed, 1 Mar 2023 18:48:40 +0100 Jehan-Guillaume de Rorthais wrote: ... > You'll find some intermediate stats I already collected in attachment: > > * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree. > * most of the non-free'd chunk are allocated since the very beginning, before > the 5000's allocation call for almost 1M call so far. > * 3754 of them have a chunk->size of 0 > * it seems there's some buggy stats or data: ># this one actually really comes from the gdb log >0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz) ># this one might be a bug in my script > 0x2: break=2 num=945346sz=2 (weird address) > * ignoring the weird size requested during the 191st call, the total amount > of non free'd memory is currently 5488MB I forgot one stat. I don't know if this is expected, normal or not, but 53 chunks has been allocated on an existing address that was not free'd before. Regards,
Re: Memory leak from ExecutorState context?
Hi Tomas, On Tue, 28 Feb 2023 20:51:02 +0100 Tomas Vondra wrote: > On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote: > > * HashBatchContext goes up to 1441MB after 240s then stay flat until the end > > (400s as the last record) > > That's interesting. We're using HashBatchContext for very few things, so > what could it consume so much memory? But e.g. the number of buckets > should be limited by work_mem, so how could it get to 1.4GB? > > Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets > and print how many batches/butches are there? I did this test this morning. Batches and buckets increased really quickly to 1048576/1048576. ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of calls, always short-cut'ed to 1048576, I guess because of the conditional block «/* safety check to avoid overflow */» appearing early in this function. I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the query run faster. Enabling it at 19.1GB of memory consumption, it stayed silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly). The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning, and a last time close to the end of the query execution. > > Any idea? How could I help to have a better idea if a leak is actually > > occurring and where exactly? > > Investigating memory leaks is tough, especially for generic memory > contexts like ExecutorState :-( Even more so when you can't reproduce it > on a machine with custom builds. > > What I'd try is this: > > 1) attach breakpoints to all returns in AllocSetAlloc(), printing the > pointer and size for ExecutorState context, so something like > > break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0 > commands > print MemoryChunkGetPointer(chunk) size > cont > end > > 2) do the same for AllocSetFree() > > 3) Match the palloc/pfree calls (using the pointer address), to > determine which ones are not freed and do some stats on the size. > Usually there's only a couple distinct sizes that account for most of > the leaked memory. So here is what I end up with this afternoon, using file, lines and macro from REL_11_18: set logging on set pagination off break aset.c:781 if strcmp("ExecutorState",context.name) == 0 commands 1 print (((char *)(chunk)) + sizeof(struct AllocChunkData)) print chunk->size cont end break aset.c:820 if strcmp("ExecutorState",context.name) == 0 commands 2 print (((char *)(chunk)) + sizeof(struct AllocChunkData)) print chunk->size cont end break aset.c:979 if strcmp("ExecutorState",context.name) == 0 commands 3 print (((char *)(chunk)) + sizeof(struct AllocChunkData)) print chunk->size cont end break AllocSetFree if strcmp("ExecutorState",context.name) == 0 commands 4 print pointer cont end So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The backend had only 3'32" of CPU time: VIRTRESSHR S %CPU %MEM TIME+ COMMAND 2727284 2.4g 17840 R 99.0 7.7 181:25.07 gdb 9054688 220648 103056 t 1.3 0.7 3:32.05 postmaster Interestingly, the RES memory of the backend did not explode yet, but VIRT is already high. I suppose the query will run for some more hours, hopefully, gdb will not exhaust the memory in the meantime... You'll find some intermediate stats I already collected in attachment: * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree. * most of the non-free'd chunk are allocated since the very beginning, before the 5000's allocation call for almost 1M call so far. * 3754 of them have a chunk->size of 0 * it seems there's some buggy stats or data: # this one actually really comes from the gdb log 0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz) # this one might be a bug in my script 0x2: break=2 num=945346sz=2 (weird address) * ignoring the weird size requested during the 191st call, the total amount of non free'd memory is currently 5488MB I couldn't print "size" as it is optimzed away, that's why I tracked chunk->size... Is there anything wrong with my current run and gdb log? The gdb log is 5MB compressed. I'll keep it off-list, but I can provide it if needed. Stay tuned... Thank you! allocs-tmp-stats.txt.gz Description: application/gzip
Re: Memory leak from ExecutorState context?
Hi Justin, On Tue, 28 Feb 2023 12:25:08 -0600 Justin Pryzby wrote: > On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote: > > Hello all, > > > > A customer is facing out of memory query which looks similar to this > > situation: > > > > > > https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b > > > > This PostgreSQL version is 11.18. Some settings: > > hash joins could exceed work_mem until v13: Yes, I am aware of this. But as far as I understand Tom Lane explanations from the discussion mentioned up thread, it should not be ExecutorState. ExecutorState (13GB) is at least ten times bigger than any other context, including HashBatchContext (1.4GB) or HashTableContext (16MB). So maybe some aggregate is walking toward the wall because of bad estimation, but something else is racing way faster to the wall. And presently it might be something related to some JOIN node. About your other points, you are right, there's numerous things we could do to improve this query, and our customer is considering it as well. It's just a matter of time now. But in the meantime, we are facing a query with a memory behavior that seemed suspect. Following the 4 years old thread I mentioned, my goal is to inspect and provide all possible information to make sure it's a "normal" behavior or something that might/should be fixed. Thank you for your help!
Re: Transparent column encryption
On Wed, 23 Nov 2022 19:45:06 +0100 Peter Eisentraut wrote: > On 28.10.22 12:16, Jehan-Guillaume de Rorthais wrote: [...] > >* I wonder if encryption related fields in ParameterDescription and > > RowDescription could be optional somehow? The former might be quite > > large when using a lot of parameters (like, imagine a large and ugly > > "IN($1...$N)"). On the other hand, these messages are not sent in high > > frequency anyway... > > They are only used if you turn on the column_encryption protocol option. > Or did you mean make them optional even then? I meant even when column_encryption is turned on. Regards,
Re: Fix proposal for comparaison bugs in PostgreSQL::Version
On Thu, 3 Nov 2022 13:11:18 -0500 Justin Pryzby wrote: > On Tue, Jun 28, 2022 at 06:17:40PM -0400, Andrew Dunstan wrote: > > Nice catch, but this looks like massive overkill. I think we can very > > simply fix the test in just a few lines of code, instead of a 190 line > > fix and a 130 line TAP test. > > > > It was never intended to be able to compare markers like rc1 vs rc2, and > > I don't see any need for it. If you can show me a sane use case I'll > > have another look, but right now it seems quite unnecessary. > > > > Here's my proposed fix. > > > > diff --git a/src/test/perl/PostgreSQL/Version.pm > > b/src/test/perl/PostgreSQL/Version.pm index 8f70491189..8d4dbbf694 100644 > > --- a/src/test/perl/PostgreSQL/Version.pm > > Is this still an outstanding issue ? The issue still exists on current HEAD: $ perl -Isrc/test/perl/ -MPostgreSQL::Version -le \ 'print "bug" if PostgreSQL::Version->new("9.6") <= 9.0' bug Regards,
Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)
On Thu, 3 Nov 2022 20:44:16 +0100 Alvaro Herrera wrote: > On 2022-Oct-05, Alvaro Herrera wrote: > > > I've been giving the patches a look and it caused me to notice two > > additional bugs in the same area: > > > > - FKs in partitions are sometimes marked NOT VALID. This is because of > > missing initialization when faking up a Constraint node in > > CloneFkReferencing. Easy to fix, have patch, running tests now. > > I have pushed the fix for this now. Thank you Alvaro!
Re: psql: Add command to use extended query protocol
On Wed, 02 Nov 2022 16:04:02 +0100 "Daniel Verite" wrote: > Jehan-Guillaume de Rorthais wrote: > > > As I wrote in my TCE review, would it be possible to use psql vars to set > > some named parameters for the prepared query? This would looks like: > > > > \set p1 foo > > \set p2 bar > > SELECT :'p1', :'p2' \gp > > As I understand the feature, variables would be passed like this: > > \set var1 'foo bar' > \set var2 'baz''qux' > > select $1, $2 \gp :var1 :var2 > > ?column? | ?column? > --+-- > foo bar | baz'qux > > It appears to work fine with the current patch. Indeed, nice. > This is consistent with the fact that PQexecParams passes $N > parameters ouf of the SQL query (versus injecting them in the text of > the query) I was not thinking about injecting them in the texte of the query, this would not be using the extended protocol anymore, or maybe with no parameter, but there's no point. What I was thinking about is psql replacing the variables from the query text with the $N notation before sending it using PQprepare. > which is also why no quoting is needed. Indeed, the quotes were not needed in my example. Thanks,
Re: psql: Add command to use extended query protocol
Hi, On Fri, 28 Oct 2022 08:52:51 +0200 Peter Eisentraut wrote: > This adds a new psql command \gp that works like \g (or semicolon) but > uses the extended query protocol. Parameters can also be passed, like > > SELECT $1, $2 \gp 'foo' 'bar' As I wrote in my TCE review, would it be possible to use psql vars to set some named parameters for the prepared query? This would looks like: \set p1 foo \set p2 bar SELECT :'p1', :'p2' \gp This seems useful when running psql script passing it some variables using -v arg. It helps with var position, changing some between exec, repeating them in the query, etc. Thoughts?
Re: Commitfest documentation
Hi Aleksander, Thank you for your help! On Mon, 31 Oct 2022 16:51:23 +0300 Aleksander Alekseev wrote: [...] > > In the commitfest application, I was wondering today what was the exact > > meaning and difference between open/closed status (is it only for the > > current commitfest?) > > Closed means that the CF was in the past. It is archived now. Open > means that new patches are accepted to the given CF. If memory serves, > when the CF starts the status changes to "In Progress". Sorry, I was asking from a patch point of view, not the whole commitfest. If you look at the "Change Status" list on a patch page, there's two sublist options: "Open statuses" and "Closed statuses". But your answer below answered the question anyway. > There are five CFs a year: in January, March, July, September, and > November. November one is about to start. This detail might have a place in the following page: https://wiki.postgresql.org/wiki/CommitFest But I'm not sure it's really worthy? > > and between «waiting for author» and «Returned with feedback». > > RwF is almost the same as "Rejected". It means that some feedback was > provided for the patch and the community wouldn't mind accepting a new > patch when and if this feedback will be accounted for. > > WfA means that the patch awaits some (relatively small) actions from > the author. Typically it happens after another round of code review. Thank you for the disambiguation. Here is a proposal for all statuses: * Needs review: Wait for a new review. * WfA : the patch awaits some (relatively small) actions from the author, typically after another round of code review. * Ready fC: No more comment from reviewer. The code is ready for a commiter review. * Rejected: The code is rejected. The community is not willing to accept new patch about $subject. * Withdraw: The author decide to remove its patch from the commit fest. * Returned wF : Some feedback was provided for the patch and the community wouldn't mind accepting a new patch when and if this feedback will be accounted for. * Move next CF: The patch is still waiting for the author, the reviewers or a commiter at the end of the current CF. * Committed : The patch as been committed. > Attached is a (!) simplified diagram of a typical patch livecycle. > Hopefully it will help a bit. It misses a Withdraw box :) I suppose it is linked from the Waiting on Author. > > I couldn't find a clear definition searching the wiki, the mailing list (too > > much unrelated results) or in the app itself. > > Yes, this could be a tribe knowledge to a certain degree at the > moment. On the flip side this is also an opportunity to contribute an > article to the Wiki. I suppose these definitions might go in: https://wiki.postgresql.org/wiki/Reviewing_a_Patch However, I'm not strictly sure who is responsible to set these statuses. The reviewer? The author? The commiter? The CF manager? I bet on the reviewer, but it seems weird a random reviewer can reject a patch on its own behalf. Regards,
Commitfest documentation
Hi, In the commitfest application, I was wondering today what was the exact meaning and difference between open/closed status (is it only for the current commitfest?) and between «waiting for author» and «Returned with feedback». I couldn't find a clear definition searching the wiki, the mailing list (too much unrelated results) or in the app itself. Maybe the commitfest home page/menu could link to some documentation hosted in the wiki? Eg. in https://wiki.postgresql.org/wiki/Reviewing_a_Patch Thoughts? Pointers I missed? Regards,
Re: Transparent column encryption
Hi, I did a review of the documentation and usability. # Applying patch The patch applied on top of f13b2088fa2 without trouble. Notice a small warning during compilation: colenccmds.c:134:27: warning: ‘encval’ may be used uninitialized A simple fix could be: +++ b/src/backend/commands/colenccmds.c @@ -119,2 +119,3 encval = defGetString(encvalEl); + *encval_p = encval; } @@ -132,4 +133,2 *alg_p = alg; - if (encval_p) - *encval_p = encval; } # Documentation * In page "create_column_encryption_key.sgml", both encryption algorithms for a CMK are declared as the default one: + + The encryption algorithm that was used to encrypt the key material of + this column encryption key. Supported algorithms are: + + +RSAES_OAEP_SHA_1 (default) + + +RSAES_OAEP_SHA_256 (default) + + + As far as I understand the code, I suppose RSAES_OAEP_SHA_1 should be the default. I believe two information should be clearly shown to user somewhere in chapter 5.5 instead of being buried deep in documentation: * «COPY does not support column decryption», currently buried in pg_dump page * «When transparent column encryption is enabled, the client encoding must match the server encoding», currently buried in the protocol description page. * In the libpq doc of PQexecPrepared2, "paramForceColumnEncryptions" might deserve a little more detail about the array format, like «0 means "don't enforce" and anything else enforce the encryption is enabled on this column». By the way, maybe this array could be an array of boolean? * In chapter 55.2.5 (protocol-flow) is stated: «when column encryption is used, the plaintext is always in text format (not binary format).». Does it means parameter "resultFormat" in "PQexecPrepared2" should always be 0? If yes, is it worth keeping this argument? Moreover, this format constraint should probably be explained in the libpq page as well. # Protocol * In the ColumnEncryptionKey message, it seems the field holding the length key material is redundant with the message length itself, as all other fields have a known size. The key material length is the message length - (4+4+4+2). For comparison, the AuthenticationSASLContinue message has a variable data size but rely only on the message length without additional field. * I wonder if encryption related fields in ParameterDescription and RowDescription could be optional somehow? The former might be quite large when using a lot of parameters (like, imagine a large and ugly "IN($1...$N)"). On the other hand, these messages are not sent in high frequency anyway... # libpq Would it be possible to have an encryption-ready PQexecParams() equivalent of what PQprepare/describe/exec do? # psql * You already mark \d in the TODO. But having some psql command to list the known CEK/CMK might be useful as well. * about write queries using psql, would it be possible to use psql parameters? Eg.: => \set c1 myval => INSERT INTO mytable VALUES (:'c1') \gencr # Manual tests * The lookup error message is shown twice for some reason: => select * from test_tce; no CMK lookup found for realm "" no CMK lookup found for realm "" It might worth adding the column name and the CMK/CEK names related to each error message? Last, notice the useless empty line between messages. * When "DROP IF EXISTS" a missing CEK or CMK, the command raise an "unrecognized object type": => DROP COLUMN MASTER KEY IF EXISTS noexists; ERROR: unrecognized object type: 10 => DROP COLUMN ENCRYPTION KEY IF EXISTS noexists; ERROR: unrecognized object type: 8 * I was wondering what "pg_typeof" should return. It currently returns "pg_encrypted_*". If this is supposed to be transparent from the client perspective, shouldn't it return "attrealtypid" when the field is encrypted? * any reason to not support altering the CMK realm? This patch is really interesting and would be a nice addition to the core. Thanks! Regards,
Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)
On Fri, 30 Sep 2022 16:11:09 -0700 Zhihong Yu wrote: > On Fri, Sep 30, 2022 at 3:30 PM Jehan-Guillaume de Rorthais > wrote: ... > > +* Self-Foreign keys are ignored as the index was preliminary > created > > preliminary created -> primarily created Thank you! This is fixed and rebased on current master branch in patches attached. Regards, >From 34ee3a3737e36d440824db3e8eb7c8f802a7276a Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 30 Sep 2022 23:42:21 +0200 Subject: [PATCH 1/3] Fix bug between constraint parenting and self-FK Function get_relation_idx_constraint_oid() assumed an index can only be associated to zero or one constraint for a given relation. However, if the relation has a self-foreign key, the index could be referenced as enforcing two different constraints on the same relation: the PK and the FK. This lead to a bug with partitionned table where the self-FK could become the parent of a partition's PK. --- src/backend/catalog/pg_constraint.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c index bb65fb1e0a..ba7a8ff965 100644 --- a/src/backend/catalog/pg_constraint.c +++ b/src/backend/catalog/pg_constraint.c @@ -985,8 +985,8 @@ get_relation_constraint_attnos(Oid relid, const char *conname, } /* - * Return the OID of the constraint associated with the given index in the - * given relation; or InvalidOid if no such index is catalogued. + * Return the OID of the original constraint enforced by the given index + * in the given relation; or InvalidOid if no such index is catalogued. */ Oid get_relation_idx_constraint_oid(Oid relationId, Oid indexId) @@ -1011,6 +1011,18 @@ get_relation_idx_constraint_oid(Oid relationId, Oid indexId) Form_pg_constraint constrForm; constrForm = (Form_pg_constraint) GETSTRUCT(tuple); + + /* + * Self-Foreign keys are ignored as the index was primarily created + * to enforce a PK or unique constraint in the first place. This means + * only primary key, unique constraint or an exlusion one can be + * returned. + */ + if (constrForm->contype != CONSTRAINT_PRIMARY + && constrForm->contype != CONSTRAINT_UNIQUE + && constrForm->contype != CONSTRAINT_EXCLUSION) + continue; + if (constrForm->conindid == indexId) { constraintId = constrForm->oid; -- 2.37.3 >From 6c8d9595dc1b2b62e3cc5ce3bc1c970b6eedcbc2 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 30 Sep 2022 23:54:04 +0200 Subject: [PATCH 2/3] Fix bug where a self-FK was cloned twice on partitions A tbale with a self-foreign keys appears on both the referenced and referencing side of the constraint. Because of this, when creating or attaching a new partition to a table, the self-FK was cloned twice, by CloneFkReferenced() and CloneFkReferencing() functions. --- src/backend/commands/tablecmds.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7d8a75d23c..d19d917571 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -9968,6 +9968,9 @@ CloneForeignKeyConstraints(List **wqueue, Relation parentRel, * clone those constraints to the given partition. This is to be called * when the partition is being created or attached. * + * Note that this ignore self-referenced FK. Those are handled in + * CloneFkReferencing(). + * * This recurses to partitions, if the relation being attached is partitioned. * Recursion is done by calling addFkRecurseReferenced. */ @@ -10047,10 +10050,14 @@ CloneFkReferenced(Relation parentRel, Relation partitionRel) constrForm = (Form_pg_constraint) GETSTRUCT(tuple); /* - * As explained above: don't try to clone a constraint for which we're - * going to clone the parent. + * As explained above and in the function header: + * - don't try to clone a constraint for which we're going to clone + * the parent. + * - don't clone a self-referenced constraint here as it is handled + * on the CloneFkReferencing() side */ - if (list_member_oid(clone, constrForm->conparentid)) + if (list_member_oid(clone, constrForm->conparentid) || + constrForm->conrelid == RelationGetRelid(parentRel)) { ReleaseSysCache(tuple); continue; -- 2.37.3 >From a08ed953e3cd559ea1fec78301cb72e611f9387e Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Sat, 1 Oct 2022 00:00:28 +0200 Subject: [PATCH 3/3] Add regression tests about self-FK with partitions --- src/test/regress/expected/constraints.out | 37 +++ src/test/regress/sql/constraints.sql | 31 +++ 2 files changed, 68 insertions(+) diff --git a/src/test/regress/expected/constraints.out b/src/test/regress/expected/constraints.out index e6f6
Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)
Hi, Please, find in attachment a small serie of patch: 0001 fix the constraint parenting bug. Not much to say. It's basically your patch we discussed with some more comments and the check on contype equals to either primary, unique or exclusion. 0002 fix the self-FK being cloned twice on partitions 0003 add a regression test validating both fix. I should confess than even with these fix, I'm still wondering about this code sanity as we could still end up with a PK on a partition being parented with a simple unique constraint from the table, on a field not even NOT NULL: DROP TABLE IF EXISTS parted_self_fk, part_with_pk; CREATE TABLE parted_self_fk ( id bigint, id_abc bigint, FOREIGN KEY (id_abc) REFERENCES parted_self_fk(id), UNIQUE (id) ) PARTITION BY RANGE (id); CREATE TABLE part_with_pk ( id bigint PRIMARY KEY, id_abc bigint, CHECK ((id >= 0 AND id < 10)) ); ALTER TABLE parted_self_fk ATTACH PARTITION part_with_pk FOR VALUES FROM (0) TO (10); SELECT cr.relname, co.conname, co.contype, p.conname AS conparentrelname FROM pg_catalog.pg_constraint co JOIN pg_catalog.pg_class cr ON cr.oid = co.conrelid LEFT JOIN pg_catalog.pg_constraint p ON p.oid = co.conparentid WHERE cr.relname IN ('parted_self_fk', 'part_with_pk') AND co.contype IN ('u', 'p'); DROP TABLE parted_self_fk; DROP TABLE CREATE TABLE CREATE TABLE ALTER TABLE relname |conname| contype | conparentrelname +---+-+--- parted_self_fk | parted_self_fk_id_key | u | part_with_pk | part_with_pk_pkey | p | parted_self_fk_id_key (2 rows) Nothing forbid the partition to have stricter constraints than the parent table, but it feels weird, so it might worth noting it here. I wonder if AttachPartitionEnsureConstraints() should exists and take care of comparing/cloning constraints before calling AttachPartitionEnsureIndexes() which would handle missing index without paying attention to related constraints? Regards, On Wed, 24 Aug 2022 12:49:13 +0200 Alvaro Herrera wrote: > On 2022-Aug-24, Jehan-Guillaume de Rorthais wrote: > > > I was naively wondering about such a patch, but was worrying about potential > > side effects on ATExecAttachPartitionIdx(), DetachPartitionFinalize() and > > DefineIndex() where I didn't had a single glance. Did you had a look? > > No. But AFAIR all the code there is supposed to worry about unique > constraints and PK only, not FKs. So if something changes, then most > likely it was wrong to begin with. > > > I did a quick ATTACH + DETACH test, and it seems DETACH partly fails with > > its housecleaning: > > Ugh. More fixes required, then. > > > Looking for few minutes in ATExecDetachPartitionFinalize(), it seems it only > > support removing the parental link on FK, not to clean the FKs added during > > the ATTACH DDL anyway. That explains the FK child1->parent left behind. But > > in fact, this let me wonder if this part of the code ever considered > > implication of self-FK during the ATTACH and DETACH process? > > No, or at least I don't remember thinking about self-referencing FKs. > If there are no tests for it, then that's likely what happened. > > > Why in the first place TWO FK are created during the ATTACH DDL? > > That's probably a bug too. > >From 146f40285ee5f773f68205f1cbafe1cbec46c5c4 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 30 Sep 2022 23:42:21 +0200 Subject: [PATCH 1/3] Fix bug between constraint parenting and self-FK Function get_relation_idx_constraint_oid() assumed an index can only be associated to zero or one constraint for a given relation. However, if the relation has a self-foreign key, the index could be referenced as enforcing two different constraints on the same relation: the PK and the FK. This lead to a bug with partitionned table where the self-FK could become the parent of a partition's PK. --- src/backend/catalog/pg_constraint.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c index bb65fb1e0a..be26dbe81d 100644 --- a/src/backend/catalog/pg_constraint.c +++ b/src/backend/catalog/pg_constraint.c @@ -985,8 +985,8 @@ get_relation_constraint_attnos(Oid relid, const char *conname, } /* - * Return the OID of the constraint associated with the given index in the - * given relation; or InvalidOid if no such index is catalogued. + * Return the OID of the original constraint enforced by the given index + * in the given relation; or InvalidOid if no such index is catalogued. */ Oid get_relation_idx_constraint_oid(Oid relationId, Oid indexId) @@ -1011,6 +1011,18 @@ get_relation_idx_con
Re: [BUG] wrong FK constraint name when colliding name on ATTACH
On Thu, 8 Sep 2022 13:25:15 +0200 Alvaro Herrera wrote: > On 2022-Sep-08, Jehan-Guillaume de Rorthais wrote: > > > Hi there, > > > > I believe this very small bug and its fix are really trivial and could be > > push out of the way quite quickly. It's just about a bad constraint name > > fixed by moving one assignation after the next one. This could easily be > > fixed for next round of releases. > > > > Well, I hope I'm not wrong :) > > I think you're right, so pushed, and backpatched to 12. I added the > test case to regression also. Great, thank you for the additional work on the regression test and the commit! > For 11, I adjusted the test case so that it didn't depend on an FK > pointing to a partitioned table (which is not supported there); it turns > out that the old code is not smart enough to get into the problem in the > first place. [...] > It seems fair to say that this case, with pg11, is unsupported and > people should upgrade if they want better behavior. That works for me. Thanks!
Re: [BUG] wrong FK constraint name when colliding name on ATTACH
Hi there, I believe this very small bug and its fix are really trivial and could be push out of the way quite quickly. It's just about a bad constraint name fixed by moving one assignation after the next one. This could easily be fixed for next round of releases. Well, I hope I'm not wrong :) Regards, On Thu, 1 Sep 2022 18:41:56 +0200 Jehan-Guillaume de Rorthais wrote: > While studying and hacking on the parenting constraint issue, I found an > incoherent piece of code leading to badly chosen fk name. If a constraint > name collision is detected, while choosing a new name for the constraint, > the code uses fkconstraint->fk_attrs which is not yet populated: > > /* No dice. Set up to create our own constraint */ > fkconstraint = makeNode(Constraint); > if (ConstraintNameIsUsed(CONSTRAINT_RELATION, >RelationGetRelid(partRel), >NameStr(constrForm->conname))) > fkconstraint->conname = > ChooseConstraintName(RelationGetRelationName(partRel), >ChooseForeignKeyConstraintNameAddition( > fkconstraint->fk_attrs), // <= WOO000OPS >"fkey", >RelationGetNamespace(partRel), NIL); > else > fkconstraint->conname = pstrdup(NameStr(constrForm->conname)); > fkconstraint->fk_upd_action = constrForm->confupdtype; > fkconstraint->fk_del_action = constrForm->confdeltype; > fkconstraint->deferrable = constrForm->condeferrable; > fkconstraint->initdeferred = constrForm->condeferred; > fkconstraint->fk_matchtype = constrForm->confmatchtype; > for (int i = 0; i < numfks; i++) > { > Form_pg_attribute att; > > att = TupleDescAttr(RelationGetDescr(partRel), > mapped_conkey[i] - 1); > fkconstraint->fk_attrs = lappend(fkconstraint->fk_attrs, // <= > POPULATING makeString(NameStr(att->attname))); > } > > The following SQL script showcase the bad constraint name: > > DROP TABLE IF EXISTS parent, child1; > > CREATE TABLE parent ( > id bigint NOT NULL default 1, > no_part smallint NOT NULL, > id_abc bigint, > CONSTRAINT dummy_constr FOREIGN KEY (id_abc, no_part) > REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE > RESTRICT, PRIMARY KEY (id, no_part) > ) > PARTITION BY LIST (no_part); > > CREATE TABLE child1 ( > id bigint NOT NULL default 1, > no_part smallint NOT NULL, > id_abc bigint, > PRIMARY KEY (id, no_part), > CONSTRAINT dummy_constr CHECK ((no_part = 1)) > ); > > ALTER TABLE parent ATTACH PARTITION child1 FOR VALUES IN ('1'); > > SELECT conname > FROM pg_constraint > WHERE conrelid = 'child1'::regclass > AND contype = 'f'; > > DROP TABLE > CREATE TABLE > CREATE TABLE > ALTER TABLE > > conname > -- >child1__fkey > (1 row) > > The resulting constraint name "child1__fkey" is missing the attributes name > the original code wanted to add. The expected name is > "child1_id_abc_no_part_fkey". > > Find in attachment a simple fix, moving the name assignation after the > FK attributes are populated. > > Regards,
[BUG] wrong FK constraint name when colliding name on ATTACH
Hi, While studying and hacking on the parenting constraint issue, I found an incoherent piece of code leading to badly chosen fk name. If a constraint name collision is detected, while choosing a new name for the constraint, the code uses fkconstraint->fk_attrs which is not yet populated: /* No dice. Set up to create our own constraint */ fkconstraint = makeNode(Constraint); if (ConstraintNameIsUsed(CONSTRAINT_RELATION, RelationGetRelid(partRel), NameStr(constrForm->conname))) fkconstraint->conname = ChooseConstraintName(RelationGetRelationName(partRel), ChooseForeignKeyConstraintNameAddition( fkconstraint->fk_attrs), // <= WOO000OPS "fkey", RelationGetNamespace(partRel), NIL); else fkconstraint->conname = pstrdup(NameStr(constrForm->conname)); fkconstraint->fk_upd_action = constrForm->confupdtype; fkconstraint->fk_del_action = constrForm->confdeltype; fkconstraint->deferrable = constrForm->condeferrable; fkconstraint->initdeferred = constrForm->condeferred; fkconstraint->fk_matchtype = constrForm->confmatchtype; for (int i = 0; i < numfks; i++) { Form_pg_attribute att; att = TupleDescAttr(RelationGetDescr(partRel), mapped_conkey[i] - 1); fkconstraint->fk_attrs = lappend(fkconstraint->fk_attrs, // <= POPULATING makeString(NameStr(att->attname))); } The following SQL script showcase the bad constraint name: DROP TABLE IF EXISTS parent, child1; CREATE TABLE parent ( id bigint NOT NULL default 1, no_part smallint NOT NULL, id_abc bigint, CONSTRAINT dummy_constr FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT, PRIMARY KEY (id, no_part) ) PARTITION BY LIST (no_part); CREATE TABLE child1 ( id bigint NOT NULL default 1, no_part smallint NOT NULL, id_abc bigint, PRIMARY KEY (id, no_part), CONSTRAINT dummy_constr CHECK ((no_part = 1)) ); ALTER TABLE parent ATTACH PARTITION child1 FOR VALUES IN ('1'); SELECT conname FROM pg_constraint WHERE conrelid = 'child1'::regclass AND contype = 'f'; DROP TABLE CREATE TABLE CREATE TABLE ALTER TABLE conname -- child1__fkey (1 row) The resulting constraint name "child1__fkey" is missing the attributes name the original code wanted to add. The expected name is "child1_id_abc_no_part_fkey". Find in attachment a simple fix, moving the name assignation after the FK attributes are populated. Regards, >From f1eeacb1eb3face38d76325666aff57019ef84c9 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Thu, 1 Sep 2022 17:41:55 +0200 Subject: [PATCH] Fix FK name when colliding during partition attachment During ATLER TABLE ATTACH PARTITION, if the name of a parent's foreign key constraint is already used on the partition, the code try to choose another one before the FK attributes list has been populated. The resulting constraint name was a strange "__fkey" instead of "__fkey". --- src/backend/commands/tablecmds.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index dacc989d85..53b0f3a9c1 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -10304,16 +10304,6 @@ CloneFkReferencing(List **wqueue, Relation parentRel, Relation partRel) /* No dice. Set up to create our own constraint */ fkconstraint = makeNode(Constraint); - if (ConstraintNameIsUsed(CONSTRAINT_RELATION, - RelationGetRelid(partRel), - NameStr(constrForm->conname))) - fkconstraint->conname = -ChooseConstraintName(RelationGetRelationName(partRel), - ChooseForeignKeyConstraintNameAddition(fkconstraint->fk_attrs), - "fkey", - RelationGetNamespace(partRel), NIL); - else - fkconstraint->conname = pstrdup(NameStr(constrForm->conname)); fkconstraint->fk_upd_action = constrForm->confupdtype; fkconstraint->fk_del_action = constrForm->confdeltype; fkconstraint->deferrable = constrForm->condeferrable; @@ -10328,6 +10318,16 @@ CloneFkReferencing(List **wqueue, Relation parentRel, Relation partRel) fkconstraint->fk_attrs = lappend(fkconstraint->fk_attrs, makeString(NameStr(att->attname))); } + if (ConstraintNameIsUsed(CONSTRAINT_RELATION, + RelationGetRelid(partRel), + NameStr(constrForm->conname))) + fkconstraint->conname = +ChooseConstraintName(RelationGetRelationName(partRel), +
Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)
On Tue, 23 Aug 2022 18:30:06 +0200 Alvaro Herrera wrote: > On 2022-Aug-23, Jehan-Guillaume de Rorthais wrote: > > Hi, > > [...] > > > However, it seems get_relation_idx_constraint_oid(), introduced in > > eb7ed3f3063, assume there could be only ONE constraint depending to an > > index. But in fact, multiple constraints can rely on the same index, eg.: > > the PK and a self referencing FK. In consequence, when looking for a > > constraint depending on an index for the given relation, either the FK or a > > PK can appears first depending on various conditions. It is then possible > > to trick it make a FK constraint a parent of a PK... > > Hmm, wow, that sounds extremely stupid. I think a sufficient fix might > be to have get_relation_idx_constraint_oid ignore any constraints that > are not unique or primary keys. I tried your scenario with the attached > and it seems to work correctly. Can you confirm? (I only ran the > pg_regress tests, not anything else for now.) > > If this is OK, we should make this API quirkiness very explicit in the > comments, so the patch needs to be a few lines larger in order to be > committable. Also, perhaps the check should be that contype equals > either primary or unique, rather than it doesn't equal foreign. I was naively wondering about such a patch, but was worrying about potential side effects on ATExecAttachPartitionIdx(), DetachPartitionFinalize() and DefineIndex() where I didn't had a single glance. Did you had a look? I did a quick ATTACH + DETACH test, and it seems DETACH partly fails with its housecleaning: DROP TABLE IF EXISTS parent, child1; CREATE TABLE parent ( id bigint NOT NULL default 1, no_part smallint NOT NULL, id_abc bigint, FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT, PRIMARY KEY (id, no_part) ) PARTITION BY LIST (no_part); CREATE TABLE child1 ( id bigint NOT NULL default 1, no_part smallint NOT NULL, id_abc bigint, PRIMARY KEY (id, no_part), CONSTRAINT child1 CHECK ((no_part = 1)) ); \C 'Before ATTACH' SELECT oid, conname, conparentid, conrelid, confrelid FROM pg_constraint WHERE conrelid in ('parent'::regclass, 'child1'::regclass) ORDER BY 1; ALTER TABLE parent ATTACH PARTITION child1 FOR VALUES IN ('1'); \C 'After ATTACH' SELECT oid, conname, conparentid, conrelid, confrelid FROM pg_constraint WHERE conrelid in ('parent'::regclass, 'child1'::regclass) ORDER BY 1; ALTER TABLE parent DETACH PARTITION child1; \C 'After DETACH' SELECT oid, conname, conparentid, conrelid, confrelid FROM pg_constraint WHERE conrelid in ('parent'::regclass, 'child1'::regclass) ORDER BY 1; Before ATTACH oid | conname | conparentid | conrelid | confrelid ---++-+--+--- 24711 | parent_pkey| 0 |24706 | 0 24712 | parent_id_abc_no_part_fkey | 0 |24706 | 24706 24721 | child1 | 0 |24717 | 0 24723 | child1_pkey| 0 |24717 | 0 (4 rows) After ATTACH oid | conname | conparentid | conrelid | confrelid ---+-+-+--+--- 24711 | parent_pkey | 0 |24706 | 0 24712 | parent_id_abc_no_part_fkey | 0 |24706 | 24706 24721 | child1 | 0 |24717 | 0 24723 | child1_pkey | 24711 |24717 | 0 24724 | parent_id_abc_no_part_fkey1 | 24712 |24706 | 24717 24727 | parent_id_abc_no_part_fkey | 24712 |24717 | 24706 (6 rows) After DETACH oid | conname | conparentid | conrelid | confrelid ---++-+--+--- 24711 | parent_pkey| 0 |24706 | 0 24712 | parent_id_abc_no_part_fkey | 0 |24706 | 24706 24721 | child1 | 0 |24717 | 0 24723 | child1_pkey| 0 |24717 | 0 24727 | parent_id_abc_no_part_fkey | 0 |24717 | 24706 (5 rows) Looking for few minutes in ATExecDetachPartitionFinalize(), it seems it only support removing the parental link on FK, not to clean the FKs added during the ATTACH DDL anyway. That explains the FK child1->parent left behind. But in fact, this let me wonder if this part of the code ever considered implication of self-FK during the ATTACH and DETACH process? Why in the first place TWO FK are created during the ATTACH DDL? Regards,
[BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)
Hi all, I've been able to work on this issue and isolate where in the code the oddity is laying. During ATExecAttachPartition(), AttachPartitionEnsureIndexes() look for existing required index on the partition to attach. It creates missing index, or sets the parent's index when a matching one exists on the partition. Good. When a matching index is found, if the parent index enforce a constraint, the function look for the similar constraint in the partition-to-be, and set the constraint parent as well: constraintOid = get_relation_idx_constraint_oid(RelationGetRelid(rel), idx); [...] /* * If this index is being created in the parent because of a * constraint, then the child needs to have a constraint also, * so look for one. If there is no such constraint, this * index is no good, so keep looking. */ if (OidIsValid(constraintOid)) { cldConstrOid = get_relation_idx_constraint_oid( RelationGetRelid(attachrel), cldIdxId); /* no dice */ if (!OidIsValid(cldConstrOid)) continue; } /* bingo. */ IndexSetParentIndex(attachrelIdxRels[i], idx); if (OidIsValid(constraintOid)) ConstraintSetParentConstraint(cldConstrOid, constraintOid, RelationGetRelid(attachrel)); However, it seems get_relation_idx_constraint_oid(), introduced in eb7ed3f3063, assume there could be only ONE constraint depending to an index. But in fact, multiple constraints can rely on the same index, eg.: the PK and a self referencing FK. In consequence, when looking for a constraint depending on an index for the given relation, either the FK or a PK can appears first depending on various conditions. It is then possible to trick it make a FK constraint a parent of a PK... In the following little scenario, when looking at the constraint linked to the PK unique index using the same index than get_relation_idx_constraint_oid use, this is the self-FK that is actually returned first by get_relation_idx_constraint_oid(), NOT the PK: postgres=# DROP TABLE IF EXISTS parent, child1; CREATE TABLE parent ( id bigint NOT NULL default 1, no_part smallint NOT NULL, id_abc bigint, FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT, PRIMARY KEY (id, no_part) ) PARTITION BY LIST (no_part); CREATE TABLE child1 ( id bigint NOT NULL default 1, no_part smallint NOT NULL, id_abc bigint, PRIMARY KEY (id, no_part), CONSTRAINT child1 CHECK ((no_part = 1)) ); -- force an indexscan as get_relation_idx_constraint_oid() use the unique -- index on (conrelid, contypid, conname) to scan pg_cosntraint set enable_seqscan TO off; set enable_bitmapscan TO off; SELECT conname FROM pg_constraint WHERE conrelid = 'parent'::regclass <=== parent AND conindid = 'parent_pkey'::regclass; <=== PK index DROP TABLE CREATE TABLE CREATE TABLE SET SET conname parent_id_abc_no_part_fkey < WOOPS! parent_pkey (2 rows) In consequence, when attaching the partition, the PK of child1 is not marked as partition of the parent's PK, which is wrong. WORST, the PK of child1 is actually unexpectedly marked as a partition of the parent's **self-FK**: postgres=# ALTER TABLE ONLY parent ATTACH PARTITION child1 FOR VALUES IN ('1'); SELECT oid, conname, conparentid, conrelid, confrelid FROM pg_constraint WHERE conrelid in ('parent'::regclass, 'child1'::regclass) ORDER BY 1; ALTER TABLE oid | conname | conparentid | conrelid | confrelid ---+-+-+--+--- 16700 | parent_pkey | 0 |16695 | 0 16701 | parent_id_abc_no_part_fkey | 0 |16695 | 16695 16706 | child1 | 0 |16702 | 0 16708 | **child1_pkey** | **16701** |16702 | 0 16709 | parent_id_abc_no_part_fkey1 | 16701 |16695 | 16702 16712 | parent_id_abc_no_part_fkey | 16701 |16702 | 16695 (6 rows) The expected result should probably be something like: oid | conname | conparentid | conrelid | confrelid ---+-+-+--+--- 16700 | parent_pkey | 0 |16695 | 0 ... 16708 | child1_pkey | 16700 |16702 | 0 I suppose this bug might
Re: Fix proposal for comparaison bugs in PostgreSQL::Version
On Tue, 5 Jul 2022 09:59:42 -0400 Andrew Dunstan wrote: > On 2022-07-03 Su 16:12, Jehan-Guillaume de Rorthais wrote: > > On Sun, 3 Jul 2022 10:40:21 -0400 > > Andrew Dunstan wrote: > > > >> On 2022-06-29 We 05:09, Jehan-Guillaume de Rorthais wrote: > >>> On Tue, 28 Jun 2022 18:17:40 -0400 > >>> Andrew Dunstan wrote: > >>> > >>>> On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote: > >>>>> ... > >>>>> A better fix would be to store the version internally as version_num > >>>>> that are trivial to compute and compare. Please, find in attachment an > >>>>> implementation of this. > >>>>> > >>>>> The patch is a bit bigger because it improved the devel version to > >>>>> support rc/beta/alpha comparison like 14rc2 > 14rc1. > >>>>> > >>>>> Moreover, it adds a bunch of TAP tests to check various use cases. > >>>> Nice catch, but this looks like massive overkill. I think we can very > >>>> simply fix the test in just a few lines of code, instead of a 190 line > >>>> fix and a 130 line TAP test. > >>> I explained why the patch was a little bit larger than required: it fixes > >>> the bugs and do a little bit more. The _version_cmp sub is shorter and > >>> easier to understand, I use multi-line code where I could probably fold > >>> them in a one-liner, added some comments... Anyway, I don't feel the > >>> number of line changed is "massive". But I can probably remove some code > >>> and shrink some other if it is really important... > >>> > >>> Moreover, to be honest, I don't mind the number of additional lines of TAP > >>> tests. Especially since it runs really, really fast and doesn't hurt > >>> day-to-day devs as it is independent from other TAP tests anyway. It could > >>> be 1k, if it runs fast, is meaningful and helps avoiding futur > >>> regressions, I would welcome the addition. > >> > >> I don't see the point of having a TAP test at all. We have TAP tests for > >> testing the substantive products we test, not for the test suite > >> infrastructure. Otherwise, where will we stop? Shall we have tests for > >> the things that test the test suite? > > Tons of perl module have regression tests. When questioning where testing > > should stop, it seems the Test::More module itself is not the last frontier: > > https://github.com/Test-More/test-more/tree/master/t > > > > Moreover, the PostgreSQL::Version is not a TAP test module, but a module to > > deal with PostgreSQL versions and compare them. > > > > Testing makes development faster as well when it comes to test the code. > > Instead of testing vaguely manually, you can test a whole bunch of > > situations and add accumulate some more when you think about a new one or > > when a bug is reported. Having TAP test helps to make sure the code work as > > expected. > > > > It helped me when creating my patch. With all due respect, I just don't > > understand your arguments against them. The number of lines or questioning > > when testing should stop doesn't hold much. > > > There is not a single TAP test in our source code that is aimed at > testing our test infrastructure as opposed to testing what we are > actually in the business of building, and I'm not about to add one. Whatever, it helped me during the dev process of fixing this bug. Remove them if you are uncomfortable with them. > Every added test consumes buildfarm cycles and space on the buildfarm > server for the report, be it ever so small. They were not supposed to enter the buildfarm cycles. I wrote it earlier, they do not interfere with day-to-day dev activity. > Every added test needs maintenance, be it ever so small. There's no such > thing as a free test (apologies to Heinlein and others). This is the first argument I can understand. > > ... > > But anyway, this is not the point. Using an array to compare versions where > > we can use version_num seems like useless and buggy convolutions to me. > > I think we'll just have to agree to disagree about it. Noted. Cheers,
Re: Fix proposal for comparaison bugs in PostgreSQL::Version
On Sun, 3 Jul 2022 10:40:21 -0400 Andrew Dunstan wrote: > On 2022-06-29 We 05:09, Jehan-Guillaume de Rorthais wrote: > > On Tue, 28 Jun 2022 18:17:40 -0400 > > Andrew Dunstan wrote: > > > >> On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote: > >>> ... > >>> A better fix would be to store the version internally as version_num that > >>> are trivial to compute and compare. Please, find in attachment an > >>> implementation of this. > >>> > >>> The patch is a bit bigger because it improved the devel version to support > >>> rc/beta/alpha comparison like 14rc2 > 14rc1. > >>> > >>> Moreover, it adds a bunch of TAP tests to check various use cases. > >> > >> Nice catch, but this looks like massive overkill. I think we can very > >> simply fix the test in just a few lines of code, instead of a 190 line > >> fix and a 130 line TAP test. > > I explained why the patch was a little bit larger than required: it fixes > > the bugs and do a little bit more. The _version_cmp sub is shorter and > > easier to understand, I use multi-line code where I could probably fold > > them in a one-liner, added some comments... Anyway, I don't feel the number > > of line changed is "massive". But I can probably remove some code and > > shrink some other if it is really important... > > > > Moreover, to be honest, I don't mind the number of additional lines of TAP > > tests. Especially since it runs really, really fast and doesn't hurt > > day-to-day devs as it is independent from other TAP tests anyway. It could > > be 1k, if it runs fast, is meaningful and helps avoiding futur regressions, > > I would welcome the addition. > > > I don't see the point of having a TAP test at all. We have TAP tests for > testing the substantive products we test, not for the test suite > infrastructure. Otherwise, where will we stop? Shall we have tests for > the things that test the test suite? Tons of perl module have regression tests. When questioning where testing should stop, it seems the Test::More module itself is not the last frontier: https://github.com/Test-More/test-more/tree/master/t Moreover, the PostgreSQL::Version is not a TAP test module, but a module to deal with PostgreSQL versions and compare them. Testing makes development faster as well when it comes to test the code. Instead of testing vaguely manually, you can test a whole bunch of situations and add accumulate some more when you think about a new one or when a bug is reported. Having TAP test helps to make sure the code work as expected. It helped me when creating my patch. With all due respect, I just don't understand your arguments against them. The number of lines or questioning when testing should stop doesn't hold much. > > If we really want to save some bytes, I have a two lines worth of code fix > > that looks more readable to me than fixing _version_cmp: > > > > +++ b/src/test/perl/PostgreSQL/Version.pm > > @@ -92,9 +92,13 @@ sub new > > # Split into an array > > my @numbers = split(/\./, $arg); > > > > + # make sure all digit of the array-represented version are set so > > we can > > + # keep _version_cmp code as a "simple" digit-to-digit comparison > > loop > > + $numbers[$_] += 0 for 0..3; > > + > > # Treat development versions as having a minor/micro version one > > less than # the first released version of that branch. > > - push @numbers, -1 if ($devel); > > + $numbers[3] = -1 if $devel; > > > > $devel ||= ""; > > I don't see why this is any more readable. The _version_cmp is much more readable. But anyway, this is not the point. Using an array to compare versions where we can use version_num seems like useless and buggy convolutions to me. Regards,
Re: Fix proposal for comparaison bugs in PostgreSQL::Version
On Tue, 28 Jun 2022 18:17:40 -0400 Andrew Dunstan wrote: > On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote: > > ... > > A better fix would be to store the version internally as version_num that > > are trivial to compute and compare. Please, find in attachment an > > implementation of this. > > > > The patch is a bit bigger because it improved the devel version to support > > rc/beta/alpha comparison like 14rc2 > 14rc1. > > > > Moreover, it adds a bunch of TAP tests to check various use cases. > > > Nice catch, but this looks like massive overkill. I think we can very > simply fix the test in just a few lines of code, instead of a 190 line > fix and a 130 line TAP test. I explained why the patch was a little bit larger than required: it fixes the bugs and do a little bit more. The _version_cmp sub is shorter and easier to understand, I use multi-line code where I could probably fold them in a one-liner, added some comments... Anyway, I don't feel the number of line changed is "massive". But I can probably remove some code and shrink some other if it is really important... Moreover, to be honest, I don't mind the number of additional lines of TAP tests. Especially since it runs really, really fast and doesn't hurt day-to-day devs as it is independent from other TAP tests anyway. It could be 1k, if it runs fast, is meaningful and helps avoiding futur regressions, I would welcome the addition. If we really want to save some bytes, I have a two lines worth of code fix that looks more readable to me than fixing _version_cmp: +++ b/src/test/perl/PostgreSQL/Version.pm @@ -92,9 +92,13 @@ sub new # Split into an array my @numbers = split(/\./, $arg); + # make sure all digit of the array-represented version are set so we can + # keep _version_cmp code as a "simple" digit-to-digit comparison loop + $numbers[$_] += 0 for 0..3; + # Treat development versions as having a minor/micro version one less than # the first released version of that branch. - push @numbers, -1 if ($devel); + $numbers[3] = -1 if $devel; $devel ||= ""; But again, in my humble opinion, the internal version array representation is more a burden we should replace by the version_num... > It was never intended to be able to compare markers like rc1 vs rc2, and > I don't see any need for it. If you can show me a sane use case I'll > have another look, but right now it seems quite unnecessary. I don't have a practical use case right now, but I thought the module would be more complete with these little few more line of codes. Now, keep in mind these TAP modules might help external projects, not just core. In fact, I wonder what was your original use case to support devel/alpha/beta/rc versions, especially since it was actually not working? Should we just get rid of this altogether and wait for an actual use case? Cheers,
Fix proposal for comparaison bugs in PostgreSQL::Version
Hi, I found a comparaison bug when using the PostgreSQL::Version module. See: $ perl -I. -MPostgreSQL::Version -le ' my $v = PostgreSQL::Version->new("9.6"); print "not 9.6 > 9.0" unless $v > 9.0; print "not 9.6 < 9.0" unless $v < 9.0; print "9.6 <= 9.0"if $v <= 9.0; print "9.6 >= 9.0"if $v >= 9.0;' not 9.6 > 9.0 not 9.6 < 9.0 9.6 <= 9.0 9.6 >= 9.0 When using < or >, 9.6 is neither greater or lesser than 9.0. When using <= or >=, 9.6 is equally greater and lesser than 9.0. The bug does not show up if you compare with "9.0" instead of 9.0. This bug is triggered with devel versions, eg. 14beta1 <=> 14. The bug appears when both objects have a different number of digit in the internal array representation: $ perl -I. -MPostgreSQL::Version -MData::Dumper -le ' print Dumper(PostgreSQL::Version->new("9.0")->{num}); print Dumper(PostgreSQL::Version->new(9.0)->{num}); print Dumper(PostgreSQL::Version->new(14)->{num}); print Dumper(PostgreSQL::Version->new("14beta1")->{num});' $VAR1 = [ '9', '0' ]; $VAR1 = [ '9' ]; $VAR1 = [ '14' ]; $VAR1 = [ '14', -1 ]; Because of this, The following loop in "_version_cmp" is wrong because we are comparing two versions with different size of 'num' array: for (my $idx = 0;; $idx++) { return 0 unless (defined $an->[$idx] && defined $bn->[$idx]); return $an->[$idx] <=> $bn->[$idx] if ($an->[$idx] <=> $bn->[$idx]); } If we want to keep this internal array representation, the only fix I can think of would be to always use a 4 element array defaulted to 0. Previous examples would be: $VAR1 = [ 9, 0, 0, 0 ]; $VAR1 = [ 9, 0, 0, 0 ]; $VAR1 = [ 14, 0, 0, 0 ]; $VAR1 = [ 14, 0, 0, -1 ]; A better fix would be to store the version internally as version_num that are trivial to compute and compare. Please, find in attachment an implementation of this. The patch is a bit bigger because it improved the devel version to support rc/beta/alpha comparison like 14rc2 > 14rc1. Moreover, it adds a bunch of TAP tests to check various use cases. Regards, >From d6b28440ad8eb61fd83be6a0df8f40108296bc4e Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Tue, 28 Jun 2022 22:17:48 +0200 Subject: [PATCH] Fix and improve PostgreSQL::Version The internal array representation of a version was failing some comparaisons because the loop comparing each digit expected both array to have the same number of digit. Comparaison like 9.6 <=> 9.0 or 14beta1 <=> 14 were failing. This patch compute and use the numeric version internaly to compare objects. Moreover, it improve the comparaison on development versions paying attention to the iteration number for rc, beta and alpha. --- src/test/perl/PostgreSQL/Version.pm | 112 -- src/test/perl/t/01-PostgreSQL::Version.t | 141 +++ 2 files changed, 216 insertions(+), 37 deletions(-) create mode 100644 src/test/perl/t/01-PostgreSQL::Version.t diff --git a/src/test/perl/PostgreSQL/Version.pm b/src/test/perl/PostgreSQL/Version.pm index 8f70491189..e42003d0d1 100644 --- a/src/test/perl/PostgreSQL/Version.pm +++ b/src/test/perl/PostgreSQL/Version.pm @@ -46,6 +46,7 @@ package PostgreSQL::Version; use strict; use warnings; +use Carp; use Scalar::Util qw(blessed); @@ -73,35 +74,73 @@ of a Postgres command like `psql --version` or `pg_config --version`; sub new { - my $class = shift; - my $arg = shift; - - chomp $arg; - - # Accept standard formats, in case caller has handed us the output of a - # postgres command line tool - my $devel; - ($arg, $devel) = ($1, $2) - if ( - $arg =~ m!^ # beginning of line - (?:\(?PostgreSQL\)?\s)? # ignore PostgreSQL marker - (\d+(?:\.\d+)*) # version number, dotted notation - (devel|(?:alpha|beta|rc)\d+)? # dev marker - see version_stamp.pl - !x); - - # Split into an array - my @numbers = split(/\./, $arg); - - # Treat development versions as having a minor/micro version one less than - # the first released version of that branch. - push @numbers, -1 if ($devel); + my $class = shift; + my $arg= shift; + my $ver= ''; + my $devel = ''; + my $vernum = 0; + my $devnum = 0; + + $arg =~ m!^(?:[^\s]+\s)? # beginning of line + optionnal command + (?:\(?PostgreSQL\)?\s)?# ignore PostgreSQL marker + (\d+)(?:\.(\d+))?(?:\.(\d+))? # version number, dotted notation + ([^\s]+)? # dev marker - see version_stamp.pl + !x; + + croak "could not parse version: $arg" unless defined $1; + + $ver = "$1"; + $vernum += 1 * $1; + $ver
Self FK oddity when attaching a partition
Hi all, While studying the issue discussed in thread "Detaching a partition with a FK on itself is not possible"[1], I stumbled across an oddity while attaching a partition having the same multiple self-FK than the parent table. Only one of the self-FK is found as a duplicate. Find in attachment some SQL to reproduce the scenario. Below the result of this scenario (constant from v12 to commit 7e367924e3). Why "child1_id_abc_no_part_fkey" is found duplicated but not the three others? From pg_constraint, only "child1_id_abc_no_part_fkey" has a "conparentid" set. conname | conparentid | conrelid | confrelid -+-+--+--- child1_id_abc_no_part_fkey | 16901 |16921 | 16921 child1_id_def_no_part_fkey | 0 |16921 | 16921 child1_id_ghi_no_part_fkey | 0 |16921 | 16921 child1_id_jkl_no_part_fkey | 0 |16921 | 16921 parent_id_abc_no_part_fkey | 16901 |16921 | 16894 parent_id_abc_no_part_fkey | 0 |16894 | 16894 parent_id_abc_no_part_fkey1 | 16901 |16894 | 16921 parent_id_def_no_part_fkey | 16906 |16921 | 16894 parent_id_def_no_part_fkey | 0 |16894 | 16894 parent_id_def_no_part_fkey1 | 16906 |16894 | 16921 parent_id_ghi_no_part_fkey | 0 |16894 | 16894 parent_id_ghi_no_part_fkey | 16911 |16921 | 16894 parent_id_ghi_no_part_fkey1 | 16911 |16894 | 16921 parent_id_jkl_no_part_fkey | 0 |16894 | 16894 parent_id_jkl_no_part_fkey | 16916 |16921 | 16894 parent_id_jkl_no_part_fkey1 | 16916 |16894 | 16921 (16 rows) Table "public.child1" [...] Partition of: parent FOR VALUES IN ('1') Partition constraint: ((no_part IS NOT NULL) AND (no_part = '1'::smallint)) Indexes: "child1_pkey" PRIMARY KEY, btree (id, no_part) Check constraints: "child1" CHECK (no_part = 1) Foreign-key constraints: "child1_id_def_no_part_fkey" FOREIGN KEY (id_def, no_part) REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT "child1_id_ghi_no_part_fkey" FOREIGN KEY (id_ghi, no_part) REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT "child1_id_jkl_no_part_fkey" FOREIGN KEY (id_jkl, no_part) REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_abc_no_part_fkey" FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_def_no_part_fkey" FOREIGN KEY (id_def, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_ghi_no_part_fkey" FOREIGN KEY (id_ghi, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_jkl_no_part_fkey" FOREIGN KEY (id_jkl, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT Referenced by: TABLE "child1" CONSTRAINT "child1_id_def_no_part_fkey" FOREIGN KEY (id_def, no_part) REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "child1" CONSTRAINT "child1_id_ghi_no_part_fkey" FOREIGN KEY (id_ghi, no_part) REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "child1" CONSTRAINT "child1_id_jkl_no_part_fkey" FOREIGN KEY (id_jkl, no_part) REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_abc_no_part_fkey" FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_def_no_part_fkey" FOREIGN KEY (id_def, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_ghi_no_part_fkey" FOREIGN KEY (id_ghi, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT TABLE "parent" CONSTRAINT "parent_id_jkl_no_part_fkey" FOREIGN KEY (id_jkl, no_part) REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT Regards, [1] https://www.postgresql.org/message-id/flat/20220321113634.68c09d4b%40karst#83c0880a1b4921fcd00d836d4e6bceb3 self-fk-after-part-attach.sql Description: application/sql
Re: Detaching a partition with a FK on itself is not possible
Hi, On Thu, 17 Mar 2022 17:58:04 + Arne Roland wrote: > I don't think this a bug, but a feature request. I therefore think hackers > would be more appropriate. +1 I changed the list destination > I don't see how an additional syntax to modify the constraint should help. Me neiher. > If I'd want to fix this, I'd try to teach the detach partition code about > self referencing foreign keys. It seems to me like that would be the cleanest > solution, because the user doesn't need to care about this at all. Teaching the detach partition about self referencing means either: * it's safe to remove the FK * we can rewrite the FK for self referencing Both solution are not ideal from the original schema and user perspective. Another solution could be to teach the create partition to detect a self referencing FK and actually create a self referencing FK, not pointing to the partitioned table, and of course issuing a NOTICE to the client.
Re: [Proposal] Add accumulated statistics for wait event
Hi Andres, On Mon, 14 Jun 2021 15:01:14 -0700 Andres Freund wrote: > On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote: > > > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > > > In the patch in attachment, I tried to fix this by using kind of an > > > > internal hook for pgstat_report_wait_start and pgstat_report_wait_end. > > > > This allows to "instrument" wait events only when required, on the fly, > > > > dynamically. > > > > > > That's *far worse*. You're adding an indirect function call. Which > > > requires loading a global variable and then a far call to a different > > > function. You're changing a path that's ~2 instructions with minimal > > > dependencies (and no branches (i.e. fully out of order executable) to > > > something on the order of ~15 instructions with plenty dependencies and > > > at least two branches (call, ret). > > > > Oh, I didn't realized it would affect all queries, even when > > log_statement_stats was off. Thank you for your explanation. > > Maybe I just am misunderstanding what you were doing? As far as I can > tell your patch changed pgstat_report_wait_start() to be an indirect > function call - right? Exact. I didn't realized this indirection would be so costy on every single calls, after the variable assignation itself. > Then yes, this adds overhead to everything. I understand now, thank you for the explanation. For my own curiosity and study, I'll remove this indirection and bench my patch anyway. > You *could* add a pgstat_report_wait_(start|end)_with_time() or such and > only use that in places that won't have a high frequency. But I just > don't quite see the use-case for that. Well, it could be useful if we decide to only track a subset of wait event. In my scenario, I originally wanted to only track ClientWrite, but then I realized this might be too specific and tried to generalize. There are probably some other way to deal with this issue. Eg.: * do NOT include the time lost waiting for the frontend side in the query execution time * expose the frontend part of the query time in log_min_duration_stmt, auto_explain, pg_stat_statements, in the same fashion we currently do with planning and execution time * having some wait-even sampling mechanism in core or as easy and hot-loadable than auto_explain Thoughts? Thanks again! Regards,
Re: [Proposal] Add accumulated statistics for wait event
Hi, On Mon, 14 Jun 2021 11:27:21 -0700 Andres Freund wrote: > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > In the patch in attachment, I tried to fix this by using kind of an internal > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > > "instrument" wait events only when required, on the fly, dynamically. > > That's *far worse*. You're adding an indirect function call. Which requires > loading a global variable and then a far call to a different function. You're > changing a path that's ~2 instructions with minimal dependencies (and no > branches (i.e. fully out of order executable) to something on the order of ~15 > instructions with plenty dependencies and at least two branches (call, ret). Oh, I didn't realized it would affect all queries, even when log_statement_stats was off. Thank you for your explanation. > I doubt there's a path towards this feature without adding the necessary > infrastructure to hot-patch the code - which is obviously quite a > substantial project. Right. Sadly, this kind of project is far above what I can do. So I suppose it's a dead end for me. I'll study if/how the sampling approach can be done dynamically. Thank you,
Re: [Proposal] Add accumulated statistics for wait event
Hi Andres, Hi all, First, thank you for your feedback! Please find in attachment a patch implementing accumulated wait event stats only from the backend point of view. As I wrote when I reviewed and rebased the existing patch, I was uncomfortable with the global approach. I still volunteer to work/review the original approach is required. See bellow for comments and some more explanations about what I think might be improvements over the previous patch. On Fri, 11 Jun 2021 12:18:07 -0700 Andres Freund wrote: > On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote: > > From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais > > Date: Tue, 1 Jun 2021 13:25:57 +0200 > > Subject: [PATCH 1/2] Add pg_stat_waitaccum view. > > > > pg_stat_waitaccum shows counts and duration of each wait events. > > Each backend/backgrounds counts and measures the time of wait event > > in every pgstat_report_wait_start and pgstat_report_wait_end. They > > store those info into their local variables and send to Statistics > > Collector. We can get those info via Statistics Collector. > > > > For reducing overhead, I implemented statistic hash instead of > > dynamic hash. I also implemented track_wait_timing which > > determines wait event duration is collected or not. > > I object to adding this overhead. The whole selling point for wait > events was that they are low overhead. I since spent time reducing the > overhead further, because even just the branches for checking if > track_activity is enabled are measurable (225a22b19ed). Agree. The previous patch I rebased was to review it and reopen this discussion, I even added a small FIXME in pgstat_report_wait_end and pgstat_report_wait_start about your work: //FIXME: recent patch to speed up this call. In the patch in attachment, I tried to fix this by using kind of an internal hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to "instrument" wait events only when required, on the fly, dynamically. Moreover, I removed the hash structure for a simple static array for faster access. > > From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais > > Date: Fri, 4 Jun 2021 18:14:51 +0200 > > Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from > > INSTR_TIME to rdtsc. > > > > This patch changes measuring method of wait event time from INSTR_TIME > > (which uses gettimeofday or clock_gettime) to rdtsc. This might reduce the > > overhead of measuring overhead. > > > > Any supports like changing clock cycle to actual time or error handling are > > not currently implemented. > > rdtsc is a serializing (*) instruction - that's the expensive part. On linux > clock_gettime() doesn't actually need a syscall. While the vdso call > implies a bit of overhead over a raw rdtsc, it's a relatively small part > of the overhead. See > https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de I choose to remove all this rdtsc part from my patch as this wasn't clear how much faster it was compare to simpler vdso functions and how to accurately extract a human time. About my take on $subject, for the sake of simplicity of this PoC, I added instrumentation to log_statement_stats. Despite the query context of the reported log, they are really accumulated stats. The patch updated pg_stat_get_waitaccum() as well to be able to report the accumulated wait events from your interactive or batch session. So using my previous fe-time demo client, you can test it using: PGOPTIONS="--log_statement_stats=on" ./fe-time 100 From logs, I now have (notice the last line): LOG: duration: 3837.194 ms statement: SELECT * FROM pgbench_accounts LOG: QUERY STATISTICS DETAIL: ! system usage stats: ! 0.087444 s user, 0.002106 s system, 3.837202 s elapsed ! [0.087444 s user, 0.003974 s system total] ! 25860 kB max resident size ! 0/0 [0/0] filesystem blocks in/out ! 0/303 [0/697] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 4/18 [5/18] voluntary/involuntary context switches ! Client/ClientWrite 4 calls, 3747102 us elapsed Using pgbench scale factor 10, the copy query for pgbench_accounts looks like: LOG: duration: 2388.081 ms statement: copy pgbench_accounts from stdin LOG: QUERY STATISTICS DETAIL: ! system usage stats: ! 1.373756 s user, 0.252860 s system, 2.388100 s elapsed ! [1.397015 s user, 0.264951 s system total] ! 37788 kB max reside
Re: [Proposal] Add accumulated statistics for wait event
Hi All, I faced a few times a situation where a long running query is actually including the time the backend is waiting for the frontend to fetch all the rows (see [1] for details). See a sample code fe-time.c and its comments in attachment to reproduce this behavior. There's no simple way today to pinpoint the problem in production without advance interactive auditing, and/or using system tools. After studying the problem, I believe it boil down to track the wait event ClientWrite, so I ended up on this thread. You might catch some mismatching times thanks to auto_explain as well. Using the fe-time.c demo code with the following command: PGDATABASE=postgres PGHOST=::1 time ./fe-time 100 The frontend time is 10s, the query time reported is 3228.631ms, but last row has been produced after 20.672ms: LOG: duration: 3228.631 ms plan: Query Text: SELECT * FROM pgbench_accounts Seq Scan on pgbench_accounts (time=0.005..20.672 rows=10 loops=1) (Note that in contrast with localhost, through the unix socket, the backend reported query time is always really close to 10s). I re-based the existing patch (see in attachment), to look at the ClientWrite for this exact query: # SELECT wait_event, calls, times FROM pg_stat_get_waitaccum(NULL) WHERE wait_event = 'ClientWrite'; wait_event | calls | times -+---+- ClientWrite | 4 | 3132266 The "time" is expressed as µs in the patch, so 3132.266ms of the total 3228.631ms is spent sending the result to the frontend. I'm not sure where are the missing 75ms. The pg_wait_sampling extension might help but it requires a production restart to install, then enable it. Whatever if the solution is sampling or cumulative, an in-core and hot-switchable solution would be much more convenient. But anyway, looking at pg_wait_sampling, we have a clear suspect as well for the later query run: # SELECT event, count FROM pg_wait_sampling_profile WHERE queryid=4045741516911800313; event| count -+--- ClientWrite | 309 The default profil period of pg_wait_sampling being 10ms, we can roughly estimate the ClientWrite around 3090ms. Note that this is close enough because we know 3132266µs has been accumulated among only 4 large wait events. Finishing bellow. On Mon, 3 Aug 2020 00:00:40 +0200 Daniel Gustafsson wrote: > > On 31 Jul 2020, at 07:23, imai.yoshik...@fujitsu.com wrote: > > > >> This patch fails to apply to HEAD, please submit a rebased version. I've > >> marked this as as Waiting on Author. Please, find in attachment a rebase of both patches. I did some small editing on the way. I didn't bench them. I'm not sure this patch is the best approach though. Receive it as a motivation to keep up with this discussion. As I wrote, whatever if the solution is sampling or cumulative, an in-core and hot-switchable solution would be much more convenient. The fact is that this patch was already available and ready to keep up with a discussion. Collecting and summing all wait events from all backends in the same place forbid to track precisely wait events from a specific backends. Especially on a busy system where numbers can quickly be buried by all other activities around. I wonder if wait events should only be accumulated on backend side, making possible to enable/disable them on the fly and to collect some reports eg. in logs or to output. Most of the code from these patch could be recycled in a simpler patch implementing this. Thoughts? > > Sorry for my absence. Unfortunately I couldn't have time to work on this > > patch in this cf. I believe I will be back in next cf, work on this patch > > and also review other patches. > > No worries, it happens. Since the thread has stalled and there is no updated > patch I've marked this entry Returned with Feedback. Please feel free to > re-open a new CF entry if you return to this patch. I volunteer to be a reviewer on this patch. Imai-san, do you agree to add it as new CF entry? Regards, [1]: Last time I had such situation was few weeks ago. A customer was reporting a query being randomly slow, running bellow 100ms most of the time and sometime hitting 28s. Long story short, the number of row was the same (10-15k), but the result set size was 9x bigger (1MB vs 9MB). As the same query was running fine from psql, I suspected the frontend was somehow saturated. Tcpdump helped me to compute that the throughput fall to 256kB/s after the first 2MB of data transfert with a very narrow TCP window. I explained to the customer their app probably doesn't pull the rows fast enough and that some buffers were probably saturated on the frontend side, waiting for the app and slowing down the whole transfert. Devels fixed the problem by moving away two fields transformations (unaccent) from their loop fetching the rows. >From 88c2779679c5c9625ca5348eec
Re: when the startup process doesn't
On Wed, 21 Apr 2021 12:36:05 -0700 Andres Freund wrote: > [...] > > I don't think that concern equally applies for what I am proposing > here. For one, we already have minRecoveryPoint in ControlData, and we > already use it for the purpose of determining where we need to recover > to, albeit only during crash recovery. Imo that's substantially > different from adding actual recovery progress status information to the > control file. Just for the record, when I was talking about updating status of the startup in the controldata, I was thinking about setting the last known LSN replayed. Not some kind of variable string. > > I also think that it'd actually be a significant reliability improvement > if we maintained an approximate minRecoveryPoint during normal running: > I've seen way too many cases where WAL files were lost / removed and > crash recovery just started up happily. Only hitting problems months > down the line. Yes, it'd obviously not bullet proof, since we'd not want > to add a significant stream of new fsyncs, but IME such WAL files > lost/removed issues tend not to be about a few hundred bytes of WAL but > many segments missing. Maybe setting this minRecoveryPoint once per segment would be enough, near from the beginning of the WAL. At least, the recovery process would be forced to actually replay until the very last known segment. Regards,
Re: when the startup process doesn't
On Tue, 20 Apr 2021 15:04:28 +0200 Magnus Hagander wrote: [...] > Yeah, I think we should definitely limit this to local access, one way > or another. Realistically using pg_hba is going to require catalog > access, isn't it? And we can't just go ignore those rows in pg_hba > that for example references role membership (as well as all those auth > methods you can't use). Two another options: 1. if this is limited to local access only, outside of the log entries, the status of the startup could be updated in the controldata file as well. This would allows to watch it without tail-grep'ing logs using eg. pg_controldata. 2. maybe the startup process could ignore update_process_title? As far as I understand the doc correctly, this GUC is mostly useful for backends on Windows. Regards,
Re: multi-install PostgresNode fails with older postgres versions
On Mon, 19 Apr 2021 10:35:39 -0700 Mark Dilger wrote: > > On Apr 19, 2021, at 10:25 AM, Jehan-Guillaume de Rorthais > > wrote: > > > > On Mon, 19 Apr 2021 12:37:08 -0400 > > Andrew Dunstan wrote: > > > >> > >> On 4/19/21 10:43 AM, Mark Dilger wrote: > >>> > >>>> On Apr 19, 2021, at 5:11 AM, Andrew Dunstan wrote: > >>>> > >>>> I think therefore I'm inclined for now to do nothing for old version > >>>> compatibility. > >>> I agree with waiting until the v15 development cycle. > >>> > >>>> I would commit the fix for the IPC::Run caching glitch, > >>>> and version detection > >>> Thank you. > >>> > >>>> I would add a warning if the module is used with > >>>> a version <= 11. > >>> Sounds fine for now. > >>> > >>>> The original goal of these changes was to allow testing of combinations > >>>> of different builds with openssl and nss, which doesn't involve old > >>>> version compatibility. > >>> Hmm. I think different folks had different goals. My personal interest > >>> is to write automated tests which spin up older servers, create data that > >>> cannot be created on newer servers (such as heap tuples with HEAP_MOVED_IN > >>> or HEAP_MOVED_OFF bits set), upgrade, and test that new code handles the > >>> old data correctly. I think this is not only useful for our test suites > >>> as a community, but is also useful for companies providing support > >>> services who need to reproduce problems that customers are having on > >>> clusters that have been pg_upgraded across large numbers of postgres > >>> versions. > >>>> As far as I know, without any compatibility changes the module is fully > >>>> compatible with releases 13 and 12, and with releases 11 and 10 so long > >>>> as you don't want a standby, and with releases 9.6 and 9.5 if you also > >>>> don't want a backup. That makes it suitable for a lot of testing without > >>>> any attempt at version compatibility. > >>>> > >>>> We can revisit compatibility further in the next release. > >>> Sounds good. > >> > >> > >> I'll work on this. Meanwhile FTR here's my latest revision - it's a lot > >> less invasive of the main module, so it seems much more palatable to me, > >> and still passes my test down to 7.2. > > > > I spend a fair bit of time to wonder how useful it could be to either > > maintain such a module in core, including for external needs, or creating a > > separate external project with a different release/distribution/packaging > > policy. > > > > Wherever the module is maintained, the goal would be to address broader > > needs, eg. adding a switch_wal() method or wait_for_archive(), supporting > > replication, backups, etc for multi-old-deprecated-PostgreSQL versions. > > > > To be honest I have mixed feelings. I feel this burden shouldn't be carried > > by the core, which has restricted needs compared to external projects. In > > the opposite, maintaining an external project which shares 90% of the code > > seems to be a useless duplicate and backport effort. Moreover Craig Ringer > > already opened the door for external use of PostgresNode with his effort to > > install/package it, see: > > https://www.postgresql.org/message-id/CAGRY4nxxKSFJEgVAv5YAk%3DbqULtFmNw7gEJef0CCgzpNy6O%3D-w%40mail.gmail.com > > > > Thoughts? > > The community needs a single shared PostgresNode implementation that can be > used by scripts which reproduce bugs. Which means it could be OK to have a PostgresNode implementation, leaving in core source-tree, which supports broader needs than the core ones (older versions and some more methods)? Did I understood correctly? If this is correct, I suppose this effort could be committed early in v15 cycle? Does it deserve some effort to build some dedicated TAP tests for these modules? I already have a small patch for this waiting on my disk for some more tests and review... Regards
Re: multi-install PostgresNode fails with older postgres versions
On Mon, 19 Apr 2021 12:37:08 -0400 Andrew Dunstan wrote: > > On 4/19/21 10:43 AM, Mark Dilger wrote: > > > >> On Apr 19, 2021, at 5:11 AM, Andrew Dunstan wrote: > >> > >> I think therefore I'm inclined for now to do nothing for old version > >> compatibility. > > I agree with waiting until the v15 development cycle. > > > >> I would commit the fix for the IPC::Run caching glitch, > >> and version detection > > Thank you. > > > >> I would add a warning if the module is used with > >> a version <= 11. > > Sounds fine for now. > > > >> The original goal of these changes was to allow testing of combinations > >> of different builds with openssl and nss, which doesn't involve old > >> version compatibility. > > Hmm. I think different folks had different goals. My personal interest is > > to write automated tests which spin up older servers, create data that > > cannot be created on newer servers (such as heap tuples with HEAP_MOVED_IN > > or HEAP_MOVED_OFF bits set), upgrade, and test that new code handles the > > old data correctly. I think this is not only useful for our test suites as > > a community, but is also useful for companies providing support services > > who need to reproduce problems that customers are having on clusters that > > have been pg_upgraded across large numbers of postgres versions. > > > >> As far as I know, without any compatibility changes the module is fully > >> compatible with releases 13 and 12, and with releases 11 and 10 so long > >> as you don't want a standby, and with releases 9.6 and 9.5 if you also > >> don't want a backup. That makes it suitable for a lot of testing without > >> any attempt at version compatibility. > >> > >> We can revisit compatibility further in the next release. > > Sounds good. > > > I'll work on this. Meanwhile FTR here's my latest revision - it's a lot > less invasive of the main module, so it seems much more palatable to me, > and still passes my test down to 7.2. I spend a fair bit of time to wonder how useful it could be to either maintain such a module in core, including for external needs, or creating a separate external project with a different release/distribution/packaging policy. Wherever the module is maintained, the goal would be to address broader needs, eg. adding a switch_wal() method or wait_for_archive(), supporting replication, backups, etc for multi-old-deprecated-PostgreSQL versions. To be honest I have mixed feelings. I feel this burden shouldn't be carried by the core, which has restricted needs compared to external projects. In the opposite, maintaining an external project which shares 90% of the code seems to be a useless duplicate and backport effort. Moreover Craig Ringer already opened the door for external use of PostgresNode with his effort to install/package it, see: https://www.postgresql.org/message-id/CAGRY4nxxKSFJEgVAv5YAk%3DbqULtFmNw7gEJef0CCgzpNy6O%3D-w%40mail.gmail.com Thoughts?
Re: multi-install PostgresNode fails with older postgres versions
On Mon, 19 Apr 2021 07:43:58 -0700 Mark Dilger wrote: > > On Apr 19, 2021, at 5:11 AM, Andrew Dunstan wrote: > > > > I think therefore I'm inclined for now to do nothing for old version > > compatibility. > > I agree with waiting until the v15 development cycle. Agree.
Re: Retry in pgbench
On Fri, 16 Apr 2021 10:28:48 +0900 (JST) Tatsuo Ishii wrote: > > By the way, I've been playing with the idea of failing gracefully and retry > > indefinitely (or until given -T) on SQL error AND connection issue. > > > > It would be useful to test replicating clusters with a (switch|fail)over > > procedure. > > Interesting idea but in general a failover takes sometime (like a few > minutes), and it will strongly affect TPS. I think in the end it just > compares the failover time. This usecase is not about benchmarking. It's about generating constant trafic to be able to practice/train some [auto]switchover procedures while being close to production activity. In this contexte, a max-saturated TPS of one node is not relevant. But being able to add some stats about downtime might be a good addition. Regards,
Re: Retry in pgbench
Hi, On Tue, 13 Apr 2021 16:12:59 +0900 (JST) Tatsuo Ishii wrote: > [...] > [...] > [...] > > Thanks for the pointer. It seems we need to resume the discussion. By the way, I've been playing with the idea of failing gracefully and retry indefinitely (or until given -T) on SQL error AND connection issue. It would be useful to test replicating clusters with a (switch|fail)over procedure. Regards,
Re: multi-install PostgresNode fails with older postgres versions
On Mon, 12 Apr 2021 09:52:24 -0400 Andrew Dunstan wrote: > On 4/12/21 8:59 AM, Jehan-Guillaume de Rorthais wrote: > > Hi, > > > > On Wed, 7 Apr 2021 20:07:41 +0200 > > Jehan-Guillaume de Rorthais wrote: > > [...] > >>>> Let me know if it worth that I work on an official patch. > >>> Let's give it a try ... > >> OK > > So, as promised, here is my take to port my previous work on PostgreSQL > > source tree. > > > > Make check pass with no errors. The new class probably deserve some own TAP > > tests. > > > > Note that I added a PostgresVersion class for easier and cleaner version > > comparisons. This could be an interesting take away no matter what. > > > > I still have some more ideas to cleanup, revamp and extend the base class, > > but I prefer to go incremental to keep things review-ability. > > > > Thanks for this. We have been working somewhat on parallel lines. With > your permission I'm going to take some of what you've done and > incorporate it in the patch I'm working on. The current context makes my weeks difficult to plan and quite chaotic, that's why it took me some days to produce the patch I promised. I'm fine with working with a common base code, thanks. Feel free to merge both code, we'll trade patches during review. However, I'm not sure what is the status of your patch, so I can not judge what would be the easier way to incorporate. Mine is tested down to 9.1 (9.0 was meaningless because of lack of pg_stat_replication) with: * get_new_node * init(allows_streaming => 1) * start * stop * backup * init_from_backup * wait_for_catchup * command_checks_all Note the various changes in my init() and new method allow_streaming(), etc. FYI (to avoid duplicate work), the next step on my todo was to produce some meaningful tap tests to prove the class. > A PostgresVersion class is a good idea - I was already contemplating > something of the kind. Thanks! Regards,
Re: multi-install PostgresNode fails with older postgres versions
Hi, On Wed, 7 Apr 2021 20:07:41 +0200 Jehan-Guillaume de Rorthais wrote: [...] > > > Let me know if it worth that I work on an official patch. > > > > Let's give it a try ... > > OK So, as promised, here is my take to port my previous work on PostgreSQL source tree. Make check pass with no errors. The new class probably deserve some own TAP tests. Note that I added a PostgresVersion class for easier and cleaner version comparisons. This could be an interesting take away no matter what. I still have some more ideas to cleanup, revamp and extend the base class, but I prefer to go incremental to keep things review-ability. Thanks, >From 077e447995b3b8bb4c0594a6616e1350acd9d48b Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Mon, 12 Apr 2021 14:45:17 +0200 Subject: [PATCH] Draft: Makes PostgresNode multi-version aware --- src/test/perl/PostgresNode.pm| 708 --- src/test/perl/PostgresVersion.pm | 65 +++ 2 files changed, 704 insertions(+), 69 deletions(-) create mode 100644 src/test/perl/PostgresVersion.pm diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index e26b2b3f30..d7a9e54efd 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -102,6 +102,7 @@ use Test::More; use TestLib (); use Time::HiRes qw(usleep); use Scalar::Util qw(blessed); +use PostgresVersion; our @EXPORT = qw( get_new_node @@ -376,9 +377,42 @@ sub dump_info return; } +# Internal routine to allows streaming replication on a primary node. +sub allows_streaming +{ + my ($self, $allows_streaming) = @_; + my $pgdata = $self->data_dir; + my $wal_level; + + if ($allows_streaming eq "logical") + { + $wal_level = "logical"; + } + else + { + $wal_level = "replica"; + } + + $self->append_conf( 'postgresql.conf', qq{ + wal_level = $wal_level + max_wal_senders = 10 + max_replication_slots = 10 + wal_log_hints = on + hot_standby = on + # conservative settings to ensure we can run multiple postmasters: + shared_buffers = 1MB + max_connections = 10 + # limit disk space consumption, too: + max_wal_size = 128MB + }); -# Internal method to set up trusted pg_hba.conf for replication. Not -# documented because you shouldn't use it, it's called automatically if needed. + $self->set_replication_conf; + + return; +} + +# Internal to set up trusted pg_hba.conf for replication. Not documented +# because you shouldn't use it, it's called automatically if needed. sub set_replication_conf { my ($self) = @_; @@ -398,6 +432,15 @@ sub set_replication_conf return; } +# Internal methods to track capabilities depending on the PostgreSQL major +# version used + +sub can_slots { return 1 } +sub can_log_hints { return 1 } +sub can_restart_after_crash { return 1 } +sub can_skip_init_fsync { return 1 } + + =pod =item $node->init(...) @@ -429,6 +472,7 @@ sub init my $port = $self->port; my $pgdata = $self->data_dir; my $host = $self->host; + my @cmd; local %ENV = $self->_get_env(); @@ -438,19 +482,25 @@ sub init mkdir $self->backup_dir; mkdir $self->archive_dir; - TestLib::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N', - @{ $params{extra} }); + @cmd = ( 'initdb', '-D', $pgdata, '-A', 'trust' ); + push @cmd, '-N' if $self->can_skip_init_fsync; + push @cmd, @{ $params{extra} } if defined $params{extra}; + + TestLib::system_or_bail(@cmd); TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata, @{ $params{auth_extra} }); open my $conf, '>>', "$pgdata/postgresql.conf"; print $conf "\n# Added by PostgresNode.pm\n"; print $conf "fsync = off\n"; - print $conf "restart_after_crash = off\n"; + print $conf "restart_after_crash = off\n" if $self->can_restart_after_crash; print $conf "log_line_prefix = '%m [%p] %q%a '\n"; print $conf "log_statement = all\n"; - print $conf "log_replication_commands = on\n"; - print $conf "wal_retrieve_retry_interval = '500ms'\n"; + if ($self->version >= '9.5.0') + { + print $conf "log_replication_commands = on\n"; + print $conf "wal_retrieve_retry_interval = '500ms'\n"; + } # If a setting tends to affect whether tests pass or fail, print it after # TEMP_CONFIG. Otherwise, print it before TEMP_CONFIG, thereby permitting @@ -463,31 +513,8 @@ sub init # concurrently must not share a stats_temp_directory. print $conf "stats_temp_directory = 'pg_stat_tmp'\n"; - if ($params{allows_streaming}) - { - if ($params{allows_streaming} eq "logical") - { - print $conf "wal_level = logical\n"; - } - else - { - print $conf "wal_level = replica\n"; - } - print $conf "max_wal_senders = 10\n"; - print $conf "max_replication_slo
Re: multi-install PostgresNode fails with older postgres versions
On Wed, 7 Apr 2021 13:38:39 -0400 Andrew Dunstan wrote: > On 4/7/21 1:19 PM, Jehan-Guillaume de Rorthais wrote: > > On Wed, 7 Apr 2021 12:51:55 -0400 > > Alvaro Herrera wrote: > > > >> On 2021-Apr-07, Jehan-Guillaume de Rorthais wrote: > >> > >>> When I'm creating a new node, I'm using the "pgaTester" factory class. It > >>> relies on PATH to check the major version using pg_config, then loads the > >>> appropriate class. > >> From a code cleanliness point of view, I agree that having separate > >> classes for each version is neater than what you call a forest of > >> conditionals. I'm not sure I like the way you instantiate the classes > >> in pgaTester though -- wouldn't it be saner to have PostgresNode::new > >> itself be in charge of deciding which class to bless the object as? > >> Since we're talking about modifying PostgresNode itself in order to > >> support this, it would make sense to do that. > > Yes, it would be much saner to make PostgresNode the factory class. Plus, > > some more logic could be injected there to either auto-detect the version > > (current behavior) or eg. use a given path to the binaries as Mark did in > > its patch. > > > Aren't you likely to end up duplicating substantial amounts of code, > though? I'm certainly not at the stage where I think the version-aware > code is creating too much clutter. The "forest of conditionals" seems > more like a small thicket. I started with a patched PostgresNode. Then I had to support backups and replication for my tests. Then it become hard to follow the logic in conditional blocks, moreover some conditionals needed to appear in multiple places in the same methods, depending on the enabled features, the conditions, what GUC was enabled by default or not, etc. So I end up with this design. I really don't want to waste community brain cycles in discussions and useless reviews. But as far as someone agree to review it, I already have the material and I can create a patch with a limited amount of work to compare and review. The one-class approach would need to support replication down to 9.0 to be fair though :/ Thanks,
Re: multi-install PostgresNode fails with older postgres versions
On Wed, 7 Apr 2021 10:50:26 -0700 Mark Dilger wrote: > > On Apr 7, 2021, at 10:36 AM, Alvaro Herrera wrote: > > > >> Yes, it would be much saner to make PostgresNode the factory class. Plus, > >> some more logic could be injected there to either auto-detect the version > >> (current behavior) or eg. use a given path to the binaries as Mark did in > >> its patch. > > > > I'm not sure what you mean about auto-detecting the version -- I assume > > we would auto-detect the version by calling pg_config from the > > configured path and parsing the binary, which is what Mark's patch is > > supposed to do already. So I don't see what the distinction between > > those two things is. > > > > In order to avoid having an ever-growing plethora of 100-byte .pm files, > > we can put the version-specific classes in the same PostgresNode.pm > > file, at the bottom, "class PostgresNode96; use parent PostgresNode10;" > > followed by the routines that are overridden for each version. > > It's not sufficient to think about postgres versions as "10", "11", etc. You > have to be able to spin up nodes of any build, like "9.0.7". There are > specific versions of postgres with specific bugs that cause specific > problems, and later versions of postgres on that same development branch have > been patched. If you only ever spin up the latest version, you can't > reproduce the problems and test how they impact cross version compatibility. Agree. > I don't think it works to have a class per micro release. So you'd have a > PostgresNode of type "10" or such, but how does that help? If you have ten > different versions of "10" in your test, they all look the same? As PostgresNode factory already checked pg_config version, it can give it as argument to the specific class constructor. We can then have eg. major_version() and version() to return the major version and full versions. Regards,
Re: multi-install PostgresNode fails with older postgres versions
On Wed, 7 Apr 2021 13:36:31 -0400 Alvaro Herrera wrote: > On 2021-Apr-07, Jehan-Guillaume de Rorthais wrote: > > > Yes, it would be much saner to make PostgresNode the factory class. Plus, > > some more logic could be injected there to either auto-detect the version > > (current behavior) or eg. use a given path to the binaries as Mark did in > > its patch. > > I'm not sure what you mean about auto-detecting the version -- I assume > we would auto-detect the version by calling pg_config from the > configured path and parsing the binary, which is what Mark's patch is > supposed to do already. So I don't see what the distinction between > those two things is. My version is currently calling pg_config without any knowledge about its absolute path. Mark's patch is able to take an explicit binary path: my $a = PostgresNode->get_new_node('a', install_path => '/my/install/8.4'); > In order to avoid having an ever-growing plethora of 100-byte .pm files, > we can put the version-specific classes in the same PostgresNode.pm > file, at the bottom, "class PostgresNode96; use parent PostgresNode10;" > followed by the routines that are overridden for each version. Sure. > > Let me know if it worth that I work on an official patch. > > Let's give it a try ... OK Regards,
Re: multi-install PostgresNode fails with older postgres versions
On Wed, 7 Apr 2021 12:51:55 -0400 Alvaro Herrera wrote: > On 2021-Apr-07, Jehan-Guillaume de Rorthais wrote: > > > When I'm creating a new node, I'm using the "pgaTester" factory class. It > > relies on PATH to check the major version using pg_config, then loads the > > appropriate class. > > From a code cleanliness point of view, I agree that having separate > classes for each version is neater than what you call a forest of > conditionals. I'm not sure I like the way you instantiate the classes > in pgaTester though -- wouldn't it be saner to have PostgresNode::new > itself be in charge of deciding which class to bless the object as? > Since we're talking about modifying PostgresNode itself in order to > support this, it would make sense to do that. Yes, it would be much saner to make PostgresNode the factory class. Plus, some more logic could be injected there to either auto-detect the version (current behavior) or eg. use a given path to the binaries as Mark did in its patch. > (I understand that one of your decisions was to avoid modifying > PostgresNode, so that you could ingest whatever came out of PGDG without > having to patch it each time.) Indeed, that's why I created this class with a not-very-fortunate name :) Let me know if it worth that I work on an official patch. Regards,
Re: why pg_walfile_name() cannot be executed during recovery?
On Fri, 2 Apr 2021 08:22:09 -0400 Robert Haas wrote: > On Fri, Apr 2, 2021 at 4:23 AM SATYANARAYANA NARLAPURAM > wrote: > > Why pg_walfile_name() can't be executed under recovery? > > I believe the issue is that the backend executing the function might > not have an accurate idea about which TLI to use. But I don't > understand why we can't find some solution to that problem. > > > What is the best way for me to get the current timeline and/or the file > > being recovering on the standby using a postgres query? I know I can get it > > via process title but don't want to go that route. > > pg_stat_wal_receiver has LSN and TLI information, but probably won't > help except when WAL receiver is actually active. > pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn() will give the > LSN at any point during recovery, but not the TLI. We might have some > gaps in this area... Yep, see previous discussion: https://www.postgresql.org/message-id/flat/20190723180518.635ac554%40firost The status by the time was to consider a new view eg. pg_stat_recovery, to report various recovery stats. But maybe the best place now would be to include it in the new pg_stat_wal view? As I'm interesting with this feature as well, I volunteer to work on it as author or reviewer. Regards,
Re: multi-install PostgresNode fails with older postgres versions
On Wed, 7 Apr 2021 09:08:31 -0700 Mark Dilger wrote: > > On Apr 7, 2021, at 7:37 AM, Jehan-Guillaume de Rorthais > > And here is a demo test file: > > https://github.com/ioguix/check_pgactivity/blob/tests/t/01-streaming_delta.t > > > > My limited set of tests are working with versions back to 9.0 so far. > > > > My 2¢ > > Hmm, I took a look. I'm not sure that we're working on the same problem, but > I might have missed something. I understood you were working on making PostgresNode compatible with older versions of PostgreSQL. So ou can create and interact with older versions, eg. 9.0. Did I misunderstood? My set of class had the exact same goal: creating and managing PostgreSQL nodes from various major versions. It's going a bit further than just init() though, as it supports some more existing methods (eg. enable_streaming) and adds some others (version, switch_wal, wait_for_archive). Regards,
Re: multi-install PostgresNode fails with older postgres versions
On Wed, 7 Apr 2021 11:54:36 -0400 Andrew Dunstan wrote: > On 4/7/21 10:37 AM, Jehan-Guillaume de Rorthais wrote: > > Hi all, > > > > First, sorry to step in this discussion this late. I didn't noticed it > > before :( > > > > I did some work about these compatibility issues in late 2020 to use > > PostgresNode in the check_pgactivity TAP tests. > > > > See https://github.com/ioguix/check_pgactivity/tree/tests/t/lib > > > > PostgresNode.pm, TestLib.pm, SimpleTee.pm and RecursiveCopy.pm comes > > unchanged from PostgreSQL source file (see headers and COPYRIGHT.pgsql). > > > > Then, I'm using the facet class PostgresNodeFacet to extend it with some > > more methods. Finaly, I created one class per majpr version, each > > inheriting from the next version. That means 13 inherits from > > PostgresNodeFacet.pm, 12 inherits from 13, 11 inherits from 12, etc. > > > > When I'm creating a new node, I'm using the "pgaTester" factory class. It > > relies on PATH to check the major version using pg_config, then loads the > > appropriate class. > > > > That means some class overrides almost no methods but version(), which > > returns the major version. Eg.: > > https://github.com/ioguix/check_pgactivity/blob/tests/t/lib/PostgresNode12.pm > > > > From tests, I can check the node version using this method, eg.: > > > > skip "skip non-compatible test on PostgreSQL 8.0 and before", 3 > > unless $node->version <= 8.0; > > > > Of course, there's a lot of duplicate code between classes, but my main goal > > was to keep PostgresNode.pm unchanged from upstream so I can easily update > > it. > > > > And here is a demo test file: > > https://github.com/ioguix/check_pgactivity/blob/tests/t/01-streaming_delta.t > > > > My limited set of tests are working with versions back to 9.0 so far. > > > > My 2¢ > > > > I don't really want to create a multitude of classes. I think Mark is > basically on the right track. I started off using a subclass of > PostgresNode but was persuaded not to go down that route, and instead we > have made some fairly minimal changes to PostgresNode itself. I think > that was a good decision. If you take out the versioning subroutines, > the actual version-aware changes Mark is proposing to PostgresNode are > quite small - less that 200 lines supporting versions all the way back > to 8.1. That's pretty awesome. I took this road because as soon as you want to use some other methods like enable_streaming, enable_archiving, etc, you find much more incompatibilities on your way. My current stack of classes is backward compatible with much more methods than just init(). But I admit it creates a multitude of class and some duplicate code... It's still possible to patch each methods in PostgresNode, but you'll end up with a forest of conditional blocks depending on how far you want to go with old PostgreSQL versions. Regards,
Re: multi-install PostgresNode fails with older postgres versions
Hi all, First, sorry to step in this discussion this late. I didn't noticed it before :( I did some work about these compatibility issues in late 2020 to use PostgresNode in the check_pgactivity TAP tests. See https://github.com/ioguix/check_pgactivity/tree/tests/t/lib PostgresNode.pm, TestLib.pm, SimpleTee.pm and RecursiveCopy.pm comes unchanged from PostgreSQL source file (see headers and COPYRIGHT.pgsql). Then, I'm using the facet class PostgresNodeFacet to extend it with some more methods. Finaly, I created one class per majpr version, each inheriting from the next version. That means 13 inherits from PostgresNodeFacet.pm, 12 inherits from 13, 11 inherits from 12, etc. When I'm creating a new node, I'm using the "pgaTester" factory class. It relies on PATH to check the major version using pg_config, then loads the appropriate class. That means some class overrides almost no methods but version(), which returns the major version. Eg.: https://github.com/ioguix/check_pgactivity/blob/tests/t/lib/PostgresNode12.pm From tests, I can check the node version using this method, eg.: skip "skip non-compatible test on PostgreSQL 8.0 and before", 3 unless $node->version <= 8.0; Of course, there's a lot of duplicate code between classes, but my main goal was to keep PostgresNode.pm unchanged from upstream so I can easily update it. And here is a demo test file: https://github.com/ioguix/check_pgactivity/blob/tests/t/01-streaming_delta.t My limited set of tests are working with versions back to 9.0 so far. My 2¢
Re: [PATCH] Disable bgworkers during servers start in pg_upgrade
Oh, I forgot another point before sending my previous email. Maybe it might worth adding some final safety checks in pg_upgrade itself? Eg. checking controldata and mxid files coherency between old and new cluster would have catch the inconsistency here.
Re: [PATCH] Disable bgworkers during servers start in pg_upgrade
Hi, On Wed, 27 Jan 2021 11:25:11 +0100 Denis Laxalde wrote: > Andres Freund a écrit : > > On 2021-01-21 16:23:58 +0100, Denis Laxalde wrote: > > > We found an issue in pg_upgrade on a cluster with a third-party > > > background worker. The upgrade goes fine, but the new cluster is then in > > > an inconsistent state. The background worker comes from the PoWA > > > extension but the issue does not appear to related to this particular > > > code. > > > > Well, it does imply that that backgrounder did something, as the pure > > existence of a bgworker shouldn't affect anything. Presumably the issue is > > that the bgworker actually does transactional writes, which causes problems > > because the xids / multixacts from early during pg_upgrade won't actually > > be valid after we do pg_resetxlog etc. Indeed, it does some writes. As soon as the powa bgworker starts, it takes "snapshots" of pg_stat_statements state and record them in a local table. To avoid concurrent run, it takes a lock on some of its local rows using SELECT FOR UPDATE, hence the mxid consumption. The inconsistency occurs at least at two place: * the datminmxid and relminmxid fields pg_dump(all)'ed and restored in the new cluster. * the multixid fields in the controlfile read during the check phase and restored later using pg_resetxlog. > > > As a solution, it seems that, for similar reasons that we restrict > > > socket access to prevent accidental connections (from commit > > > f763b77193), we should also prevent background workers to start at this > > > step. > > > > I think that'd have quite the potential for negative impact - [...] > > Thank you for those insights! +1 > > I wonder if we could > > > > a) set default_transaction_read_only to true, and explicitly change it > >in the sessions that need that. According to Denis' tests discussed off-list, it works fine in regard with the powa bgworker, albeit some complaints in logs. However, I wonder how fragile it could be as bgworker could easily manipulate either the GUC or "BEGIN READ WRITE". I realize this is really uncommon practices, but bgworker code from third parties might be surprising. > > b) when in binary upgrade mode / -b, error out on all wal writes in > >sessions that don't explicitly set a session-level GUC to allow > >writes. It feels safer because more specific to the subject. But I wonder if the benefice worth adding some (limited?) complexity and a small/quick conditional block in a very hot code path for a very limited use case. What about c) where the bgworker are not loaded by default during binary upgrade mode / -b unless they explicitly set a bgw_flags (BGWORKER_BINARY_UPGRADE ?) when they are required during pg_upgrade? Regards,
Re: vacuum -vs reltuples on insert only index
On Wed, 4 Nov 2020 18:44:03 -0800 Peter Geoghegan wrote: > On Mon, Nov 2, 2020 at 10:03 AM Peter Geoghegan wrote: > > Actually, it seems better to always count num_index_tuples the old way > > during cleanup-only index VACUUMs, despite the inaccuracy that that > > creates with posting list tuples. > > Pushed a fix for this just now. > > Thanks for the report! Sorry I couldn't give some more feedback on your thoughts on time... Thank you for your investigation and fix! Regards,
Re: The ultimate extension hook.
On Thu, 24 Sep 2020 17:08:44 +1200 David Rowley wrote: [...] > I wondered if there was much in the way of use-cases like a traffic > filter, or statement replication. I wasn't sure if it was a solution > looking for a problem or not, but it seems like it could be productive > to talk about possibilities here and make a judgement call based on if > any alternatives exist today that will allow that problem to be solved > sufficiently in another way. If I understand correctly the proposal, this might enable traffic capture using a loadable extension. This kind of usage would allows to replay and validate any kind of traffic from a major version to another one. Eg. to look for regressions from the application point of view, before a major upgrade. I did such regression tests in past. We were capturing production traffic using libpcap and replay it using pgshark on upgraded test env. Very handy. However: * libpcap can drop network packet during high load. This make the capture painful to recover past the hole. * useless with encrypted traffic So, +1 for such hooks. Regards,
vacuum -vs reltuples on insert only index
Hello, I've found a behavior change with pg_class.reltuples on btree index. With only insert activity on a table, when an index is processed, its related reltuples is set to 0. Here is a demo script: -- force index cleanup set vacuum_cleanup_index_scale_factor to 0; drop table if exists t; create table t as select i from generate_series(1, 100) i; create index t_i on t(i); -- after index creation its reltuples is correct select reltuples from pg_class where relname = 't_i' -- result: reltuples | 100 -- vacuum set index reltuples to 0 vacuum t; select reltuples from pg_class where relname = 't_i' -- result: reltuples | 0 -- analyze set it back to correct value analyze t; select reltuples from pg_class where relname = 't_i' -- result: reltuples | 100 -- insert + vacuum reset it again to 0 insert into t values(101); vacuum (verbose off, analyze on, index_cleanup on) t; select reltuples from pg_class where relname = 't_i' -- result: reltuples | 0 -- delete + vacuum set it back to correct value delete from t where i=10; vacuum (verbose off, analyze on, index_cleanup on) t; select reltuples from pg_class where relname = 't_i' -- result: reltuples | 100 -- and back to 0 again with insert+vacuum insert into t values(102); vacuum (verbose off, analyze on, index_cleanup on) t; select reltuples from pg_class where relname = 't_i' -- result: reltuples | 0 Before 0d861bbb70, btvacuumpage was adding to relation stats the number of leaving lines in the block using: stats->num_index_tuples += maxoff - minoff + 1; After 0d861bbb70, it is set using new variable nhtidslive: stats->num_index_tuples += nhtidslive However, nhtidslive is only incremented if callback (IndexBulkDeleteCallback) is set, which seems not to be the case on select-only workload. A naive fix might be to use "maxoff - minoff + 1" when callback==NULL. Thoughts? Regards,
Re: [patch] demote
On Tue, 18 Aug 2020 17:41:31 +0200 Jehan-Guillaume de Rorthais wrote: > Hi, > > Please find in attachment v5 of the patch set rebased on master after various > conflicts. > > Regards, > > On Wed, 5 Aug 2020 00:04:53 +0200 > Jehan-Guillaume de Rorthais wrote: > > > Demote now keeps backends with no active xid alive. Smart mode keeps all > > backends: it waits for them to finish their xact and enter read-only. Fast > > mode terminate backends wit an active xid and keeps all other ones. > > Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1 > > (check recovery state) once demoted. > > During demote, no new session is allowed. > > > > As backends with no active xid survive, a new SQL admin function > > "pg_demote(fast bool, wait bool, wait_seconds int)" had been added. Just to keep the list inform, I found a race condition leading to backends trying to write to XLog after they processed the demote signal. Eg.: [posmaster] LOG: all backends in read only [checkpointer] LOG: demoting [backend] PANIC: cannot make new WAL entries during recovery STATEMENT: UPDATE pgbench_accounts [...] Because of this Postmaster enters in crash recovery while demote environnement is in progress. I have a couple of other subjects right now, but I plan to get back to it soon. Regards,
Re: [patch] demote
Hi, Please find in attachment v5 of the patch set rebased on master after various conflicts. Regards, On Wed, 5 Aug 2020 00:04:53 +0200 Jehan-Guillaume de Rorthais wrote: > Demote now keeps backends with no active xid alive. Smart mode keeps all > backends: it waits for them to finish their xact and enter read-only. Fast > mode terminate backends wit an active xid and keeps all other ones. > Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1 > (check recovery state) once demoted. > During demote, no new session is allowed. > > As backends with no active xid survive, a new SQL admin function > "pg_demote(fast bool, wait bool, wait_seconds int)" had been added. > > Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie(). > The resulting refactoring makes the code much simpler, cleaner, with better > isolation of actions from the code point of view. > > Thanks to the refactoring, the patch now only adds one state to the state > machine: PM_DEMOTING. A second one could be use to replace: > > /* Demoting: start the Startup Process */ > if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0) > > with eg.: > > if (pmState == PM_DEMOTED) > > I believe it might be a bit simpler to understand, but the existing comment > might be good enough as well. The full state machine path for demote is: > > PM_DEMOTING /* wait for active xid backend to finish */ > PM_SHUTDOWN /* wait for checkpoint shutdown and its > various shutdown tasks */ > PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */ > PM_STARTUP > > Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests, > adding tests on backend behaviors during demote and new function pg_demote(). > > On my todo: > > * cancel running checkpoint for fast demote ? > * forbid demote when PITR backup is in progress > * user documentation > * Robert's concern about snapshot during hot standby > * anything else reported to me > > Plus, I might be able to split the backend part and their signals of the patch > 0002 in its own patch if it helps the review. It would apply after 0001 and > before actual 0002. > > As there was no consensus and the discussions seemed to conclude this patch > set should keep growing to see were it goes, I wonder if/when I should add it > to the commitfest. Advice? Opinion? >From 90e26a7e2d53f7a8436de0b73eb57498f884de9d Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 31 Jul 2020 10:58:40 +0200 Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local variable Adds functions extern LocalSetXLogInsertNotAllowed() and LocalSetXLogInsertCheckRecovery() to set the local variable LocalXLogInsert respectively to 0 and -1. These functions are declared as extern for future need in the demote patch. Function LocalSetXLogInsertAllowed() already exists and declared as static as it is not needed outside of xlog.h. --- src/backend/access/transam/xlog.c | 27 +++ src/include/access/xlog.h | 2 ++ 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 09c01ed4ae..c0d79f192c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7711,7 +7711,7 @@ StartupXLOG(void) Insert->fullPageWrites = lastFullPageWrites; LocalSetXLogInsertAllowed(); UpdateFullPageWrites(); - LocalXLogInsertAllowed = -1; + LocalSetXLogInsertCheckRecovery(); if (InRecovery) { @@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void) InitXLOGAccess(); } +/* + * Make XLogInsertAllowed() return false in the current process only. + */ +void +LocalSetXLogInsertNotAllowed(void) +{ + LocalXLogInsertAllowed = 0; +} + +/* + * Make XLogInsertCheckRecovery() return false in the current process only. + */ +void +LocalSetXLogInsertCheckRecovery(void) +{ + LocalXLogInsertAllowed = -1; +} + + /* * Subroutine to try to fetch and validate a prior checkpoint record. * @@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags) if (shutdown) { if (flags & CHECKPOINT_END_OF_RECOVERY) - LocalXLogInsertAllowed = -1; /* return to "check" state */ + LocalSetXLogInsertCheckRecovery(); /* return to "check" state */ else - LocalXLogInsertAllowed = 0; /* never again write WAL */ + LocalSetXLogInsertNotAllowed(); /* never again write WAL */ } /* @@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void) END_CRIT_SECTION(); - LocalXLogInsertAllowed = -1; /* return to "check" state */ + LocalSetXLogInsertCheckRecovery(); /* return to "check" state */ } /* diff --git a/src/include/access/xlog.h b/src/incl
Re: [patch] demote
Hi, Yet another summary + patch + tests. Demote now keeps backends with no active xid alive. Smart mode keeps all backends: it waits for them to finish their xact and enter read-only. Fast mode terminate backends wit an active xid and keeps all other ones. Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1 (check recovery state) once demoted. During demote, no new session is allowed. As backends with no active xid survive, a new SQL admin function "pg_demote(fast bool, wait bool, wait_seconds int)" had been added. Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie(). The resulting refactoring makes the code much simpler, cleaner, with better isolation of actions from the code point of view. Thanks to the refactoring, the patch now only adds one state to the state machine: PM_DEMOTING. A second one could be use to replace: /* Demoting: start the Startup Process */ if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0) with eg.: if (pmState == PM_DEMOTED) I believe it might be a bit simpler to understand, but the existing comment might be good enough as well. The full state machine path for demote is: PM_DEMOTING /* wait for active xid backend to finish */ PM_SHUTDOWN /* wait for checkpoint shutdown and its various shutdown tasks */ PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */ PM_STARTUP Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests, adding tests on backend behaviors during demote and new function pg_demote(). On my todo: * cancel running checkpoint for fast demote ? * forbid demote when PITR backup is in progress * user documentation * Robert's concern about snapshot during hot standby * anything else reported to me Plus, I might be able to split the backend part and their signals of the patch 0002 in its own patch if it helps the review. It would apply after 0001 and before actual 0002. As there was no consensus and the discussions seemed to conclude this patch set should keep growing to see were it goes, I wonder if/when I should add it to the commitfest. Advice? Opinion? Regards, >From da3c4575f8ea40c089483b9cfa209db4993148ff Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 31 Jul 2020 10:58:40 +0200 Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local variable Adds functions extern LocalSetXLogInsertNotAllowed() and LocalSetXLogInsertCheckRecovery() to set the local variable LocalXLogInsert respectively to 0 and -1. These functions are declared as extern for future need in the demote patch. Function LocalSetXLogInsertAllowed() already exists and declared as static as it is not needed outside of xlog.h. --- src/backend/access/transam/xlog.c | 27 +++ src/include/access/xlog.h | 2 ++ 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 756b838e6a..25a9f78690 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7711,7 +7711,7 @@ StartupXLOG(void) Insert->fullPageWrites = lastFullPageWrites; LocalSetXLogInsertAllowed(); UpdateFullPageWrites(); - LocalXLogInsertAllowed = -1; + LocalSetXLogInsertCheckRecovery(); if (InRecovery) { @@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void) InitXLOGAccess(); } +/* + * Make XLogInsertAllowed() return false in the current process only. + */ +void +LocalSetXLogInsertNotAllowed(void) +{ + LocalXLogInsertAllowed = 0; +} + +/* + * Make XLogInsertCheckRecovery() return false in the current process only. + */ +void +LocalSetXLogInsertCheckRecovery(void) +{ + LocalXLogInsertAllowed = -1; +} + + /* * Subroutine to try to fetch and validate a prior checkpoint record. * @@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags) if (shutdown) { if (flags & CHECKPOINT_END_OF_RECOVERY) - LocalXLogInsertAllowed = -1; /* return to "check" state */ + LocalSetXLogInsertCheckRecovery(); /* return to "check" state */ else - LocalXLogInsertAllowed = 0; /* never again write WAL */ + LocalSetXLogInsertNotAllowed(); /* never again write WAL */ } /* @@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void) END_CRIT_SECTION(); - LocalXLogInsertAllowed = -1; /* return to "check" state */ + LocalSetXLogInsertCheckRecovery(); /* return to "check" state */ } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 221af87e71..8c9cadc6da 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -306,6 +306,8 @@ extern RecoveryState GetRecoveryState(void); extern bool HotStandbyActive(void); extern bool HotStandbyActiveInReplay(void); extern bool XLogInsertAllowed(void); +extern void LocalSetXLogInsertNotAllowed(void); +extern void LocalSetXLogI
BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows
Hi, I'm bumping this thread on pgsql-hacker, hopefully it will drag some more opinions/discussions. Should we try to fix this issue or not? This is clearly an upstream bug. It has been reported, including regression tests, but this doesn't move since 2 years now. If we choose not to fix it on our side using eg a workaround (see patch), I suppose this small bug should be documented somewhere so people are not lost alone in the wild. Opinions? Regards, Begin forwarded message: Date: Sat, 13 Jun 2020 00:43:22 +0200 From: Jehan-Guillaume de Rorthais To: Thomas Munro , Peter Geoghegan Cc: Роман Литовченко , PostgreSQL mailing lists Subject: Re: BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows On Fri, 12 Jun 2020 18:40:55 +0200 Jehan-Guillaume de Rorthais wrote: > On Wed, 10 Jun 2020 00:29:33 +0200 > Jehan-Guillaume de Rorthais wrote: > [...] > > After playing with ICU regression tests, I found functions ucol_strcollIter > > and ucol_nextSortKeyPart are safe. I'll do some performance tests and report > > here. > > I did some benchmarks. See attachment for the script and its header to > reproduce. > > It sorts 935895 french phrases from 0 to 122 chars with an average of 49. > Performance tests were done on current master HEAD (buggy) and using the patch > in attachment, relying on ucol_strcollIter. > > My preliminary test with ucol_getSortKey was catastrophic, as we might > expect. 15-17x slower than the current HEAD. So I removed it from actual > tests. I didn't try with ucol_nextSortKeyPart though. > > Using ucol_strcollIter performs ~20% slower than HEAD on UTF8 databases, but > this might be acceptable. Here are the numbers: > >DB Encoding HEAD strcollIter ratio >UTF8 2.74 3.27 1.19x >LATIN15.34 5.40 1.01x > > I plan to add a regression test soon. Please, find in attachment the second version of the patch, with a regression test. Regards, -- Jehan-Guillaume de Rorthais Dalibo >From bb428135490caafe23562e3dcd9925d7d7c5a7be Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Thu, 11 Jun 2020 18:08:31 +0200 Subject: [PATCH] Replace buggy ucol_strcoll* funcs with ucol_strcollIter Functions ucol_strcoll and ucol_strcollUTF8 are breaking some collation rules. This leads to wrong sort order or wrong result from index using such collations. See bug report #15285 for details: http://postgr.es/m/153201618542.1404.3611626898935613264%40wrigleys.postgresql.org This fix replace ucol_strcoll* functions with ucol_strcollIter() which is not affected by this bug. Reported-by: Roman Lytovchenko Analysed-by: Peter Geoghegan Author: Jehan-Guillaume de Rorthais --- src/backend/utils/adt/varlena.c | 54 --- src/include/utils/pg_locale.h | 14 - .../regress/expected/collate.icu.utf8.out | 12 + src/test/regress/sql/collate.icu.utf8.sql | 8 +++ 4 files changed, 56 insertions(+), 32 deletions(-) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 2eaabd6231..f156c00a1a 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -1638,34 +1638,43 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid) if (mylocale->provider == COLLPROVIDER_ICU) { #ifdef USE_ICU -#ifdef HAVE_UCOL_STRCOLLUTF8 if (GetDatabaseEncoding() == PG_UTF8) { UErrorCode status; + UCharIterator iter1, + iter2; status = U_ZERO_ERROR; - result = ucol_strcollUTF8(mylocale->info.icu.ucol, - arg1, len1, - arg2, len2, - ); + + uiter_setUTF8(, arg1, len1); + uiter_setUTF8(, arg2, len2); + + result = ucol_strcollIter(mylocale->info.icu.ucol, + , , ); if (U_FAILURE(status)) ereport(ERROR, (errmsg("collation failed: %s", u_errorName(status; } else -#endif { + UErrorCode status; + UCharIterator iter1, + iter2; int32_t ulen1, ulen2; UChar *uchar1, *uchar2; + status = U_ZERO_ERROR; + ulen1 = icu_to_uchar(, arg1, len1); ulen2 = icu_to_uchar(, arg2, len2); - result = ucol_strcoll(mylocale->info.icu.ucol, - uchar1, ulen1, - uchar2, ulen2); + uiter_setString(, uchar1, ulen1); + uiter_setString(, uchar2, ulen2); + + result = ucol_strcollIter(mylocale->info.icu.ucol, + , , ); pfree(uchar1); pfree(uchar2); @@ -2352,34 +2361,43 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup) if (sss->locale->provider == COLLPROVIDER_ICU) { #ifdef USE_ICU -#ifdef HAVE_UCOL_STRCOLLUTF8 if (GetDatabaseEncoding() == PG_UTF8) { UErrorCode status; +UCh
Re: [patch] demote
On Tue, 14 Jul 2020 12:49:51 -0700 Andres Freund wrote: > Hi, > > On 2020-07-14 17:26:37 +0530, Amul Sul wrote: > > On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais > > wrote: > > > > > > Hi, > > > > > > Another summary + patch + tests. > > > > > > This patch supports 2PC. The goal is to keep them safe during > > > demote/promote actions so they can be committed/rollbacked later on a > > > primary. See tests. > > > > Wondering is it necessary to clear prepared transactions from shared memory? > > Can't simply skip clearing and restoring prepared transactions while > > demoting? > > Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store > transaction state. So I don't think it'd be correct to leave them around > in their previous state. We'd likely end up with incorrect snapshots > if a demoted node later gets promoted... Indeed. I experienced it while debugging. PGXACT/PGPROC/locks need to be cleared.
Re: [patch] demote
Hi, Another summary + patch + tests. This patch supports 2PC. The goal is to keep them safe during demote/promote actions so they can be committed/rollbacked later on a primary. See tests. The checkpointer is now shutdowned after the demote shutdown checkpoint. It removes some useless code complexity, eg. avoiding to signal postmaster from checkpointer to keep going with the demotion. Cascaded replication is now supported. Wal senders stay actives during demotion but set their local "am_cascading_walsender = true". It has been a rough debug session (thank you rr and tests!) on my side, but it might deserve it. I believe they should stay connected during the demote actions for futur features, eg. triggering a switchover over the replication protocol using an admin function. The first tests has been added in "recovery/t/021_promote-demote.pl". I'll add some more tests in futur versions. I believe the patch is ready for some preliminary tests and advice or directions. On my todo: * study how to only disconnect or cancel active RW backends * ...then add pg_demote() admin function * cancel running checkpoint for fast demote ? * user documentation * Robert's concern about snapshot during hot standby * some more coding style cleanup/refactoring * anything else reported to me :) Thanks, On Fri, 3 Jul 2020 00:12:10 +0200 Jehan-Guillaume de Rorthais wrote: > Hi, > > Here is a small activity summary since last report. > > On Thu, 25 Jun 2020 19:27:54 +0200 > Jehan-Guillaume de Rorthais wrote: > [...] > > I hadn't time to investigate Robert's concern about shared memory for > > snapshot during recovery. > > I hadn't time to dig very far, but I suppose this might be related to the > comment in ProcArrayShmemSize(). If I'm right, then it seems the space is > already allocated as long as hot_standby is enabled. I realize it doesn't > means we are on the safe side of the fence though. I still have to have a > better understanding on this. > > > The patch doesn't deal with prepared xact yet. Testing > > "start->demote->promote" raise an assert if some prepared xact exist. I > > suppose I will rollback them during demote in next patch version. > > Rollback all prepared transaction on demote seems easy. However, I realized > there's no point to cancel them. After the demote action, they might still be > committed later on a promoted instance. > > I am currently trying to clean shared memory for existing prepared transaction > so they are handled by the startup process during recovery. > I've been able to clean TwoPhaseState and the ProcArray. I'm now in the > process to clean remaining prepared xact locks. > > Regards, > > -- Jehan-Guillaume de Rorthais Dalibo >From 4470772702273c720cdea942ed229d59f3a70318 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 10 Apr 2020 18:01:45 +0200 Subject: [PATCH 1/2] Support demoting instance from production to standby Architecture: * creates a shutdown checkpoint on demote * use DB_DEMOTING state in controlfile * try to handle subsystems init correctly during demote * keep some sub-processes alive: stat collector, bgwriter and optionally archiver or wal senders * the code currently use USR1 to signal the wal senders to check if they must enable the cascading mode * ShutdownXLOG takes a boolean arg to handle demote differently * the checkpointer is restarted for code simplicity Trivial manual tests: * make check: OK * make check-world: OK * start in production->demote->demote: OK * start in production->demote->stop: OK * start in production->demote-> promote: OK * switch roles between primary and standby (switchover): OK * commit and check 2PC after demote/promote * commit and check 2PC after switchover Discuss/Todo: * cancel or kill active and idle in xact RW backends * keep RO backends * pg_demote() function? * code reviewing, arch, analysis, checks, etc * investigate snapshots shmem needs/init during recovery compare to production * add doc * cancel running checkpoint during demote * replace with a END_OF_PRODUCTION xlog record? --- src/backend/access/transam/twophase.c | 95 +++ src/backend/access/transam/xlog.c | 315 src/backend/postmaster/checkpointer.c | 28 +++ src/backend/postmaster/postmaster.c | 261 +++- src/backend/replication/walsender.c | 1 + src/backend/storage/ipc/procarray.c | 2 + src/backend/storage/ipc/procsignal.c| 4 + src/backend/storage/lmgr/lock.c | 12 + src/bin/pg_controldata/pg_controldata.c | 2 + src/bin/pg_ctl/pg_ctl.c | 111 + src/include/access/twophase.h | 1 + src/include/access/xlog.h | 19 +- src/include/catalog/pg_control.h| 1 + src/include/libpq/libpq-be.h
Re: [patch] demote
Hi, Here is a small activity summary since last report. On Thu, 25 Jun 2020 19:27:54 +0200 Jehan-Guillaume de Rorthais wrote: [...] > I hadn't time to investigate Robert's concern about shared memory for snapshot > during recovery. I hadn't time to dig very far, but I suppose this might be related to the comment in ProcArrayShmemSize(). If I'm right, then it seems the space is already allocated as long as hot_standby is enabled. I realize it doesn't means we are on the safe side of the fence though. I still have to have a better understanding on this. > The patch doesn't deal with prepared xact yet. Testing > "start->demote->promote" raise an assert if some prepared xact exist. I > suppose I will rollback them during demote in next patch version. Rollback all prepared transaction on demote seems easy. However, I realized there's no point to cancel them. After the demote action, they might still be committed later on a promoted instance. I am currently trying to clean shared memory for existing prepared transaction so they are handled by the startup process during recovery. I've been able to clean TwoPhaseState and the ProcArray. I'm now in the process to clean remaining prepared xact locks. Regards,
Re: Remove Deprecated Exclusive Backup Mode
On Thu, 2 Jul 2020 12:32:14 +0200 Magnus Hagander wrote: [...] > > non-exclusive backup...this is not that easy anymore. And > > pg_is_in_backup() is quite misleading if the admin found it without reading > > doc. Maybe an admin > > Yeah, as it is now it should really be called pg_is_in_exclusive_backup(). +1 We end up to the same conclusion while discussing with Gilles Darold this morning. But considering the discussion bellow, this might not be needed anyway. > > function to list in progress non-exclusive backup and related backend pid > > might > > be a good start? > > > > I think it would make perfect sense to show manual (exclusive or > non-exclusive) base backups in pg_stat_progress_basebackup. There's no > fundamental need that one should only be for base backups taken with > pg_basebackup. In fact, I'd argue that's a mistake in the view in v13 that > it does not include this information. > > It could have "phase" set to "manual non-exclusive" for example, and leave > the other fields at NULL. Even in manual non-exclusive we could define some different phase: * waiting for checkpoint to finish * external backup * waiting for wal archiving to finish So maybe exposing the backup label and/or the method might be useful as well? > I guess in theory even for exclusive ones, with the pid column set to NULL. We are discussing removing the mandatory connection during the non-exclusive backup. If it become optional at some point, this PID might be NULL as well... > (As Stephen mentioned at some point in the future we might also want to > extend it to allow these tools to report their progress as well into it, > probably by just calling an admin function on the connection that they > already have). > > > > Tools like pg_basebackup (and probably pgbackrest/barman to some extends) > > still need the backup to abort on disconnection. Maybe it could flag its > > session using the replication protocol or call a new admin function or load > > a hook on session-shutdown to keep previous API behavior? > > Suddenly requiring a replication protocol connection for one thing, when > all their other things are done over a normal one, seems really terrible. Sorry, I misexplained. Adding such flag to the replication protocol was only for pg_basebackup. Other tools would use an admin function to achieve the same goal. > But having an admin function would work. > > But anything requiring loading of a hook is going to raise the bar > massively as suddenly somebody needs to install a compiled binary on the > server in order to use it. Contrib already provide many useful tools. How would it be different from eg. pg_archivecleanup? IIRC, end of session hook was discussed in the past, but I did not follow the thread by the time. But anyway, an admin function would probably make the job as well anyway. > But calling a separate function is pretty much > what I suggested upthread > (except I suggested a new version of pg_stop_backup, but implementation wise) > > But I'm not sure what previous API behavior you're looking to preserve here? Current non-exclusive backup are automatically aborted when the pg_start_backup() session disappear without pg_stop_backup(). I suppose this was the API behavior that Robert is concerned to loose when he is writing about people forgetting to end backups. And now you are talking about Stephen's idea to report the progress of a backup, this kind of API could be used as watchdog as well: if the frontend did not report before some optional kind of (user defined) timeout, abort the backup. We would have a "similar" behavior than current one. To be clear, I suppose these admin functions could be called from any backend session, taking some kind of backup id (eg. the token you were talking about upthread). Regards,
Re: Remove Deprecated Exclusive Backup Mode
On Wed, 1 Jul 2020 15:58:57 -0400 Robert Haas wrote: > On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander wrote: > > As far as I've seen, the one thing that people have problems with in the > > exclusive mode backups are precisely the fact that they have to keep a > > persistent conneciton open, and thus it cannot work together with backup > > software that is limited to only supporting running a pre- and a post > > script. > > > > Something like I have suggested here is to solve *that* problem. I don't > > think anybody actually explicitly wants "exclusive backups" -- they want a > > backup solution that plugs into their world of pre/post scripts. And if we > > can make that one work in a safer way than the current exclusive backups, > > ohw is that not an improvement? > > Yeah, I guess that's a pretty fair point. I have to confess to having > somewhat limited enthusiasm for adding a third mode here, but it might > be worth it. > > It seems pretty well inevitable to me that people are going to forget > to end them. There's so many way an admin could break their backup process and not notice it. Even with current nonexclusive backup, a bad written script might creates bad unrecoverable backups. In regard with your concern, monitoring in progress backups might be *one* answer, from server side. This was "easy" with exclusive backup, we just monitor the backup_label age and warn if it is older than expected [1]. Using non-exclusive backup...this is not that easy anymore. And pg_is_in_backup() is quite misleading if the admin found it without reading doc. Maybe an admin function to list in progress non-exclusive backup and related backend pid might be a good start? With such admin function facilities, client side process could monitor backup and decide to cancel them based on some internal backup policies/process. Tools like pg_basebackup (and probably pgbackrest/barman to some extends) still need the backup to abort on disconnection. Maybe it could flag its session using the replication protocol or call a new admin function or load a hook on session-shutdown to keep previous API behavior? Regards, [1] https://github.com/OPMDG/check_pgactivity/#user-content-backup_label_age-8.1
Re: [patch] demote
On Fri, 26 Jun 2020 16:14:38 +0900 (JST) Kyotaro Horiguchi wrote: > Hello. > > At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais > wrote in > > Here is a summary of my work during the last few days on this demote > > approach. > > > > Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the > > commit message and as FIXME in code. > > > > The patch is not finished or bug-free yet, I'm still not very happy with the > > coding style, it probably lack some more code documentation, but a lot has > > changed since v1. It's still a PoC to push the discussion a bit further > > after being myself silent for some days. > > > > The patch is currently relying on a demote checkpoint. I understand a forced > > checkpoint overhead can be massive and cause major wait/downtime. But I keep > > this for a later step. Maybe we should be able to cancel a running > > checkpoint? Or leave it to its synching work but discard the result without > > wirting it to XLog? > > If we are going to dive so close to server shutdown, we can just > utilize the restart-after-crash path, which we can assume to work > reliably. The attached is a quite rough sketch, hijacking smart > shutdown path for a convenience, of that but seems working. "pg_ctl > -m s -W stop" lets server demote. This was actually my very first toy PoC. However, resetting everything is far from a graceful demote I was seeking for. Moreover, such a patch will not be able to evolve to eg. keep read only backends around. > > I hadn't time to investigate Robert's concern about shared memory for > > snapshot during recovery. > > The patch does all required clenaup of resources including shared > memory, I believe. It's enough if we don't need to keep any resources > alive? Resetting everything might not be enough. If I understand Robert's concern correctly, it might actually need more shmem for hot standby xact snapshot. Or maybe some shmem init'ed differently. Regards,
Re: [patch] demote
Hi, Here is a summary of my work during the last few days on this demote approach. Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the commit message and as FIXME in code. The patch is not finished or bug-free yet, I'm still not very happy with the coding style, it probably lack some more code documentation, but a lot has changed since v1. It's still a PoC to push the discussion a bit further after being myself silent for some days. The patch is currently relying on a demote checkpoint. I understand a forced checkpoint overhead can be massive and cause major wait/downtime. But I keep this for a later step. Maybe we should be able to cancel a running checkpoint? Or leave it to its synching work but discard the result without wirting it to XLog? I hadn't time to investigate Robert's concern about shared memory for snapshot during recovery. The patch doesn't deal with prepared xact yet. Testing "start->demote->promote" raise an assert if some prepared xact exist. I suppose I will rollback them during demote in next patch version. I'm not sure how to divide this patch in multiple small independent steps. I suppose I can split it like: 1. add demote checkpoint 2. support demote: mostly postmaster, startup/xlog and checkpointer related code 3. cli using pg_ctl demote ...But I'm not sure it worth it. Regards, >From 03c41dd706648cd20df90a128db64eee6b6dad97 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 10 Apr 2020 18:01:45 +0200 Subject: [PATCH] Demote PoC Changes: * creates a demote checkpoint * use DB_DEMOTING state in controlfile * try to handle subsystems init correctly during demote * keep some sub-processes alive: stat collector, checkpointer, bgwriter and optionally archiver or wal senders * add signal PMSIGNAL_DEMOTING to start the startup process after the demote checkpoint * ShutdownXLOG takes a boolean arg to handle demote differently Trivial manual tests: * make check : OK * make check-world : OK * start in production -> demote -> demote: OK * start in production -> demote -> stop : OK * start in production -> demote -> promote : NOK (2PC, see TODO) but OK with no prepared xact. Discuss/Todo: * rollback prepared xact * cancel/kill active/idle in xact R/W backends * pg_demote() function? * some more code reviewing around StartupXlog * investigate snapshots shmem needs/init during recovery compare to production * add tap tests * add doc * how to handle checkpoint? --- src/backend/access/rmgrdesc/xlogdesc.c | 9 +- src/backend/access/transam/xlog.c | 287 +++- src/backend/postmaster/checkpointer.c | 22 ++ src/backend/postmaster/postmaster.c | 250 - src/backend/storage/ipc/procsignal.c| 4 + src/bin/pg_controldata/pg_controldata.c | 2 + src/bin/pg_ctl/pg_ctl.c | 111 + src/include/access/xlog.h | 18 +- src/include/catalog/pg_control.h| 2 + src/include/libpq/libpq-be.h| 7 +- src/include/postmaster/bgwriter.h | 1 + src/include/storage/pmsignal.h | 1 + src/include/storage/procsignal.h| 1 + src/include/utils/pidfile.h | 1 + 14 files changed, 537 insertions(+), 179 deletions(-) diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c index 1cd97852e8..5aeaff18f8 100644 --- a/src/backend/access/rmgrdesc/xlogdesc.c +++ b/src/backend/access/rmgrdesc/xlogdesc.c @@ -40,7 +40,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record) uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; if (info == XLOG_CHECKPOINT_SHUTDOWN || - info == XLOG_CHECKPOINT_ONLINE) + info == XLOG_CHECKPOINT_ONLINE || + info == XLOG_CHECKPOINT_DEMOTE) { CheckPoint *checkpoint = (CheckPoint *) rec; @@ -65,7 +66,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record) checkpoint->oldestCommitTsXid, checkpoint->newestCommitTsXid, checkpoint->oldestActiveXid, - (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online"); + (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : + (info == XLOG_CHECKPOINT_DEMOTE)? "demote" : "online"); } else if (info == XLOG_NEXTOID) { @@ -185,6 +187,9 @@ xlog_identify(uint8 info) case XLOG_FPI_FOR_HINT: id = "FPI_FOR_HINT"; break; + case XLOG_CHECKPOINT_DEMOTE: + id = "CHECKPOINT_DEMOTE"; + break; } return id; diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e455384b5b..0e18e546ba 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6301,6 +6301,13 @@ CheckRequiredParameterValues(void) /* * This must be called ONCE during postmaster or standalone-backend startup */ +/* + * FIXME demote: part of the code here assume there's no other active +
Re: [patch] demote
On Thu, 18 Jun 2020 11:22:47 -0400 Robert Haas wrote: > On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao > wrote: > > ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY". > > What about the following procedure? > > > > 1. Demote the primary to a standby. Then this demoted standby is read-only. > > 2. The orignal standby automatically establishes the cascading replication > > connection with the demoted standby. > > 3. Wait for all the WAL records available in the demoted standby to be > > streamed to the orignal standby. > > 4. Promote the original standby to new primary. > > 5. Change primary_conninfo in the demoted standby so that it establishes > > the replication connection with new primary. > > > > So it seems enough to implement "demote" feature for a clean switchover. > > There's something to that idea. I think it somewhat depends on how > invasive the various operations are. For example, I'm not really sure > how feasible it is to demote without a full server restart that kicks > out all sessions. If that is required, it's a significant disadvantage > compared to ASRO. On the other hand, if a machine can be demoted just > by kicking out R/W sessions, as ASRO currently does, then maybe > there's not that much difference. Or maybe both designs are subject to > improvement and we can do something even less invasive... Considering the current demote patch improvement. I was considering to digg in the following direction: * add a new state in the state machine where all backends are idle * this new state forbid any new writes, the same fashion we do on standby nodes * this state could either wait for end of xact, or cancel/kill RW backends, in the same fashion current smart/fast stop do * from this state, we might then rollback pending prepared xact, stop other sub-process etc (as the current patch does), and demote safely to PM_RECOVERY or PM_HOT_STANDBY (depending on the setup). Is it something worth considering? Maybe the code will be so close from ASRO, it would just be kind of a fusion of both patch? I don't know, I didn't look at the ASRO patch yet. > One thing I think people are going to want to do is have the master go > read-only if it loses communication to the rest of the network, to > avoid or at least mitigate split-brain. However, such network > interruptions are often transient, so it might not be uncommon to > briefly go read-only due to a network blip, but then recover quickly > and return to a read-write state. It doesn't seem to matter much > whether that read-only state is a new kind of normal operation (like > what ASRO would do) or whether we've actually returned to a recovery > state (as demote would do) but the collateral effects of the state > change do matter. Well, triggering such actions (demote or read only) often occurs external decision, hopefully relying on at least some quorum and being able to escalade to watchdog or fencing is required. Most tools around will need to demote or fence. It seems dangerous to flip between read only/read write on a bunch of cluster nodes. It might be quickly messy, especially since a former primary with non replicated data could automatically replicate from a new primary without screaming... Regards,
Re: [patch] demote
On Thu, 18 Jun 2020 11:15:02 -0400 Robert Haas wrote: > On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais > wrote: > > At some expense, Admin can already set the system as readonly from the > > application point of view, using: > > > > alter system set default_transaction_read_only TO on; > > select pg_reload_conf(); > > > > Current RW xact will finish, but no other will be allowed. > > That doesn't block all WAL generation, though: > > rhaas=# alter system set default_transaction_read_only TO on; > ALTER SYSTEM > rhaas=# select pg_reload_conf(); > pg_reload_conf > > t > (1 row) > rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts; > rhaas=# Yes, this, and the fact that any user can switch transaction_read_only back to on easily. This was a terrible example. My point was that ALTER SYSTEM READ ONLY as described here doesn't feel like a required user feature, outside of the demote scope. It might be useful for the demote process, but only from the core point of view, without user interaction. It seems there's no other purpose from the admin standpoint. > There's a bunch of other things it also doesn't block, too. If you're > trying to switch to a new primary, you really want to stop WAL > generation completely on the old one. Otherwise, you can't guarantee > that the machine you're going to promote is completely caught up, > which means you might lose some changes, and you might have to > pg_rewind the old master. Yes, of course. I wasn't explaining transaction_read_only was useful in a switchover procedure, sorry for the confusion and misleading comment. Regards,
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, 18 Jun 2020 10:52:49 -0400 Robert Haas wrote: [...] > But what you want is that if it does happen to go down for some reason before > all the WAL is streamed, you can bring it back up and finish streaming the > WAL without generating any new WAL. Thanks to cascading replication, it could be very possible without this READ ONLY mode, just in recovery mode, isn't it? Regards,
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, 17 Jun 2020 12:07:22 -0400 Robert Haas wrote: [...] > > Commands that involve a whole > > bunch of subtle interlocking --- and, therefore, aren't going to work if > > anything has gone wrong already anywhere in the server --- seem like a > > particularly poor thing to be hanging your HA strategy on. > > It's important not to conflate controlled switchover with failover. > When there's a failover, you have to accept some risk of data loss or > service interruption; but a controlled switchover does not need to > carry the same risks and there are plenty of systems out there where > it doesn't. Yes. Maybe we should make sure the wording we are using is the same for everyone. I already hear/read "failover", "controlled failover", "switchover" or "controlled switchover", this is confusing. My definition of switchover is: swapping primary and secondary status between two replicating instances. With no data loss. This is a controlled procedure where all steps must succeed to complete. If a step fails, the procedure fail back to the original primary with no data loss. However, Wikipedia has a broader definition, including situations where the switchover is executed upon a failure: https://en.wikipedia.org/wiki/Switchover Regards,
Re: [patch] demote
On Wed, 17 Jun 2020 11:14:47 -0700 Andres Freund wrote: > Hi, > > On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote: > > As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur > > objectives than mine, I decided to share the humble patch I am playing with > > to step down an instance from primary to standby status. > > To make sure we are on the same page: What your patch intends to do is > to leave the server running, but switch from being a primary to > replicating from another system. Correct? Yes. The instance status is retrograded from "in production" to "in archive recovery". Of course, it will start replicating depending on archive_mode/command and primary_conninfo setup. > > Main difference with Amul's patch is that all backends must be disconnected > > to process with the demote. Either we wait for them to disconnect (smart) > > or we kill them (fast). This makes life much easier from the code point of > > view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')" > > as an admin would kill the session, with no options to wait for the action > > to finish, as we do with pg_promote(). Keeping read only session around > > could probably be achieved using global barrier as Amul did, but without > > all the complexity related to WAL writes prohibition. > > FWIW just doing that for normal backends isn't sufficient, you also have > to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due > to hint bits, the checkpoint record, and some more). In fact, the patch relies on existing code path in the state machine. The startup process is started when the code enters in PM_NO_CHILDREN state. This state is set when «These other guys should be dead already» as stated in the code: /* These other guys should be dead already */ Assert(StartupPID == 0); Assert(WalReceiverPID == 0); Assert(BgWriterPID == 0); Assert(CheckpointerPID == 0); Assert(WalWriterPID == 0); Assert(AutoVacPID == 0); /* syslogger is not considered here */ pmState = PM_NO_CHILDREN; > > There's still some questions in the current patch. As I wrote, it's an > > humble patch, a proof of concept, a bit naive. > > > > Does it worth discussing it and improving it further or do I miss something > > obvious in this design that leads to a dead end? > > I don't think there's a fundamental issue, but I think it needs to deal > with a lot more things than it does right now. StartupXLOG doesn't > currently deal correctly with subsystems that are already > initialized. And your patch doesn't make it so as far as I can tell. If you are talking about bgwriter, checkpointer, etc, as far as I understand the current state machine, my patch actually deal with them. Thank you for your feedback! I'll study how hard it would be to keep read only backends around during the demote step. Regards,