from:"Jehan\-Guillaume de Rorthais"

Re: Sorting regression of text function result since commit 586b98fdf1aae

2023-12-12 Thread Jehan-Guillaume de Rorthais

On Mon, 11 Dec 2023 15:43:12 -0500
Tom Lane  wrote:

> Jehan-Guillaume de Rorthais  writes:
> > It looks like since 586b98fdf1aae, the result type collation of
> > "convert_from" is forced to "C", like the patch does for type "name",
> > instead of the "default" collation for type "text".  
> 
> Well, convert_from() inherits its result collation from the input,
> per the normal rules for collation assignment [1].
> 
> > Looking at hints in the header comment of function "exprCollation", I poked
> > around and found that the result collation wrongly follow the input
> > collation in this case.  
> 
> It's not "wrong", it's what the SQL standard requires.

Mh, OK. This is at least a surprising behavior. Having a non-data related
argument impacting the result collation seems counter-intuitive. But I
understand this is by standard, no need to discuss it.

> > I couldn't find anything explaining this behavior in the changelog. It looks
> > like a regression to me, but if this is actually expected, maybe this
> > deserve some documentation patch?  
> 
> The v12 release notes do say
> 
> Type name now behaves much like a domain over type text that has
> default collation “C”.

Sure, and I saw it, but reading at this entry, I couldn't guess this could have
such implication on text result from a function call. That's why I hunt for
the precise commit and was surprise to find this was the actual change.

> You'd have similar results from an expression involving such a domain,
> I believe.
> 
> I'm less than excited about patching the v12 release notes four
> years later.  Maybe, if this point had come up in a more timely
> fashion, we'd have mentioned it --- but it's hardly possible to
> cover every potential implication of such a change in the
> release notes.

This could have been documented in the collation concept page, as a trap to be
aware of. A link from the release note to such a small paragraph would have
been enough to warn devs this might have implications when mixed with other
collatable types. But I understand we can not document all the traps paving the
way to the standard anyway.

Thank you for your explanation!

Regards,

Sorting regression of text function result since commit 586b98fdf1aae

2023-12-11 Thread Jehan-Guillaume de Rorthais

Hi,

A customer found what looks like a sort regression while testing his code from
v11 on a higher version. We hunt this regression down to commit 586b98fdf1aae,
introduced in v12.

Consider the following test case:

  createdb -l fr_FR.utf8 -T template0 reg
  psql reg <<<"
  BEGIN;
  CREATE TABLE IF NOT EXISTS reg
  (
id bigint NOT NULL,
reg bytea NOT NULL
  );
  
  INSERT INTO reg VALUES
(1, convert_to( 'aaa', 'UTF8')),
(2, convert_to( 'aa}', 'UTF8'));
  
  SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8');"

In parent commit 68f6f2b7395fe, it results:

   id 
  
2
1

And in 586b98fdf1aae:

   id 
  
1
2

Looking at the plan, the sort node are different:

* 68f6f2b7395fe: Sort Key: (convert_from(reg, 'UTF8'::name))
* 586b98fdf1aae: Sort Key: (convert_from(reg, 'UTF8'::name)) COLLATE "C"

It looks like since 586b98fdf1aae, the result type collation of "convert_from"
is forced to "C", like the patch does for type "name", instead of the "default"
collation for type "text".

Looking at hints in the header comment of function "exprCollation", I poked
around and found that the result collation wrongly follow the input collation
in this case. With 586b98fdf1aae:

  -- 2nd parameter type resolved as "name" so collation forced to "C"
  SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8');
  --   1
  --   2

  -- Collation of 2nd parameter is forced to something else
  SELECT id FROM reg ORDER BY convert_from(reg, 'UTF8' COLLATE \"default\");
  --  2
  --  1
  -- Sort
  --   Sort Key: (convert_from(reg, 'UTF8'::name COLLATE "default"))
  --   ->  Seq Scan on reg

It seems because the second parameter type is "name", the result collation
become "C" instead of being the collation associated with "text" type:
"default".

I couldn't find anything explaining this behavior in the changelog. It looks
like a regression to me, but if this is actually expected, maybe this deserve
some documentation patch?

Regards,

Re: Query execution in Perl TAP tests needs work

2023-11-06 Thread Jehan-Guillaume de Rorthais

On Wed, 18 Oct 2023 18:25:01 +0200
Alvaro Herrera  wrote:

> On 2023-Oct-18, Robert Haas wrote:
> 
> > Without FFI::Platypus, we have to write Perl code that can speak the
> > wire protocol directly. Basically, we're writing our own PostgreSQL
> > driver for Perl, though we might need only a subset of the things a
> > real driver would need to handle, and we might add some extra things,
> > like code that can send intentionally botched protocol messages.  
> 
> We could revive the old src/interfaces/perl5 code, which was a libpq
> wrapper -- at least the subset of it that the tests need.  It was moved
> to gborg by commit 9a0b4d7f8474 and a couple more versions were made
> there, which can be seen at
> https://www.postgresql.org/ftp/projects/gborg/pgperl/stable/,
> version 2.1.1 being apparently the latest.  The complete driver was
> about 3000 lines, judging by the commit that removed it.  Presumably we
> don't need the whole of that.

+1 to test this. I can give it some time to revive it and post results here if
you agree and no one think of some show stopper.

Re: Detoasting optionally to make Explain-Analyze less misleading

2023-09-13 Thread Jehan-Guillaume de Rorthais

Hi Stepan & all,

On Tue, 12 Sep 2023 17:16:00 +0200
stepan rutz  wrote:

...
> Attached a new patch. Hoping for feedback,

Nice addition to EXPLAIN!

On the feature front, what about adding the actual detoasting/serializing time
in the explain output?

That could be:

  =>  explain (analyze,serialize,costs off,timing off) 
  select * from test_detoast;
QUERY PLAN
  ─
   Seq Scan on public.test_detoast (actual rows=Nv loops=N)
   Planning Time: N ms
   Execution Time: N ms
   Serialize Time: N ms

Re: EBCDIC sorting as a use case for ICU rules

2023-08-24 Thread Jehan-Guillaume de Rorthais

Hi,

Sorry to chime in so lately, I was waiting for some customer feedback.

On Wed, 21 Jun 2023 15:28:38 +0200
"Daniel Verite"  wrote:

> At a conference this week I was asked if ICU could be able to
> sort like EBCDIC [2].
> It turns out it has been already  asked on
> -general a few years ago [3] with no satisfactory answer at the time ,
> and that it can be implemented with rules in v16.

We worked with a customer few months ago about this question and end up with a
procedure to build new locale/collation for glibc and load them in PostgreSQL
[1].

Our customer built the fr_ebcdic locale file themselves, based on the EBCDIC
IBM500 codepage (including about the same characters than iso 8859-1) and share
it under the BY-CC licence. See in attachment.

The procedure is quite simple:

1. copy this file under "/usr/share/i18n/locales/fr_ebcdic"
2. build it using "localedef -c -i fr_ebcdic -f UTF-8 fr_ebcdic.UTF-8"
3. restart your PostgreSQL instance (because of localeset weird behavior)
4. "pg_import_system_collations('schema')" or create the collation, eg.:
   CREATE COLLATION fr_ebcdic (
 PROVIDER = libc,
 LC_COLLATE = fr_ebcdic.utf8,
 LC_CTYPE = fr_ebcdic.utf8
   );

Now, same question than for the ICU: do we want to provide documentation about
this? Online documentation about such feature are quite arid. In fact, this
could be useful in various other way than just EBCDIC.

Regards,

[1] https://www.postgresql.org/message-id/20230209144947.1dfad6c0%40karst

fr_ebcdic
Description: Binary data

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-08-10 Thread Jehan-Guillaume de Rorthais

On Thu, 3 Aug 2023 11:02:43 +0200
Alvaro Herrera  wrote:

> On 2023-Aug-03, tender wang wrote:
> 
> > I think  old "sub-FK" should not be dropped, that will be violates foreign
> > key constraint.  
> 
> Yeah, I've been playing more with the patch and it is definitely not
> doing the right things.  Just eyeballing the contents of pg_trigger and
> pg_constraint for partitions added by ALTER...ATTACH shows that the
> catalog contents are inconsistent with those added by CREATE TABLE
> PARTITION OF.

Well, as stated in my orignal message, at the patch helps understanding the
problem and sketch a possible solution. It definitely is not complete.

After DETACHing the table, we surely needs to check everything again and
recreating what is needed to keep the FK consistent.

But should we keep the FK after DETACH? Did you check the two other discussions
related to FK, self-FK & partition? Unfortunately, as Tender experienced, the
more we dig the more we find bugs. Moreover, the second one might seems
unsolvable and deserve a closer look. See:

* FK broken after DETACHing referencing part
  https://www.postgresql.org/message-id/20230420144344.40744130%40karst
* Issue attaching a table to a partitioned table with an  auto-referenced
  foreign key
  https://www.postgresql.org/message-id/20230707175859.17c91538%40karst

Re: Issue attaching a table to a partitioned table with an auto-referenced foreign key

2023-07-07 Thread Jehan-Guillaume de Rorthais

So I gave a look at this one... And it's a tricky one.

The current policy about DETACHing a partition is to keep/adjust all FK
referencing it or referenced by it.

However, in this exact self-referencing usecase, we can have rows referencing
rows from the same partition OR another one. It seems like an
impossible issue to solve.

Here is an example based on Guillaume's scenario ([c1_old, c2_old] -> [c1, c2]):

  t1:
t1_a:
  c1 | c1_old | c2 | c2_old
 +++
   1 |   NULL |  2 |   NULL
   1 |  1 |  3 |  2
   1 |  2 |  4 |  2 
t1_b:
  c1 | c1_old | c2 | c2_old
 +++
   2 |  1 |  2 |  3

Now, what happens with the FK when we DETACH t1_a?
  * it's not enough t1_a only keeps a self-FK, as it references some
rows from t1_b:
(1, 2, 4, 2) -> (2, 1, 2, 3)
  * and t1_a can not only keeps a FK referencing t1 either as it references some
rows fro itself:
(1, 1, 3, 2) -> (1, NULL, 2, NULL)

I'm currently not able to think about a constraint we could build to address
this situation after the DETACH.

The only clean way out would be to drop the FK between the old partition and
the partitioned table. But then, it breaks the current policy to keep the
constraint after DETACH. Not mentioning the nightmare to detect this situation
from some other ones.

Thoughts?

On Wed, 22 Mar 2023 11:14:19 +0100
Guillaume Lelarge  wrote:

> One last ping, hoping someone will have more time now than in january.
> 
> Perhaps my test is wrong, but I'd like to know why.
> 
> Thanks.
> 
> Le mar. 17 janv. 2023 à 16:53, Guillaume Lelarge  a
> écrit :
> 
> > Quick ping, just to make sure someone can get a look at this issue :)
> > Thanks.
> >
> >
> > Le ven. 6 janv. 2023 à 11:07, Guillaume Lelarge 
> > a écrit :
> >  
> >> Hello,
> >>
> >> One of our customers has an issue with partitions and foreign keys. He
> >> works on a v13, but the issue is also present on v15.
> >>
> >> I attach a SQL script showing the issue, and the results on 13.7, 13.9,
> >> and 15.1. But I'll explain the script here, and its behaviour on 13.9.
> >>
> >> There is one partitioned table, two partitions and a foreign key. The
> >> foreign key references the same table:
> >>
> >> create table t1 (
> >>   c1 bigint not null,
> >>   c1_old bigint null,
> >>   c2 bigint not null,
> >>   c2_old bigint null,
> >>   primary key (c1, c2)
> >>   )
> >>   partition by list (c1);
> >> create table t1_a   partition of t1 for values in (1);
> >> create table t1_def partition of t1 default;
> >> alter table t1 add foreign key (c1_old, c2_old) references t1 (c1, c2) on
> >> delete restrict on update restrict;
> >>
> >> I've a SQL function that shows me some information from pg_constraints
> >> (code of the function in the SQL script attached). Here is the result of
> >> this function after creating the table, its partitions, and its foreign
> >> key:
> >>
> >> select * from show_constraints();
> >> conname |   t|  tref  |   coparent
> >> +++---
> >>  t1_c1_old_c2_old_fkey  | t1 | t1 |
> >>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> >> (5 rows)
> >>
> >> The constraint works great :
> >>
> >> insert into t1 values(1, NULL, 2, NULL);
> >> insert into t1 values(2, 1,2, 2);
> >> delete from t1 where c1 = 1;
> >> psql:ticket15010_v3.sql:34: ERROR:  update or delete on table "t1_a"
> >> violates foreign key constraint "t1_c1_old_c2_old_fkey1" on table "t1"
> >> DETAIL:  Key (c1, c2)=(1, 2) is still referenced from table "t1".
> >>
> >> This error is normal since the line I want to delete is referenced on the
> >> other line.
> >>
> >> If I try to detach the partition, it also gives me an error.
> >>
> >> alter table t1 detach partition t1_a;
> >> psql:ticket15010_v3.sql:36: ERROR:  removing partition "t1_a" violates
> >> foreign key constraint "t1_c1_old_c2_old_fkey1"
> >> DETAIL:  Key (c1_old, c2_old)=(1, 2) is still referenced from table "t1".
> >>
> >> Sounds good to me too (well, I'd like it to be smarter and find that the
> >> constraint is still good after the detach, but I can understand why it
> >> won't allow it).
> >>
> >> The pg_constraint didn't change of course:
> >>
> >> select * from show_constraints();
> >> conname |   t|  tref  |   coparent
> >> +++---
> >>  t1_c1_old_c2_old_fkey  | t1 | t1 |
> >>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey2 |

[BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-07-05 Thread Jehan-Guillaume de Rorthais

Hi,

(patch proposal below).

Consider a table with a FK pointing to a partitioned table.

  CREATE TABLE p ( id bigint PRIMARY KEY )
PARTITION BY list (id);
  CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1);

  CREATE TABLE r_1 (
id   bigint PRIMARY KEY,
p_id bigint NOT NULL,
FOREIGN KEY (p_id) REFERENCES p (id)
  );

Now, attach this table "refg_1" as partition of another one having the same FK:

  CREATE TABLE r (
id   bigint PRIMARY KEY,
p_id bigint NOT NULL,
FOREIGN KEY (p_id) REFERENCES p (id)
  ) PARTITION BY list (id);

  ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1); 

The old sub-FKs (below 18289) created in this table to enforce the action
triggers on referenced partitions are not deleted when the table becomes a
partition. Because of this, we have additional and useless triggers on the
referenced partitions and we can not DETACH this partition on the referencing
side anymore:

  => ALTER TABLE r DETACH PARTITION r_1;
  ERROR:  could not find ON INSERT check triggers of foreign key
  constraint 18289

  => SELECT c.oid, conparentid, 
   conrelid::regclass, 
   confrelid::regclass, 
   t.tgfoid::regproc
 FROM pg_constraint c 
 JOIN pg_trigger t ON t.tgconstraint = c.oid
 WHERE confrelid::regclass = 'p_1'::regclass;
oid  │ conparentid │ conrelid │ confrelid │ tgfoid 
  ───┼─┼──┼───┼
   18289 │   18286 │ r_1  │ p_1   │ "RI_FKey_noaction_del"
   18289 │   18286 │ r_1  │ p_1   │ "RI_FKey_noaction_upd"
   18302 │   18299 │ r│ p_1   │ "RI_FKey_noaction_del"
   18302 │   18299 │ r│ p_1   │ "RI_FKey_noaction_upd"
  (4 rows)

The legitimate constraint and triggers here are 18302. The old sub-FK
18289 having 18286 as parent should have gone during the ATTACH PARTITION.

Please, find in attachment a patch dropping old "sub-FK" during the ATTACH
PARTITION command and adding a regression test about it. At the very least, it
help understanding the problem and sketch a possible solution.

Regards,
>From 5fc7997b9f9a17ee5a31f059c18e6c01fd716c04 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Wed, 5 Jul 2023 19:19:40 +0200
Subject: [PATCH v1] Remove useless parted-FK constraints when attaching a
 partition

When a FK is referencing a partitioned table, this FK is parenting
as many other "parted-FK" constraints than the referencing side has
partitions. Each of these sub-constraints enforce only the
UPDATE/DELETE triggers on each referenced partition, but not the
check triggers on the referencing partition as this is already
handle by the parent constraint. These parted-FK are half-backed on
purpose.

However when attaching such standalone table to a partitioned table
having the same FK definition, all these sub-constraints were not
removed. This leave a useless action trigger on each referenced
partition, the legitimate one already checking against the root of
the referencing partition.

Moreover, when the partition is later detached,
DetachPartitionFinalize() look for all existing FK on the relation
and try to detach them from the parent triggers and constraints.
But because these parted-FK are half backed, calling
GetForeignKeyCheckTriggers() to later detach the check triggers
raise an ERROR:

  ERROR:  could not find ON INSERT check triggers of foreign key
  constraint N.
---
 src/backend/commands/tablecmds.c  | 57 +--
 src/test/regress/expected/foreign_key.out | 22 +
 src/test/regress/sql/foreign_key.sql  | 27 +++
 3 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fce5e6f220..7c1aa5d395 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10566,9 +10566,11 @@ tryAttachPartitionForeignKey(ForeignKeyCacheInfo *fk,
 	Form_pg_constraint parentConstr;
 	HeapTuple	partcontup;
 	Form_pg_constraint partConstr;
-	ScanKeyData key;
+	ScanKeyData key[3];
 	SysScanDesc scan;
 	HeapTuple	trigtup;
+	HeapTuple	contup;
+	Relation	pg_constraint;
 	Oid			insertTriggerOid,
 updateTriggerOid;
 
@@ -10631,12 +10633,12 @@ tryAttachPartitionForeignKey(ForeignKeyCacheInfo *fk,
 	 * in the partition.  We identify them because they have our constraint
 	 * OID, as well as being on the referenced rel.
 	 */
-	ScanKeyInit(,
+	ScanKeyInit(key,
 Anum_pg_trigger_tgconstraint,
 BTEqualStrategyNumber, F_OIDEQ,
 ObjectIdGetDatum(fk->conoid));
 	scan = systable_beginscan(trigrel, TriggerConstraintIndexId, true,
-			  NULL, 1, );
+			  NULL, 1, key);
 	while ((trigtup = systable_getnext(scan)) != NULL)
 	{
 		Form_pg_trigger trgform = (Form_pg_trigger) GETSTRUCT(trigtup);
@@ -10684,6 +10686,55 @@ tryAttachPartitionForeignKey(ForeignKeyCa

Re: Memory leak from ExecutorState context?

2023-05-22 Thread Jehan-Guillaume de Rorthais

On Fri, 19 May 2023 17:23:56 +0200
Tomas Vondra  wrote:
> On 5/19/23 00:27, Melanie Plageman wrote:
> > v10 LGTM.  
> 
> Thanks!
> 
> I've pushed 0002 and 0003, after some general bikeshedding and minor
> rewording (a bit audacious, admittedly).

Thank you Melanie et Tomas for your help, review et commit!

Re: Memory leak from ExecutorState context?

2023-05-17 Thread Jehan-Guillaume de Rorthais

On Wed, 17 May 2023 13:46:35 -0400
Melanie Plageman  wrote:

> On Wed, May 17, 2023 at 07:10:08PM +0200, Jehan-Guillaume de Rorthais wrote:
> > On Tue, 16 May 2023 16:00:52 -0400
> > Melanie Plageman  wrote:  
> > > ...
> > > There are some existing indentation issues in these files, but you can
> > > leave those or put them in a separate commit.  
> > 
> > Reindented in v9.
> > 
> > I put existing indentation issues in a separate commit to keep the actual
> > patches clean from distractions.  
> 
> It is a matter of opinion, but I tend to prefer the commit with the fix
> for the existing indentation issues to be first in the patch set.

OK. moved in v10 patch set.

> ...
> > > >  void
> > > >  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> > > > - BufFile **fileptr)
> > > > + BufFile **fileptr,
> > > > MemoryContext cxt) {
> > 
> > Note that I changed this to:
> > 
> >  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> >BufFile **fileptr, HashJoinTable hashtable) {
> > 
> > As this function must allocate BufFile buffer in spillCxt, I suppose
> > we should force it explicitly in the function code.
> > 
> > Plus, having hashtable locally could be useful for later patch to eg. fine
> > track allocated memory in spaceUsed.  
> 
> I think if you want to pass the hashtable instead of the memory context,
> I think you'll need to explain why you still pass the buffile pointer
> (because ExecHashJoinSaveTuple() is called for inner and outer batch
> files) instead of getting it from the hashtable's arrays of buffile
> pointers.

Comment added in v10

> > @@ -1310,22 +1311,38 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
> ...
> 
> Suggested small edit to the second paragraph:
> 
>   During the build phase, buffered files are created for inner batches.
>   Each batch's buffered file is closed (and its buffer freed) after the
>   batch is loaded into memory during the outer side scan. Therefore, it
>   is necessary to allocate the batch file buffer in a memory context
>   which outlives the batch itself.

Changed.

> I'd also mention the reason for passing the buffile pointer above the
> function.

Added.

Regards,
>From 25ef004b31e07956a018dd08170ee1588b7719b8 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Wed, 17 May 2023 19:01:06 +0200
Subject: [PATCH v10 1/3] Run pgindent on nodeHash.c and nodeHashjoin.c

---
 src/backend/executor/nodeHash.c | 6 +++---
 src/backend/executor/nodeHashjoin.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5fd1c5553b..ac3eb32d97 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1327,7 +1327,7 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 			else
 			{
 size_t		tuple_size =
-MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+	MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 /* It belongs in a later batch. */
 hashtable->batches[batchno].estimated_size += tuple_size;
@@ -1369,7 +1369,7 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 	for (i = 1; i < old_nbatch; ++i)
 	{
 		ParallelHashJoinBatch *shared =
-		NthParallelHashJoinBatch(old_batches, i);
+			NthParallelHashJoinBatch(old_batches, i);
 
 		old_inner_tuples[i] = sts_attach(ParallelHashJoinBatchInner(shared),
 		 ParallelWorkerNumber + 1,
@@ -3317,7 +3317,7 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 			while (DsaPointerIsValid(batch->chunks))
 			{
 HashMemoryChunk chunk =
-dsa_get_address(hashtable->area, batch->chunks);
+	dsa_get_address(hashtable->area, batch->chunks);
 dsa_pointer next = chunk->next.shared;
 
 dsa_free(hashtable->area, batch->chunks);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..b29a8ff48b 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1170,7 +1170,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		{
 			SharedTuplestoreAccessor *inner_tuples;
 			Barrier*batch_barrier =
-			>batches[batchno].shared->batch_barrier;
+>batches[batchno].shared->batch_barrier;
 
 			switch (BarrierAttach(batch_barrier))
 			{
@@ -1558,7 +1558,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 {
 	int			plan_node_id = state->js.ps.plan->plan_node_id;
 	ParallelHashJoinState *pstate =
-	shm_toc_lookup(pcxt->toc, plan_node_id, false);
+		shm_toc_lookup(

Re: Memory leak from ExecutorState context?

2023-05-17 Thread Jehan-Guillaume de Rorthais

On Tue, 16 May 2023 16:00:52 -0400
Melanie Plageman  wrote:

> On Tue, May 16, 2023 at 04:00:51PM +0200, Jehan-Guillaume de Rorthais wrote:
> 
> > From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Thu, 30 Apr 2020 07:16:28 -0700
> > Subject: [PATCH v8 1/3] Describe hash join implementation
> > 
> > This is just a draft to spark conversation on what a good comment might
> > be like in this file on how the hybrid hash join algorithm is
> > implemented in Postgres. I'm pretty sure this is the accepted term for
> > this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join  
> 
> I recommend changing the commit message to something like this:
> 
>   Describe hash join implementation
> 
>   Add details to the block comment in nodeHashjoin.c describing the
>   hybrid hash join algorithm at a high level.
> 
>   Author: Melanie Plageman 
>   Author: Jehan-Guillaume de Rorthais 
>   Reviewed-by: Tomas Vondra 
>   Discussion: https://postgr.es/m/20230516160051.4267a800%40karst

Done, but assigning myself as a reviewer as I don't remember having authored
anything in this but the reference to the Hybrid hash page, which is
questionable according to Tomas :)

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > b/src/backend/executor/nodeHashjoin.c index 0a3f32f731..93a78f6f74 100644
> > --- a/src/backend/executor/nodeHashjoin.c
> > ...
> > + * TODO: should we discuss that tuples can only spill forward?  
> 
> I would just cut this for now since we haven't started on an agreed-upon
> wording.

Removed in v9.

> > From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Tue, 16 May 2023 15:42:14 +0200
> > Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in a dedicated
> >  context  
> 
> Here is a draft commit message for the second patch:
> 
> ...

Thanks. Adopted with some minor rewording... hopefully it's OK.

> I recommend editing and adding links to the various discussions on this
> topic from your research.

Done in v9.

> As for the patch itself, I noticed that there are few things needing
> pgindenting.
>
> I usually do the following to run pgindent (in case you haven't done
> this recently).
> 
> ...

Thank you for your recipe.

> There are some existing indentation issues in these files, but you can
> leave those or put them in a separate commit.

Reindented in v9.

I put existing indentation issues in a separate commit to keep the actual
patches clean from distractions.

> ...
> Add a period at the end of this comment.
> 
> > +   /*
> > +* Use hash join spill memory context to allocate accessors and
> > their
> > +* buffers
> > +*/

Fixed in v9.

> Add a period at the end of this comment.
> 
> > +   /* Use hash join spill memory context to allocate accessors */

Fixed in v9.

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > b/src/backend/executor/nodeHashjoin.c @@ -1313,21 +1314,30 @@
> > ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
> >   * The data recorded in the file for each tuple is its hash value,
> >   * then the tuple in MinimalTuple format.
> >   *
> > - * Note: it is important always to call this in the regular executor
> > - * context, not in a shorter-lived context; else the temp file buffers
> > - * will get messed up.
> > + * If this is the first write to the batch file, this function first
> > + * create it.  The associated BufFile buffer is allocated in the given
> > + * context.  It is important to always give the HashSpillContext
> > + * context.  First to avoid a shorter-lived context, else the temp file
> > + * buffers will get messed up.  Second, for a better accounting of the
> > + * spilling memory consumption.
> > + *
> >   */  
> 
> Here is my suggested wording fot this block comment:
> 
> The batch file is lazily created. If this is the first tuple written to
> this batch, the batch file is created and its buffer is allocated in the
> given context. It is important to pass in a memory context which will
> live for the entirety of the lifespan of the batch.

Reworded. The context must actually survive the batch itself, not just live
during the lifespan of the batch.

> >  void
> >  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> > - BufFile **fileptr)
> > + BufFile **fileptr, MemoryContext
> > cxt) {  

Note that I changed this to:

 ExecHashJoinSaveTuple(MinimalTuple tuple, u

Re: Memory leak from ExecutorState context?

2023-05-16 Thread Jehan-Guillaume de Rorthais

Hi,

On Tue, 16 May 2023 12:01:51 +0200 Tomas Vondra 
wrote:

> On 5/16/23 00:15, Jehan-Guillaume de Rorthais wrote:
> > On Sat, 13 May 2023 23:47:53 +0200
> > Tomas Vondra  wrote:
> ...
> >> I'm not really sure about calling this "hybrid hash-join". What does it
> >> even mean? The new comments simply describe the existing batching, and
> >> how we increment number of batches, etc.  
> > 
> > Unless I'm wrong, I believed it comes from the "Hybrid hash join algorithm"
> > as described here (+ see pdf in this page ref):
> > 
> >   https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
> > 
> > I added the ref in the v7 documentation to avoid futur confusion, is it ok?
> 
> Ah, I see. I'd just leave out the "hybrid" entirely. We've lived without
> it until now, we know what the implementation does ...

I changed the title, but kept the reference. There's still two other uses
of "hybrid hash join algorithm" in function and code comments. Keeping the ref
in this header doesn't cost much and help new comers.

> >> 0002
> >> ...
> >> The modified comment in hashjoin.h has a some alignment issues.  
> > 
> > I see no alignment issue. I suppose what bother you might be the spaces
> > before spillCxt and batchCxt to show they are childs of hashCxt? Should I
> > remove them?
> 
> It didn't occur to me this is intentional to show the contexts are
> children of hashCxt, so maybe it's not a good way to document that. I'd
> remove it, and if you think it's something worth mentioning, I'd add an
> explicit comment.

Changed.

Thanks,
>From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH v8 1/3] Describe hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..93a78f6f74 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,47 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ * HASH JOIN
+ *
+ * This is based on the "hybrid hash join" algorithm described shortly in the
+ * following page, and in length in the pdf in page's notes:
+ *
+ *   https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
+ *
+ * If the inner side tuples of a hash join do not fit in memory, the hash join
+ * can be executed in multiple batches.
+ *
+ * If the statistics on the inner side relation are accurate, planner chooses a
+ * multi-batch strategy and estimates the number of batches.
+ *
+ * The query executor measures the real size of the hashtable and increases the
+ * number of batches if the hashtable grows too large.
+ *
+ * The number of batches is always a power of two, so an increase in the number
+ * of batches doubles it.
+ *
+ * Serial hash join measures batch size lazily -- waiting until it is loading a
+ * batch to determine if it will fit in memory. While inserting tuples into the
+ * hashtable, serial hash join will, if that tuple were to exceed work_mem,
+ * dump out the hashtable and reassign them either to other batch files or the
+ * current batch resident in the hashtable.
+ *
+ * Parallel hash join, on the other hand, completes all changes to the number
+ * of batches during the build phase. If it increases the number of batches, it
+ * dumps out all the tuples from all batches and reassigns them to entirely new
+ * batch files. Then it checks every batch to ensure it will fit in the space
+ * budget for the query.
+ *
+ * In both parallel and serial hash join, the executor currently makes a best
+ * effort. If a particular batch will not fit in memory, it tries doubling the
+ * number of batches. If after a batch increase, there is a batch which
+ * retained all or none of its tuples, the executor disables growth in the
+ * number of batches globally. After growth is disabled, all batches that would
+ * have previously triggered an increase in the number of batches instead
+ * exceed the space allowed.
+ *
+ * TODO: should we discuss that tuples can only spill forward?
+ *
  * PARALLELISM
  *
  * Hash joins can participate in parallel query execution in several ways.  A
-- 
2.40.1

>From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Tue, 16 May 2023 15:42:14 +0200
Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in

Re: Memory leak from ExecutorState context?

2023-05-15 Thread Jehan-Guillaume de Rorthais

Hi,

Thanks for your review!

On Sat, 13 May 2023 23:47:53 +0200
Tomas Vondra  wrote:

> Thanks for the patches. A couple mostly minor comments, to complement
> Melanie's review:
> 
> 0001
> 
> I'm not really sure about calling this "hybrid hash-join". What does it
> even mean? The new comments simply describe the existing batching, and
> how we increment number of batches, etc.

Unless I'm wrong, I believed it comes from the "Hybrid hash join algorithm" as
described here (+ see pdf in this page ref):

  https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join

I added the ref in the v7 documentation to avoid futur confusion, is it ok?

> 0002
> 
> I wouldn't call the new ExecHashJoinSaveTuple parameter spillcxt but
> just something generic (e.g. cxt). Yes, we're passing spillCxt, but from
> the function's POV it's just a pointer.

changed in v7.

> I also wouldn't move the ExecHashJoinSaveTuple comment inside - it just
> needs to be reworded that we're expecting the context to be with the
> right lifespan. The function comment is the right place to document such
> expectations, people are less likely to read the function body.

moved and reworded in v7.

> The modified comment in hashjoin.h has a some alignment issues.

I see no alignment issue. I suppose what bother you might be the spaces
before spillCxt and batchCxt to show they are childs of hashCxt? Should I
remove them?

Regards,
>From 997a82a0ee9eda1774b5210401b2d9bf281318da Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH 1/3] Describe hybrid hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..a39164c15d 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,47 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ *   HYBRID HASH JOIN
+ *
+ *  This is based on the hybrid hash join algorythm described shortly in the
+ *  following page, and in length in the pdf in page's notes:
+ *
+ *https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
+ *
+ *  If the inner side tuples of a hash join do not fit in memory, the hash join
+ *  can be executed in multiple batches.
+ *
+ *  If the statistics on the inner side relation are accurate, planner chooses a
+ *  multi-batch strategy and estimates the number of batches.
+ *
+ *  The query executor measures the real size of the hashtable and increases the
+ *  number of batches if the hashtable grows too large.
+ *
+ *  The number of batches is always a power of two, so an increase in the number
+ *  of batches doubles it.
+ *
+ *  Serial hash join measures batch size lazily -- waiting until it is loading a
+ *  batch to determine if it will fit in memory. While inserting tuples into the
+ *  hashtable, serial hash join will, if that tuple were to exceed work_mem,
+ *  dump out the hashtable and reassign them either to other batch files or the
+ *  current batch resident in the hashtable.
+ *
+ *  Parallel hash join, on the other hand, completes all changes to the number
+ *  of batches during the build phase. If it increases the number of batches, it
+ *  dumps out all the tuples from all batches and reassigns them to entirely new
+ *  batch files. Then it checks every batch to ensure it will fit in the space
+ *  budget for the query.
+ *
+ *  In both parallel and serial hash join, the executor currently makes a best
+ *  effort. If a particular batch will not fit in memory, it tries doubling the
+ *  number of batches. If after a batch increase, there is a batch which
+ *  retained all or none of its tuples, the executor disables growth in the
+ *  number of batches globally. After growth is disabled, all batches that would
+ *  have previously triggered an increase in the number of batches instead
+ *  exceed the space allowed.
+ *
+ *  TODO: should we discuss that tuples can only spill forward?
+ *
  * PARALLELISM
  *
  * Hash joins can participate in parallel query execution in several ways.  A
-- 
2.40.1

>From 52e989d5eb6405e86e0d460dffbfaca292bc4274 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
 context

---
 src/backend/executor/nodeHash.c   | 43 ---
 src/backend/executor/nodeHashjoin.c   | 22 
 src/backend/utils/sort/sharedtuplestore.c |  8 +
 src/include/executor

Re: Memory leak from ExecutorState context?

2023-05-15 Thread Jehan-Guillaume de Rorthais

On Sun, 14 May 2023 00:10:00 +0200
Tomas Vondra  wrote:

> On 5/12/23 23:36, Melanie Plageman wrote:
> > Thanks for continuing to work on this.
> > 
> > Are you planning to modify what is displayed for memory usage in
> > EXPLAIN ANALYZE?

Yes, I already start to work on this. Tracking spilling memory in
spaceUsed/spacePeak change the current behavior of the serialized HJ because it
will increase the number of batch much faster, so this is a no go for v16.

I'll try to accumulate the total allocated (used+not used) spill context memory
in instrumentation. This is gross, but it avoids to track the spilling memory
in its own structure entry.

> We could do that, but we can do that separately - it's a separate and
> independent improvement, I think.

+1

> Also, do you have a proposal how to change the explain output? In
> principle we already have the number of batches, so people can calculate
> the "peak" amount of memory (assuming they realize what it means).

We could add the batch memory consumption with the number of batches. Eg.:

  Buckets: 4096 (originally 4096)  
  Batches: 32768 (originally 8192) using 256MB
  Memory Usage: 192kB

> I think the main problem with adding this info to EXPLAIN is that I'm
> not sure it's very useful in practice. I've only really heard about this
> memory explosion issue when the query dies with OOM or takes forever,
> but EXPLAIN ANALYZE requires the query to complete.

It could be useful to help admins tuning their queries realize that the current
number of batches is consuming much more memory than the join itself.

This could help them fix the issue before OOM happen.

Regards,

Re: Unlinking Parallel Hash Join inner batch files sooner

2023-05-10 Thread Jehan-Guillaume de Rorthais

Hi,

Thanks for working on this!

On Wed, 10 May 2023 15:11:20 +1200
Thomas Munro  wrote:

> One complaint about PHJ is that it can, in rare cases, use a
> surprising amount of temporary disk space where non-parallel HJ would
> not.  When it decides that it needs to double the number of batches to
> try to fit each inner batch into memory, and then again and again
> depending on your level of bad luck, it leaves behind all the earlier
> generations of inner batch files to be cleaned up at the end of the
> query.  That's stupid.  Here's a patch to unlink them sooner, as a
> small improvement.

This patch can indeed save a decent amount of temporary disk space.

Considering its complexity is (currently?) quite low, it worth it.

> The reason I didn't do this earlier is that sharedtuplestore.c
> continues the pre-existing tradition where each parallel process
> counts what it writes against its own temp_file_limit.  At the time I
> thought I'd need to have one process unlink all the files, but if a
> process were to unlink files that it didn't create, that accounting
> system would break.  Without some new kind of shared temp_file_limit
> mechanism that doesn't currently exist, per-process counters could go
> negative, creating free money.  In the attached patch, I realised
> something that I'd missed before: there is a safe point for each
> backend to unlink just the files that it created, and there is no way
> for a process that created files not to reach that point.

Indeed.

For what it worth, from my new and non-experienced understanding of the
parallel mechanism, waiting for all workers to reach
WAIT_EVENT_HASH_GROW_BATCHES_REPARTITION, after re-dispatching old batches in
new ones, seems like a safe place to instruct each workers to clean their old
temp files.

> Here's an example query that tries 8, 16 and then 32 batches on my
> machine, because reltuples is clobbered with a bogus value.

Nice!

Regards,

Re: Memory leak from ExecutorState context?

2023-05-10 Thread Jehan-Guillaume de Rorthais

Thank you for your review!

On Mon, 8 May 2023 11:56:48 -0400
Melanie Plageman  wrote:

> ...
> > 4. accessor->read_buffer is currently allocated in accessor->context as
> > well.
> > 
> >This buffer holds tuple read from the fileset. This is still a buffer,
> > but not related to any file anymore...
> > 
> > Because of 3 and 4, and the gross context granularity of SharedTuplestore
> > related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".
> 
> I think bufCxt is a potentially confusing name. The context contains
> pointers to the batch files and saying those are related to buffers is
> confusing. It also sounds like it could include any kind of buffer
> related to the hashtable or hash join.
> 
> Perhaps we could call this memory context the "spillCxt"--indicating it
> is for the memory required for spilling to permanent storage while
> executing hash joins.

"Spilling" seems fair and a large enough net to grab everything around temp
files and accessing them.

> I discuss this more in my code review below.
> 
> > From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Mon, 27 Mar 2023 15:54:39 +0200
> > Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
> >  context  
> 
> > diff --git a/src/backend/executor/nodeHash.c
> > b/src/backend/executor/nodeHash.c index 5fd1c5553b..a4fbf29301 100644
> > --- a/src/backend/executor/nodeHash.c
> > +++ b/src/backend/executor/nodeHash.c
> > @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List
> > *hashOperators, List *hashCollations, 
> > if (nbatch > 1 && hashtable->parallel_state == NULL)
> > {
> > +   MemoryContext oldctx;
> > +
> > /*
> >  * allocate and initialize the file arrays in hashCxt (not
> > needed for
> >  * parallel case which uses shared tuplestores instead of
> > raw files) */
> > +   oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
> > +
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > /* The files will not be opened until needed... */
> > /* ... but make sure we have temp tablespaces established
> > for them */  
> 
> I haven't studied it closely, but I wonder if we shouldn't switch back
> into the oldctx before calling PrepareTempTablespaces().
> PrepareTempTablespaces() is explicit about what memory context it is
> allocating in, however, I'm not sure it makes sense to call it in the
> fileCxt. If you have a reason, you should add a comment and do the same
> in ExecHashIncreaseNumBatches().

I had no reason. I catched it in ExecHashIncreaseNumBatches() while reviewing
myself, but not here.

Line moved in v6.

> > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
> >hashtable, nbatch, hashtable->spaceUsed);
> >  #endif
> >  
> > -   oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> > -
> > if (hashtable->innerBatchFile == NULL)
> > {
> > +   MemoryContext oldcxt =
> > MemoryContextSwitchTo(hashtable->fileCxt); +
> > /* we had no file arrays before */
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > +  
> > +   MemoryContextSwitchTo(oldcxt);
> > +
> > /* time to establish the temp tablespaces, too */
> > PrepareTempTablespaces();
> > }
> > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)  
> 
> I don't see a reason to call repalloc0_array() in a different context
> than the initial palloc0_array().

Unless I'm wrong, wrapping the whole if/else blocks in memory context
hashtable->fileCxt seemed useless as repalloc() actually realloc in the context
the original allocation occurred. So I only wrapped the palloc() calls.

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > ...
> > @@ -1308,21 +1310,27 @@ ExecParallelHashJoinNewBatch(HashJoinState
> > *hjstate)
> >   * The data recorded in the file for each tuple is its hash value,  
> 
> It doesn't sound accurate to me to say that it should be called *in* the
> HashBatchFiles context. We now switch into that context, so you can
> probably remove this comment and add a comment above the switch into the
> filecxt which explains that the temp f

Re: Memory leak from ExecutorState context?

2023-05-04 Thread Jehan-Guillaume de Rorthais

Hi,

On Fri, 21 Apr 2023 16:44:48 -0400
Melanie Plageman  wrote:

> On Fri, Apr 7, 2023 at 8:01 PM Jehan-Guillaume de Rorthais
>  wrote:
> >
> > On Fri, 31 Mar 2023 14:06:11 +0200
> > Jehan-Guillaume de Rorthais  wrote:
> >  
> > > > [...]  
> > > > >> Hmmm, not sure is WARNING is a good approach, but I don't have a
> > > > >> better idea at the moment.  
> > > > >
> > > > > I stepped it down to NOTICE and added some more infos.
> > > > >
> > > > > [...]
> > > > >   NOTICE:  Growing number of hash batch to 32768 is exhausting allowed
> > > > > memory (137494656 > 2097152)  
> > > > [...]
> > > >
> > > > OK, although NOTICE that may actually make it less useful - the default
> > > > level is WARNING, and regular users are unable to change the level. So
> > > > very few people will actually see these messages.  
> > [...]  
> > > Anyway, maybe this should be added in the light of next patch, balancing
> > > between increasing batches and allowed memory. The WARNING/LOG/NOTICE
> > > message could appears when we actually break memory rules because of some
> > > bad HJ situation.  
> >
> > So I did some more minor editions to the memory context patch and start
> > working on the balancing memory patch. Please, find in attachment the v4
> > patch set:
> >
> > * 0001-v4-Describe-hybrid-hash-join-implementation.patch:
> >   Adds documentation written by Melanie few years ago
> > * 0002-v4-Allocate-hash-batches-related-BufFile-in-a-dedicated.patch:
> >   The batches' BufFile dedicated memory context patch  
> 
> This is only a review of the code in patch 0002 (the patch to use a more
> granular memory context for multi-batch hash join batch files). I have
> not reviewed the changes to comments in detail either.

Ok.

> I think the biggest change that is needed is to implement this memory
> context usage for parallel hash join.

Indeed. 

> To implement a file buffer memory context for multi-patch parallel hash
> join [...]

Thank you for your review and pointers!

After (some days off and then) studying the parallel code, I end up with this:

1. As explained by Melanie, the v5 patch sets accessor->context to fileCxt.

2. BufFile buffers were wrongly allocated in ExecutorState context for
   accessor->read|write_file, from sts_puttuple and sts_parallel_scan_next when
   they first need to work with the accessor FileSet.

   The v5 patch now allocate them in accessor->context, directly in
   sts_puttuple and sts_parallel_scan_next. This avoids to wrap each single
   call of these functions inside MemoryContextSwitchTo calls. I suppose this
   is correct as the comment about accessor->context says 
   "/* Memory context for **buffers**. */" in struct SharedTuplestoreAccessor.

3. accessor->write_chunk is currently already allocated in accessor->context.

   In consequence, this buffer is now allocated in the fileCxt instead
   of hashCxt context. 

   This is a bit far fetched, but I suppose this is ok as it act as a second
   level buffer, on top of the BufFile.

4. accessor->read_buffer is currently allocated in accessor->context as well.

   This buffer holds tuple read from the fileset. This is still a buffer, but
   not related to any file anymore...

Because of 3 and 4, and the gross context granularity of SharedTuplestore
related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".

Regards,
>From acaa94fab35ab966366a29a092780e54a68de0f7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH 1/3] Describe hybrid hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 36 +
 1 file changed, 36 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..920d1831c2 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,42 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ *   HYBRID HASH JOIN
+ *
+ *  If the inner side tuples of a hash join do not fit in memory, the hash join
+ *  can be executed in multiple batches.
+ *
+ *  If the statistics on the inner side relation are accurate, planner chooses a
+ *  multi-batch strategy and estimates the number of batches.
+ *
+ *  The query executor measures the real siz

Re: Commitfest 2023-03 starting tomorrow!

2023-04-21 Thread Jehan-Guillaume de Rorthais

Hi,
 
After catching up with this thread, where pending bugs are listed and discussed,
I wonder if the current patches trying to lower the HashJoin memory explosion[1]
could be added to the "Older bugs affecting stable branches" list of
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items as I think they
deserve some discussion/triage for v16?

The patch as it stands is not invasive. As I wrote previously[2], if we
couldn't handle such situation better in v16, and if this patch is not
backpatch-able in a minor release, then we will keep living another year, maybe
more, with this bad memory behavior. 

Thank you,

[1] https://www.postgresql.org/message-id/20230408020119.32a0841b%40karst  
[2] https://www.postgresql.org/message-id/20230320151234.38b2235e%40karst

[BUG] FK broken after DETACHing referencing part

2023-04-20 Thread Jehan-Guillaume de Rorthais

Hi,

Considering two partitionned tables with a FK between them:

  DROP TABLE IF EXISTS p, c, c_1 CASCADE;

  --
  -- Parent table + partition + data
  CREATE TABLE p (
id bigint PRIMARY KEY
  )
  PARTITION BY list (id);

  CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1);

  INSERT INTO p VALUES (1);

  
  -- Child table + partition + data
  CREATE TABLE c (
idbigint PRIMARY KEY,
p_id bigint NOT NULL,
FOREIGN KEY (p_id) REFERENCES p (id)
  )
  PARTITION BY list (id);

  CREATE TABLE c_1 PARTITION OF c FOR VALUES IN (1);

  INSERT INTO c VALUES (1,1);

After DETACHing the "c_1" partition, current implementation make sure it
keeps the FK herited from its previous top table "c":

  ALTER TABLE c DETACH PARTITION c_1;
  \d c_1
  -- outputs:
  -- [...]
  -- Foreign-key constraints:
  --   "c_p_id_fkey" FOREIGN KEY (p_id) REFERENCES p(id)

However, because the referenced side is partionned, this FK is half backed, with
only the referencing (insert/update on c_1) side enforced, but not the
referenced side (update/delete on p):

  INSERT INTO c_1 VALUES (2,2); -- fails as EXPECTED
  -- ERROR:  insert or update on table "child_1" violates foreign key [...]

  DELETE FROM p; -- should actually fail
  -- DELETE 1

  SELECT * FROM c_1;
  --  id | parent_id 
  -- +---
  --   1 | 1
  -- (1 row)

  SELECT * FROM p;
  --  id 
  -- 
  -- (0 rows)

When detaching "c_1", current implementation adds two triggers to enforce
UPDATE/DELETE on "p" are restricted if "c_1" keeps referencing the
related rows... But it forgets to add them on partitions of "p_1", where the
triggers should actually fire.

To make it clear, the FK c_1 -> p constraint and triggers after DETACHING c_1
are:

  SELECT c.oid AS conid, c.conname, c.conparentid AS conparent,
 r2.relname AS pkrel,
 t.tgrelid::regclass AS tgrel,
 p.proname
  FROM pg_constraint c 
  JOIN pg_class r ON c.conrelid = r.oid
  JOIN pg_class r2 ON c.confrelid = r2.oid
  JOIN pg_trigger t ON t.tgconstraint = c.oid
  JOIN pg_proc p ON p.oid = t.tgfoid
  WHERE r.relname = 'c_1' AND r2.relname LIKE 'p%'
  ORDER BY r.relname, c.conname, t.tgrelid::regclass::text, p.proname;

  --  conid |   conname   | conparent | pkrel | tgrel |   proname
  -- ---+-+---+---+---+--
  --  18454 | c_p_id_fkey | 0 | p | c_1   | RI_FKey_check_ins
  --  18454 | c_p_id_fkey | 0 | p | c_1   | RI_FKey_check_upd
  --  18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_del
  --  18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_upd

Where they should be:

  --  conid |   conname| conparent | pkrel | tgrel |   proname
  -- ---+--+---+---+---+--
  --  18454 | c_p_id_fkey  | 0 | p | c_1   | RI_FKey_check_ins
  --  18454 | c_p_id_fkey  | 0 | p | c_1   | RI_FKey_check_upd
  --  18454 | c_p_id_fkey  | 0 | p | p | RI_FKey_noaction_del
  --  18454 | c_p_id_fkey  | 0 | p | p | RI_FKey_noaction_upd
  --  NEW!! | c_p_id_fkey1 | 18454 | p_1   | p_1   | RI_FKey_noaction_del
  --  NEW!! | c_p_id_fkey1 | 18454 | p_1   | p_1   | RI_FKey_noaction_upd

I poked around DetachPartitionFinalize() to try to find a way to fix this, but
it looks like it would duplicate a bunch of code from other code path (eg.
from CloneFkReferenced).

Instead of tweaking existing FK, keeping old constraint name (wouldn't
"c_1_p_id_fkey" be better after detach?) and duplicating some code around, what
about cleaning up the FK constraints from the detached table and
recreating a cleaner one using the known code path ATAddForeignKeyConstraint() ?

Thanks for reading me down to here!

++

Re: OOM in hash join

2023-04-14 Thread Jehan-Guillaume de Rorthais

On Fri, 14 Apr 2023 13:21:05 +0200
Matthias van de Meent  wrote:

> On Fri, 14 Apr 2023 at 12:59, Konstantin Knizhnik  wrote:
> >
> > Hi hackers,
> >
> > Too small value of work_mem cause memory overflow in parallel hash join
> > because of too much number batches.
> > There is the plan:  
> 
> [...]
> 
> > There is still some gap between size reported by memory context sump and
> > actual size of backend.
> > But is seems to be obvious, that trying to fit in work_mem
> > sharedtuplestore creates so much batches, that  them consume much more
> > memory than work_mem.

Indeed. The memory consumed by batches is not accounted and the consumption
reported in explain analyze is wrong.

Would you be able to test the latest patchset posted [1] ? This does not fix
the work_mem overflow, but it helps to keep the number of batches
balanced and acceptable. Any feedback, comment or review would be useful.

[1] 
https://www.postgresql.org/message-id/flat/20230408020119.32a0841b%40karst#616c1f41fcc10e8f89d41e8e5693618c

Regards,

Re: Memory leak from ExecutorState context?

2023-04-11 Thread Jehan-Guillaume de Rorthais

On Sat, 8 Apr 2023 02:01:19 +0200
Jehan-Guillaume de Rorthais  wrote:

> On Fri, 31 Mar 2023 14:06:11 +0200
> Jehan-Guillaume de Rorthais  wrote:
> 
>  [...]  
> 
> After rebasing Tomas' memory balancing patch, I did some memory measures
> to answer some of my questions. Please, find in attachment the resulting
> charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
> between HEAD and Tomas' patch. They shows an alternance of numbers
> before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
> didn't try to find the exact last total peak of memory consumption during the
> join phase and before all the BufFiles are destroyed. So the last number
> might be underestimated.

I did some more analysis about the total memory consumption in filecxt of HEAD,
v3 and v4 patches. My previous debug numbers only prints memory metrics during
batch increments or hash table destruction. That means:

* for HEAD: we miss the batches consumed during the outer scan
* for v3: adds twice nbatch in spaceUsed, which is a rough estimation
* for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak

Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated"
from there, here are the maximum allocated memory for bufFile context for each
branch:

batches   max bufFiles   total  spaceAllowed rise
  HEAD16384  199966960  ~194MB
  v3   4096   65419456   ~78MB
  v4(*3)   2048   3427328048MB  nbatch*sizeof(PGAlignedBlock)*3
  v4(*4)   1024   17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
  v4(*5)   2048   34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5

It seems account for bufFile in spaceUsed allows a better memory balancing and
management. The precise factor to rise spaceAllowed is yet to be defined. *3 or
*4 looks good, but this is based on a single artificial test case.

Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
far wrong with the reality. So even if we don't commit the balancing memory
patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?

Regards,

Re: Memory leak from ExecutorState context?

2023-03-31 Thread Jehan-Guillaume de Rorthais

On Tue, 28 Mar 2023 17:25:49 +0200
Tomas Vondra  wrote:
...
>  * Note that BufFile structs are allocated with palloc(), and therefore
>  * will go away automatically at query/transaction end.  Since the
> underlying
>  * virtual Files are made with OpenTemporaryFile, all resources for
>  * the file are certain to be cleaned up even if processing is aborted
>  * by ereport(ERROR).  The data structures required are made in the
>  * palloc context that was current when the BufFile was created, and
>  * any external resources such as temp files are owned by the ResourceOwner
>  * that was current at that time.
> 
> which I take as confirmation that it's legal to allocate BufFile in any
> memory context, and that cleanup is handled by the cache in fd.c.

OK. I just over interpreted comments and been over prudent.

> [...]
> >> Hmmm, not sure is WARNING is a good approach, but I don't have a better
> >> idea at the moment.  
> > 
> > I stepped it down to NOTICE and added some more infos.
> > 
> > [...]
> >   NOTICE:  Growing number of hash batch to 32768 is exhausting allowed
> > memory (137494656 > 2097152)
> [...]
> 
> OK, although NOTICE that may actually make it less useful - the default
> level is WARNING, and regular users are unable to change the level. So
> very few people will actually see these messages.

The main purpose of NOTICE was to notice user/dev, as client_min_messages=notice
by default. 

But while playing with it, I wonder if the message is in the good place anyway.
It is probably pessimistic as it shows memory consumption when increasing the
number of batch, but before actual buffile are (lazily) allocated. The message
should probably pop a bit sooner with better numbers.

Anyway, maybe this should be added in the light of next patch, balancing
between increasing batches and allowed memory. The WARNING/LOG/NOTICE message
could appears when we actually break memory rules because of some bad HJ
situation.

Another way to expose the bad memory consumption would be to add memory infos to
the HJ node in the explain output, or maybe collect some min/max/mean for
pg_stat_statement, but I suspect tracking memory spikes by query is another
challenge altogether...

In the meantime, find in attachment v3 of the patch with debug and NOTICE
messages removed. Given the same plan from my previous email, here is the
memory contexts close to the query end:

 ExecutorState: 32768 total in 3 blocks; 15512 free (6 chunks); 17256 used
   HashTableContext: 8192 total in 1 blocks; 7720 free (0 chunks); 472 used 
HashBatchFiles: 28605680 total in 3256 blocks; 970128 free (38180 chunks);
27635552 used
HashBatchContext: 960544 total in 23 blocks; 7928 free (0 chunks); 952616 
used


Regards,
>From 6814994fa0576a8ba6458412ac5f944135fc3813 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context

---
 src/backend/executor/nodeHash.c | 27 ++-
 src/backend/executor/nodeHashjoin.c | 18 +-
 src/include/executor/hashjoin.h | 15 ---
 src/include/executor/nodeHashjoin.h |  2 +-
 4 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 748c9b0024..0c0d5b4a3c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	 *
 	 * The hashtable control block is just palloc'd from the executor's
 	 * per-query memory context.  Everything else should be kept inside the
-	 * subsidiary hashCxt or batchCxt.
+	 * subsidiary hashCxt, batchCxt or fileCxt.
 	 */
 	hashtable = palloc_object(HashJoinTableData);
 	hashtable->nbuckets = nbuckets;
@@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 "HashBatchContext",
 ALLOCSET_DEFAULT_SIZES);
 
+	hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
+ "HashBatchFiles",
+ ALLOCSET_DEFAULT_SIZES);
+
 	/* Allocate data that will live for the life of the hashjoin */
 
 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 
 	if (nbatch > 1 && hashtable->parallel_state == NULL)
 	{
+		MemoryContext oldctx;
+
 		/*
 		 * allocate and initialize the file arrays in hashCxt (not needed for
 		 * parallel case which uses shared tuplestores instead of raw files)
 		 */
+		oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
+
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outer

Re: Memory leak from ExecutorState context?

2023-03-28 Thread Jehan-Guillaume de Rorthais

Hi,

Sorry for the late answer, I was reviewing the first patch and it took me some
time to study and dig around.

On Thu, 23 Mar 2023 08:07:04 -0400
Melanie Plageman  wrote:

> On Fri, Mar 10, 2023 at 1:51 PM Jehan-Guillaume de Rorthais
>  wrote:
> > > So I guess the best thing would be to go through these threads, see what
> > > the status is, restart the discussion and propose what to do. If you do
> > > that, I'm happy to rebase the patches, and maybe see if I could improve
> > > them in some way.  
> >
> > OK! It took me some time, but I did it. I'll try to sum up the situation as
> > simply as possible.  
> 
> Wow, so many memories!
> 
> I'm excited that someone looked at this old work (though it is sad that
> a customer faced this issue). And, Jehan, I really appreciate your great
> summarization of all these threads. This will be a useful reference.

Thank you!

> > 1. "move BufFile stuff into separate context"
> > [...]
> >I suppose this simple one has been forgotten in the fog of all other
> >discussions. Also, this probably worth to be backpatched.  
> 
> I agree with Jehan-Guillaume and Tomas that this seems fine to commit
> alone.

This is a WIP.

> > 2. "account for size of BatchFile structure in hashJoin"
> > [...] 
> 
> I think I would have to see a modern version of a patch which does this
> to assess if it makes sense. But, I probably still agree with 2019
> Melanie :)

I volunteer to work on this after the memory context patch, unless someone grab
it in the meantime.

> [...]
> On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais
>  wrote:
> > BNJL and/or other considerations are for 17 or even after. In the meantime,
> > Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with
> > other discussed solutions. No one down vote since then. Melanie, what is
> > your opinion today on this patch? Did you change your mind as you worked
> > for many months on BNLJ since then?  
> 
> So, in order to avoid deadlock, my design of adaptive hash join/block
> nested loop hash join required a new parallelism concept not yet in
> Postgres at the time -- the idea of a lone worker remaining around to do
> work when others have left.
> 
> See: BarrierArriveAndDetachExceptLast()
> introduced in 7888b09994
> 
> Thomas Munro had suggested we needed to battle test this concept in a
> more straightforward feature first, so I implemented parallel full outer
> hash join and parallel right outer hash join with it.
> 
> https://commitfest.postgresql.org/42/2903/
> 
> This has been stalled ready-for-committer for two years. It happened to
> change timing such that it made an existing rarely hit parallel hash
> join bug more likely to be hit. Thomas recently committed our fix for
> this in 8d578b9b2e37a4d (last week). It is my great hope that parallel
> full outer hash join goes in before the 16 feature freeze.

This is really interesting to follow. I kinda feel/remember how this could
be useful for your BNLJ patch. It's good to see things are moving, step by
step.

Thanks for the pointers.

> If it does, I think it could make sense to try and find committable
> smaller pieces of the adaptive hash join work. As it is today, parallel
> hash join does not respect work_mem, and, in some sense, is a bit broken.
> 
> I would be happy to work on this feature again, or, if you were
> interested in picking it up, to provide review and any help I can if for
> you to work on it.

I don't think I would be able to pick up such a large and complex patch. But I'm
interested to help, test and review, as far as I can!

Regards,

Re: Memory leak from ExecutorState context?

2023-03-28 Thread Jehan-Guillaume de Rorthais

On Tue, 28 Mar 2023 00:43:34 +0200
Tomas Vondra  wrote:

> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
> > Please, find in attachment a patch to allocate bufFiles in a dedicated
> > context. I picked up your patch, backpatch'd it, went through it and did
> > some minor changes to it. I have some comment/questions thought.
> > 
> >   1. I'm not sure why we must allocate the "HashBatchFiles" new context
> >   under ExecutorState and not under hashtable->hashCxt?
> > 
> >   The only references I could find was in hashjoin.h:30:
> > 
> >/* [...]
> > * [...] (Exception: data associated with the temp files lives in the
> > * per-query context too, since we always call buffile.c in that
> > context.)
> > 
> >   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
> >   original comment in the patch):
> > 
> >/* [...]
> > * Note: it is important always to call this in the regular executor
> > * context, not in a shorter-lived context; else the temp file buffers
> > * will get messed up.
> > 
> > 
> >   But these are not explanation of why BufFile related allocations must be
> > under a per-query context. 
> >   
> 
> Doesn't that simply describe the current (unpatched) behavior where
> BufFile is allocated in the per-query context? 

I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
rule about «**always** call buffile.c in per-query context». But maybe it ought
to be «always call buffile.c from one of the sub-query context»? I assume the
aim is to enforce the tmp files removal on query end/error?

> I mean, the current code calls BufFileCreateTemp() without switching the
> context, so it's in the ExecutorState. But with the patch it very clearly is
> not.
> 
> And I'm pretty sure the patch should do
> 
> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>  "HashBatchFiles",
>  ALLOCSET_DEFAULT_SIZES);
> 
> and it'd still work. Or why do you think we *must* allocate it under
> ExecutorState?

That was actually my very first patch and it indeed worked. But I was confused
about the previous quoted code comments. That's why I kept your original code
and decided to rise the discussion here.

Fixed in new patch in attachment.

> FWIW The comment in hashjoin.h needs updating to reflect the change.

Done in the last patch. Is my rewording accurate?

> >   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
> > switch seems fragile as it could be forgotten in futur code path/changes.
> > So I added an Assert() in the function to make sure the current memory
> > context is "HashBatchFiles" as expected.
> >   Another way to tie this up might be to pass the memory context as
> > argument to the function.
> >   ... Or maybe I'm over precautionary.
> >   
> 
> I'm not sure I'd call that fragile, we have plenty other code that
> expects the memory context to be set correctly. Not sure about the
> assert, but we don't have similar asserts anywhere else.

I mostly sticked it there to stimulate the discussion around this as I needed
to scratch that itch.

> But I think it's just ugly and overly verbose

+1

Your patch was just a demo/debug patch by the time. It needed some cleanup now
:)

> it'd be much nicer to e.g. pass the memory context as a parameter, and do
> the switch inside.

That was a proposition in my previous mail, so I did it in the new patch. Let's
see what other reviewers think.

> >   3. You wrote:
> >   
> >>> A separate BufFile memory context helps, although people won't see it
> >>> unless they attach a debugger, I think. Better than nothing, but I was
> >>> wondering if we could maybe print some warnings when the number of batch
> >>> files gets too high ...  
> > 
> >   So I added a WARNING when batches memory are exhausting the memory size
> >   allowed.
> > 
> >+   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
> >+   elog(WARNING, "Growing number of hash batch is exhausting
> > memory");
> > 
> >   This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
> >   overflows the memory budget. I realize now I should probably add the
> > memory limit, the number of current batch and their memory consumption.
> >   The message is probably too cryptic for a user. It could probably be
> >   reworded, but some doc or additionnal hint around this message might help.
> >   
> 
> Hmmm, not sure is WARN

Re: Memory leak from ExecutorState context?

2023-03-27 Thread Jehan-Guillaume de Rorthais

Hi,

On Mon, 20 Mar 2023 15:12:34 +0100
Jehan-Guillaume de Rorthais  wrote:

> On Mon, 20 Mar 2023 09:32:17 +0100
> Tomas Vondra  wrote:
> 
> > >> * Patch 1 could be rebased/applied/backpatched
> > > 
> > > Would it help if I rebase Patch 1 ("move BufFile stuff into separate
> > > context")?   
> > 
> > Yeah, I think this is something we'd want to do. It doesn't change the
> > behavior, but it makes it easier to track the memory consumption etc.  
> 
> Will do this week.

Please, find in attachment a patch to allocate bufFiles in a dedicated context.
I picked up your patch, backpatch'd it, went through it and did some minor
changes to it. I have some comment/questions thought.

  1. I'm not sure why we must allocate the "HashBatchFiles" new context
  under ExecutorState and not under hashtable->hashCxt?

  The only references I could find was in hashjoin.h:30:

   /* [...]
* [...] (Exception: data associated with the temp files lives in the
* per-query context too, since we always call buffile.c in that context.)

  And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
  original comment in the patch):

   /* [...]
* Note: it is important always to call this in the regular executor
* context, not in a shorter-lived context; else the temp file buffers
* will get messed up.

  But these are not explanation of why BufFile related allocations must be under
  a per-query context. 

  2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context switch
  seems fragile as it could be forgotten in futur code path/changes. So I
  added an Assert() in the function to make sure the current memory context is
  "HashBatchFiles" as expected.
  Another way to tie this up might be to pass the memory context as argument to
  the function.
  ... Or maybe I'm over precautionary.

  3. You wrote:

>> A separate BufFile memory context helps, although people won't see it
>> unless they attach a debugger, I think. Better than nothing, but I was
>> wondering if we could maybe print some warnings when the number of batch
>> files gets too high ...

  So I added a WARNING when batches memory are exhausting the memory size
  allowed.

   +   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
   +   elog(WARNING, "Growing number of hash batch is exhausting memory");

  This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
  overflows the memory budget. I realize now I should probably add the memory
  limit, the number of current batch and their memory consumption.
  The message is probably too cryptic for a user. It could probably be
  reworded, but some doc or additionnal hint around this message might help.

  4. I left the debug messages for some more review rounds

Regards,
>From a0dd541ad4e47735fc388d0c553e935f0956f254 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context

---
 src/backend/executor/nodeHash.c | 39 +++--
 src/backend/executor/nodeHashjoin.c | 14 ---
 src/include/executor/hashjoin.h |  1 +
 3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 748c9b0024..3f1ae9755e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	 *
 	 * The hashtable control block is just palloc'd from the executor's
 	 * per-query memory context.  Everything else should be kept inside the
-	 * subsidiary hashCxt or batchCxt.
+	 * subsidiary hashCxt, batchCxt or fileCxt.
 	 */
 	hashtable = palloc_object(HashJoinTableData);
 	hashtable->nbuckets = nbuckets;
@@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 "HashBatchContext",
 ALLOCSET_DEFAULT_SIZES);

+	hashtable->fileCxt = AllocSetContextCreate(CurrentMemoryContext,
+ "HashBatchFiles",
+ ALLOCSET_DEFAULT_SIZES);
+
 	/* Allocate data that will live for the life of the hashjoin */

 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,

 	if (nbatch > 1 && hashtable->parallel_state == NULL)
 	{
+		MemoryContext oldctx;
+
 		/*
 		 * allocate and initialize the file arrays in hashCxt (not needed for
 		 * parallel case which uses shared tuplestores instead of raw files)
 		 */
+		oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
+
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outerBatchFile = palloc0_ar

Re: Memory leak from ExecutorState context?

2023-03-20 Thread Jehan-Guillaume de Rorthais

On Mon, 20 Mar 2023 09:32:17 +0100
Tomas Vondra  wrote:

> >> * Patch 1 could be rebased/applied/backpatched  
> > 
> > Would it help if I rebase Patch 1 ("move BufFile stuff into separate
> > context")? 
> 
> Yeah, I think this is something we'd want to do. It doesn't change the
> behavior, but it makes it easier to track the memory consumption etc.

Will do this week.

> I think focusing on the backpatchability is not particularly good
> approach. It's probably be better to fist check if there's any hope of
> restarting the work on BNLJ, which seems like a "proper" solution for
> the future.

Backpatching worth some consideration. Balancing is almost there and could help
in 16. As time is flying, it might well miss release 16, but maybe it could be
backpacthed in 16.1 or a later minor release? But what if it is not eligible for
backpatching? We would stall another year without having at least an imperfect
stop-gap solution (sorry for picking your words ;)).

BNJL and/or other considerations are for 17 or even after. In the meantime,
Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with other
discussed solutions. No one down vote since then. Melanie, what is your
opinion today on this patch? Did you change your mind as you worked for many
months on BNLJ since then?

> On 3/19/23 20:31, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 05:41:11PM +0100, Tomas Vondra wrote:  
>  * Patch 2 is worth considering to backpatch  
> >>
> >> I'm not quite sure what exactly are the numbered patches, as some of the
> >> threads had a number of different patch ideas, and I'm not sure which
> >> one was/is the most promising one.  
> > 
> > patch 2 is referring to the list of patches that was compiled
> > https://www.postgresql.org/message-id/20230310195114.6d0c5406%40karst  
> 
> Ah, I see - it's just the "rebalancing" patch which minimizes the total
> amount of memory used (i.e. grow work_mem a bit, so that we don't
> allocate too many files).
> 
> Yeah, I think that's the best we can do without reworking how we spill
> data (slicing or whatever).

Indeed, that's why I was speaking about backpatching for this one:

* it surely helps 16 (and maybe previous release) in such skewed situation
* it's constrained
* it's not too invasive, it doesn't shake to whole algorithm all over the place
* Melanie was +1 for it, no one down vote. 

What need to be discussed/worked:

* any regressions for existing queries running fine or without OOM?
* add the bufFile memory consumption in the planner considerations?

> I think the spilling is "nicer" in that it actually enforces work_mem
> more strictly than (a), 

Sure it enforces work_mem more strictly, but this is a discussion for 17 or
later in my humble opinion.

> but we should probably spend a bit more time on
> the exact spilling strategy. I only really tried two trivial ideas, but
> maybe we can be smarter about allocating / sizing the files? Maybe
> instead of slices of constant size we could/should make them larger,
> similarly to what log-structured storage does.
> 
> For example we might double the number of batches a file represents, so
> the first file would be for batch 1, the next one for batches 2+3, then
> 4+5+6+7, then 8-15, then 16-31, ...
> 
> We should have some model calculating (i) amount of disk space we would
> need for the spilled files, and (ii) the amount of I/O we do when
> writing the data between temp files.

And:

* what about Robert's discussion on uneven batch distribution? Why is it
  ignored? Maybe there was some IRL or off-list discussions? Or did I missed
  some mails?
* what about dealing with all normal batch first, then revamp in
  freshly emptied batches the skewed ones, spliting them if needed, then rince &
  repeat? At some point, we would probably still need something like slicing
  and/or BNLJ though...

> let's revive the rebalance and/or spilling patches. Imperfect but better than
> nothing.

+1 for rebalance. I'm not even sure it could make it to 16 as we are running
out time, but it worth to try as it's the closest one that could be
reviewed and ready'd-for-commiter.

I might lack of ambition, but spilling patches seems too much to make it for
16. It seems to belongs with other larger patches/ideas (patches 4 a 5 in my sum
up). But this is just my humble feeling.

> And then in the end we can talk about if/what can be backpatched.
> 
> FWIW I don't think there's a lot of rush, considering this is clearly a
> matter for PG17. So the summer CF at the earliest, people are going to
> be busy until then.

100% agree, there's no rush for patches 3, 4 ... and 5.

Thanks!

Re: Memory leak from ExecutorState context?

2023-03-17 Thread Jehan-Guillaume de Rorthais

Hi there,

On Fri, 10 Mar 2023 19:51:14 +0100
Jehan-Guillaume de Rorthais  wrote:

> > So I guess the best thing would be to go through these threads, see what
> > the status is, restart the discussion and propose what to do. If you do
> > that, I'm happy to rebase the patches, and maybe see if I could improve
> > them in some way.  
> 
> [...]
> 
> > I was hoping we'd solve this by the BNL, but if we didn't get that in 4
> > years, maybe we shouldn't stall and get at least an imperfect stop-gap
> > solution ...  
> 
> Indeed. So, to sum-up:
> 
> * Patch 1 could be rebased/applied/backpatched

Would it help if I rebase Patch 1 ("move BufFile stuff into separate context")?

> * Patch 2 is worth considering to backpatch

Same question.

> * Patch 3 seemed withdrawn in favor of BNLJ
> * Patch 4 is waiting for some more review and has some TODO
> * discussion 5 worth few minutes to discuss before jumping on previous topics

These other patches needs more discussions and hacking. They have a low
priority compare to other discussions and running commitfest. However, how can
avoid losing them in limbo again?

Regards,

Re: Memory leak from ExecutorState context?

2023-03-10 Thread Jehan-Guillaume de Rorthais

Hi,

> So I guess the best thing would be to go through these threads, see what
> the status is, restart the discussion and propose what to do. If you do
> that, I'm happy to rebase the patches, and maybe see if I could improve
> them in some way.

OK! It took me some time, but I did it. I'll try to sum up the situation as
simply as possible.

I reviewed the following threads:

* Out of Memory errors are frustrating as heck!
  2019-04-14 -> 2019-04-28
  
https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net

  This discussion stalled, waiting for OP, but ideas there ignited all other
  discussions.

* accounting for memory used for BufFile during hash joins
  2019-05-04 -> 2019-09-10
  
https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh%40development

  This was suppose to push forward a patch discussed on previous thread, but
  it actually took over it and more ideas pops from there.

* Replace hashtable growEnable flag
  2019-05-15 -> 2019-05-16
  
https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com

  This one quickly merged to the next one.

* Avoiding hash join batch explosions with extreme skew and weird stats
  2019-05-16 -> 2020-09-24
  
https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

  Another thread discussing another facet of the problem, but eventually end up
  discussing / reviewing the BNLJ implementation.
  
Five possible fixes/ideas were discussed all over these threads:


1. "move BufFile stuff into separate context"
   last found patch: 2019-04-21
   
https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development
   
https://www.postgresql.org/message-id/attachment/100822/0001-move-BufFile-stuff-into-separate-context.patch

   This patch helps with observability/debug by allocating the bufFiles in the
   appropriate context instead of the "ExecutorState" one.

   I suppose this simple one has been forgotten in the fog of all other
   discussions. Also, this probably worth to be backpatched.

2. "account for size of BatchFile structure in hashJoin"
   last found patch: 2019-04-22
   
https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
   
https://www.postgresql.org/message-id/attachment/100951/v2-simple-rebalance.patch

   This patch seems like a good first step:

   * it definitely helps older versions where other patches discussed are way
 too invasive to be backpatched
   * it doesn't step on the way of other discussed patches

   While looking at the discussions around this patch, I was wondering if the
   planner considers the memory allocation of bufFiles. But of course, Melanie
   already noticed that long before I was aware of this problem and discussion:

   2019-07-10: «I do think that accounting for Buffile overhead when estimating
   the size of the hashtable during ExecChooseHashTableSize() so it can be
 used during planning is a worthwhile patch by itself (though I know it
 is not even part of this patch).»
   
https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com
 
   Tomas Vondra agreed with this in his answer, but no new version of the patch
   where produced.

   Finally, Melanie was pushing the idea to commit this patch no matter other
   pending patches/ideas:

   2019-09-05: «If Tomas or someone else has time to pick up and modify BufFile
 accounting patch, committing that still seems like the nest logical
 step.»
   
https://www.postgresql.org/message-id/CAAKRu_b6%2BjC93WP%2BpWxqK5KAZJC5Rmxm8uquKtEf-KQ%2B%2B1Li6Q%40mail.gmail.com

   Unless I'm wrong, no one down voted this.

3. "per slice overflow file"
   last found patch: 2019-05-08
   
https://www.postgresql.org/message-id/20190508150844.rij36rtuk4lhvztw%40development
   
https://www.postgresql.org/message-id/attachment/101080/v4-per-slice-overflow-file-20190508.patch

   This patch has been withdraw after an off-list discussion with Thomas Munro
   because of a missing parallel hashJoin implementation. Plus, before any
   effort started on the parallel implementation, the BNLJ idea appeared and
   seemed more appealing.

   See:
   
https://www.postgresql.org/message-id/20190529145517.sj2poqmb3cr4cg6w%40development

   By the time, it still seems to have some interest despite the BNLJ patch:

   2019-07-10: «If slicing is made to work for parallel-aware hashjoin and the
   code is in a committable state (and probably has the threshold I mentioned
 above), then I think that this patch should go in.»
   
https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com

   But this might have been disapproved later by Tomas:

   2019-09-10: «I have to admit I kinda lost track [...] My feeling is that we
 should get the BNLJ committed

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

On Thu, 2 Mar 2023 19:53:14 +0100
Tomas Vondra  wrote:
> On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
...

> > There was some thoughts about how to make a better usage of the memory. As
> > memory is exploding way beyond work_mem, at least, avoid to waste it with
> > too many buffers of BufFile. So you expand either the work_mem or the
> > number of batch, depending on what move is smarter. TJis is explained and
> > tested here:
> > 
> > https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> > https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
> > 
> > And then, another patch to overflow each batch to a dedicated temp file and
> > stay inside work_mem (v4-per-slice-overflow-file.patch):
> > 
> > https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
> > 
> > Then, nothing more on the discussion about this last patch. So I guess it
> > just went cold.
> 
> I think a contributing factor was that the OP did not respond for a
> couple months, so the thread went cold.
> 
> > For what it worth, these two patches seems really interesting to me. Do you
> > need any help to revive it?
> 
> I think another reason why that thread went nowhere were some that we've
> been exploring a different (and likely better) approach to fix this by
> falling back to a nested loop for the "problematic" batches.
> 
> As proposed in this thread:
> 
>  
> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development

Unless I'm wrong, you are linking to the same «frustrated as heck!» discussion,
for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch
(balancing between increasing batches *and* work_mem).

No sign of turning "problematic" batches to nested loop. Did I miss something?

Do you have a link close to your hand about such algo/patch test by any chance?

> I was hoping we'd solve this by the BNL, but if we didn't get that in 4
> years, maybe we shouldn't stall and get at least an imperfect stop-gap
> solution ...

I'll keep searching tomorrow about existing BLN discussions (is it block level
nested loops?).

Regards,

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

On Thu, 2 Mar 2023 19:15:30 +0100
Jehan-Guillaume de Rorthais  wrote:
[...]
> For what it worth, these two patches seems really interesting to me. Do you
> need any help to revive it?

To avoid confusion, the two patches I meant were:

* 0001-move-BufFile-stuff-into-separate-context.patch   
* v4-per-slice-overflow-file.patch

Regards,

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

Hi!

On Thu, 2 Mar 2023 13:44:52 +0100
Tomas Vondra  wrote:
> Well, yeah and no.
> 
> In principle we could/should have allocated the BufFiles in a different
> context (possibly hashCxt). But in practice it probably won't make any
> difference, because the query will probably run all the hashjoins at the
> same time. Imagine a sequence of joins - we build all the hashes, and
> then tuples from the outer side bubble up through the plans. And then
> you process the last tuple and release all the hashes.
> 
> This would not fix the issue. It'd be helpful for accounting purposes
> (we'd know it's the buffiles and perhaps for which hashjoin node). But
> we'd still have to allocate the memory etc. (so still OOM).

Well, accounting things in the correct context would already worth a patch I
suppose. At least, it help to investigate faster. Plus, you already wrote a
patch about that[1]:

https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development

Note that I did reference the "Out of Memory errors are frustrating as heck!"
thread in my first email, pointing on a Tom Lane's email explaining that
ExecutorState was not supposed to be so large[2].

[2] 
https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b

> There's only one thing I think could help - increase the work_mem enough
> not to trigger the explosive growth in number of batches. Imagine
> there's one very common value, accounting for ~65MB of tuples. With
> work_mem=64MB this leads to exactly the explosive growth you're
> observing here. With 128MB it'd probably run just fine.
> 
> The problem is we don't know how large the work_mem would need to be :-(
> So you'll have to try and experiment a bit.
> 
> I remembered there was a thread [1] about *exactly* this issue in 2019.
> 
> [1]
> https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net
>
> I even posted a couple patches that try to address this by accounting
> for the BufFile memory, and increasing work_mem a bit instead of just
> blindly increasing the number of batches (ignoring the fact that means
> more memory will be used for the BufFile stuff).
> 
> I don't recall why it went nowhere, TBH. But I recall there were
> discussions about maybe doing something like "block nestloop" at the
> time, or something. Or maybe the thread just went cold.

So I read the full thread now. I'm still not sure why we try to avoid hash
collision so hard, and why a similar data subset barely larger than work_mem
makes the number of batchs explode, but I think I have a better understanding of
the discussion and the proposed solutions.

There was some thoughts about how to make a better usage of the memory. As
memory is exploding way beyond work_mem, at least, avoid to waste it with too
many buffers of BufFile. So you expand either the work_mem or the number of
batch, depending on what move is smarter. TJis is explained and tested here:

https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development

And then, another patch to overflow each batch to a dedicated temp file and
stay inside work_mem (v4-per-slice-overflow-file.patch):

https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

Then, nothing more on the discussion about this last patch. So I guess it just
went cold.

For what it worth, these two patches seems really interesting to me. Do you need
any help to revive it?

Regards,

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

On Thu, 2 Mar 2023 01:30:27 +0100
Tomas Vondra  wrote:
> On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote:
> >>> ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
> >>> of calls, always short-cut'ed to 1048576, I guess because of the
> >>> conditional block «/* safety check to avoid overflow */» appearing early
> >>> in this function.   
> >[...] But what about this other short-cut then?
> > 
> > /* do nothing if we've decided to shut off growth */
> > if (!hashtable->growEnabled)
> > return;
> > 
> > [...]
> > 
> > /*
> >  * If we dumped out either all or none of the tuples in the table,
> >  * disable
> >  * further expansion of nbatch.  This situation implies that we have
> >  * enough tuples of identical hashvalues to overflow spaceAllowed.
> >  * Increasing nbatch will not fix it since there's no way to subdivide
> >  * the
> >  * group any more finely. We have to just gut it out and hope the server
> >  * has enough RAM.
> >  */
> > if (nfreed == 0 || nfreed == ninmemory)
> > {
> > hashtable->growEnabled = false;
> >   #ifdef HJDEBUG
> > printf("Hashjoin %p: disabling further increase of nbatch\n",
> >hashtable);
> >   #endif
> > }
> > 
> > If I guess correctly, the function is not able to split the current batch,
> > so it sits and hopes. This is a much better suspect and I can surely track
> > this from gdb.
> 
> Yes, this would make much more sense - it'd be consistent with the
> hypothesis that this is due to number of batches exploding (it's a
> protection exactly against that).
> 
> You specifically mentioned the other check earlier, but now I realize
> you've been just speculating it might be that.

Yes, sorry about that, I jumped on this speculation without actually digging it
much...

[...]
> But I have another idea - put a breakpoint on makeBufFile() which is the
> bit that allocates the temp files including the 8kB buffer, and print in
> what context we allocate that. I have a hunch we may be allocating it in
> the ExecutorState. That'd explain all the symptoms.

That what I was wondering as well yesterday night.

So, on your advice, I set a breakpoint on makeBufFile:

  (gdb) info br
  Num Type   Disp Enb AddressWhat
  1   breakpoint keep y   0x007229df in makeBufFile
  bt 10
  p CurrentMemoryContext.name


Then, I disabled it and ran the query up to this mem usage:

   VIRTRESSHR S  %CPU %MEM
  20.1g   7.0g  88504 t   0.0 22.5

Then, I enabled the breakpoint and look at around 600 bt and context name
before getting bored. They **all** looked like that:

  Breakpoint 1, BufFileCreateTemp (...)at buffile.c:201
  201 in buffile.c
  #0  BufFileCreateTemp (...) buffile.c:201
  #1  ExecHashJoinSaveTuple (tuple=0x1952c180, ...)   nodeHashjoin.c:1238
  #2  ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
  #3  ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
  #4  ExecProcNodeInstr (node=)execProcnode.c:462
  #5  ExecProcNode (node=0x31a6418)
  #6  ExecSort (pstate=0x31a6308)
  #7  ExecProcNodeInstr (node=)
  #8  ExecProcNode (node=0x31a6308)
  #9  fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)
  
  $421643 = 0x99d7f7 "ExecutorState"

These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
probably log much more of them if more checks/stats need to be collected, but
it really slow down the query a lot, granting it only 1-5% of CPU time instead
of the usual 100%.

So It's not exactly a leakage, as memory would be released at the end of the
query, but I suppose they should be allocated in a shorter living context,
to avoid this memory bloat, am I right?

> BTW with how many batches does the hash join start?

* batches went from 32 to 1048576 before being growEnabled=false as suspected
* original and current nbuckets were set to 1048576 immediately
* allowed space is set to the work_mem, but current space usage is 1.3GB, as
  measured previously close before system refuse more memory allocation.

Here are the full details about the hash associated with the previous backtrace:

  (gdb) up
  (gdb) up
  (gdb) p *((HashJoinState*)pstate)->hj_HashTable
  $421652 = {
nbuckets = 1048576,
log2_nbuckets = 20,
nbuckets_original = 1048576,
nbuckets_optimal = 1048576,
log2_nbuckets_optimal = 20,
buckets = {unshared = 0x68f12e8, shared = 0x68f12e8},
keepNulls = true,
skewEnabled = false,
skewBucket = 0x0,
skewBucketLen = 0,

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

Hi,

On Wed, 1 Mar 2023 20:29:11 +0100
Tomas Vondra  wrote:
> On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:
> > On Tue, 28 Feb 2023 20:51:02 +0100
> > Tomas Vondra  wrote:  
> >> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:  
> >>> * HashBatchContext goes up to 1441MB after 240s then stay flat until the
> >>> end (400s as the last record)
> >>
> >> That's interesting. We're using HashBatchContext for very few things, so
> >> what could it consume so much memory? But e.g. the number of buckets
> >> should be limited by work_mem, so how could it get to 1.4GB?
> >>
> >> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
> >> and print how many batches/butches are there?  
> > 
> > I did this test this morning.
> > 
> > Batches and buckets increased really quickly to 1048576/1048576.
> 
> OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
> buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
> fillfactor 1.0).
> 
> But 1M batches? I guess that could be problematic. It doesn't seem like
> much, but we need 1M files on each side - 1M for the hash table, 1M for
> the outer relation. That's 16MB of pointers, but the files are BufFile
> and we keep 8kB buffer for each of them. That's ~16GB right there :-(
>
> In practice it probably won't be that bad, because not all files will be
> allocated/opened concurrently (especially if this is due to many tuples
> having the same value). Assuming that's what's happening here, ofc.

And I suppose they are close/freed concurrently as well?

> > ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
> > of calls, always short-cut'ed to 1048576, I guess because of the
> > conditional block «/* safety check to avoid overflow */» appearing early in
> > this function. 
> 
> Hmmm, that's a bit weird, no? I mean, the check is
> 
>   /* safety check to avoid overflow */
>   if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
>  return;
> 
> Why would it stop at 1048576? It certainly is not higher than  INT_MAX/2
> and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
> happening here?

Indeed, not the good suspect. But what about this other short-cut then?

/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
return;

[...]

/*
 * If we dumped out either all or none of the tuples in the table, disable
 * further expansion of nbatch.  This situation implies that we have
 * enough tuples of identical hashvalues to overflow spaceAllowed.
 * Increasing nbatch will not fix it since there's no way to subdivide the
 * group any more finely. We have to just gut it out and hope the server
 * has enough RAM.
 */
if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
  #ifdef HJDEBUG
printf("Hashjoin %p: disabling further increase of nbatch\n",
   hashtable);
  #endif
}

If I guess correctly, the function is not able to split the current batch, so
it sits and hopes. This is a much better suspect and I can surely track this
from gdb.

Being able to find what are the fields involved in the join could help as well
to check or gather some stats about them, but I hadn't time to dig this yet...

[...]
> >> Investigating memory leaks is tough, especially for generic memory
> >> contexts like ExecutorState :-( Even more so when you can't reproduce it
> >> on a machine with custom builds.
> >>
> >> What I'd try is this:

[...]
> > I couldn't print "size" as it is optimzed away, that's why I tracked
> > chunk->size... Is there anything wrong with my current run and gdb log? 
> 
> There's definitely something wrong. The size should not be 0, and
> neither it should be > 1GB. I suspect it's because some of the variables
> get optimized out, and gdb just uses some nonsense :-(
> 
> I guess you'll need to debug the individual breakpoints, and see what's
> available. It probably depends on the compiler version, etc. For example
> I don't see the "chunk" for breakpoint 3, but "chunk_size" works and I
> can print the chunk pointer with a bit of arithmetics:
> 
>   p (block->freeptr - chunk_size)
> 
> I suppose similar gympastics could work for the other breakpoints.

OK, I'll give it a try tomorrow.

Thank you!

NB: the query has been killed by the replication.

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

On Wed, 1 Mar 2023 20:34:08 +0100
Tomas Vondra  wrote:

> On 3/1/23 19:09, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 1 Mar 2023 18:48:40 +0100
> > Jehan-Guillaume de Rorthais  wrote:
> > ...  
> >> You'll find some intermediate stats I already collected in attachment:
> >>
> >> * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
> >> * most of the non-free'd chunk are allocated since the very beginning,
> >> before the 5000's allocation call for almost 1M call so far.
> >> * 3754 of them have a chunk->size of 0
> >> * it seems there's some buggy stats or data:
> >># this one actually really comes from the gdb log
> >>0x38a77b8: break=3 num=191   sz=4711441762604810240 (weird sz)
> >># this one might be a bug in my script
> >>  0x2: break=2 num=945346sz=2   (weird
> >> address)
> >> * ignoring the weird size requested during the 191st call, the total amount
> >>   of non free'd memory is currently 5488MB  
> > 
> > I forgot one stat. I don't know if this is expected, normal or not, but 53
> > chunks has been allocated on an existing address that was not free'd before.
> >   
> 
> It's likely chunk was freed by repalloc() and not by pfree() directly.
> Or maybe the whole context got destroyed/reset, in which case we don't
> free individual chunks. But that's unlikely for the ExecutorState.

Well, as all breakpoints were conditional on ExecutorState, I suppose this
might be repalloc then.

Regards,

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

On Wed, 1 Mar 2023 18:48:40 +0100
Jehan-Guillaume de Rorthais  wrote:
...
> You'll find some intermediate stats I already collected in attachment:
> 
> * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
> * most of the non-free'd chunk are allocated since the very beginning, before
>   the 5000's allocation call for almost 1M call so far.
> * 3754 of them have a chunk->size of 0
> * it seems there's some buggy stats or data:
># this one actually really comes from the gdb log
>0x38a77b8: break=3 num=191   sz=4711441762604810240 (weird sz)
># this one might be a bug in my script
>  0x2: break=2 num=945346sz=2   (weird address)
> * ignoring the weird size requested during the 191st call, the total amount
>   of non free'd memory is currently 5488MB

I forgot one stat. I don't know if this is expected, normal or not, but 53
chunks has been allocated on an existing address that was not free'd before.

Regards,

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

Hi Tomas,

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra  wrote:
> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> > * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
> >   (400s as the last record)  
> 
> That's interesting. We're using HashBatchContext for very few things, so
> what could it consume so much memory? But e.g. the number of buckets
> should be limited by work_mem, so how could it get to 1.4GB?
> 
> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
> and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of
calls, always short-cut'ed to 1048576, I guess because of the conditional block
«/* safety check to avoid overflow */» appearing early in this function.

I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the
query run faster. Enabling it at 19.1GB of memory consumption, it stayed
silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly).

The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning,
and a last time close to the end of the query execution.

> > Any idea? How could I help to have a better idea if a leak is actually
> > occurring and where exactly?
> 
> Investigating memory leaks is tough, especially for generic memory
> contexts like ExecutorState :-( Even more so when you can't reproduce it
> on a machine with custom builds.
> 
> What I'd try is this:
> 
> 1) attach breakpoints to all returns in AllocSetAlloc(), printing the
> pointer and size for ExecutorState context, so something like
> 
>   break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
>   commands
>   print MemoryChunkGetPointer(chunk) size
>   cont
>   end
> 
> 2) do the same for AllocSetFree()
> 
> 3) Match the palloc/pfree calls (using the pointer address), to
> determine which ones are not freed and do some stats on the size.
> Usually there's only a couple distinct sizes that account for most of
> the leaked memory.

So here is what I end up with this afternoon, using file, lines and macro from
REL_11_18:

  set logging  on
  set pagination off

  break aset.c:781 if strcmp("ExecutorState",context.name) == 0
  commands 1
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont
  end

  break aset.c:820 if strcmp("ExecutorState",context.name) == 0
  commands 2
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont 
  end

  break aset.c:979 if strcmp("ExecutorState",context.name) == 0
  commands 3
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont 
  end

  break AllocSetFree if strcmp("ExecutorState",context.name) == 0
  commands 4 
  print pointer
  cont
  end

So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The
backend had only 3'32" of CPU time:

   VIRTRESSHR S  %CPU %MEM TIME+ COMMAND
  2727284   2.4g  17840 R  99.0  7.7 181:25.07 gdb
  9054688 220648 103056 t   1.3  0.7   3:32.05 postmaster

Interestingly, the RES memory of the backend did not explode yet, but VIRT is
already high.

I suppose the query will run for some more hours, hopefully, gdb will not
exhaust the memory in the meantime...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
  the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
   # this one actually really comes from the gdb log
   0x38a77b8: break=3 num=191   sz=4711441762604810240 (weird sz)
   # this one might be a bug in my script
 0x2: break=2 num=945346sz=2   (weird address)
* ignoring the weird size requested during the 191st call, the total amount
  of non free'd memory is currently 5488MB

I couldn't print "size" as it is optimzed away, that's why I tracked
chunk->size... Is there anything wrong with my current run and gdb log? 

The gdb log is 5MB compressed. I'll keep it off-list, but I can provide it if
needed.

Stay tuned...

Thank you!

allocs-tmp-stats.txt.gz
Description: application/gzip

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

Hi Justin,

On Tue, 28 Feb 2023 12:25:08 -0600
Justin Pryzby  wrote:

> On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:
> > Hello all,
> > 
> > A customer is facing out of memory query which looks similar to this
> > situation:
> > 
> >   
> > https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> > 
> > This PostgreSQL version is 11.18. Some settings:  
> 
> hash joins could exceed work_mem until v13:

Yes, I am aware of this. But as far as I understand Tom Lane explanations from
the discussion mentioned up thread, it should not be ExecutorState.
ExecutorState (13GB) is at least ten times bigger than any other context,
including HashBatchContext (1.4GB) or HashTableContext (16MB). So maybe some
aggregate is walking toward the wall because of bad estimation, but something
else is racing way faster to the wall. And presently it might be something
related to some JOIN node.

About your other points, you are right, there's numerous things we could do to
improve this query, and our customer is considering it as well. It's just a
matter of time now.

But in the meantime, we are facing a query with a memory behavior that seemed
suspect. Following the 4 years old thread I mentioned, my goal is to inspect
and provide all possible information to make sure it's a "normal" behavior or
something that might/should be fixed.

Thank you for your help!

Re: Transparent column encryption

2022-11-24 Thread Jehan-Guillaume de Rorthais

On Wed, 23 Nov 2022 19:45:06 +0100
Peter Eisentraut  wrote:
> On 28.10.22 12:16, Jehan-Guillaume de Rorthais wrote:
[...]

> >* I wonder if encryption related fields in ParameterDescription and
> >  RowDescription could be optional somehow? The former might be quite
> >  large when using a lot of parameters (like, imagine a large and ugly
> >  "IN($1...$N)"). On the other hand, these messages are not sent in high
> >  frequency anyway...  
> 
> They are only used if you turn on the column_encryption protocol option. 
>   Or did you mean make them optional even then?

I meant even when column_encryption is turned on.

Regards,

Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-11-04 Thread Jehan-Guillaume de Rorthais

On Thu, 3 Nov 2022 13:11:18 -0500
Justin Pryzby  wrote:

> On Tue, Jun 28, 2022 at 06:17:40PM -0400, Andrew Dunstan wrote:
> > Nice catch, but this looks like massive overkill. I think we can very
> > simply fix the test in just a few lines of code, instead of a 190 line
> > fix and a 130 line TAP test.
> > 
> > It was never intended to be able to compare markers like rc1 vs rc2, and
> > I don't see any need for it. If you can show me a sane use case I'll
> > have another look, but right now it seems quite unnecessary.
> > 
> > Here's my proposed fix.
> > 
> > diff --git a/src/test/perl/PostgreSQL/Version.pm
> > b/src/test/perl/PostgreSQL/Version.pm index 8f70491189..8d4dbbf694 100644
> > --- a/src/test/perl/PostgreSQL/Version.pm  
> 
> Is this still an outstanding issue ?

The issue still exists on current HEAD:

  $ perl -Isrc/test/perl/ -MPostgreSQL::Version -le \
  'print "bug" if PostgreSQL::Version->new("9.6") <= 9.0'
  bug

Regards,

Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-11-04 Thread Jehan-Guillaume de Rorthais

On Thu, 3 Nov 2022 20:44:16 +0100
Alvaro Herrera  wrote:

> On 2022-Oct-05, Alvaro Herrera wrote:
> 
> > I've been giving the patches a look and it caused me to notice two
> > additional bugs in the same area:
> > 
> > - FKs in partitions are sometimes marked NOT VALID.  This is because of
> >   missing initialization when faking up a Constraint node in
> >   CloneFkReferencing.  Easy to fix, have patch, running tests now.  
> 
> I have pushed the fix for this now.

Thank you Alvaro!

Re: psql: Add command to use extended query protocol

2022-11-02 Thread Jehan-Guillaume de Rorthais

On Wed, 02 Nov 2022 16:04:02 +0100
"Daniel Verite"  wrote:

>   Jehan-Guillaume de Rorthais wrote:
> 
> > As I wrote in my TCE review, would it be possible to use psql vars to set
> > some named parameters for the prepared query? This would looks like:
> > 
> >  \set p1 foo
> >  \set p2 bar
> >  SELECT :'p1', :'p2' \gp  
> 
> As I understand the feature, variables would be passed like this:
> 
> \set var1 'foo bar'
> \set var2 'baz''qux'
> 
> select $1, $2 \gp :var1 :var2
> 
>  ?column? | ?column? 
> --+--
>  foo bar  | baz'qux
> 
> It appears to work fine with the current patch.

Indeed, nice.

> This is consistent with the fact that PQexecParams passes $N
> parameters ouf of the SQL query (versus injecting them in the text of
> the query)

I was not thinking about injecting them in the texte of the query, this
would not be using the extended protocol anymore, or maybe with no parameter,
but there's no point.

What I was thinking about is psql replacing the variables from the query text
with the $N notation before sending it using PQprepare.

> which is also why no quoting is needed.

Indeed, the quotes were not needed in my example.

Thanks,

Re: psql: Add command to use extended query protocol

2022-11-02 Thread Jehan-Guillaume de Rorthais

Hi,

On Fri, 28 Oct 2022 08:52:51 +0200
Peter Eisentraut  wrote:

> This adds a new psql command \gp that works like \g (or semicolon) but
> uses the extended query protocol.  Parameters can also be passed, like
> 
>  SELECT $1, $2 \gp 'foo' 'bar'

As I wrote in my TCE review, would it be possible to use psql vars to set some
named parameters for the prepared query? This would looks like:

  \set p1 foo
  \set p2 bar
  SELECT :'p1', :'p2' \gp

This seems useful when running psql script passing it some variables using
-v arg. It helps with var position, changing some between exec, repeating them
in the query, etc.

Thoughts?

Re: Commitfest documentation

2022-10-31 Thread Jehan-Guillaume de Rorthais

Hi Aleksander,

Thank you for your help!

On Mon, 31 Oct 2022 16:51:23 +0300
Aleksander Alekseev  wrote:
[...]
> > In the commitfest application, I was wondering today what was the exact
> > meaning and difference between open/closed status (is it only for the
> > current commitfest?)  
> 
> Closed means that the CF was in the past. It is archived now. Open
> means that new patches are accepted to the given CF. If memory serves,
> when the CF starts the status changes to "In Progress".

Sorry, I was asking from a patch point of view, not the whole commitfest. If
you look at the "Change Status" list on a patch page, there's two sublist
options: "Open statuses" and "Closed statuses". But your answer below answered
the question anyway.

> There are five CFs a year: in January, March, July, September, and
> November. November one is about to start.

This detail might have a place in the following page:
https://wiki.postgresql.org/wiki/CommitFest

But I'm not sure it's really worthy?

> > and between «waiting for author» and «Returned with feedback».  
> 
> RwF is almost the same as "Rejected". It means that some feedback was
> provided for the patch and the community wouldn't mind accepting a new
> patch when and if this feedback will be accounted for.
> 
> WfA means that the patch awaits some (relatively small) actions from
> the author. Typically it happens after another round of code review.

Thank you for the disambiguation. Here is a proposal for all statuses:

 * Needs review: Wait for a new review.
 * WfA : the patch awaits some (relatively small) actions from
 the author, typically after another round of code review.
 * Ready fC: No more comment from reviewer. The code is ready for a
 commiter review.
 * Rejected: The code is rejected. The community is not willing to accept
 new patch about $subject.
 * Withdraw: The author decide to remove its patch from the commit fest.
 * Returned wF : Some feedback was provided for the patch and the community
 wouldn't mind accepting a new patch when and if this feedback
 will be accounted for.
 * Move next CF: The patch is still waiting for the author, the reviewers or a
 commiter at the end of the current CF.
 * Committed   : The patch as been committed.

> Attached is a (!) simplified diagram of a typical patch livecycle.
> Hopefully it will help a bit.

It misses a Withdraw box :)
I suppose it is linked from the Waiting on Author.

> > I couldn't find a clear definition searching the wiki, the mailing list (too
> > much unrelated results) or in the app itself.  
> 
> Yes, this could be a tribe knowledge to a certain degree at the
> moment. On the flip side this is also an opportunity to contribute an
> article to the Wiki.

I suppose these definitions might go in:
https://wiki.postgresql.org/wiki/Reviewing_a_Patch

However, I'm not strictly sure who is responsible to set these statuses. The
reviewer? The author? The commiter? The CF manager? I bet on the reviewer, but
it seems weird a random reviewer can reject a patch on its own behalf.

Regards,

Commitfest documentation

2022-10-31 Thread Jehan-Guillaume de Rorthais

Hi,

In the commitfest application, I was wondering today what was the exact meaning
and difference between open/closed status (is it only for the current
commitfest?) and between «waiting for author» and «Returned with feedback».

I couldn't find a clear definition searching the wiki, the mailing list (too
much unrelated results) or in the app itself.

Maybe the commitfest home page/menu could link to some documentation hosted in
the wiki? Eg. in https://wiki.postgresql.org/wiki/Reviewing_a_Patch

Thoughts? Pointers I missed?

Regards,

Re: Transparent column encryption

2022-10-28 Thread Jehan-Guillaume de Rorthais

Hi,

I did a review of the documentation and usability.

# Applying patch

  The patch applied on top of f13b2088fa2 without trouble. Notice a small
  warning during compilation:
  
colenccmds.c:134:27: warning: ‘encval’ may be used uninitialized
  
  A simple fix could be:
  
+++ b/src/backend/commands/colenccmds.c
@@ -119,2 +119,3
encval = defGetString(encvalEl);
+   *encval_p = encval;
}
@@ -132,4 +133,2
*alg_p = alg;
-   if (encval_p)
-   *encval_p = encval;
 }

# Documentation

  * In page "create_column_encryption_key.sgml", both encryption algorithms for
a CMK are declared as the default one:

+ 
+  The encryption algorithm that was used to encrypt the key material of
+  this column encryption key.  Supported algorithms are:
+  
+   
+RSAES_OAEP_SHA_1 (default)
+   
+   
+RSAES_OAEP_SHA_256 (default)
+   
+  
+ 
  
As far as I understand the code, I suppose RSAES_OAEP_SHA_1 should be the
default. 

  I believe two information should be clearly shown to user somewhere in
  chapter 5.5 instead of being buried deep in documentation:
 
  * «COPY does not support column decryption», currently buried in pg_dump page
  * «When transparent column encryption is enabled, the client encoding must
match the server encoding», currently buried in the protocol description
page.

  * In the libpq doc of PQexecPrepared2, "paramForceColumnEncryptions" might
deserve a little more detail about the array format, like «0 means "don't
enforce" and anything else enforce the encryption is enabled on this
column». By the way, maybe this array could be an array of boolean? 

  * In chapter 55.2.5 (protocol-flow) is stated: «when column encryption is
used, the plaintext is always in text format (not binary format).». Does it
means parameter "resultFormat" in "PQexecPrepared2" should always be 0? If
yes, is it worth keeping this argument? Moreover, this format constraint
should probably be explained in the libpq page as well.

# Protocol

  * In the ColumnEncryptionKey message, it seems the field holding the length
key material is redundant with the message length itself, as all other
fields have a known size. The key material length is the message length -
(4+4+4+2). For comparison, the AuthenticationSASLContinue message has a
variable data size but rely only on the message length without additional
field.

  * I wonder if encryption related fields in ParameterDescription and
RowDescription could be optional somehow? The former might be quite large
when using a lot of parameters (like, imagine a large and ugly
"IN($1...$N)"). On the other hand, these messages are not sent in high
frequency anyway...

# libpq

  Would it be possible to have an encryption-ready PQexecParams() equivalent
  of what PQprepare/describe/exec do?

# psql

  * You already mark \d in the TODO. But having some psql command to list the
known CEK/CMK might be useful as well.

  * about write queries using psql, would it be possible to use psql
parameters? Eg.:

  => \set c1 myval
  => INSERT INTO mytable VALUES (:'c1') \gencr

# Manual tests

  * The lookup error message is shown twice for some reason:

  => select * from test_tce;
  no CMK lookup found for realm ""
  
  no CMK lookup found for realm ""

It might worth adding the column name and the CMK/CEK names related to each
error message? Last, notice the useless empty line between messages.

  * When "DROP IF EXISTS" a missing CEK or CMK, the command raise an
"unrecognized object type":

  => DROP COLUMN MASTER KEY IF EXISTS noexists;
  ERROR:  unrecognized object type: 10
  => DROP COLUMN ENCRYPTION KEY IF EXISTS noexists;
  ERROR:  unrecognized object type: 8

  * I was wondering what "pg_typeof" should return. It currently returns
"pg_encrypted_*". If this is supposed to be transparent from the client
perspective, shouldn't it return "attrealtypid" when the field is encrypted?

  * any reason to not support altering the CMK realm?

This patch is really interesting and would be a nice addition to the core.

Thanks!

Regards,

Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-10-03 Thread Jehan-Guillaume de Rorthais

On Fri, 30 Sep 2022 16:11:09 -0700
Zhihong Yu  wrote:

> On Fri, Sep 30, 2022 at 3:30 PM Jehan-Guillaume de Rorthais 
> wrote:
...
> 
> +* Self-Foreign keys are ignored as the index was preliminary
> created
> 
> preliminary created -> primarily created

Thank you! This is fixed and rebased on current master branch in patches
attached.

Regards,
>From 34ee3a3737e36d440824db3e8eb7c8f802a7276a Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 30 Sep 2022 23:42:21 +0200
Subject: [PATCH 1/3] Fix bug between constraint parenting and self-FK

Function get_relation_idx_constraint_oid() assumed an index can
only be associated to zero or one constraint for a given relation.

However, if the relation has a self-foreign key, the index could
be referenced as enforcing two different constraints on the same
relation: the PK and the FK.

This lead to a bug with partitionned table where the self-FK could
become the parent of a partition's PK.
---
 src/backend/catalog/pg_constraint.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index bb65fb1e0a..ba7a8ff965 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -985,8 +985,8 @@ get_relation_constraint_attnos(Oid relid, const char *conname,
 }
 
 /*
- * Return the OID of the constraint associated with the given index in the
- * given relation; or InvalidOid if no such index is catalogued.
+ * Return the OID of the original constraint enforced by the given index
+ * in the given relation; or InvalidOid if no such index is catalogued.
  */
 Oid
 get_relation_idx_constraint_oid(Oid relationId, Oid indexId)
@@ -1011,6 +1011,18 @@ get_relation_idx_constraint_oid(Oid relationId, Oid indexId)
 		Form_pg_constraint constrForm;
 
 		constrForm = (Form_pg_constraint) GETSTRUCT(tuple);
+
+		/*
+		 * Self-Foreign keys are ignored as the index was primarily created
+		 * to enforce a PK or unique constraint in the first place.  This means
+		 * only primary key, unique constraint or an exlusion one can be
+		 * returned.
+		 */
+		if (constrForm->contype != CONSTRAINT_PRIMARY
+			&& constrForm->contype != CONSTRAINT_UNIQUE
+			&& constrForm->contype != CONSTRAINT_EXCLUSION)
+			continue;
+
 		if (constrForm->conindid == indexId)
 		{
 			constraintId = constrForm->oid;
-- 
2.37.3

>From 6c8d9595dc1b2b62e3cc5ce3bc1c970b6eedcbc2 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 30 Sep 2022 23:54:04 +0200
Subject: [PATCH 2/3] Fix bug where a self-FK was cloned twice on partitions

A tbale with a self-foreign keys appears on both the referenced
and referencing side of the constraint.

Because of this, when creating or attaching a new partition to
a table, the self-FK was cloned twice, by CloneFkReferenced() and
CloneFkReferencing() functions.
---
 src/backend/commands/tablecmds.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7d8a75d23c..d19d917571 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9968,6 +9968,9 @@ CloneForeignKeyConstraints(List **wqueue, Relation parentRel,
  * clone those constraints to the given partition.  This is to be called
  * when the partition is being created or attached.
  *
+ * Note that this ignore self-referenced FK. Those are handled in
+ * CloneFkReferencing().
+ *
  * This recurses to partitions, if the relation being attached is partitioned.
  * Recursion is done by calling addFkRecurseReferenced.
  */
@@ -10047,10 +10050,14 @@ CloneFkReferenced(Relation parentRel, Relation partitionRel)
 		constrForm = (Form_pg_constraint) GETSTRUCT(tuple);
 
 		/*
-		 * As explained above: don't try to clone a constraint for which we're
-		 * going to clone the parent.
+		 * As explained above and in the function header:
+		 * - don't try to clone a constraint for which we're going to clone
+		 *   the parent.
+		 * - don't clone a self-referenced constraint here as it is handled
+		 *   on the CloneFkReferencing() side
 		 */
-		if (list_member_oid(clone, constrForm->conparentid))
+		if (list_member_oid(clone, constrForm->conparentid) ||
+			constrForm->conrelid == RelationGetRelid(parentRel))
 		{
 			ReleaseSysCache(tuple);
 			continue;
-- 
2.37.3

>From a08ed953e3cd559ea1fec78301cb72e611f9387e Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Sat, 1 Oct 2022 00:00:28 +0200
Subject: [PATCH 3/3] Add regression tests about self-FK with partitions

---
 src/test/regress/expected/constraints.out | 37 +++
 src/test/regress/sql/constraints.sql  | 31 +++
 2 files changed, 68 insertions(+)

diff --git a/src/test/regress/expected/constraints.out b/src/test/regress/expected/constraints.out
index e6f6

Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-09-30 Thread Jehan-Guillaume de Rorthais

Hi,

Please, find in attachment a small serie of patch:

  0001 fix the constraint parenting bug. Not much to say. It's basically your
  patch we discussed with some more comments and the check on contype equals to
  either primary, unique or exclusion.

  0002 fix the self-FK being cloned twice on partitions

  0003 add a regression test validating both fix.

I should confess than even with these fix, I'm still wondering about this code
sanity as we could still end up with a PK on a partition being parented with a
simple unique constraint from the table, on a field not even NOT NULL:

  DROP TABLE IF EXISTS parted_self_fk, part_with_pk;

  CREATE TABLE parted_self_fk (
id bigint,
id_abc bigint,
FOREIGN KEY (id_abc) REFERENCES parted_self_fk(id),
UNIQUE (id)
  )
  PARTITION BY RANGE (id);

  CREATE TABLE part_with_pk (
id bigint PRIMARY KEY,
id_abc bigint,
CHECK ((id >= 0 AND id < 10))
  );

  ALTER TABLE parted_self_fk ATTACH
PARTITION part_with_pk FOR VALUES FROM (0) TO (10);

  SELECT cr.relname, co.conname, co.contype, p.conname AS conparentrelname
  FROM pg_catalog.pg_constraint co
  JOIN pg_catalog.pg_class cr ON cr.oid = co.conrelid
  LEFT JOIN pg_catalog.pg_constraint p ON p.oid = co.conparentid
  WHERE cr.relname IN ('parted_self_fk', 'part_with_pk')
AND co.contype IN ('u', 'p');
  
  DROP TABLE parted_self_fk;

  DROP TABLE
  CREATE TABLE
  CREATE TABLE
  ALTER TABLE
  relname |conname| contype |   conparentrelname
  +---+-+---
   parted_self_fk | parted_self_fk_id_key | u   | 
   part_with_pk   | part_with_pk_pkey | p   | parted_self_fk_id_key
  (2 rows)

Nothing forbid the partition to have stricter constraints than the parent
table, but it feels weird, so it might worth noting it here.

I wonder if AttachPartitionEnsureConstraints() should exists and take care of
comparing/cloning constraints before calling AttachPartitionEnsureIndexes()
which would handle missing index without paying attention to related
constraints?

Regards,

On Wed, 24 Aug 2022 12:49:13 +0200
Alvaro Herrera  wrote:

> On 2022-Aug-24, Jehan-Guillaume de Rorthais wrote:
> 
> > I was naively wondering about such a patch, but was worrying about potential
> > side effects on ATExecAttachPartitionIdx(), DetachPartitionFinalize() and
> > DefineIndex() where I didn't had a single glance. Did you had a look?  
> 
> No.  But AFAIR all the code there is supposed to worry about unique
> constraints and PK only, not FKs.  So if something changes, then most 
> likely it was wrong to begin with.
> 
> > I did a quick ATTACH + DETACH test, and it seems DETACH partly fails with
> > its housecleaning:  
> 
> Ugh.  More fixes required, then.
> 
> > Looking for few minutes in ATExecDetachPartitionFinalize(), it seems it only
> > support removing the parental link on FK, not to clean the FKs added during
> > the ATTACH DDL anyway. That explains the FK child1->parent left behind. But
> > in fact, this let me wonder if this part of the code ever considered
> > implication of self-FK during the ATTACH and DETACH process?  
> 
> No, or at least I don't remember thinking about self-referencing FKs.
> If there are no tests for it, then that's likely what happened.
> 
> > Why in the first place TWO FK are created during the ATTACH DDL?  
> 
> That's probably a bug too.
> 

>From 146f40285ee5f773f68205f1cbafe1cbec46c5c4 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 30 Sep 2022 23:42:21 +0200
Subject: [PATCH 1/3] Fix bug between constraint parenting and self-FK

Function get_relation_idx_constraint_oid() assumed an index can
only be associated to zero or one constraint for a given relation.

However, if the relation has a self-foreign key, the index could
be referenced as enforcing two different constraints on the same
relation: the PK and the FK.

This lead to a bug with partitionned table where the self-FK could
become the parent of a partition's PK.
---
 src/backend/catalog/pg_constraint.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index bb65fb1e0a..be26dbe81d 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -985,8 +985,8 @@ get_relation_constraint_attnos(Oid relid, const char *conname,
 }
 
 /*
- * Return the OID of the constraint associated with the given index in the
- * given relation; or InvalidOid if no such index is catalogued.
+ * Return the OID of the original constraint enforced by the given index
+ * in the given relation; or InvalidOid if no such index is catalogued.
  */
 Oid
 get_relation_idx_constraint_oid(Oid relationId, Oid indexId)
@@ -1011,6 +1011,18 @@ get_relation_idx_con

Re: [BUG] wrong FK constraint name when colliding name on ATTACH

2022-09-08 Thread Jehan-Guillaume de Rorthais

On Thu, 8 Sep 2022 13:25:15 +0200
Alvaro Herrera  wrote:

> On 2022-Sep-08, Jehan-Guillaume de Rorthais wrote:
> 
> > Hi there,
> > 
> > I believe this very small bug and its fix are really trivial and could be
> > push out of the way quite quickly. It's just about a bad constraint name
> > fixed by moving one assignation after the next one. This could easily be
> > fixed for next round of releases.
> > 
> > Well, I hope I'm not wrong :)  
> 
> I think you're right, so pushed, and backpatched to 12.  I added the
> test case to regression also.

Great, thank you for the additional work on the regression test and the commit!

> For 11, I adjusted the test case so that it didn't depend on an FK
> pointing to a partitioned table (which is not supported there); it turns
> out that the old code is not smart enough to get into the problem in the
> first place.  [...]
> It seems fair to say that this case, with pg11, is unsupported and
> people should upgrade if they want better behavior.

That works for me.

Thanks!

Re: [BUG] wrong FK constraint name when colliding name on ATTACH

2022-09-08 Thread Jehan-Guillaume de Rorthais

Hi there,

I believe this very small bug and its fix are really trivial and could be push
out of the way quite quickly. It's just about a bad constraint name fixed by
moving one assignation after the next one. This could easily be fixed for next
round of releases.

Well, I hope I'm not wrong :)

Regards,

On Thu, 1 Sep 2022 18:41:56 +0200
Jehan-Guillaume de Rorthais  wrote:

> While studying and hacking on the parenting constraint issue, I found an
> incoherent piece of code leading to badly chosen fk name. If a constraint
> name collision is detected, while choosing a new name for the constraint,
> the code uses fkconstraint->fk_attrs which is not yet populated:
> 
>   /* No dice.  Set up to create our own constraint */
>   fkconstraint = makeNode(Constraint);
>   if (ConstraintNameIsUsed(CONSTRAINT_RELATION,
>RelationGetRelid(partRel),
>NameStr(constrForm->conname)))
>   fkconstraint->conname =
>   ChooseConstraintName(RelationGetRelationName(partRel),
>ChooseForeignKeyConstraintNameAddition(
>   fkconstraint->fk_attrs),  // <= WOO000OPS
>"fkey",
>RelationGetNamespace(partRel), NIL);
>   else
>   fkconstraint->conname = pstrdup(NameStr(constrForm->conname));
>   fkconstraint->fk_upd_action = constrForm->confupdtype;
>   fkconstraint->fk_del_action = constrForm->confdeltype;
>   fkconstraint->deferrable = constrForm->condeferrable;
>   fkconstraint->initdeferred = constrForm->condeferred;
>   fkconstraint->fk_matchtype = constrForm->confmatchtype;
>   for (int i = 0; i < numfks; i++)
>   {
>   Form_pg_attribute att;
>   
>   att = TupleDescAttr(RelationGetDescr(partRel),
>   mapped_conkey[i] - 1);
>   fkconstraint->fk_attrs = lappend(fkconstraint->fk_attrs, // <=
> POPULATING makeString(NameStr(att->attname)));
>   }
> 
> The following SQL script showcase the bad constraint name:
> 
>   DROP TABLE IF EXISTS parent, child1;
>   
>   CREATE TABLE parent (
>   id bigint NOT NULL default 1,
>   no_part smallint NOT NULL,
>   id_abc bigint,
>   CONSTRAINT dummy_constr FOREIGN KEY (id_abc, no_part)
>   REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE
> RESTRICT, PRIMARY KEY (id, no_part)
>   )
>   PARTITION BY LIST (no_part);
>   
>   CREATE TABLE child1 (
>   id bigint NOT NULL default 1,
>   no_part smallint NOT NULL,
>   id_abc bigint,
>   PRIMARY KEY (id, no_part),
>   CONSTRAINT dummy_constr CHECK ((no_part = 1))
>   );
> 
>   ALTER TABLE parent ATTACH PARTITION child1 FOR VALUES IN ('1');
> 
>   SELECT conname
>   FROM pg_constraint
>   WHERE conrelid = 'child1'::regclass
> AND contype = 'f';
> 
>   DROP TABLE
>   CREATE TABLE
>   CREATE TABLE
>   ALTER TABLE
>   
>  conname
>   --
>child1__fkey
>   (1 row)
> 
> The resulting constraint name "child1__fkey" is missing the attributes name
> the original code wanted to add. The expected name is
> "child1_id_abc_no_part_fkey".
> 
> Find in attachment a simple fix, moving the name assignation after the
> FK attributes are populated.
> 
> Regards,

[BUG] wrong FK constraint name when colliding name on ATTACH

2022-09-01 Thread Jehan-Guillaume de Rorthais

Hi,

While studying and hacking on the parenting constraint issue, I found an
incoherent piece of code leading to badly chosen fk name. If a constraint
name collision is detected, while choosing a new name for the constraint,
the code uses fkconstraint->fk_attrs which is not yet populated:

  /* No dice.  Set up to create our own constraint */
  fkconstraint = makeNode(Constraint);
  if (ConstraintNameIsUsed(CONSTRAINT_RELATION,
   RelationGetRelid(partRel),
   NameStr(constrForm->conname)))
  fkconstraint->conname =
  ChooseConstraintName(RelationGetRelationName(partRel),
   ChooseForeignKeyConstraintNameAddition(
  fkconstraint->fk_attrs),  // <= WOO000OPS
   "fkey",
   RelationGetNamespace(partRel), NIL);
  else
  fkconstraint->conname = pstrdup(NameStr(constrForm->conname));
  fkconstraint->fk_upd_action = constrForm->confupdtype;
  fkconstraint->fk_del_action = constrForm->confdeltype;
  fkconstraint->deferrable = constrForm->condeferrable;
  fkconstraint->initdeferred = constrForm->condeferred;
  fkconstraint->fk_matchtype = constrForm->confmatchtype;
  for (int i = 0; i < numfks; i++)
  {
  Form_pg_attribute att;
  
  att = TupleDescAttr(RelationGetDescr(partRel),
  mapped_conkey[i] - 1);
  fkconstraint->fk_attrs = lappend(fkconstraint->fk_attrs, // <= POPULATING
   makeString(NameStr(att->attname)));
  }

The following SQL script showcase the bad constraint name:

  DROP TABLE IF EXISTS parent, child1;
  
  CREATE TABLE parent (
  id bigint NOT NULL default 1,
  no_part smallint NOT NULL,
  id_abc bigint,
  CONSTRAINT dummy_constr FOREIGN KEY (id_abc, no_part)
  REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT,
  PRIMARY KEY (id, no_part)
  )
  PARTITION BY LIST (no_part);
  
  CREATE TABLE child1 (
  id bigint NOT NULL default 1,
  no_part smallint NOT NULL,
  id_abc bigint,
  PRIMARY KEY (id, no_part),
  CONSTRAINT dummy_constr CHECK ((no_part = 1))
  );

  ALTER TABLE parent ATTACH PARTITION child1 FOR VALUES IN ('1');

  SELECT conname
  FROM pg_constraint
  WHERE conrelid = 'child1'::regclass
AND contype = 'f';

  DROP TABLE
  CREATE TABLE
  CREATE TABLE
  ALTER TABLE
  
 conname
  --
   child1__fkey
  (1 row)

The resulting constraint name "child1__fkey" is missing the attributes name the
original code wanted to add. The expected name is "child1_id_abc_no_part_fkey".

Find in attachment a simple fix, moving the name assignation after the
FK attributes are populated.

Regards,
>From f1eeacb1eb3face38d76325666aff57019ef84c9 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Thu, 1 Sep 2022 17:41:55 +0200
Subject: [PATCH] Fix FK name when colliding during partition attachment

During ATLER TABLE ATTACH PARTITION, if the name of a parent's
foreign key constraint is already used on the partition, the code
try to choose another one before the FK attributes list has been
populated. The resulting constraint name was a strange
"__fkey" instead of "__fkey".
---
 src/backend/commands/tablecmds.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dacc989d85..53b0f3a9c1 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10304,16 +10304,6 @@ CloneFkReferencing(List **wqueue, Relation parentRel, Relation partRel)
 
 		/* No dice.  Set up to create our own constraint */
 		fkconstraint = makeNode(Constraint);
-		if (ConstraintNameIsUsed(CONSTRAINT_RELATION,
- RelationGetRelid(partRel),
- NameStr(constrForm->conname)))
-			fkconstraint->conname =
-ChooseConstraintName(RelationGetRelationName(partRel),
-	 ChooseForeignKeyConstraintNameAddition(fkconstraint->fk_attrs),
-	 "fkey",
-	 RelationGetNamespace(partRel), NIL);
-		else
-			fkconstraint->conname = pstrdup(NameStr(constrForm->conname));
 		fkconstraint->fk_upd_action = constrForm->confupdtype;
 		fkconstraint->fk_del_action = constrForm->confdeltype;
 		fkconstraint->deferrable = constrForm->condeferrable;
@@ -10328,6 +10318,16 @@ CloneFkReferencing(List **wqueue, Relation parentRel, Relation partRel)
 			fkconstraint->fk_attrs = lappend(fkconstraint->fk_attrs,
 			 makeString(NameStr(att->attname)));
 		}
+		if (ConstraintNameIsUsed(CONSTRAINT_RELATION,
+ RelationGetRelid(partRel),
+ NameStr(constrForm->conname)))
+			fkconstraint->conname =
+ChooseConstraintName(RelationGetRelationName(partRel),
+

Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-08-24 Thread Jehan-Guillaume de Rorthais

On Tue, 23 Aug 2022 18:30:06 +0200
Alvaro Herrera  wrote:

> On 2022-Aug-23, Jehan-Guillaume de Rorthais wrote:
> 
> Hi,
> 
> [...]
> 
> > However, it seems get_relation_idx_constraint_oid(), introduced in
> > eb7ed3f3063, assume there could be only ONE constraint depending to an
> > index. But in fact, multiple constraints can rely on the same index, eg.:
> > the PK and a self referencing FK. In consequence, when looking for a
> > constraint depending on an index for the given relation, either the FK or a
> > PK can appears first depending on various conditions. It is then possible
> > to trick it make a FK constraint a parent of a PK...  
> 
> Hmm, wow, that sounds extremely stupid.  I think a sufficient fix might
> be to have get_relation_idx_constraint_oid ignore any constraints that
> are not unique or primary keys.  I tried your scenario with the attached
> and it seems to work correctly.  Can you confirm?  (I only ran the
> pg_regress tests, not anything else for now.)
>
> If this is OK, we should make this API quirkiness very explicit in the
> comments, so the patch needs to be a few lines larger in order to be
> committable.  Also, perhaps the check should be that contype equals
> either primary or unique, rather than it doesn't equal foreign.

I was naively wondering about such a patch, but was worrying about potential
side effects on ATExecAttachPartitionIdx(), DetachPartitionFinalize() and
DefineIndex() where I didn't had a single glance. Did you had a look?

I did a quick ATTACH + DETACH test, and it seems DETACH partly fails with its
housecleaning:

  DROP TABLE IF EXISTS parent, child1;
  
  CREATE TABLE parent (
  id bigint NOT NULL default 1,
  no_part smallint NOT NULL,
  id_abc bigint,
  FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part)
  ON UPDATE RESTRICT ON DELETE RESTRICT,
  PRIMARY KEY (id, no_part)
  )
  PARTITION BY LIST (no_part);
  
  CREATE TABLE child1 (
  id bigint NOT NULL default 1,
  no_part smallint NOT NULL,
  id_abc bigint,
  PRIMARY KEY (id, no_part),
  CONSTRAINT child1 CHECK ((no_part = 1))
  );

  \C 'Before ATTACH'
  SELECT oid, conname, conparentid, conrelid, confrelid
  FROM pg_constraint
  WHERE conrelid in ('parent'::regclass, 'child1'::regclass)
  ORDER BY 1;

  ALTER TABLE parent ATTACH PARTITION child1 FOR VALUES IN ('1');

  \C 'After ATTACH'
  SELECT oid, conname, conparentid, conrelid, confrelid
  FROM pg_constraint
  WHERE conrelid in ('parent'::regclass, 'child1'::regclass)
  ORDER BY 1;

  ALTER TABLE parent DETACH PARTITION child1;

  \C 'After DETACH'
  SELECT oid, conname, conparentid, conrelid, confrelid
  FROM pg_constraint
  WHERE conrelid in ('parent'::regclass, 'child1'::regclass)
  ORDER BY 1;


Before ATTACH
oid  |  conname   | conparentid | conrelid | confrelid 
  ---++-+--+---
   24711 | parent_pkey|   0 |24706 | 0
   24712 | parent_id_abc_no_part_fkey |   0 |24706 | 24706
   24721 | child1 |   0 |24717 | 0
   24723 | child1_pkey|   0 |24717 | 0
  (4 rows)
  
 After ATTACH
oid  |   conname   | conparentid | conrelid | confrelid 
  ---+-+-+--+---
   24711 | parent_pkey |   0 |24706 | 0
   24712 | parent_id_abc_no_part_fkey  |   0 |24706 | 24706
   24721 | child1  |   0 |24717 | 0
   24723 | child1_pkey |   24711 |24717 | 0
   24724 | parent_id_abc_no_part_fkey1 |   24712 |24706 | 24717
   24727 | parent_id_abc_no_part_fkey  |   24712 |24717 | 24706
  (6 rows)
  
After DETACH
oid  |  conname   | conparentid | conrelid | confrelid 
  ---++-+--+---
   24711 | parent_pkey|   0 |24706 | 0
   24712 | parent_id_abc_no_part_fkey |   0 |24706 | 24706
   24721 | child1 |   0 |24717 | 0
   24723 | child1_pkey|   0 |24717 | 0
   24727 | parent_id_abc_no_part_fkey |   0 |24717 | 24706
  (5 rows)

Looking for few minutes in ATExecDetachPartitionFinalize(), it seems it only
support removing the parental link on FK, not to clean the FKs added during the
ATTACH DDL anyway. That explains the FK child1->parent left behind. But in
fact, this let me wonder if this part of the code ever considered implication
of self-FK during the ATTACH and DETACH process? Why in the first place TWO FK
are created during the ATTACH DDL?

Regards,

[BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-08-23 Thread Jehan-Guillaume de Rorthais

Hi all,

I've been able to work on this issue and isolate where in the code the oddity
is laying.

During ATExecAttachPartition(), AttachPartitionEnsureIndexes() look for existing
required index on the partition to attach. It creates missing index, or sets the
parent's index when a matching one exists on the partition. Good.

When a matching index is found, if the parent index enforce a constraint, the
function look for the similar constraint in the partition-to-be, and set the
constraint parent as well:

constraintOid = get_relation_idx_constraint_oid(RelationGetRelid(rel),
idx);

[...]

/*
 * If this index is being created in the parent because of a
 * constraint, then the child needs to have a constraint also,
 * so look for one.  If there is no such constraint, this
 * index is no good, so keep looking.
 */
if (OidIsValid(constraintOid))
{
cldConstrOid = get_relation_idx_constraint_oid(
RelationGetRelid(attachrel),
cldIdxId);
/* no dice */
if (!OidIsValid(cldConstrOid))
continue;
 }
 /* bingo. */
 IndexSetParentIndex(attachrelIdxRels[i], idx);
 if (OidIsValid(constraintOid))
ConstraintSetParentConstraint(cldConstrOid, constraintOid,
  RelationGetRelid(attachrel));

However, it seems get_relation_idx_constraint_oid(), introduced in eb7ed3f3063,
assume there could be only ONE constraint depending to an index. But in fact,
multiple constraints can rely on the same index, eg.: the PK and a self
referencing FK. In consequence, when looking for a constraint depending on an
index for the given relation, either the FK or a PK can appears first depending
on various conditions. It is then possible to trick it make a FK constraint a
parent of a PK...

In the following little scenario, when looking at the constraint linked to
the PK unique index using the same index than get_relation_idx_constraint_oid
use, this is the self-FK that is actually returned first by
get_relation_idx_constraint_oid(), NOT the PK:

  postgres=# DROP TABLE IF EXISTS parent, child1;

CREATE TABLE parent (
id bigint NOT NULL default 1,
no_part smallint NOT NULL,
id_abc bigint,
FOREIGN KEY (id_abc, no_part) REFERENCES parent(id, no_part)
ON UPDATE RESTRICT ON DELETE RESTRICT,
PRIMARY KEY (id, no_part)
)
PARTITION BY LIST (no_part);

CREATE TABLE child1 (
id bigint NOT NULL default 1,
no_part smallint NOT NULL,
id_abc bigint,
PRIMARY KEY (id, no_part),
CONSTRAINT child1 CHECK ((no_part = 1))
);
  
-- force an indexscan as get_relation_idx_constraint_oid() use the unique
-- index on (conrelid, contypid, conname) to scan pg_cosntraint
set enable_seqscan TO off;
set enable_bitmapscan TO off;
  
SELECT conname
FROM pg_constraint
WHERE conrelid = 'parent'::regclass   <=== parent
  AND conindid = 'parent_pkey'::regclass; <=== PK index
  
  DROP TABLE
  CREATE TABLE
  CREATE TABLE
  SET
  SET
conname   
  
   parent_id_abc_no_part_fkey < WOOPS!
   parent_pkey
  (2 rows)

In consequence, when attaching the partition, the PK of child1 is not marked as
partition of the parent's PK, which is wrong. WORST, the PK of child1 is
actually unexpectedly marked as a partition of the parent's **self-FK**:

  postgres=# ALTER TABLE ONLY parent ATTACH PARTITION child1 
 FOR VALUES IN ('1');

SELECT oid, conname, conparentid, conrelid, confrelid
FROM pg_constraint
WHERE conrelid in ('parent'::regclass, 'child1'::regclass)   
ORDER BY 1;
  
  ALTER TABLE
oid  |   conname   | conparentid | conrelid | confrelid 
  ---+-+-+--+---
   16700 | parent_pkey |   0 |16695 | 0
   16701 | parent_id_abc_no_part_fkey  |   0 |16695 | 16695
   16706 | child1  |   0 |16702 | 0
   16708 | **child1_pkey** |   **16701** |16702 | 0
   16709 | parent_id_abc_no_part_fkey1 |   16701 |16695 | 16702
   16712 | parent_id_abc_no_part_fkey  |   16701 |16702 | 16695
  (6 rows)

The expected result should probably be something like:

oid  |   conname   | conparentid | conrelid | confrelid 
  ---+-+-+--+---
   16700 | parent_pkey |   0 |16695 | 0
   ...
   16708 | child1_pkey |   16700 |16702 | 0


I suppose this bug might

Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-07-06 Thread Jehan-Guillaume de Rorthais

On Tue, 5 Jul 2022 09:59:42 -0400
Andrew Dunstan  wrote:

> On 2022-07-03 Su 16:12, Jehan-Guillaume de Rorthais wrote:
> > On Sun, 3 Jul 2022 10:40:21 -0400
> > Andrew Dunstan  wrote:
> >  
> >> On 2022-06-29 We 05:09, Jehan-Guillaume de Rorthais wrote:  
> >>> On Tue, 28 Jun 2022 18:17:40 -0400
> >>> Andrew Dunstan  wrote:
> >>>
> >>>> On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote:
> >>>>> ...
> >>>>> A better fix would be to store the version internally as version_num
> >>>>> that are trivial to compute and compare. Please, find in attachment an
> >>>>> implementation of this.
> >>>>>
> >>>>> The patch is a bit bigger because it improved the devel version to
> >>>>> support rc/beta/alpha comparison like 14rc2 > 14rc1.
> >>>>>
> >>>>> Moreover, it adds a bunch of TAP tests to check various use cases.  
> >>>> Nice catch, but this looks like massive overkill. I think we can very
> >>>> simply fix the test in just a few lines of code, instead of a 190 line
> >>>> fix and a 130 line TAP test.
> >>> I explained why the patch was a little bit larger than required: it fixes
> >>> the bugs and do a little bit more. The _version_cmp sub is shorter and
> >>> easier to understand, I use multi-line code where I could probably fold
> >>> them in a one-liner, added some comments... Anyway, I don't feel the
> >>> number of line changed is "massive". But I can probably remove some code
> >>> and shrink some other if it is really important...
> >>>
> >>> Moreover, to be honest, I don't mind the number of additional lines of TAP
> >>> tests. Especially since it runs really, really fast and doesn't hurt
> >>> day-to-day devs as it is independent from other TAP tests anyway. It could
> >>> be 1k, if it runs fast, is meaningful and helps avoiding futur
> >>> regressions, I would welcome the addition.
> >>
> >> I don't see the point of having a TAP test at all. We have TAP tests for
> >> testing the substantive products we test, not for the test suite
> >> infrastructure. Otherwise, where will we stop? Shall we have tests for
> >> the things that test the test suite?  
> > Tons of perl module have regression tests. When questioning where testing
> > should stop, it seems the Test::More module itself is not the last frontier:
> > https://github.com/Test-More/test-more/tree/master/t
> >
> > Moreover, the PostgreSQL::Version is not a TAP test module, but a module to
> > deal with PostgreSQL versions and compare them.
> >
> > Testing makes development faster as well when it comes to test the code.
> > Instead of testing vaguely manually, you can test a whole bunch of
> > situations and add accumulate some more when you think about a new one or
> > when a bug is reported. Having TAP test helps to make sure the code work as
> > expected.
> >
> > It helped me when creating my patch. With all due respect, I just don't
> > understand your arguments against them. The number of lines or questioning
> > when testing should stop doesn't hold much.  
> 
> 
> There is not a single TAP test in our source code that is aimed at
> testing our test infrastructure as opposed to testing what we are
> actually in the business of building, and I'm not about to add one.

Whatever, it helped me during the dev process of fixing this bug. Remove
them if you are uncomfortable with them.

> Every added test consumes buildfarm cycles and space on the buildfarm
> server for the report, be it ever so small.

They were not supposed to enter the buildfarm cycles. I wrote it earlier, they
do not interfere with day-to-day dev activity.

> Every added test needs maintenance, be it ever so small. There's no such
> thing as a free test (apologies to Heinlein and others).

This is the first argument I can understand.

> > ...
> > But anyway, this is not the point. Using an array to compare versions where
> > we can use version_num seems like useless and buggy convolutions to me.
> 
> I think we'll just have to agree to disagree about it.

Noted.

Cheers,

Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-07-03 Thread Jehan-Guillaume de Rorthais

On Sun, 3 Jul 2022 10:40:21 -0400
Andrew Dunstan  wrote:

> On 2022-06-29 We 05:09, Jehan-Guillaume de Rorthais wrote:
> > On Tue, 28 Jun 2022 18:17:40 -0400
> > Andrew Dunstan  wrote:
> >  
> >> On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote:  
> >>> ...
> >>> A better fix would be to store the version internally as version_num that
> >>> are trivial to compute and compare. Please, find in attachment an
> >>> implementation of this.
> >>>
> >>> The patch is a bit bigger because it improved the devel version to support
> >>> rc/beta/alpha comparison like 14rc2 > 14rc1.
> >>>
> >>> Moreover, it adds a bunch of TAP tests to check various use cases.
> >>
> >> Nice catch, but this looks like massive overkill. I think we can very
> >> simply fix the test in just a few lines of code, instead of a 190 line
> >> fix and a 130 line TAP test.  
> > I explained why the patch was a little bit larger than required: it fixes
> > the bugs and do a little bit more. The _version_cmp sub is shorter and
> > easier to understand, I use multi-line code where I could probably fold
> > them in a one-liner, added some comments... Anyway, I don't feel the number
> > of line changed is "massive". But I can probably remove some code and
> > shrink some other if it is really important...
> >
> > Moreover, to be honest, I don't mind the number of additional lines of TAP
> > tests. Especially since it runs really, really fast and doesn't hurt
> > day-to-day devs as it is independent from other TAP tests anyway. It could
> > be 1k, if it runs fast, is meaningful and helps avoiding futur regressions,
> > I would welcome the addition.  
> 
> 
> I don't see the point of having a TAP test at all. We have TAP tests for
> testing the substantive products we test, not for the test suite
> infrastructure. Otherwise, where will we stop? Shall we have tests for
> the things that test the test suite?

Tons of perl module have regression tests. When questioning where testing
should stop, it seems the Test::More module itself is not the last frontier:
https://github.com/Test-More/test-more/tree/master/t

Moreover, the PostgreSQL::Version is not a TAP test module, but a module to
deal with PostgreSQL versions and compare them.

Testing makes development faster as well when it comes to test the code.
Instead of testing vaguely manually, you can test a whole bunch of situations
and add accumulate some more when you think about a new one or when a bug is
reported. Having TAP test helps to make sure the code work as expected.

It helped me when creating my patch. With all due respect, I just don't
understand your arguments against them. The number of lines or questioning when
testing should stop doesn't hold much.

> > If we really want to save some bytes, I have a two lines worth of code fix
> > that looks more readable to me than fixing _version_cmp:
> >
> > +++ b/src/test/perl/PostgreSQL/Version.pm
> > @@ -92,9 +92,13 @@ sub new
> > # Split into an array
> > my @numbers = split(/\./, $arg);
> >  
> > +   # make sure all digit of the array-represented version are set so
> > we can
> > +   # keep _version_cmp code as a "simple" digit-to-digit comparison
> > loop
> > +   $numbers[$_] += 0 for 0..3;
> > +
> > # Treat development versions as having a minor/micro version one
> > less than # the first released version of that branch.
> > -   push @numbers, -1 if ($devel);
> > +   $numbers[3] = -1 if $devel;
> >  
> > $devel ||= "";  
> 
> I don't see why this is any more readable.

The _version_cmp is much more readable.

But anyway, this is not the point. Using an array to compare versions where we
can use version_num seems like useless and buggy convolutions to me.

Regards,

Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-06-29 Thread Jehan-Guillaume de Rorthais

On Tue, 28 Jun 2022 18:17:40 -0400
Andrew Dunstan  wrote:

> On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote:
> > ...
> > A better fix would be to store the version internally as version_num that
> > are trivial to compute and compare. Please, find in attachment an
> > implementation of this.
> >
> > The patch is a bit bigger because it improved the devel version to support
> > rc/beta/alpha comparison like 14rc2 > 14rc1.
> >
> > Moreover, it adds a bunch of TAP tests to check various use cases.  
> 
> 
> Nice catch, but this looks like massive overkill. I think we can very
> simply fix the test in just a few lines of code, instead of a 190 line
> fix and a 130 line TAP test.

I explained why the patch was a little bit larger than required: it fixes the
bugs and do a little bit more. The _version_cmp sub is shorter and easier to
understand, I use multi-line code where I could probably fold them in a
one-liner, added some comments... Anyway, I don't feel the number of line
changed is "massive". But I can probably remove some code and shrink some other
if it is really important...

Moreover, to be honest, I don't mind the number of additional lines of TAP
tests. Especially since it runs really, really fast and doesn't hurt day-to-day
devs as it is independent from other TAP tests anyway. It could be 1k, if it
runs fast, is meaningful and helps avoiding futur regressions, I would welcome
the addition.

If we really want to save some bytes, I have a two lines worth of code fix that
looks more readable to me than fixing _version_cmp:

+++ b/src/test/perl/PostgreSQL/Version.pm
@@ -92,9 +92,13 @@ sub new
# Split into an array
my @numbers = split(/\./, $arg);

+   # make sure all digit of the array-represented version are set so we can
+   # keep _version_cmp code as a "simple" digit-to-digit comparison loop
+   $numbers[$_] += 0 for 0..3;
+
# Treat development versions as having a minor/micro version one less 
than
# the first released version of that branch.
-   push @numbers, -1 if ($devel);
+   $numbers[3] = -1 if $devel;

$devel ||= "";

But again, in my humble opinion, the internal version array representation is
more a burden we should replace by the version_num...

> It was never intended to be able to compare markers like rc1 vs rc2, and
> I don't see any need for it. If you can show me a sane use case I'll
> have another look, but right now it seems quite unnecessary.

I don't have a practical use case right now, but I thought the module
would be more complete with these little few more line of codes. Now, keep in
mind these TAP modules might help external projects, not just core.

In fact, I wonder what was your original use case to support
devel/alpha/beta/rc versions, especially since it was actually not working?
Should we just get rid of this altogether and wait for an actual use case?

Cheers,

Fix proposal for comparaison bugs in PostgreSQL::Version

2022-06-28 Thread Jehan-Guillaume de Rorthais

Hi,

I found a comparaison bug when using the PostgreSQL::Version module. See:

  $ perl -I. -MPostgreSQL::Version -le '
my $v = PostgreSQL::Version->new("9.6");
  
print "not 9.6 > 9.0" unless $v >  9.0;
print "not 9.6 < 9.0" unless $v <  9.0;
print "9.6 <= 9.0"if $v <= 9.0;
print "9.6 >= 9.0"if $v >= 9.0;'
  not 9.6 > 9.0
  not 9.6 < 9.0
  9.6 <= 9.0
  9.6 >= 9.0

When using < or >, 9.6 is neither greater or lesser than 9.0. 
When using <= or >=, 9.6 is equally greater and lesser than 9.0.
The bug does not show up if you compare with "9.0" instead of 9.0.
This bug is triggered with devel versions, eg. 14beta1 <=> 14.

The bug appears when both objects have a different number of digit in the
internal array representation:

  $ perl -I. -MPostgreSQL::Version -MData::Dumper -le '
print Dumper(PostgreSQL::Version->new("9.0")->{num});
print Dumper(PostgreSQL::Version->new(9.0)->{num});
print Dumper(PostgreSQL::Version->new(14)->{num});
print Dumper(PostgreSQL::Version->new("14beta1")->{num});'
  $VAR1 = [ '9', '0' ];
  $VAR1 = [ '9' ];
  $VAR1 = [ '14' ];
  $VAR1 = [ '14', -1 ];

Because of this, The following loop in "_version_cmp" is wrong because we are
comparing two versions with different size of 'num' array:

for (my $idx = 0;; $idx++)
{
return 0 unless (defined $an->[$idx] && defined $bn->[$idx]);
return $an->[$idx] <=> $bn->[$idx]
  if ($an->[$idx] <=> $bn->[$idx]);
}


If we want to keep this internal array representation, the only fix I can think
of would be to always use a 4 element array defaulted to 0. Previous examples
would be:

  $VAR1 = [ 9, 0, 0, 0 ];
  $VAR1 = [ 9, 0, 0, 0 ];
  $VAR1 = [ 14, 0, 0, 0 ];
  $VAR1 = [ 14, 0, 0, -1 ];

A better fix would be to store the version internally as version_num that are
trivial to compute and compare. Please, find in attachment an implementation of
this.

The patch is a bit bigger because it improved the devel version to support
rc/beta/alpha comparison like 14rc2 > 14rc1.

Moreover, it adds a bunch of TAP tests to check various use cases.

Regards,
>From d6b28440ad8eb61fd83be6a0df8f40108296bc4e Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Tue, 28 Jun 2022 22:17:48 +0200
Subject: [PATCH] Fix and improve PostgreSQL::Version

The internal array representation of a version was failing some
comparaisons because the loop comparing each digit expected both array
to have the same number of digit.

Comparaison like 9.6 <=> 9.0 or 14beta1 <=> 14 were failing.

This patch compute and use the numeric version internaly to compare
objects.

Moreover, it improve the comparaison on development versions paying
attention to the iteration number for rc, beta and alpha.
---
 src/test/perl/PostgreSQL/Version.pm  | 112 --
 src/test/perl/t/01-PostgreSQL::Version.t | 141 +++
 2 files changed, 216 insertions(+), 37 deletions(-)
 create mode 100644 src/test/perl/t/01-PostgreSQL::Version.t

diff --git a/src/test/perl/PostgreSQL/Version.pm b/src/test/perl/PostgreSQL/Version.pm
index 8f70491189..e42003d0d1 100644
--- a/src/test/perl/PostgreSQL/Version.pm
+++ b/src/test/perl/PostgreSQL/Version.pm
@@ -46,6 +46,7 @@ package PostgreSQL::Version;
 
 use strict;
 use warnings;
+use Carp;
 
 use Scalar::Util qw(blessed);
 
@@ -73,35 +74,73 @@ of a Postgres command like `psql --version` or `pg_config --version`;
 
 sub new
 {
-	my $class = shift;
-	my $arg   = shift;
-
-	chomp $arg;
-
-	# Accept standard formats, in case caller has handed us the output of a
-	# postgres command line tool
-	my $devel;
-	($arg, $devel) = ($1, $2)
-	  if (
-		$arg =~ m!^ # beginning of line
-  (?:\(?PostgreSQL\)?\s)? # ignore PostgreSQL marker
-  (\d+(?:\.\d+)*) # version number, dotted notation
-  (devel|(?:alpha|beta|rc)\d+)?   # dev marker - see version_stamp.pl
-		 !x);
-
-	# Split into an array
-	my @numbers = split(/\./, $arg);
-
-	# Treat development versions as having a minor/micro version one less than
-	# the first released version of that branch.
-	push @numbers, -1 if ($devel);
+	my $class  = shift;
+	my $arg= shift;
+	my $ver= '';
+	my $devel  = '';
+	my $vernum = 0;
+	my $devnum = 0;
+
+	$arg =~ m!^(?:[^\s]+\s)?   # beginning of line + optionnal command
+	(?:\(?PostgreSQL\)?\s)?# ignore PostgreSQL marker
+	(\d+)(?:\.(\d+))?(?:\.(\d+))?  # version number, dotted notation
+	([^\s]+)?  # dev marker - see version_stamp.pl
+	!x;
+
+	croak "could not parse version: $arg" unless defined $1;
+
+	$ver = "$1";
+	$vernum += 1 * $1;
+	$ver

Self FK oddity when attaching a partition

2022-06-03 Thread Jehan-Guillaume de Rorthais

Hi all,

While studying the issue discussed in thread "Detaching a partition with a FK
on itself is not possible"[1], I stumbled across an oddity while attaching a
partition having the same multiple self-FK than the parent table.

Only one of the self-FK is found as a duplicate. Find in attachment some SQL to
reproduce the scenario. Below the result of this scenario (constant from v12 to
commit 7e367924e3). Why "child1_id_abc_no_part_fkey" is found duplicated but not
the three others? From pg_constraint, only "child1_id_abc_no_part_fkey" has a
"conparentid" set.


 conname   | conparentid | conrelid | confrelid 
  -+-+--+---
   child1_id_abc_no_part_fkey  |   16901 |16921 | 16921
   child1_id_def_no_part_fkey  |   0 |16921 | 16921
   child1_id_ghi_no_part_fkey  |   0 |16921 | 16921
   child1_id_jkl_no_part_fkey  |   0 |16921 | 16921
   parent_id_abc_no_part_fkey  |   16901 |16921 | 16894
   parent_id_abc_no_part_fkey  |   0 |16894 | 16894
   parent_id_abc_no_part_fkey1 |   16901 |16894 | 16921
   parent_id_def_no_part_fkey  |   16906 |16921 | 16894
   parent_id_def_no_part_fkey  |   0 |16894 | 16894
   parent_id_def_no_part_fkey1 |   16906 |16894 | 16921
   parent_id_ghi_no_part_fkey  |   0 |16894 | 16894
   parent_id_ghi_no_part_fkey  |   16911 |16921 | 16894
   parent_id_ghi_no_part_fkey1 |   16911 |16894 | 16921
   parent_id_jkl_no_part_fkey  |   0 |16894 | 16894
   parent_id_jkl_no_part_fkey  |   16916 |16921 | 16894
   parent_id_jkl_no_part_fkey1 |   16916 |16894 | 16921
  (16 rows)


 Table "public.child1"
  [...]
  Partition of: parent FOR VALUES IN ('1')
  Partition constraint: ((no_part IS NOT NULL) AND (no_part = '1'::smallint))
  Indexes:
"child1_pkey" PRIMARY KEY, btree (id, no_part)
  Check constraints:
"child1" CHECK (no_part = 1)
  Foreign-key constraints:
"child1_id_def_no_part_fkey"
FOREIGN KEY (id_def, no_part)
REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
"child1_id_ghi_no_part_fkey"
FOREIGN KEY (id_ghi, no_part)
REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
"child1_id_jkl_no_part_fkey"
FOREIGN KEY (id_jkl, no_part)
REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_abc_no_part_fkey"
FOREIGN KEY (id_abc, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_def_no_part_fkey"
FOREIGN KEY (id_def, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_ghi_no_part_fkey"
FOREIGN KEY (id_ghi, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_jkl_no_part_fkey"
FOREIGN KEY (id_jkl, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
  Referenced by:
TABLE "child1" CONSTRAINT "child1_id_def_no_part_fkey"
FOREIGN KEY (id_def, no_part)
REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "child1" CONSTRAINT "child1_id_ghi_no_part_fkey"
FOREIGN KEY (id_ghi, no_part)
REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "child1" CONSTRAINT "child1_id_jkl_no_part_fkey"
FOREIGN KEY (id_jkl, no_part)
REFERENCES child1(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_abc_no_part_fkey"
FOREIGN KEY (id_abc, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_def_no_part_fkey"
FOREIGN KEY (id_def, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_ghi_no_part_fkey"
FOREIGN KEY (id_ghi, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "parent" CONSTRAINT "parent_id_jkl_no_part_fkey"
FOREIGN KEY (id_jkl, no_part)
REFERENCES parent(id, no_part) ON UPDATE RESTRICT ON DELETE RESTRICT

Regards,

[1]
https://www.postgresql.org/message-id/flat/20220321113634.68c09d4b%40karst#83c0880a1b4921fcd00d836d4e6bceb3


self-fk-after-part-attach.sql
Description: application/sql

Re: Detaching a partition with a FK on itself is not possible

2022-03-21 Thread Jehan-Guillaume de Rorthais

Hi,

On Thu, 17 Mar 2022 17:58:04 +
Arne Roland  wrote:

> I don't think this a bug, but a feature request. I therefore think hackers
> would be more appropriate.

+1

I changed the list destination

> I don't see how an additional syntax to modify the constraint should help.

Me neiher.

> If I'd want to fix this, I'd try to teach the detach partition code about
> self referencing foreign keys. It seems to me like that would be the cleanest
> solution, because the user doesn't need to care about this at all.

Teaching the detach partition about self referencing means either:

* it's safe to remove the FK
* we can rewrite the FK for self referencing

Both solution are not ideal from the original schema and user perspective.

Another solution could be to teach the create partition to detect a self
referencing FK and actually create a self referencing FK, not pointing to the
partitioned table, and of course issuing a NOTICE to the client.

Re: [Proposal] Add accumulated statistics for wait event

2021-06-16 Thread Jehan-Guillaume de Rorthais

Hi Andres,

On Mon, 14 Jun 2021 15:01:14 -0700
Andres Freund  wrote:

> On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote:
> > > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:  
> > > > In the patch in attachment, I tried to fix this by using kind of an
> > > > internal hook for pgstat_report_wait_start and pgstat_report_wait_end.
> > > > This allows to "instrument" wait events only when required, on the fly,
> > > > dynamically.
> > > 
> > > That's *far worse*. You're adding an indirect function call. Which
> > > requires loading a global variable and then a far call to a different
> > > function. You're changing a path that's ~2 instructions with minimal
> > > dependencies (and no branches (i.e. fully out of order executable) to
> > > something on the order of ~15 instructions with plenty dependencies and
> > > at least two branches (call, ret).  
> > 
> > Oh, I didn't realized it would affect all queries, even when
> > log_statement_stats was off. Thank you for your explanation.  
> 
> Maybe I just am misunderstanding what you were doing? As far as I can
> tell your patch changed pgstat_report_wait_start() to be an indirect
> function call - right?

Exact.

I didn't realized this indirection would be so costy on every single calls,
after the variable assignation itself.

> Then yes, this adds overhead to everything.

I understand now, thank you for the explanation.
For my own curiosity and study, I'll remove this indirection and bench my patch
anyway.

> You *could* add a pgstat_report_wait_(start|end)_with_time() or such and
> only use that in places that won't have a high frequency. But I just
> don't quite see the use-case for that.

Well, it could be useful if we decide to only track a subset of wait event.
In my scenario, I originally wanted to only track ClientWrite, but then I
realized this might be too specific and tried to generalize.

There are probably some other way to deal with this issue. Eg.:

* do NOT include the time lost waiting for the frontend side in the query
  execution time
* expose the frontend part of the query time in log_min_duration_stmt,
  auto_explain, pg_stat_statements, in the same fashion we currently do with
  planning and execution time
* having some wait-even sampling mechanism in core or as easy and hot-loadable
  than auto_explain

Thoughts?

Thanks again!

Regards,

Re: [Proposal] Add accumulated statistics for wait event

2021-06-14 Thread Jehan-Guillaume de Rorthais

Hi,

On Mon, 14 Jun 2021 11:27:21 -0700
Andres Freund  wrote:

> On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> > In the patch in attachment, I tried to fix this by using kind of an internal
> > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> > "instrument" wait events only when required, on the fly, dynamically.  
> 
> That's *far worse*. You're adding an indirect function call. Which requires
> loading a global variable and then a far call to a different function. You're
> changing a path that's ~2 instructions with minimal dependencies (and no
> branches (i.e. fully out of order executable) to something on the order of ~15
> instructions with plenty dependencies and at least two branches (call, ret).

Oh, I didn't realized it would affect all queries, even when log_statement_stats
was off. Thank you for your explanation.

> I doubt there's a path towards this feature without adding the necessary
> infrastructure to hot-patch the code - which is obviously quite a
> substantial project.

Right. Sadly, this kind of project is far above what I can do. So I suppose
it's a dead end for me.

I'll study if/how the sampling approach can be done dynamically.

Thank you,

Re: [Proposal] Add accumulated statistics for wait event

2021-06-14 Thread Jehan-Guillaume de Rorthais

Hi Andres, Hi all,

First, thank you for your feedback!

Please find in attachment a patch implementing accumulated wait event stats
only from the backend point of view. As I wrote when I reviewed and rebased the
existing patch, I was uncomfortable with the global approach. I still volunteer
to work/review the original approach is required.

See bellow for comments and some more explanations about what I think might be
improvements over the previous patch.

On Fri, 11 Jun 2021 12:18:07 -0700
Andres Freund  wrote:

> On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote:
> > From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Tue, 1 Jun 2021 13:25:57 +0200
> > Subject: [PATCH 1/2] Add pg_stat_waitaccum view.
> > 
> > pg_stat_waitaccum shows counts and duration of each wait events.
> > Each backend/backgrounds counts and measures the time of wait event
> > in every pgstat_report_wait_start and pgstat_report_wait_end. They
> > store those info into their local variables and send to Statistics
> > Collector. We can get those info via Statistics Collector.
> > 
> > For reducing overhead, I implemented statistic hash instead of
> > dynamic hash. I also implemented track_wait_timing which
> > determines wait event duration is collected or not.  
> 
> I object to adding this overhead. The whole selling point for wait
> events was that they are low overhead. I since spent time reducing the
> overhead further, because even just the branches for checking if
> track_activity is enabled are measurable (225a22b19ed).

Agree. The previous patch I rebased was to review it and reopen this discussion,
I even added a small FIXME in pgstat_report_wait_end and
pgstat_report_wait_start about your work:

  //FIXME: recent patch to speed up this call.

In the patch in attachment, I tried to fix this by using kind of an internal
hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
"instrument" wait events only when required, on the fly, dynamically.

Moreover, I removed the hash structure for a simple static array for faster
access.

> > From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Fri, 4 Jun 2021 18:14:51 +0200
> > Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from
> >  INSTR_TIME to rdtsc.
> > 
> > This patch changes measuring method of wait event time from INSTR_TIME
> > (which uses gettimeofday or clock_gettime) to rdtsc. This might reduce the
> > overhead of measuring overhead.
> > 
> > Any supports like changing clock cycle to actual time or error handling are
> > not currently implemented.  
> 
> rdtsc is a serializing (*) instruction - that's the expensive part. On linux
> clock_gettime() doesn't actually need a syscall. While the vdso call
> implies a bit of overhead over a raw rdtsc, it's a relatively small part
> of the overhead.  See
> https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

I choose to remove all this rdtsc part from my patch as this wasn't clear how
much faster it was compare to simpler vdso functions and how to accurately
extract a human time.

About my take on $subject, for the sake of simplicity of this PoC, I added
instrumentation to log_statement_stats. Despite the query context of the
reported log, they are really accumulated stats.

The patch updated pg_stat_get_waitaccum() as well to be able to report the
accumulated wait events from your interactive or batch session.

So using my previous fe-time demo client, you can test it using:

  PGOPTIONS="--log_statement_stats=on" ./fe-time 100

From logs, I now have (notice the last line):

  LOG:  duration: 3837.194 ms  statement: SELECT * FROM pgbench_accounts
  LOG:  QUERY STATISTICS
  DETAIL:  ! system usage stats:
   !   0.087444 s user, 0.002106 s system, 3.837202 s elapsed
   !   [0.087444 s user, 0.003974 s system total]
   !   25860 kB max resident size
   !   0/0 [0/0] filesystem blocks in/out
   !   0/303 [0/697] page faults/reclaims, 0 [0] swaps
   !   0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
   !   4/18 [5/18] voluntary/involuntary context switches
   !   Client/ClientWrite 4 calls, 3747102 us elapsed

Using pgbench scale factor 10, the copy query for pgbench_accounts looks like:

  LOG:  duration: 2388.081 ms  statement: copy pgbench_accounts from stdin
  LOG:  QUERY STATISTICS
  DETAIL:  ! system usage stats:
   !   1.373756 s user, 0.252860 s system, 2.388100 s elapsed
   !   [1.397015 s user, 0.264951 s system total]
   !   37788 kB max reside

Re: [Proposal] Add accumulated statistics for wait event

2021-06-04 Thread Jehan-Guillaume de Rorthais

Hi All,

I faced a few times a situation where a long running query is actually
including the time the backend is waiting for the frontend to fetch all the
rows (see [1] for details). See a sample code fe-time.c and its comments in
attachment to reproduce this behavior.

There's no simple way today to pinpoint the problem in production without
advance interactive auditing, and/or using system tools. After studying the
problem, I believe it boil down to track the wait event ClientWrite, so I ended
up on this thread.

You might catch some mismatching times thanks to auto_explain as well. Using
the fe-time.c demo code with the following command:

  PGDATABASE=postgres PGHOST=::1 time ./fe-time 100

The frontend time is 10s, the query time reported is 3228.631ms, but last row
has been produced after 20.672ms:

  LOG:  duration: 3228.631 ms  plan:
Query Text: SELECT * FROM pgbench_accounts
Seq Scan on pgbench_accounts (time=0.005..20.672 rows=10 loops=1)

(Note that in contrast with localhost, through the unix socket, the backend
reported query time is always really close to 10s).

I re-based the existing patch (see in attachment), to look at the ClientWrite
for this exact query: 

  # SELECT wait_event, calls, times 
FROM pg_stat_get_waitaccum(NULL)
WHERE wait_event = 'ClientWrite';

   wait_event  | calls |  times  
  -+---+-
   ClientWrite | 4 | 3132266

The "time" is expressed as µs in the patch, so 3132.266ms of the total
3228.631ms is spent sending the result to the frontend. I'm not sure where are
the missing 75ms.

The pg_wait_sampling extension might help but it requires a production restart
to install, then enable it. Whatever if the solution is sampling or cumulative,
an in-core and hot-switchable solution would be much more convenient. But
anyway, looking at pg_wait_sampling, we have a clear suspect as well for the
later query run:

  # SELECT event, count
FROM pg_wait_sampling_profile
WHERE queryid=4045741516911800313;

  event| count
  -+---
   ClientWrite |   309

The default profil period of pg_wait_sampling being 10ms, we can roughly
estimate the ClientWrite around 3090ms. Note that this is close enough because
we know 3132266µs has been accumulated among only 4 large wait events.

Finishing bellow.

On Mon, 3 Aug 2020 00:00:40 +0200
Daniel Gustafsson  wrote:

> > On 31 Jul 2020, at 07:23, imai.yoshik...@fujitsu.com wrote:
> >   
> >> This patch fails to apply to HEAD, please submit a rebased version.  I've
> >> marked this as as Waiting on Author.

Please, find in attachment a rebase of both patches. I did some small editing
on the way. I didn't bench them.

I'm not sure this patch is the best approach though. Receive it as a
motivation to keep up with this discussion. As I wrote, whatever if the
solution is sampling or cumulative, an in-core and hot-switchable solution
would be much more convenient. The fact is that this patch was already
available and ready to keep up with a discussion.

Collecting and summing all wait events from all backends in the same place
forbid to track precisely wait events from a specific backends. Especially on a
busy system where numbers can quickly be buried by all other activities around.

I wonder if wait events should only be accumulated on backend side, making
possible to enable/disable them on the fly and to collect some reports eg. in
logs or to output. Most of the code from these patch could be recycled in a
simpler patch implementing this.

Thoughts?

> > Sorry for my absence. Unfortunately I couldn't have time to work on this
> > patch in this cf. I believe I will be back in next cf, work on this patch
> > and also review other patches.  
> 
> No worries, it happens.  Since the thread has stalled and there is no updated
> patch I've marked this entry Returned with Feedback.  Please feel free to
> re-open a new CF entry if you return to this patch.

I volunteer to be a reviewer on this patch.

Imai-san, do you agree to add it as new CF entry?

Regards,

[1]:  Last time I had such situation was few weeks ago. A customer was
reporting a query being randomly slow, running bellow 100ms most of the time
and sometime hitting 28s. Long story short, the number of row was the same
(10-15k), but the result set size was 9x bigger (1MB vs 9MB). As the same query
was running fine from psql, I suspected the frontend was somehow saturated.
Tcpdump helped me to compute that the throughput fall to 256kB/s after the
first 2MB of data transfert with a very narrow TCP window. I explained to the
customer their app probably doesn't pull the rows fast enough and that
some buffers were probably saturated on the frontend side, waiting for
the app and slowing down the whole transfert.
Devels fixed the problem by moving away two fields transformations (unaccent)
from their loop fetching the rows.

>From 88c2779679c5c9625ca5348eec

Re: when the startup process doesn't

2021-04-21 Thread Jehan-Guillaume de Rorthais

On Wed, 21 Apr 2021 12:36:05 -0700
Andres Freund  wrote:

>  [...]  
> 
> I don't think that concern equally applies for what I am proposing
> here. For one, we already have minRecoveryPoint in ControlData, and we
> already use it for the purpose of determining where we need to recover
> to, albeit only during crash recovery. Imo that's substantially
> different from adding actual recovery progress status information to the
> control file.

Just for the record, when I was talking about updating status of the startup
in the controldata, I was thinking about setting the last known LSN replayed.
Not some kind of variable string.

> 
> I also think that it'd actually be a significant reliability improvement
> if we maintained an approximate minRecoveryPoint during normal running:
> I've seen way too many cases where WAL files were lost / removed and
> crash recovery just started up happily. Only hitting problems months
> down the line. Yes, it'd obviously not bullet proof, since we'd not want
> to add a significant stream of new fsyncs, but IME such WAL files
> lost/removed issues tend not to be about a few hundred bytes of WAL but
> many segments missing.

Maybe setting this minRecoveryPoint once per segment would be enough, near
from the beginning of the WAL. At least, the recovery process would be
forced to actually replay until the very last known segment.

Regards,

Re: when the startup process doesn't

2021-04-20 Thread Jehan-Guillaume de Rorthais

On Tue, 20 Apr 2021 15:04:28 +0200
Magnus Hagander  wrote:
[...]
> Yeah, I think we should definitely limit this to local access, one way
> or another. Realistically using pg_hba is going to require catalog
> access, isn't it? And we can't just go ignore those rows in pg_hba
> that for example references role membership (as well as all those auth
> methods you can't use).

Two another options:

1. if this is limited to local access only, outside of the log entries, the
status of the startup could be updated in the controldata file as well. This
would allows to watch it without tail-grep'ing logs using eg. pg_controldata.

2. maybe the startup process could ignore update_process_title? As far
as I understand the doc correctly, this GUC is mostly useful for backends on
Windows.

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-19 Thread Jehan-Guillaume de Rorthais

On Mon, 19 Apr 2021 10:35:39 -0700
Mark Dilger  wrote:

> > On Apr 19, 2021, at 10:25 AM, Jehan-Guillaume de Rorthais 
> > wrote:
> > 
> > On Mon, 19 Apr 2021 12:37:08 -0400
> > Andrew Dunstan  wrote:
> >   
> >> 
> >> On 4/19/21 10:43 AM, Mark Dilger wrote:  
> >>>   
> >>>> On Apr 19, 2021, at 5:11 AM, Andrew Dunstan  wrote:
> >>>> 
> >>>> I think therefore I'm inclined for now to do nothing for old version
> >>>> compatibility.  
> >>> I agree with waiting until the v15 development cycle.
> >>>   
> >>>> I would commit the fix for the IPC::Run caching glitch,
> >>>> and version detection  
> >>> Thank you.
> >>>   
> >>>> I would add a warning if the module is used with
> >>>> a version <= 11.  
> >>> Sounds fine for now.
> >>>   
> >>>> The original goal of these changes was to allow testing of combinations
> >>>> of different builds with openssl and nss, which doesn't involve old
> >>>> version compatibility.  
> >>> Hmm.  I think different folks had different goals.  My personal interest
> >>> is to write automated tests which spin up older servers, create data that
> >>> cannot be created on newer servers (such as heap tuples with HEAP_MOVED_IN
> >>> or HEAP_MOVED_OFF bits set), upgrade, and test that new code handles the
> >>> old data correctly.  I think this is not only useful for our test suites
> >>> as a community, but is also useful for companies providing support
> >>> services who need to reproduce problems that customers are having on
> >>> clusters that have been pg_upgraded across large numbers of postgres
> >>> versions. 
> >>>> As far as I know, without any compatibility changes the module is fully
> >>>> compatible with releases 13 and 12, and with releases 11 and 10 so long
> >>>> as you don't want a standby, and with releases 9.6 and 9.5 if you also
> >>>> don't want a backup. That makes it suitable for a lot of testing without
> >>>> any attempt at version compatibility.
> >>>> 
> >>>> We can revisit compatibility further in the next release.  
> >>> Sounds good.  
> >> 
> >> 
> >> I'll work on this. Meanwhile FTR here's my latest revision - it's a lot
> >> less invasive of the main module, so it seems much more palatable to me,
> >> and still passes my test down to 7.2.  
> > 
> > I spend a fair bit of time to wonder how useful it could be to either
> > maintain such a module in core, including for external needs, or creating a
> > separate external project with a different release/distribution/packaging
> > policy.
> > 
> > Wherever the module is maintained, the goal would be to address broader
> > needs, eg. adding a switch_wal() method or wait_for_archive(), supporting
> > replication, backups, etc for multi-old-deprecated-PostgreSQL versions.
> >
> > To be honest I have mixed feelings. I feel this burden shouldn't be carried
> > by the core, which has restricted needs compared to external projects. In
> > the opposite, maintaining an external project which shares 90% of the code
> > seems to be a useless duplicate and backport effort. Moreover Craig Ringer
> > already opened the door for external use of PostgresNode with his effort to
> > install/package it, see:
> > https://www.postgresql.org/message-id/CAGRY4nxxKSFJEgVAv5YAk%3DbqULtFmNw7gEJef0CCgzpNy6O%3D-w%40mail.gmail.com
> > 
> > Thoughts?  
> 
> The community needs a single shared PostgresNode implementation that can be
> used by scripts which reproduce bugs.

Which means it could be OK to have a PostgresNode implementation, leaving in
core source-tree, which supports broader needs than the core ones (older
versions and some more methods)? Did I understood correctly?

If this is correct, I suppose this effort could be committed early in v15 cycle?

Does it deserve some effort to build some dedicated TAP tests for these
modules? I already have a small patch for this waiting on my disk for some more
tests and review...

Regards

Re: multi-install PostgresNode fails with older postgres versions

2021-04-19 Thread Jehan-Guillaume de Rorthais

On Mon, 19 Apr 2021 12:37:08 -0400
Andrew Dunstan  wrote:

> 
> On 4/19/21 10:43 AM, Mark Dilger wrote:
> >
> >> On Apr 19, 2021, at 5:11 AM, Andrew Dunstan  wrote:
> >>
> >> I think therefore I'm inclined for now to do nothing for old version
> >> compatibility.
> > I agree with waiting until the v15 development cycle.
> >
> >> I would commit the fix for the IPC::Run caching glitch,
> >> and version detection
> > Thank you.
> >
> >> I would add a warning if the module is used with
> >> a version <= 11.
> > Sounds fine for now.
> >
> >> The original goal of these changes was to allow testing of combinations
> >> of different builds with openssl and nss, which doesn't involve old
> >> version compatibility.
> > Hmm.  I think different folks had different goals.  My personal interest is
> > to write automated tests which spin up older servers, create data that
> > cannot be created on newer servers (such as heap tuples with HEAP_MOVED_IN
> > or HEAP_MOVED_OFF bits set), upgrade, and test that new code handles the
> > old data correctly.  I think this is not only useful for our test suites as
> > a community, but is also useful for companies providing support services
> > who need to reproduce problems that customers are having on clusters that
> > have been pg_upgraded across large numbers of postgres versions.
> >
> >> As far as I know, without any compatibility changes the module is fully
> >> compatible with releases 13 and 12, and with releases 11 and 10 so long
> >> as you don't want a standby, and with releases 9.6 and 9.5 if you also
> >> don't want a backup. That makes it suitable for a lot of testing without
> >> any attempt at version compatibility.
> >>
> >> We can revisit compatibility further in the next release.
> > Sounds good.
> 
> 
> I'll work on this. Meanwhile FTR here's my latest revision - it's a lot
> less invasive of the main module, so it seems much more palatable to me,
> and still passes my test down to 7.2.

I spend a fair bit of time to wonder how useful it could be to either maintain
such a module in core, including for external needs, or creating a separate
external project with a different release/distribution/packaging policy.

Wherever the module is maintained, the goal would be to address broader
needs, eg. adding a switch_wal() method or wait_for_archive(), supporting
replication, backups, etc for multi-old-deprecated-PostgreSQL versions.

To be honest I have mixed feelings. I feel this burden shouldn't be carried
by the core, which has restricted needs compared to external projects. In the
opposite, maintaining an external project which shares 90% of the code seems to
be a useless duplicate and backport effort. Moreover Craig Ringer already opened
the door for external use of PostgresNode with his effort to install/package it,
see:
https://www.postgresql.org/message-id/CAGRY4nxxKSFJEgVAv5YAk%3DbqULtFmNw7gEJef0CCgzpNy6O%3D-w%40mail.gmail.com

Thoughts?

Re: multi-install PostgresNode fails with older postgres versions

2021-04-19 Thread Jehan-Guillaume de Rorthais

On Mon, 19 Apr 2021 07:43:58 -0700
Mark Dilger  wrote:

> > On Apr 19, 2021, at 5:11 AM, Andrew Dunstan  wrote:
> > 
> > I think therefore I'm inclined for now to do nothing for old version
> > compatibility.  
> 
> I agree with waiting until the v15 development cycle.

Agree.

Re: Retry in pgbench

2021-04-16 Thread Jehan-Guillaume de Rorthais

On Fri, 16 Apr 2021 10:28:48 +0900 (JST)
Tatsuo Ishii  wrote:

> > By the way, I've been playing with the idea of failing gracefully and retry
> > indefinitely (or until given -T) on SQL error AND connection issue.
> > 
> > It would be useful to test replicating clusters with a (switch|fail)over
> > procedure.  
> 
> Interesting idea but in general a failover takes sometime (like a few
> minutes), and it will strongly affect TPS. I think in the end it just
> compares the failover time.

This usecase is not about benchmarking. It's about generating constant trafic
to be able to practice/train some [auto]switchover procedures while being close
to production activity.

In this contexte, a max-saturated TPS of one node is not relevant. But being
able to add some stats about downtime might be a good addition.

Regards,

Re: Retry in pgbench

2021-04-13 Thread Jehan-Guillaume de Rorthais

Hi,

On Tue, 13 Apr 2021 16:12:59 +0900 (JST)
Tatsuo Ishii  wrote:

>  [...]  
>  [...]  
>  [...]  
> 
> Thanks for the pointer. It seems we need to resume the discussion.

By the way, I've been playing with the idea of failing gracefully and retry
indefinitely (or until given -T) on SQL error AND connection issue.

It would be useful to test replicating clusters with a (switch|fail)over
procedure.

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-12 Thread Jehan-Guillaume de Rorthais

On Mon, 12 Apr 2021 09:52:24 -0400
Andrew Dunstan  wrote:

> On 4/12/21 8:59 AM, Jehan-Guillaume de Rorthais wrote:
> > Hi,
> >
> > On Wed, 7 Apr 2021 20:07:41 +0200
> > Jehan-Guillaume de Rorthais  wrote:
> > [...]  
> >>>> Let me know if it worth that I work on an official patch.  
> >>> Let's give it a try ...
> >> OK  
> > So, as promised, here is my take to port my previous work on PostgreSQL
> > source tree.
> >
> > Make check pass with no errors. The new class probably deserve some own TAP
> > tests.
> >
> > Note that I added a PostgresVersion class for easier and cleaner version
> > comparisons. This could be an interesting take away no matter what.
> >
> > I still have some more ideas to cleanup, revamp and extend the base class,
> > but I prefer to go incremental to keep things review-ability.
> >  
> 
> Thanks for this. We have been working somewhat on parallel lines. With
> your permission I'm going to take some of what you've done and
> incorporate it in the patch I'm working on.

The current context makes my weeks difficult to plan and quite chaotic, that's
why it took me some days to produce the patch I promised.

I'm fine with working with a common base code, thanks. Feel free to merge both
code, we'll trade patches during review. However, I'm not sure what is the
status of your patch, so I can not judge what would be the easier way to
incorporate. Mine is tested down to 9.1 (9.0 was meaningless because of lack of
pg_stat_replication) with:

* get_new_node
* init(allows_streaming => 1)
* start
* stop
* backup
* init_from_backup
* wait_for_catchup
* command_checks_all

Note the various changes in my init() and new method allow_streaming(), etc.

FYI (to avoid duplicate work), the next step on my todo was to produce some
meaningful tap tests to prove the class.

> A PostgresVersion class is a good idea - I was already contemplating
> something of the kind.

Thanks!

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-12 Thread Jehan-Guillaume de Rorthais

Hi,

On Wed, 7 Apr 2021 20:07:41 +0200
Jehan-Guillaume de Rorthais  wrote:
[...]
> > > Let me know if it worth that I work on an official patch.
> > 
> > Let's give it a try ...  
> 
> OK

So, as promised, here is my take to port my previous work on PostgreSQL
source tree.

Make check pass with no errors. The new class probably deserve some own TAP
tests.

Note that I added a PostgresVersion class for easier and cleaner version
comparisons. This could be an interesting take away no matter what.

I still have some more ideas to cleanup, revamp and extend the base class, but
I prefer to go incremental to keep things review-ability.

Thanks,
>From 077e447995b3b8bb4c0594a6616e1350acd9d48b Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 12 Apr 2021 14:45:17 +0200
Subject: [PATCH] Draft: Makes PostgresNode multi-version aware

---
 src/test/perl/PostgresNode.pm| 708 ---
 src/test/perl/PostgresVersion.pm |  65 +++
 2 files changed, 704 insertions(+), 69 deletions(-)
 create mode 100644 src/test/perl/PostgresVersion.pm

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e26b2b3f30..d7a9e54efd 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -102,6 +102,7 @@ use Test::More;
 use TestLib ();
 use Time::HiRes qw(usleep);
 use Scalar::Util qw(blessed);
+use PostgresVersion;
 
 our @EXPORT = qw(
   get_new_node
@@ -376,9 +377,42 @@ sub dump_info
 	return;
 }
 
+# Internal routine to allows streaming replication on a primary node.
+sub allows_streaming
+{
+	my ($self, $allows_streaming) = @_;
+	my $pgdata = $self->data_dir;
+	my $wal_level;
+
+	if ($allows_streaming eq "logical")
+	{
+		$wal_level = "logical";
+	}
+	else
+	{
+		$wal_level = "replica";
+	}
+
+	$self->append_conf( 'postgresql.conf', qq{
+		wal_level = $wal_level
+		max_wal_senders = 10
+		max_replication_slots = 10
+		wal_log_hints = on
+		hot_standby = on
+		# conservative settings to ensure we can run multiple postmasters:
+		shared_buffers = 1MB
+		max_connections = 10
+		# limit disk space consumption, too:
+		max_wal_size = 128MB
+	});
 
-# Internal method to set up trusted pg_hba.conf for replication.  Not
-# documented because you shouldn't use it, it's called automatically if needed.
+	$self->set_replication_conf;
+
+	return;
+}
+
+# Internal to set up trusted pg_hba.conf for replication.  Not documented
+# because you shouldn't use it, it's called automatically if needed.
 sub set_replication_conf
 {
 	my ($self) = @_;
@@ -398,6 +432,15 @@ sub set_replication_conf
 	return;
 }
 
+# Internal methods to track capabilities depending on the PostgreSQL major
+# version used
+
+sub can_slots   { return 1 }
+sub can_log_hints   { return 1 }
+sub can_restart_after_crash { return 1 }
+sub can_skip_init_fsync { return 1 }
+
+
 =pod
 
 =item $node->init(...)
@@ -429,6 +472,7 @@ sub init
 	my $port   = $self->port;
 	my $pgdata = $self->data_dir;
 	my $host   = $self->host;
+	my @cmd;
 
 	local %ENV = $self->_get_env();
 
@@ -438,19 +482,25 @@ sub init
 	mkdir $self->backup_dir;
 	mkdir $self->archive_dir;
 
-	TestLib::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N',
-		@{ $params{extra} });
+	@cmd = ( 'initdb', '-D', $pgdata, '-A', 'trust' );
+	push @cmd, '-N' if $self->can_skip_init_fsync;
+	push @cmd, @{ $params{extra} } if defined $params{extra};
+
+	TestLib::system_or_bail(@cmd);
 	TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata,
 		@{ $params{auth_extra} });
 
 	open my $conf, '>>', "$pgdata/postgresql.conf";
 	print $conf "\n# Added by PostgresNode.pm\n";
 	print $conf "fsync = off\n";
-	print $conf "restart_after_crash = off\n";
+	print $conf "restart_after_crash = off\n" if $self->can_restart_after_crash;
 	print $conf "log_line_prefix = '%m [%p] %q%a '\n";
 	print $conf "log_statement = all\n";
-	print $conf "log_replication_commands = on\n";
-	print $conf "wal_retrieve_retry_interval = '500ms'\n";
+	if ($self->version >= '9.5.0')
+	{
+		print $conf "log_replication_commands = on\n";
+		print $conf "wal_retrieve_retry_interval = '500ms'\n";
+	}
 
 	# If a setting tends to affect whether tests pass or fail, print it after
 	# TEMP_CONFIG.  Otherwise, print it before TEMP_CONFIG, thereby permitting
@@ -463,31 +513,8 @@ sub init
 	# concurrently must not share a stats_temp_directory.
 	print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
 
-	if ($params{allows_streaming})
-	{
-		if ($params{allows_streaming} eq "logical")
-		{
-			print $conf "wal_level = logical\n";
-		}
-		else
-		{
-			print $conf "wal_level = replica\n";
-		}
-		print $conf "max_wal_senders = 10\n";
-		print $conf "max_replication_slo

Re: multi-install PostgresNode fails with older postgres versions

2021-04-07 Thread Jehan-Guillaume de Rorthais

On Wed, 7 Apr 2021 13:38:39 -0400
Andrew Dunstan  wrote:

> On 4/7/21 1:19 PM, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 7 Apr 2021 12:51:55 -0400
> > Alvaro Herrera  wrote:
> >  
> >> On 2021-Apr-07, Jehan-Guillaume de Rorthais wrote:
> >>  
> >>> When I'm creating a new node, I'm using the "pgaTester" factory class. It
> >>> relies on PATH to check the major version using pg_config, then loads the
> >>> appropriate class.
> >> From a code cleanliness point of view, I agree that having separate
> >> classes for each version is neater than what you call a forest of
> >> conditionals.  I'm not sure I like the way you instantiate the classes
> >> in pgaTester though -- wouldn't it be saner to have PostgresNode::new
> >> itself be in charge of deciding which class to bless the object as?
> >> Since we're talking about modifying PostgresNode itself in order to
> >> support this, it would make sense to do that.  
> > Yes, it would be much saner to make PostgresNode the factory class. Plus,
> > some more logic could be injected there to either auto-detect the version
> > (current behavior) or eg. use a given path to the binaries as Mark did in
> > its patch.  
> 
> 
> Aren't you likely to end up duplicating substantial amounts of code,
> though? I'm certainly not at the stage where I think the version-aware
> code is creating too much clutter. The "forest of conditionals" seems
> more like a small thicket.

I started with a patched PostgresNode. Then I had to support backups and
replication for my tests. Then it become hard to follow the logic in
conditional blocks, moreover some conditionals needed to appear in multiple
places in the same methods, depending on the enabled features, the conditions,
what GUC was enabled by default or not, etc. So I end up with this design.

I really don't want to waste community brain cycles in discussions and useless
reviews. But as far as someone agree to review it, I already have the material
and I can create a patch with a limited amount of work to compare and review.
The one-class approach would need to support replication down to 9.0 to be fair
though :/

Thanks,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-07 Thread Jehan-Guillaume de Rorthais

On Wed, 7 Apr 2021 10:50:26 -0700
Mark Dilger  wrote:

> > On Apr 7, 2021, at 10:36 AM, Alvaro Herrera  wrote:
> >   
> >> Yes, it would be much saner to make PostgresNode the factory class. Plus,
> >> some more logic could be injected there to either auto-detect the version
> >> (current behavior) or eg. use a given path to the binaries as Mark did in
> >> its patch.  
> > 
> > I'm not sure what you mean about auto-detecting the version -- I assume
> > we would auto-detect the version by calling pg_config from the
> > configured path and parsing the binary, which is what Mark's patch is
> > supposed to do already.  So I don't see what the distinction between
> > those two things is.
> > 
> > In order to avoid having an ever-growing plethora of 100-byte .pm files,
> > we can put the version-specific classes in the same PostgresNode.pm
> > file, at the bottom, "class PostgresNode96; use parent PostgresNode10;"
> > followed by the routines that are overridden for each version.  
> 
> It's not sufficient to think about postgres versions as "10", "11", etc.  You
> have to be able to spin up nodes of any build, like "9.0.7". There are
> specific versions of postgres with specific bugs that cause specific
> problems, and later versions of postgres on that same development branch have
> been patched.  If you only ever spin up the latest version, you can't
> reproduce the problems and test how they impact cross version compatibility.


Agree.

> I don't think it works to have a class per micro release.  So you'd have a
> PostgresNode of type "10" or such, but how does that help?  If you have ten
> different versions of "10" in your test, they all look the same?

As PostgresNode factory already checked pg_config version, it can give it as
argument to the specific class constructor. We can then have eg.
major_version() and version() to return the major version and full versions.

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-07 Thread Jehan-Guillaume de Rorthais

On Wed, 7 Apr 2021 13:36:31 -0400
Alvaro Herrera  wrote:

> On 2021-Apr-07, Jehan-Guillaume de Rorthais wrote:
> 
> > Yes, it would be much saner to make PostgresNode the factory class. Plus,
> > some more logic could be injected there to either auto-detect the version
> > (current behavior) or eg. use a given path to the binaries as Mark did in
> > its patch.  
> 
> I'm not sure what you mean about auto-detecting the version -- I assume
> we would auto-detect the version by calling pg_config from the
> configured path and parsing the binary, which is what Mark's patch is
> supposed to do already.  So I don't see what the distinction between
> those two things is.

My version is currently calling pg_config without any knowledge about its
absolute path.

Mark's patch is able to take an explicit binary path:

  my $a = PostgresNode->get_new_node('a', install_path => '/my/install/8.4');

> In order to avoid having an ever-growing plethora of 100-byte .pm files,
> we can put the version-specific classes in the same PostgresNode.pm
> file, at the bottom, "class PostgresNode96; use parent PostgresNode10;"
> followed by the routines that are overridden for each version.

Sure.

> > Let me know if it worth that I work on an official patch.  
> 
> Let's give it a try ...

OK

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-07 Thread Jehan-Guillaume de Rorthais

On Wed, 7 Apr 2021 12:51:55 -0400
Alvaro Herrera  wrote:

> On 2021-Apr-07, Jehan-Guillaume de Rorthais wrote:
> 
> > When I'm creating a new node, I'm using the "pgaTester" factory class. It
> > relies on PATH to check the major version using pg_config, then loads the
> > appropriate class.  
> 
> From a code cleanliness point of view, I agree that having separate
> classes for each version is neater than what you call a forest of
> conditionals.  I'm not sure I like the way you instantiate the classes
> in pgaTester though -- wouldn't it be saner to have PostgresNode::new
> itself be in charge of deciding which class to bless the object as?
> Since we're talking about modifying PostgresNode itself in order to
> support this, it would make sense to do that.

Yes, it would be much saner to make PostgresNode the factory class. Plus, some
more logic could be injected there to either auto-detect the version (current
behavior) or eg. use a given path to the binaries as Mark did in its patch.

> (I understand that one of your decisions was to avoid modifying
> PostgresNode, so that you could ingest whatever came out of PGDG without
> having to patch it each time.)

Indeed, that's why I created this class with a not-very-fortunate name :) 

Let me know if it worth that I work on an official patch.

Regards,

Re: why pg_walfile_name() cannot be executed during recovery?

2021-04-07 Thread Jehan-Guillaume de Rorthais

On Fri, 2 Apr 2021 08:22:09 -0400
Robert Haas  wrote:

> On Fri, Apr 2, 2021 at 4:23 AM SATYANARAYANA NARLAPURAM
>  wrote:
> > Why pg_walfile_name() can't be executed under recovery?  
> 
> I believe the issue is that the backend executing the function might
> not have an accurate idea about which TLI to use. But I don't
> understand why we can't find some solution to that problem.
> 
> > What is the best way for me to get the current timeline and/or the file
> > being recovering on the standby using a postgres query? I know I can get it
> > via process title but don't want to go that route.  
> 
> pg_stat_wal_receiver has LSN and TLI information, but probably won't
> help except when WAL receiver is actually active.
> pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn() will give the
> LSN at any point during recovery, but not the TLI. We might have some
> gaps in this area...

Yep, see previous discussion:
https://www.postgresql.org/message-id/flat/20190723180518.635ac554%40firost

The status by the time was to consider a new view eg. pg_stat_recovery, to
report various recovery stats.

But maybe the best place now would be to include it in the new pg_stat_wal view?

As I'm interesting with this feature as well, I volunteer to work on it as
author or reviewer.

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-07 Thread Jehan-Guillaume de Rorthais

On Wed, 7 Apr 2021 09:08:31 -0700
Mark Dilger  wrote:

> > On Apr 7, 2021, at 7:37 AM, Jehan-Guillaume de Rorthais 

> > And here is a demo test file:
> > https://github.com/ioguix/check_pgactivity/blob/tests/t/01-streaming_delta.t
> > 
> > My limited set of tests are working with versions back to 9.0 so far.
> > 
> > My 2¢  
> 
> Hmm, I took a look.  I'm not sure that we're working on the same problem, but
> I might have missed something.

I understood you were working on making PostgresNode compatible with older
versions of PostgreSQL. So ou can create and interact with older versions,
eg. 9.0. Did I misunderstood?

My set of class had the exact same goal: creating and managing PostgreSQL nodes
from various major versions. It's going a bit further than just init() though,
as it supports some more existing methods (eg. enable_streaming) and adds some
others (version, switch_wal, wait_for_archive).

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-07 Thread Jehan-Guillaume de Rorthais

On Wed, 7 Apr 2021 11:54:36 -0400
Andrew Dunstan  wrote:

> On 4/7/21 10:37 AM, Jehan-Guillaume de Rorthais wrote:
> > Hi all,
> >
> > First, sorry to step in this discussion this late. I didn't noticed it
> > before :(
> >
> > I did some work about these compatibility issues in late 2020 to use
> > PostgresNode in the check_pgactivity TAP tests.
> >
> > See https://github.com/ioguix/check_pgactivity/tree/tests/t/lib
> >
> > PostgresNode.pm, TestLib.pm, SimpleTee.pm and RecursiveCopy.pm comes
> > unchanged from PostgreSQL source file (see headers and COPYRIGHT.pgsql).
> >
> > Then, I'm using the facet class PostgresNodeFacet to extend it with some
> > more methods. Finaly, I created one class per majpr version, each
> > inheriting from the next version. That means 13 inherits from
> > PostgresNodeFacet.pm, 12 inherits from 13, 11 inherits from 12, etc.
> >
> > When I'm creating a new node, I'm using the "pgaTester" factory class. It
> > relies on PATH to check the major version using pg_config, then loads the
> > appropriate class.
> >
> > That means some class overrides almost no methods but version(), which
> > returns the major version. Eg.:
> > https://github.com/ioguix/check_pgactivity/blob/tests/t/lib/PostgresNode12.pm
> >
> > From tests, I can check the node version using this method, eg.:
> >
> >   skip "skip non-compatible test on PostgreSQL 8.0 and before", 3
> > unless $node->version <= 8.0;
> >
> > Of course, there's a lot of duplicate code between classes, but my main goal
> > was to keep PostgresNode.pm unchanged from upstream so I can easily update
> > it.
> >
> > And here is a demo test file:
> > https://github.com/ioguix/check_pgactivity/blob/tests/t/01-streaming_delta.t
> >
> > My limited set of tests are working with versions back to 9.0 so far.
> >
> > My 2¢  
> 
> 
> 
> I don't really want to create a multitude of classes. I think Mark is
> basically on the right track. I started off using a subclass of
> PostgresNode but was persuaded not to go down that route, and instead we
> have made some fairly minimal changes to PostgresNode itself. I think
> that was a good decision. If you take out the versioning subroutines,
> the actual version-aware changes Mark is proposing to PostgresNode are
> quite small - less that 200 lines supporting versions all the way back
> to 8.1. That's pretty awesome.


I took this road because as soon as you want to use some other methods like
enable_streaming, enable_archiving, etc, you find much more incompatibilities
on your way. My current stack of classes is backward compatible with much
more methods than just init(). But I admit it creates a multitude of class and
some duplicate code...

It's still possible to patch each methods in PostgresNode, but you'll end up
with a forest of conditional blocks depending on how far you want to go with old
PostgreSQL versions.

Regards,

Re: multi-install PostgresNode fails with older postgres versions

2021-04-07 Thread Jehan-Guillaume de Rorthais

Hi all,

First, sorry to step in this discussion this late. I didn't noticed it before :(

I did some work about these compatibility issues in late 2020 to use
PostgresNode in the check_pgactivity TAP tests.

See https://github.com/ioguix/check_pgactivity/tree/tests/t/lib

PostgresNode.pm, TestLib.pm, SimpleTee.pm and RecursiveCopy.pm comes unchanged
from PostgreSQL source file (see headers and COPYRIGHT.pgsql).

Then, I'm using the facet class PostgresNodeFacet to extend it with some more
methods. Finaly, I created one class per majpr version, each inheriting from the
next version. That means 13 inherits from PostgresNodeFacet.pm, 12 inherits from
13, 11 inherits from 12, etc.

When I'm creating a new node, I'm using the "pgaTester" factory class. It
relies on PATH to check the major version using pg_config, then loads the
appropriate class.

That means some class overrides almost no methods but version(), which returns
the major version. Eg.:
https://github.com/ioguix/check_pgactivity/blob/tests/t/lib/PostgresNode12.pm

From tests, I can check the node version using this method, eg.:

  skip "skip non-compatible test on PostgreSQL 8.0 and before", 3
unless $node->version <= 8.0;

Of course, there's a lot of duplicate code between classes, but my main goal
was to keep PostgresNode.pm unchanged from upstream so I can easily update it.

And here is a demo test file:
https://github.com/ioguix/check_pgactivity/blob/tests/t/01-streaming_delta.t

My limited set of tests are working with versions back to 9.0 so far.

My 2¢

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2021-01-27 Thread Jehan-Guillaume de Rorthais

Oh, I forgot another point before sending my previous email.

Maybe it might worth adding some final safety checks in pg_upgrade itself?
Eg. checking controldata and mxid files coherency between old and new
cluster would have catch the inconsistency here.

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2021-01-27 Thread Jehan-Guillaume de Rorthais

Hi, 

On Wed, 27 Jan 2021 11:25:11 +0100
Denis Laxalde  wrote:

> Andres Freund a écrit :
> > On 2021-01-21 16:23:58 +0100, Denis Laxalde wrote:  
> > > We found an issue in pg_upgrade on a cluster with a third-party
> > > background worker. The upgrade goes fine, but the new cluster is then in
> > > an inconsistent state. The background worker comes from the PoWA
> > > extension but the issue does not appear to related to this particular
> > > code.  
> > 
> > Well, it does imply that that backgrounder did something, as the pure
> > existence of a bgworker shouldn't affect anything. Presumably the issue is
> > that the bgworker actually does transactional writes, which causes problems
> > because the xids / multixacts from early during pg_upgrade won't actually
> > be valid after we do pg_resetxlog etc.

Indeed, it does some writes. As soon as the powa bgworker starts, it takes
"snapshots" of pg_stat_statements state and record them in a local table. To
avoid concurrent run, it takes a lock on some of its local rows using SELECT FOR
UPDATE, hence the mxid consumption.

The inconsistency occurs at least at two place:

* the datminmxid and relminmxid fields pg_dump(all)'ed and restored in the new
  cluster.
* the multixid fields in the controlfile read during the check phase and
  restored later using pg_resetxlog.

> > > As a solution, it seems that, for similar reasons that we restrict
> > > socket access to prevent accidental connections (from commit
> > > f763b77193), we should also prevent background workers to start at this
> > > step.  
> > 
> > I think that'd have quite the potential for negative impact - [...]
> 
> Thank you for those insights!

+1

> > I wonder if we could
> > 
> > a) set default_transaction_read_only to true, and explicitly change it
> >in the sessions that need that.

According to Denis' tests discussed off-list, it works fine in regard with the
powa bgworker, albeit some complaints in logs. However, I wonder how fragile it
could be as bgworker could easily manipulate either the GUC or "BEGIN READ
WRITE". I realize this is really uncommon practices, but bgworker code from
third parties might be surprising.

> > b) when in binary upgrade mode / -b, error out on all wal writes in
> >sessions that don't explicitly set a session-level GUC to allow
> >writes.

It feels safer because more specific to the subject. But I wonder if the
benefice worth adding some (limited?) complexity and a small/quick conditional
block in a very hot code path for a very limited use case.

What about c) where the bgworker are not loaded by default during binary upgrade
mode / -b unless they explicitly set a bgw_flags (BGWORKER_BINARY_UPGRADE ?)
when they are required during pg_upgrade?

Regards,

Re: vacuum -vs reltuples on insert only index

2020-11-09 Thread Jehan-Guillaume de Rorthais

On Wed, 4 Nov 2020 18:44:03 -0800
Peter Geoghegan  wrote:

> On Mon, Nov 2, 2020 at 10:03 AM Peter Geoghegan  wrote:
> > Actually, it seems better to always count num_index_tuples the old way
> > during cleanup-only index VACUUMs, despite the inaccuracy that that
> > creates with posting list tuples.  
> 
> Pushed a fix for this just now.
> 
> Thanks for the report!

Sorry I couldn't give some more feedback on your thoughts on time...

Thank you for your investigation and fix!

Regards,

Re: The ultimate extension hook.

2020-10-23 Thread Jehan-Guillaume de Rorthais

On Thu, 24 Sep 2020 17:08:44 +1200
David Rowley  wrote:
[...]
> I wondered if there was much in the way of use-cases like a traffic
> filter, or statement replication. I wasn't sure if it was a solution
> looking for a problem or not, but it seems like it could be productive
> to talk about possibilities here and make a judgement call based on if
> any alternatives exist today that will allow that problem to be solved
> sufficiently in another way.

If I understand correctly the proposal, this might enable traffic capture using
a loadable extension.

This kind of usage would allows to replay and validate any kind of traffic from
a major version to another one. Eg. to look for regressions from the application
point of view, before a major upgrade.

I did such regression tests in past. We were capturing production traffic
using libpcap and replay it using pgshark on upgraded test env. Very handy.
However:

* libpcap can drop network packet during high load. This make the capture
  painful to recover past the hole.
* useless with encrypted traffic

So, +1 for such hooks.

Regards,

vacuum -vs reltuples on insert only index

2020-10-23 Thread Jehan-Guillaume de Rorthais

Hello,

I've found a behavior change with pg_class.reltuples on btree index. With only
insert activity on a table, when an index is processed, its related reltuples
is set to 0. Here is a demo script:

  -- force index cleanup
  set vacuum_cleanup_index_scale_factor to 0;

  drop table if exists t;
  create table t as select i from generate_series(1, 100) i;
  create index t_i on t(i);

  -- after index creation its reltuples is correct
  select reltuples from pg_class where relname = 't_i' 
  -- result: reltuples | 100

  -- vacuum set index reltuples to 0
  vacuum t;
  select reltuples from pg_class where relname = 't_i' 
  -- result: reltuples | 0

  -- analyze set it back to correct value
  analyze t;
  select reltuples from pg_class where relname = 't_i' 
  -- result: reltuples | 100

  -- insert + vacuum reset it again to 0
  insert into t values(101);
  vacuum (verbose off, analyze on, index_cleanup on) t;
  select reltuples from pg_class where relname = 't_i' 
  -- result: reltuples | 0

  -- delete + vacuum set it back to correct value
  delete from t where i=10;
  vacuum (verbose off, analyze on, index_cleanup on) t;
  select reltuples from pg_class where relname = 't_i' 
  -- result: reltuples | 100

  -- and back to 0 again with insert+vacuum
  insert into t values(102);
  vacuum (verbose off, analyze on, index_cleanup on) t;
  select reltuples from pg_class where relname = 't_i' 
  -- result: reltuples | 0

Before 0d861bbb70, btvacuumpage was adding to relation stats the number of
leaving lines in the block using:

  stats->num_index_tuples += maxoff - minoff + 1;

After 0d861bbb70, it is set using new variable nhtidslive:

  stats->num_index_tuples += nhtidslive

However, nhtidslive is only incremented if callback (IndexBulkDeleteCallback)
is set, which seems not to be the case on select-only workload.

A naive fix might be to use "maxoff - minoff + 1" when callback==NULL.

Thoughts?

Regards,

Re: [patch] demote

2020-09-01 Thread Jehan-Guillaume de Rorthais

On Tue, 18 Aug 2020 17:41:31 +0200
Jehan-Guillaume de Rorthais  wrote:

> Hi,
> 
> Please find in attachment v5 of the patch set rebased on master after various
> conflicts.
> 
> Regards,
> 
> On Wed, 5 Aug 2020 00:04:53 +0200
> Jehan-Guillaume de Rorthais  wrote:
> 
> > Demote now keeps backends with no active xid alive. Smart mode keeps all
> > backends: it waits for them to finish their xact and enter read-only. Fast
> > mode terminate backends wit an active xid and keeps all other ones.
> > Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
> > (check recovery state) once demoted.
> > During demote, no new session is allowed.
> > 
> > As backends with no active xid survive, a new SQL admin function
> > "pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Just to keep the list inform, I found a race condition leading to backends
trying to write to XLog after they processed the demote signal. Eg.:

  [posmaster]  LOG:  all backends in read only
  [checkpointer]   LOG:  demoting
  [backend]  PANIC:  cannot make new WAL entries during recovery
 STATEMENT:  UPDATE pgbench_accounts [...]

Because of this Postmaster enters in crash recovery while demote
environnement is in progress.

I have a couple of other subjects right now, but I plan to get back to it soon.

Regards,

Re: [patch] demote

2020-08-18 Thread Jehan-Guillaume de Rorthais

Hi,

Please find in attachment v5 of the patch set rebased on master after various
conflicts.

Regards,

On Wed, 5 Aug 2020 00:04:53 +0200
Jehan-Guillaume de Rorthais  wrote:

> Demote now keeps backends with no active xid alive. Smart mode keeps all
> backends: it waits for them to finish their xact and enter read-only. Fast
> mode terminate backends wit an active xid and keeps all other ones.
> Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
> (check recovery state) once demoted.
> During demote, no new session is allowed.
> 
> As backends with no active xid survive, a new SQL admin function
> "pg_demote(fast bool, wait bool, wait_seconds int)" had been added.
> 
> Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
> The resulting refactoring makes the code much simpler, cleaner, with better
> isolation of actions from the code point of view.
> 
> Thanks to the refactoring, the patch now only adds one state to the state
> machine: PM_DEMOTING. A second one could be use to replace:
> 
> /* Demoting: start the Startup Process */
> if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)
> 
> with eg.:
> 
> if (pmState == PM_DEMOTED)
> 
> I believe it might be a bit simpler to understand, but the existing comment
> might be good enough as well. The full state machine path for demote is:
> 
>  PM_DEMOTING  /* wait for active xid backend to finish */
>  PM_SHUTDOWN  /* wait for checkpoint shutdown and its 
>  various shutdown tasks */
>  PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
>  PM_STARTUP
> 
> Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
> adding tests on backend behaviors during demote and new function pg_demote().
> 
> On my todo:
> 
> * cancel running checkpoint for fast demote ?
> * forbid demote when PITR backup is in progress
> * user documentation
> * Robert's concern about snapshot during hot standby
> * anything else reported to me
> 
> Plus, I might be able to split the backend part and their signals of the patch
> 0002 in its own patch if it helps the review. It would apply after 0001 and
> before actual 0002.
> 
> As there was no consensus and the discussions seemed to conclude this patch
> set should keep growing to see were it goes, I wonder if/when I should add it
> to the commitfest. Advice? Opinion?
>From 90e26a7e2d53f7a8436de0b73eb57498f884de9d Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 31 Jul 2020 10:58:40 +0200
Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local
 variable

Adds functions extern LocalSetXLogInsertNotAllowed() and
LocalSetXLogInsertCheckRecovery() to set the local variable
LocalXLogInsert respectively to 0 and -1.

These functions are declared as extern for future need in
the demote patch.

Function LocalSetXLogInsertAllowed() already exists and
declared as static as it is not needed outside of xlog.h.
---
 src/backend/access/transam/xlog.c | 27 +++
 src/include/access/xlog.h |  2 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..c0d79f192c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7711,7 +7711,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
+	LocalSetXLogInsertCheckRecovery();
 
 	if (InRecovery)
 	{
@@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+/*
+ * Make XLogInsertAllowed() return false in the current process only.
+ */
+void
+LocalSetXLogInsertNotAllowed(void)
+{
+	LocalXLogInsertAllowed = 0;
+}
+
+/*
+ * Make XLogInsertCheckRecovery() return false in the current process only.
+ */
+void
+LocalSetXLogInsertCheckRecovery(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
+			LocalSetXLogInsertCheckRecovery(); /* return to "check" state */
 		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
+			LocalSetXLogInsertNotAllowed(); /* never again write WAL */
 	}
 
 	/*
@@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void)
 
 	END_CRIT_SECTION();
 
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
+	LocalSetXLogInsertCheckRecovery();	/* return to "check" state */
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/incl

Re: [patch] demote

2020-08-04 Thread Jehan-Guillaume de Rorthais

Hi,

Yet another summary + patch + tests.

Demote now keeps backends with no active xid alive. Smart mode keeps all
backends: it waits for them to finish their xact and enter read-only. Fast
mode terminate backends wit an active xid and keeps all other ones.
Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
(check recovery state) once demoted.
During demote, no new session is allowed.

As backends with no active xid survive, a new SQL admin function
"pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
The resulting refactoring makes the code much simpler, cleaner, with better
isolation of actions from the code point of view.

Thanks to the refactoring, the patch now only adds one state to the state
machine: PM_DEMOTING. A second one could be use to replace:

/* Demoting: start the Startup Process */
if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)

with eg.:

if (pmState == PM_DEMOTED)

I believe it might be a bit simpler to understand, but the existing comment
might be good enough as well. The full state machine path for demote is:

 PM_DEMOTING  /* wait for active xid backend to finish */
 PM_SHUTDOWN  /* wait for checkpoint shutdown and its 
 various shutdown tasks */
 PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
 PM_STARTUP

Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
adding tests on backend behaviors during demote and new function pg_demote().

On my todo:

* cancel running checkpoint for fast demote ?
* forbid demote when PITR backup is in progress
* user documentation
* Robert's concern about snapshot during hot standby
* anything else reported to me

Plus, I might be able to split the backend part and their signals of the patch
0002 in its own patch if it helps the review. It would apply after 0001 and
before actual 0002.

As there was no consensus and the discussions seemed to conclude this patch set
should keep growing to see were it goes, I wonder if/when I should add it to
the commitfest. Advice? Opinion?

Regards,
>From da3c4575f8ea40c089483b9cfa209db4993148ff Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 31 Jul 2020 10:58:40 +0200
Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local
 variable

Adds functions extern LocalSetXLogInsertNotAllowed() and
LocalSetXLogInsertCheckRecovery() to set the local variable
LocalXLogInsert respectively to 0 and -1.

These functions are declared as extern for future need in
the demote patch.

Function LocalSetXLogInsertAllowed() already exists and
declared as static as it is not needed outside of xlog.h.
---
 src/backend/access/transam/xlog.c | 27 +++
 src/include/access/xlog.h |  2 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 756b838e6a..25a9f78690 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7711,7 +7711,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
+	LocalSetXLogInsertCheckRecovery();
 
 	if (InRecovery)
 	{
@@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+/*
+ * Make XLogInsertAllowed() return false in the current process only.
+ */
+void
+LocalSetXLogInsertNotAllowed(void)
+{
+	LocalXLogInsertAllowed = 0;
+}
+
+/*
+ * Make XLogInsertCheckRecovery() return false in the current process only.
+ */
+void
+LocalSetXLogInsertCheckRecovery(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
+			LocalSetXLogInsertCheckRecovery(); /* return to "check" state */
 		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
+			LocalSetXLogInsertNotAllowed(); /* never again write WAL */
 	}
 
 	/*
@@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void)
 
 	END_CRIT_SECTION();
 
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
+	LocalSetXLogInsertCheckRecovery();	/* return to "check" state */
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..8c9cadc6da 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,8 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void LocalSetXLogInsertNotAllowed(void);
+extern void LocalSetXLogI

BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows

2020-07-15 Thread Jehan-Guillaume de Rorthais

Hi,

I'm bumping this thread on pgsql-hacker, hopefully it will drag some more
opinions/discussions.

Should we try to fix this issue or not? This is clearly an upstream bug. It has
been reported, including regression tests, but this doesn't move since 2 years
now.

If we choose not to fix it on our side using eg a workaround (see patch), I
suppose this small bug should be documented somewhere so people are not lost
alone in the wild.

Opinions?

Regards,

Begin forwarded message:

Date: Sat, 13 Jun 2020 00:43:22 +0200
From: Jehan-Guillaume de Rorthais 
To: Thomas Munro , Peter Geoghegan 
Cc: Роман Литовченко , PostgreSQL mailing lists
 Subject: Re: BUG #15285: Query used index
over field with ICU collation in some cases wrongly return 0 rows


On Fri, 12 Jun 2020 18:40:55 +0200
Jehan-Guillaume de Rorthais  wrote:

> On Wed, 10 Jun 2020 00:29:33 +0200
> Jehan-Guillaume de Rorthais  wrote:
> [...]  
> > After playing with ICU regression tests, I found functions ucol_strcollIter
> > and ucol_nextSortKeyPart are safe. I'll do some performance tests and report
> > here.  
> 
> I did some benchmarks. See attachment for the script and its header to
> reproduce.
> 
> It sorts 935895 french phrases from 0 to 122 chars with an average of 49.
> Performance tests were done on current master HEAD (buggy) and using the patch
> in attachment, relying on ucol_strcollIter.
> 
> My preliminary test with ucol_getSortKey was catastrophic, as we might
> expect. 15-17x slower than the current HEAD. So I removed it from actual
> tests. I didn't try with ucol_nextSortKeyPart though.
> 
> Using ucol_strcollIter performs ~20% slower than HEAD on UTF8 databases, but
> this might be acceptable. Here are the numbers:
> 
>DB Encoding   HEAD  strcollIter   ratio
>UTF8  2.74 3.27   1.19x
>LATIN15.34 5.40   1.01x
> 
> I plan to add a regression test soon.  

Please, find in attachment the second version of the patch, with a
regression test.

Regards,


-- 
Jehan-Guillaume de Rorthais
Dalibo
>From bb428135490caafe23562e3dcd9925d7d7c5a7be Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Thu, 11 Jun 2020 18:08:31 +0200
Subject: [PATCH] Replace buggy ucol_strcoll* funcs with ucol_strcollIter

Functions ucol_strcoll and ucol_strcollUTF8 are breaking some collation rules.
This leads to wrong sort order or wrong result from index using such
collations. See bug report #15285 for details:
http://postgr.es/m/153201618542.1404.3611626898935613264%40wrigleys.postgresql.org

This fix replace ucol_strcoll* functions with ucol_strcollIter() which is not
affected by this bug.

Reported-by: Roman Lytovchenko
Analysed-by: Peter Geoghegan
Author: Jehan-Guillaume de Rorthais
---
 src/backend/utils/adt/varlena.c   | 54 ---
 src/include/utils/pg_locale.h | 14 -
 .../regress/expected/collate.icu.utf8.out | 12 +
 src/test/regress/sql/collate.icu.utf8.sql |  8 +++
 4 files changed, 56 insertions(+), 32 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2eaabd6231..f156c00a1a 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1638,34 +1638,43 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 			if (mylocale->provider == COLLPROVIDER_ICU)
 			{
 #ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
 if (GetDatabaseEncoding() == PG_UTF8)
 {
 	UErrorCode	status;
+	UCharIterator	iter1,
+	iter2;
 
 	status = U_ZERO_ERROR;
-	result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-			  arg1, len1,
-			  arg2, len2,
-			  );
+
+	uiter_setUTF8(, arg1, len1);
+	uiter_setUTF8(, arg2, len2);
+
+	result = ucol_strcollIter(mylocale->info.icu.ucol,
+			  , , );
 	if (U_FAILURE(status))
 		ereport(ERROR,
 (errmsg("collation failed: %s", u_errorName(status;
 }
 else
-#endif
 {
+	UErrorCode	status;
+	UCharIterator	iter1,
+	iter2;
 	int32_t		ulen1,
 ulen2;
 	UChar	   *uchar1,
 			   *uchar2;
 
+	status = U_ZERO_ERROR;
+
 	ulen1 = icu_to_uchar(, arg1, len1);
 	ulen2 = icu_to_uchar(, arg2, len2);
 
-	result = ucol_strcoll(mylocale->info.icu.ucol,
-		  uchar1, ulen1,
-		  uchar2, ulen2);
+	uiter_setString(, uchar1, ulen1);
+	uiter_setString(, uchar2, ulen2);
+
+	result = ucol_strcollIter(mylocale->info.icu.ucol,
+			  , , );
 
 	pfree(uchar1);
 	pfree(uchar2);
@@ -2352,34 +2361,43 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		if (sss->locale->provider == COLLPROVIDER_ICU)
 		{
 #ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
 			if (GetDatabaseEncoding() == PG_UTF8)
 			{
 UErrorCode	status;
+UCh

Re: [patch] demote

2020-07-14 Thread Jehan-Guillaume de Rorthais

On Tue, 14 Jul 2020 12:49:51 -0700
Andres Freund  wrote:

> Hi,
> 
> On 2020-07-14 17:26:37 +0530, Amul Sul wrote:
> > On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
> >  wrote:  
> > >
> > > Hi,
> > >
> > > Another summary + patch + tests.
> > >
> > > This patch supports 2PC. The goal is to keep them safe during
> > > demote/promote actions so they can be committed/rollbacked later on a
> > > primary. See tests. 
> > 
> > Wondering is it necessary to clear prepared transactions from shared memory?
> > Can't simply skip clearing and restoring prepared transactions while
> > demoting?  
> 
> Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store
> transaction state. So I don't think it'd be correct to leave them around
> in their previous state. We'd likely end up with incorrect snapshots
> if a demoted node later gets promoted...

Indeed. I experienced it while debugging. PGXACT/PGPROC/locks need to
be cleared.

Re: [patch] demote

2020-07-13 Thread Jehan-Guillaume de Rorthais

Hi,

Another summary + patch + tests.

This patch supports 2PC. The goal is to keep them safe during demote/promote
actions so they can be committed/rollbacked later on a primary. See tests.

The checkpointer is now shutdowned after the demote shutdown checkpoint. It
removes some useless code complexity, eg. avoiding to signal postmaster from
checkpointer to keep going with the demotion. 

Cascaded replication is now supported. Wal senders stay actives during
demotion but set their local "am_cascading_walsender = true". It has been a
rough debug session (thank you rr and tests!) on my side, but it might deserve
it. I believe they should stay connected during the demote actions for futur
features, eg. triggering a switchover over the replication protocol using an
admin function.

The first tests has been added in "recovery/t/021_promote-demote.pl". I'll add
some more tests in futur versions.

I believe the patch is ready for some preliminary tests and advice or
directions.

On my todo:

* study how to only disconnect or cancel active RW backends
  * ...then add pg_demote() admin function
* cancel running checkpoint for fast demote ?
* user documentation
* Robert's concern about snapshot during hot standby
* some more coding style cleanup/refactoring
* anything else reported to me :)

Thanks,

On Fri, 3 Jul 2020 00:12:10 +0200
Jehan-Guillaume de Rorthais  wrote:

> Hi,
> 
> Here is a small activity summary since last report.
> 
> On Thu, 25 Jun 2020 19:27:54 +0200
> Jehan-Guillaume de Rorthais  wrote:
> [...]
> > I hadn't time to investigate Robert's concern about shared memory for
> > snapshot during recovery.
> 
> I hadn't time to dig very far, but I suppose this might be related to the
> comment in ProcArrayShmemSize(). If I'm right, then it seems the space is
> already allocated as long as hot_standby is enabled. I realize it doesn't
> means we are on the safe side of the fence though. I still have to have a
> better understanding on this.
> 
> > The patch doesn't deal with prepared xact yet. Testing
> > "start->demote->promote" raise an assert if some prepared xact exist. I
> > suppose I will rollback them during demote in next patch version.
> 
> Rollback all prepared transaction on demote seems easy. However, I realized
> there's no point to cancel them. After the demote action, they might still be
> committed later on a promoted instance.
> 
> I am currently trying to clean shared memory for existing prepared transaction
> so they are handled by the startup process during recovery.
> I've been able to clean TwoPhaseState and the ProcArray. I'm now in the
> process to clean remaining prepared xact locks.
> 
> Regards,
> 
> 



-- 
Jehan-Guillaume de Rorthais
Dalibo
>From 4470772702273c720cdea942ed229d59f3a70318 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH 1/2] Support demoting instance from production to standby

Architecture:

* creates a shutdown checkpoint on demote
* use DB_DEMOTING state in controlfile
* try to handle subsystems init correctly during demote
* keep some sub-processes alive:
  stat collector, bgwriter and optionally archiver or wal senders
* the code currently use USR1 to signal the wal senders to check
  if they must enable the cascading mode
* ShutdownXLOG takes a boolean arg to handle demote differently
* the checkpointer is restarted for code simplicity

Trivial manual tests:

* make check: OK
* make check-world: OK
* start in production->demote->demote: OK
* start in production->demote->stop: OK
* start in production->demote-> promote: OK
* switch roles between primary and standby (switchover): OK
* commit and check 2PC after demote/promote
* commit and check 2PC after switchover

Discuss/Todo:

* cancel or kill active and idle in xact RW backends
  * keep RO backends
  * pg_demote() function?
* code reviewing, arch, analysis, checks, etc
* investigate snapshots shmem needs/init during recovery compare to
  production
* add doc
* cancel running checkpoint during demote
  * replace with a END_OF_PRODUCTION xlog record?
---
 src/backend/access/transam/twophase.c   |  95 +++
 src/backend/access/transam/xlog.c   | 315 
 src/backend/postmaster/checkpointer.c   |  28 +++
 src/backend/postmaster/postmaster.c | 261 +++-
 src/backend/replication/walsender.c |   1 +
 src/backend/storage/ipc/procarray.c |   2 +
 src/backend/storage/ipc/procsignal.c|   4 +
 src/backend/storage/lmgr/lock.c |  12 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c | 111 +
 src/include/access/twophase.h   |   1 +
 src/include/access/xlog.h   |  19 +-
 src/include/catalog/pg_control.h|   1 +
 src/include/libpq/libpq-be.h

Re: [patch] demote

2020-07-02 Thread Jehan-Guillaume de Rorthais

Hi,

Here is a small activity summary since last report.

On Thu, 25 Jun 2020 19:27:54 +0200
Jehan-Guillaume de Rorthais  wrote:
[...]
> I hadn't time to investigate Robert's concern about shared memory for snapshot
> during recovery.

I hadn't time to dig very far, but I suppose this might be related to the
comment in ProcArrayShmemSize(). If I'm right, then it seems the space is
already allocated as long as hot_standby is enabled. I realize it doesn't means
we are on the safe side of the fence though. I still have to have a better
understanding on this.

> The patch doesn't deal with prepared xact yet. Testing
> "start->demote->promote" raise an assert if some prepared xact exist. I
> suppose I will rollback them during demote in next patch version.

Rollback all prepared transaction on demote seems easy. However, I realized
there's no point to cancel them. After the demote action, they might still be
committed later on a promoted instance.

I am currently trying to clean shared memory for existing prepared transaction
so they are handled by the startup process during recovery.
I've been able to clean TwoPhaseState and the ProcArray. I'm now in the
process to clean remaining prepared xact locks.

Regards,

Re: Remove Deprecated Exclusive Backup Mode

2020-07-02 Thread Jehan-Guillaume de Rorthais

On Thu, 2 Jul 2020 12:32:14 +0200
Magnus Hagander  wrote:
[...]
> > non-exclusive backup...this is not that easy anymore. And
> > pg_is_in_backup() is quite misleading if the admin found it without reading
> > doc. Maybe an admin 
> 
> Yeah, as it is now it should really be called pg_is_in_exclusive_backup().

+1
We end up to the same conclusion while discussing with Gilles Darold this
morning. But considering the discussion bellow, this might not be needed anyway.

> > function to list in progress non-exclusive backup and related backend pid
> > might
> > be a good start?
> >  
> 
> I think it would make perfect sense to show manual (exclusive or
> non-exclusive) base backups in pg_stat_progress_basebackup. There's no
> fundamental need that one should only be for base backups taken with
> pg_basebackup. In fact, I'd argue that's a mistake in the view in v13 that
> it does not include this information.
> 
> It could have "phase" set to "manual non-exclusive" for example, and leave
> the other fields at NULL.

Even in manual non-exclusive we could define some different phase:

 * waiting for checkpoint to finish
 * external backup
 * waiting for wal archiving to finish

So maybe exposing the backup label and/or the method might be useful as well?

> I guess in theory even for exclusive ones, with the pid column set to NULL.

We are discussing removing the mandatory connection during the non-exclusive
backup. If it become optional at some point, this PID might be NULL as well...

> (As Stephen mentioned at some point in the future we might also want to
> extend it to allow these tools to report their progress as well into it,
> probably by just calling an admin function on the connection that they
> already have).
> 
> 
> > Tools like pg_basebackup (and probably pgbackrest/barman to some extends)
> > still need the backup to abort on disconnection. Maybe it could flag its
> > session using the replication protocol or call a new admin function or load
> > a hook on session-shutdown to keep previous API behavior?
> 
> Suddenly requiring a replication protocol connection for one thing, when
> all their other things are done over a normal one, seems really terrible.

Sorry, I misexplained. Adding such flag to the replication protocol was only for
pg_basebackup. Other tools would use an admin function to achieve the same goal.

> But having an admin function would work.
> 
> But anything requiring loading of a hook is going to raise the bar
> massively as suddenly somebody needs to install a compiled binary on the
> server in order to use it.

Contrib already provide many useful tools. How would it be different from eg.
pg_archivecleanup? IIRC, end of session hook was discussed in the past, but I
did not follow the thread by the time. But anyway, an admin function would
probably make the job as well anyway.

> But calling a separate function is pretty much > what I suggested upthread
> (except I suggested a new version of pg_stop_backup, but implementation wise)
> 
> But I'm not sure what previous API behavior you're looking to preserve here?

Current non-exclusive backup are automatically aborted when the
pg_start_backup() session disappear without pg_stop_backup(). I suppose this was
the API behavior that Robert is concerned to loose when he is writing about
people forgetting to end backups.

And now you are talking about Stephen's idea to report the progress of a backup,
this kind of API could be used as watchdog as well: if the frontend did not
report before some optional kind of (user defined) timeout, abort the backup.
We would have a "similar" behavior than current one.

To be clear, I suppose these admin functions could be called from any backend
session, taking some kind of backup id (eg. the token you were talking about
upthread).

Regards,

Re: Remove Deprecated Exclusive Backup Mode

2020-07-01 Thread Jehan-Guillaume de Rorthais

On Wed, 1 Jul 2020 15:58:57 -0400
Robert Haas  wrote:

> On Wed, Jul 1, 2020 at 3:50 PM Magnus Hagander  wrote:
> > As far as I've seen, the one thing that people have problems with in the
> > exclusive mode backups are precisely the fact that they have to keep a
> > persistent conneciton open, and thus it cannot work together with backup
> > software that is limited to only supporting running a pre- and a post
> > script.
> >
> > Something like I have suggested here is to solve *that* problem. I don't
> > think anybody actually explicitly wants "exclusive backups" -- they want a
> > backup solution that plugs into their world of pre/post scripts. And if we
> > can make that one work in a safer way than the current exclusive backups,
> > ohw is that not an improvement?  
> 
> Yeah, I guess that's a pretty fair point. I have to confess to having
> somewhat limited enthusiasm for adding a third mode here, but it might
> be worth it.
> 
> It seems pretty well inevitable to me that people are going to forget
> to end them.

There's so many way an admin could break their backup process and not notice
it. Even with current nonexclusive backup, a bad written script might creates
bad unrecoverable backups. 

In regard with your concern, monitoring in progress backups might be *one*
answer, from server side. This was "easy" with exclusive backup, we just
monitor the backup_label age and warn if it is older than expected [1]. Using
non-exclusive backup...this is not that easy anymore. And pg_is_in_backup() is
quite misleading if the admin found it without reading doc. Maybe an admin
function to list in progress non-exclusive backup and related backend pid might
be a good start?

With such admin function facilities, client side process could monitor backup
and decide to cancel them based on some internal backup policies/process.

Tools like pg_basebackup (and probably pgbackrest/barman to some extends) still
need the backup to abort on disconnection. Maybe it could flag its session
using the replication protocol or call a new admin function or load a hook on
session-shutdown to keep previous API behavior?

Regards,

[1] https://github.com/OPMDG/check_pgactivity/#user-content-backup_label_age-8.1

Re: [patch] demote

2020-07-01 Thread Jehan-Guillaume de Rorthais

On Fri, 26 Jun 2020 16:14:38 +0900 (JST)
Kyotaro Horiguchi  wrote:

> Hello.
> 
> At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais
>  wrote in 
> > Here is a summary of my work during the last few days on this demote
> > approach.
> > 
> > Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
> > commit message and as FIXME in code.
> > 
> > The patch is not finished or bug-free yet, I'm still not very happy with the
> > coding style, it probably lack some more code documentation, but a lot has
> > changed since v1. It's still a PoC to push the discussion a bit further
> > after being myself silent for some days.
> > 
> > The patch is currently relying on a demote checkpoint. I understand a forced
> > checkpoint overhead can be massive and cause major wait/downtime. But I keep
> > this for a later step. Maybe we should be able to cancel a running
> > checkpoint? Or leave it to its synching work but discard the result without
> > wirting it to XLog?
> 
> If we are going to dive so close to server shutdown, we can just
> utilize the restart-after-crash path, which we can assume to work
> reliably. The attached is a quite rough sketch, hijacking smart
> shutdown path for a convenience, of that but seems working.  "pg_ctl
> -m s -W stop" lets server demote.

This was actually my very first toy PoC.

However, resetting everything is far from a graceful demote I was seeking for.
Moreover, such a patch will not be able to evolve to eg. keep read only
backends around.

> > I hadn't time to investigate Robert's concern about shared memory for
> > snapshot during recovery.
> 
> The patch does all required clenaup of resources including shared
> memory, I believe.  It's enough if we don't need to keep any resources
> alive?

Resetting everything might not be enough. If I understand Robert's concern
correctly, it might actually need more shmem for hot standby xact snapshot. Or
maybe some shmem init'ed differently.

Regards,

Re: [patch] demote

2020-06-25 Thread Jehan-Guillaume de Rorthais

Hi,

Here is a summary of my work during the last few days on this demote approach.

Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
commit message and as FIXME in code.

The patch is not finished or bug-free yet, I'm still not very happy with the
coding style, it probably lack some more code documentation, but a lot has
changed since v1. It's still a PoC to push the discussion a bit further after
being myself silent for some days.

The patch is currently relying on a demote checkpoint. I understand a forced
checkpoint overhead can be massive and cause major wait/downtime. But I keep
this for a later step. Maybe we should be able to cancel a running checkpoint?
Or leave it to its synching work but discard the result without wirting it to
XLog?

I hadn't time to investigate Robert's concern about shared memory for snapshot
during recovery.

The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
raise an assert if some prepared xact exist. I suppose I will rollback them
during demote in next patch version.

I'm not sure how to divide this patch in multiple small independent steps. I
suppose I can split it like:

1. add demote checkpoint
2. support demote: mostly postmaster, startup/xlog and checkpointer related
   code
3. cli using pg_ctl demote

...But I'm not sure it worth it.

Regards,
>From 03c41dd706648cd20df90a128db64eee6b6dad97 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH] Demote PoC

Changes:

* creates a demote checkpoint
* use DB_DEMOTING state in controlfile
* try to handle subsystems init correctly during demote
* keep some sub-processes alive:
  stat collector, checkpointer, bgwriter and optionally archiver or wal
  senders
* add signal PMSIGNAL_DEMOTING to start the startup process after the
  demote checkpoint
* ShutdownXLOG takes a boolean arg to handle demote differently

Trivial manual tests:

* make check : OK
* make check-world : OK
* start in production -> demote -> demote: OK
* start in production -> demote -> stop : OK
* start in production -> demote -> promote : NOK (2PC, see TODO)
  but OK with no prepared xact.

Discuss/Todo:

* rollback prepared xact
* cancel/kill active/idle in xact R/W backends
  * pg_demote() function?
* some more code reviewing around StartupXlog
* investigate snapshots shmem needs/init during recovery compare to
  production
* add tap tests
* add doc
* how to handle checkpoint?
---
 src/backend/access/rmgrdesc/xlogdesc.c  |   9 +-
 src/backend/access/transam/xlog.c   | 287 +++-
 src/backend/postmaster/checkpointer.c   |  22 ++
 src/backend/postmaster/postmaster.c | 250 -
 src/backend/storage/ipc/procsignal.c|   4 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c | 111 +
 src/include/access/xlog.h   |  18 +-
 src/include/catalog/pg_control.h|   2 +
 src/include/libpq/libpq-be.h|   7 +-
 src/include/postmaster/bgwriter.h   |   1 +
 src/include/storage/pmsignal.h  |   1 +
 src/include/storage/procsignal.h|   1 +
 src/include/utils/pidfile.h |   1 +
 14 files changed, 537 insertions(+), 179 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1cd97852e8..5aeaff18f8 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -40,7 +40,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	if (info == XLOG_CHECKPOINT_SHUTDOWN ||
-		info == XLOG_CHECKPOINT_ONLINE)
+		info == XLOG_CHECKPOINT_ONLINE ||
+		info == XLOG_CHECKPOINT_DEMOTE)
 	{
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
@@ -65,7 +66,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		 checkpoint->oldestCommitTsXid,
 		 checkpoint->newestCommitTsXid,
 		 checkpoint->oldestActiveXid,
-		 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
+		 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" :
+			(info == XLOG_CHECKPOINT_DEMOTE)? "demote" : "online");
 	}
 	else if (info == XLOG_NEXTOID)
 	{
@@ -185,6 +187,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_DEMOTE:
+			id = "CHECKPOINT_DEMOTE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e455384b5b..0e18e546ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6301,6 +6301,13 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+

Re: [patch] demote

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Thu, 18 Jun 2020 11:22:47 -0400
Robert Haas  wrote:

> On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao 
> wrote:
> > ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
> > What about the following procedure?
> >
> > 1. Demote the primary to a standby. Then this demoted standby is read-only.
> > 2. The orignal standby automatically establishes the cascading replication
> > connection with the demoted standby.
> > 3. Wait for all the WAL records available in the demoted standby to be
> > streamed to the orignal standby.
> > 4. Promote the original standby to new primary.
> > 5. Change primary_conninfo in the demoted standby so that it establishes
> > the replication connection with new primary.
> >
> > So it seems enough to implement "demote" feature for a clean switchover.  
> 
> There's something to that idea. I think it somewhat depends on how
> invasive the various operations are. For example, I'm not really sure
> how feasible it is to demote without a full server restart that kicks
> out all sessions. If that is required, it's a significant disadvantage
> compared to ASRO. On the other hand, if a machine can be demoted just
> by kicking out R/W sessions, as ASRO currently does, then maybe
> there's not that much difference. Or maybe both designs are subject to
> improvement and we can do something even less invasive...

Considering the current demote patch improvement. I was considering to digg in
the following direction:

* add a new state in the state machine where all backends are idle
* this new state forbid any new writes, the same fashion we do on standby nodes
* this state could either wait for end of xact, or cancel/kill
  RW backends, in the same fashion current smart/fast stop do
* from this state, we might then rollback pending prepared xact, stop other
  sub-process etc (as the current patch does), and demote safely to
  PM_RECOVERY or PM_HOT_STANDBY (depending on the setup).

Is it something worth considering?
Maybe the code will be so close from ASRO, it would just be kind of a fusion of
both patch? I don't know, I didn't look at the ASRO patch yet.

> One thing I think people are going to want to do is have the master go
> read-only if it loses communication to the rest of the network, to
> avoid or at least mitigate split-brain. However, such network
> interruptions are often transient, so it might not be uncommon to
> briefly go read-only due to a network blip, but then recover quickly
> and return to a read-write state. It doesn't seem to matter much
> whether that read-only state is a new kind of normal operation (like
> what ASRO would do) or whether we've actually returned to a recovery
> state (as demote would do) but the collateral effects of the state
> change do matter.

Well, triggering such actions (demote or read only) often occurs external
decision, hopefully relying on at least some quorum and being able to escalade
to watchdog or fencing is required.

Most tools around will need to demote or fence. It seems dangerous to flip
between read only/read write on a bunch of cluster nodes. It might be quickly
messy, especially since a former primary with non replicated data could
automatically replicate from a new primary without screaming...

Regards,

Re: [patch] demote

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Thu, 18 Jun 2020 11:15:02 -0400
Robert Haas  wrote:

> On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais
>  wrote:
> > At some expense, Admin can already set the system as readonly from the
> > application point of view, using:
> >
> >   alter system set default_transaction_read_only TO on;
> >   select pg_reload_conf();
> >
> > Current RW xact will finish, but no other will be allowed.  
> 
> That doesn't block all WAL generation, though:
> 
> rhaas=# alter system set default_transaction_read_only TO on;
> ALTER SYSTEM
> rhaas=# select pg_reload_conf();
>  pg_reload_conf
> 
>  t
> (1 row)
> rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts;
> rhaas=#

Yes, this, and the fact that any user can switch transaction_read_only back to
on easily. This was a terrible example.

My point was that ALTER SYSTEM READ ONLY as described here doesn't feel like a
required user feature, outside of the demote scope. It might be useful for the
demote process, but only from the core point of view, without user interaction.
It seems there's no other purpose from the admin standpoint.

> There's a bunch of other things it also doesn't block, too. If you're
> trying to switch to a new primary, you really want to stop WAL
> generation completely on the old one. Otherwise, you can't guarantee
> that the machine you're going to promote is completely caught up,
> which means you might lose some changes, and you might have to
> pg_rewind the old master.

Yes, of course. I wasn't explaining transaction_read_only was useful in a 
switchover procedure, sorry for the confusion and misleading comment.

Regards,

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Thu, 18 Jun 2020 10:52:49 -0400
Robert Haas  wrote:

[...]
> But what you want is that if it does happen to go down for some reason before
> all the WAL is streamed, you can bring it back up and finish streaming the
> WAL without generating any new WAL.

Thanks to cascading replication, it could be very possible without this READ
ONLY mode, just in recovery mode, isn't it?

Regards,

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Wed, 17 Jun 2020 12:07:22 -0400
Robert Haas  wrote:
[...]

> > Commands that involve a whole
> > bunch of subtle interlocking --- and, therefore, aren't going to work if
> > anything has gone wrong already anywhere in the server --- seem like a
> > particularly poor thing to be hanging your HA strategy on.  
> 
> It's important not to conflate controlled switchover with failover.
> When there's a failover, you have to accept some risk of data loss or
> service interruption; but a controlled switchover does not need to
> carry the same risks and there are plenty of systems out there where
> it doesn't.

Yes. Maybe we should make sure the wording we are using is the same for
everyone. I already hear/read "failover", "controlled failover", "switchover" or
"controlled switchover", this is confusing. My definition of switchover is:

  swapping primary and secondary status between two replicating instances. With
  no data loss. This is a controlled procedure where all steps must succeed to
  complete.
  If a step fails, the procedure fail back to the original primary with no data
  loss.

However, Wikipedia has a broader definition, including situations where the
switchover is executed upon a failure: https://en.wikipedia.org/wiki/Switchover

Regards,

Re: [patch] demote

2020-06-18 Thread Jehan-Guillaume de Rorthais

On Wed, 17 Jun 2020 11:14:47 -0700
Andres Freund  wrote:

> Hi,
> 
> On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote:
> > As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
> > objectives than mine, I decided to share the humble patch I am playing with
> > to step down an instance from primary to standby status.  
> 
> To make sure we are on the same page: What your patch intends to do is
> to leave the server running, but switch from being a primary to
> replicating from another system. Correct?

Yes. The instance status is retrograded from "in production" to "in archive
recovery".

Of course, it will start replicating depending on archive_mode/command and
primary_conninfo setup.

> > Main difference with Amul's patch is that all backends must be disconnected
> > to process with the demote. Either we wait for them to disconnect (smart)
> > or we kill them (fast). This makes life much easier from the code point of
> > view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')"
> > as an admin would kill the session, with no options to wait for the action
> > to finish, as we do with pg_promote(). Keeping read only session around
> > could probably be achieved using global barrier as Amul did, but without
> > all the complexity related to WAL writes prohibition.  
> 
> FWIW just doing that for normal backends isn't sufficient, you also have
> to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due
> to hint bits, the checkpoint record, and some more).

In fact, the patch relies on existing code path in the state machine. The
startup process is started when the code enters in PM_NO_CHILDREN state. This
state is set when «These other guys should be dead already» as stated in the
code:

/* These other guys should be dead already */
Assert(StartupPID == 0);
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
pmState = PM_NO_CHILDREN;

> > There's still some questions in the current patch. As I wrote, it's an
> > humble patch, a proof of concept, a bit naive.
> > 
> > Does it worth discussing it and improving it further or do I miss something
> > obvious in this design that leads to a dead end?  
> 
> I don't think there's a fundamental issue, but I think it needs to deal
> with a lot more things than it does right now. StartupXLOG doesn't
> currently deal correctly with subsystems that are already
> initialized. And your patch doesn't make it so as far as I can tell.

If you are talking about bgwriter, checkpointer, etc, as far as I understand
the current state machine, my patch actually deal with them.

Thank you for your feedback!

I'll study how hard it would be to keep read only backends around during the
demote step.

Regards,

1 2 >

1 - 100 of 159 matches

Mail list logo