date:20230811

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-11 Thread Amit Kapila

On Fri, Aug 11, 2023 at 11:38 PM Bruce Momjian  wrote:
>
> On Fri, Aug 11, 2023 at 10:46:31AM +0530, Amit Kapila wrote:
> > On Thu, Aug 10, 2023 at 7:07 PM Masahiko Sawada  
> > wrote:
> > > What I imagined is that we do this check before
> > > check_and_dump_old_cluster() while the server is 'off'. Reading the
> > > slot state file would be simple and I guess we would not need a tool
> > > or cli program for that. We need to expose RepliactionSlotOnDisk,
> > > though.
> >
> > Won't that require a lot of version-specific checks as across versions
> > the file format could be different? For the case of the control file,
> > we use version-specific pg_controldata (for the old cluster, the
> > corresponding version's pg_controldata) utility to read the old
> > version control file. I thought we need something similar here if we
> > want to do what you are suggesting.
>
> You mean the slot file format?

Yes.

>
>  We will need that complexity somewhere,
> so why not in pg_upgrade?
>

I don't think we need the complexity of version-specific checks if we
do what we do in get_control_data(). Basically, invoke
version-specific pg_replslotdata to get version-specific slot
information. There has been a proposal for a tool like that [1]. Do
you have something better in mind? If so, can you please explain the
same a bit more?

> > > After reading the control file and the slots' state files we
> > > check if slot's confirmed_flush_lsn matches the latest checkpoint LSN
> > > in the control file (BTW maybe we can get slot name and plugin name
> > > here instead of using pg_dump?).
> >
> > But isn't the advantage of doing via pg_dump (in binary_mode) that we
> > allow some outside core in-place upgrade tool to also use it if
> > required? If we don't think that would be required then we can
> > probably use the info we retrieve it in pg_upgrade.
>
> You mean the code reading the slot file?  I don't see the point of
> adding user complexity to enable some hypothetical external usage.
>

It is not just that we need a slot reading facility but rather mimic
something like pg_get_replication_slots() where we have to know the
walstate (WALAVAIL_REMOVED, etc.) as well. I am not against it but am
not sure that we do it for any other object in the upgrade. Can you
please point me out if we have any such prior usage? Even if we don't
do it today, we can start doing now if that makes sense but it appears
to me that we are accessing contents of data-dir/WAL by invoking some
other utilities like pg_controldata, pg_resetwal, so something similar
would make sense here. Actually, what we do here also somewhat depends
on what we decide for the other point we are discussing above in the
email.

[1] - 
https://www.postgresql.org/message-id/flat/CALj2ACW0rV5gWK8A3m6_X62qH%2BVfaq5hznC%3Di0R5Wojt5%2Byhyw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Adding a LogicalRepWorker type field

2023-08-11 Thread Amit Kapila

On Sat, Aug 12, 2023 at 5:01 AM Peter Smith  wrote:
>
> On Fri, Aug 11, 2023 at 9:13 PM Amit Kapila  wrote:
> >
> > On Fri, Aug 11, 2023 at 3:41 PM Peter Smith  wrote:
> > >
> > > On Fri, Aug 11, 2023 at 7:33 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Aug 10, 2023 at 7:50 AM Peter Smith  
> > > > wrote:
> > > > >
> > > > > >
> > > > > > * If you do the above then there won't be a need to change the
> > > > > > variable name is_parallel_apply_worker in logicalrep_worker_launch.
> > > > > >
> > > > >
> > > > > Done.
> > > > >
> > > >
> > > > I don't think the addition of two new macros isTablesyncWorker() and
> > > > isLeaderApplyWorker() adds much value, so removed those and ran
> > > > pgindent. I am planning to commit this patch early next week unless
> > > > you or others have any comments.
> > > >
> > >
> > > Thanks for considering this patch fit for pushing.
> > >
> > > Actually, I recently found 2 more overlooked places in the launcher.c
> > > code which can benefit from using the isTablesyncWorker(w) macro that
> > > was removed in patch v6-0001.
> > >
> >
> > @@ -1301,7 +1301,7 @@ pg_stat_get_subscription(PG_FUNCTION_ARGS)
> >   worker_pid = worker.proc->pid;
> >
> >   values[0] = ObjectIdGetDatum(worker.subid);
> > - if (OidIsValid(worker.relid))
> > + if (isTablesyncWorker(&worker))
> >   values[1] = ObjectIdGetDatum(worker.relid);
> >
> > I don't see this as a good fit for using isTablesyncWorker(). If we
> > were returning worker_type then using it would be okay.
>
> Yeah, I also wasn't very sure about that one, except it seems
> analogous to the existing code immediately below it, where you could
> say the same thing:
> if (isParallelApplyWorker(&worker))
> values[3] = Int32GetDatum(worker.leader_pid);
>

Fair point. I think it is better to keep the code consistent. So, I'll
merge your changes and push the patch early next week.

-- 
With Regards,
Amit Kapila.

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-11 Thread Thomas Munro

Thanks.  I realised that it's easy enough to test that theory about
cleanup locks by hacking ConditionalLockBufferForCleanup() to return
false randomly.  Then the test occasionally fails as described.  Seems
like we'll need to fix that test, but it's not evidence of a server
bug, and my signal handler refactoring patch is in the clear.  Thanks
for testing it!

CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-08-11 Thread Jeff Davis

The attached patch implements a new SEARCH clause for CREATE FUNCTION.
The SEARCH clause controls the search_path used when executing
functions that were created without a SET clause.

Background:

Controlling search_path is critical for the correctness and security of
functions. Right now, the author of a function without a SET clause has
little ability to control the function's behavior, because even basic
operators like "+" involve search_path. This is a big problem for, e.g.
functions used in expression indexes which are called by any user with
write privileges on the table.

Motivation:

I'd like to (eventually) get to safe-by-default behavior. In other
words, the simplest function declaration should be safe for the most
common use cases.

To get there, we need some way to explicitly specify the less common
cases. Right now there's no way for the function author to indicate
that a function intends to use the session's search path. We also need
an easier way to specify that the user wants a safe search_path ("SET
search_path = pg_catalog, pg_temp" is arcane).

And when we know more about the user's actual intent, then it will be
easier to either form a transition plan to push users into the safer
behavior, or at least warn strongly when the user is doing something
dangerous (i.e. using a function that depends on the session's search
path as part of an expression index).

Today, the only information we have about the user's intent is the
presence or absence of a "SET search_path" clause, which is not a
strong signal.

Proposal:

Add SEARCH { DEFAULT | SYSTEM | SESSION } clause to CREATE/ALTER
function.

  * SEARCH DEFAULT is the same as no SEARCH clause at all, and ends up
stored in the catalog as prosearch='d'.
  * SEARCH SYSTEM means that we switch to the safe search path of
"pg_catalog, pg_temp"  when executing the function. Stored as
prosearch='y'.
  * SEARCH SESSION means that we don't switch the search_path when
executing the function, and it's inherited from the session. Stored as
prosearch='e'.

Regardless of the SEARCH clause, a "SET search_path" clause will
override it. The SEARCH clause only matters when "SET search_path" is
not there.

Additionally provide a GUC, defaulting to false for compatibility, that
can interpret prosearch='d' as if it were prosearch='y'. It could help
provide a transition path. I know there's a strong reluctance to adding
these kinds of GUCs; I can remove it and I think the patch will still
be worthwhile. Perhaps there are alternatives that could help with
migration at pg_dump time instead?

Benefits:

1. The user can be more explicit about their actual intent. Do they
want safety and consistency? Or the flexibility of using the session's
search_path?

2. We can more accurately serve the user's intent. For instance, the
safe search_path of "pg_catalog, pg_temp" is arcane and seems to be
there just because we don't have a way to specify that pg_temp be
excluded entirely. But perhaps in the future we *do* want to exclude
pg_temp entirely. Knowing that the user just wants "SEARCH SYSTEM"
allows us some freedom to do that.

3. Users can be forward-compatible by specifying the functions that
really do need to use the session's search path as SEARCH SESSION, so
that they will never be broken in the future. That gives us a cleaner
path toward making the default behavior safe.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 0218b472bd1ae8152b8bc8d92de4b7cc13dbbbad Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 11 Aug 2023 14:33:36 -0700
Subject: [PATCH] CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }.

Control the search_path while executing functions created without an
explicit SET clause.

Allows users to be forward compatible by explicitly declaring that a
function depends on the session's search_path, or that it only depends
on the safe search_path of "pg_catalog, pg_temp".

Also adds a GUC that allows users to choose how to treat functions
that have not specified a SEARCH clause or a SET clause. The GUC is
false by default to preserve existing semantics, but eventually it
will default to true for greater safety.
---
 doc/src/sgml/config.sgml  | 26 +++
 doc/src/sgml/ref/alter_function.sgml  | 15 
 doc/src/sgml/ref/create_function.sgml | 34 +
 src/backend/catalog/pg_aggregate.c|  1 +
 src/backend/catalog/pg_proc.c | 12 
 src/backend/commands/functioncmds.c   | 44 +++-
 src/backend/commands/typecmds.c   |  4 ++
 src/backend/optimizer/util/clauses.c  |  1 +
 src/backend/parser/gram.y | 15 
 src/backend/utils/fmgr/fmgr.c | 24 +--
 src/backend/utils/misc/guc_tables.c   | 10 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/catalog/namespace.h   |  6 ++
 src/include/catalog/pg_proc.h | 15 
 src/include/utils/guc.h

Re: Schema variables - new implementation for Postgres 15

2023-08-11 Thread Julien Rouhaud

On Fri, Aug 11, 2023 at 05:55:26PM +0200, Dmitry Dolgov wrote:
>
> Another confusing example was this one at the end of set_session_variable:
>
> + /*
> +  * XXX While unlikely, an error here is possible. It wouldn't leak 
> memory
> +  * as the allocated chunk has already been correctly assigned to the
> +  * session variable, but would contradict this function contract, which 
> is
> +  * that this function should either succeed or leave the current value
> +  * untouched.
> +  */
> + elog(DEBUG1, "session variable \"%s.%s\" (oid:%u) has new value",
> +  
> get_namespace_name(get_session_variable_namespace(svar->varid)),
> +  get_session_variable_name(svar->varid),
> +  svar->varid);
>
> It's not clear, which exactly error you're talking about, it's the last
> instruction in the function.

FTR I think I'm the one that changed that.  The error I was talking about is
elog() itself (in case of OOM for instance), or even one of the get_* call, if
running with log_level <= DEBUG1.  It's clearly really unlikely but still
possible, thus this comment which also tries to explain why this elog() is not
done earlier.

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 20:11:34 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-08-11 18:30:02 -0400, Tom Lane wrote:
> >> +1 for including this in CI tests
>
> > I didn't even mean CI - I meant 'make check-world' / 'meson test'. Which of
> > course would include CI automatically.
>
> Hmm.  I'm allergic to anything that significantly increases the cost
> of check-world, and this seems like it'd do that.

Hm, compared to the cost of check-world it's not that large, but still,
annoying to make it larger.

We can make it lot cheaper, but perhaps not in a general enough fashion that
it's suitable for a test.

pgindent already can query git (for --commit). We could teach pgindent to
ask git what remote branch is being tracked, and constructed a list of files
of the difference between the remote branch and the local branch?

That option could do something like:
git diff --name-only $(git rev-parse --abbrev-ref --symbolic-full-name 
@{upstream})

That's pretty quick, even for a relatively large delta.

> Maybe we could automate it, but not as part of check-world per se?

We should definitely do that.  Another related thing that'd be useful to
script is updating typedefs.list with the additional typedefs found
locally. Right now the script for that still lives in the

Greetings,

Andres Freund

Rename ExtendedBufferWhat in 16?

2023-08-11 Thread Thomas Munro

Hi

Commit 31966b15 invented a way for functions dealing with relation
extension to accept a Relation in online code and an SMgrRelation in
recovery code (instead of using the earlier FakeRelcacheEntry
concept).  It seems highly likely that future new bufmgr.c interfaces
will face the same problem, and need to do something similar.  Let's
generalise the names so that each interface doesn't have to re-invent
the wheel?  ExtendedBufferWhat is also just not a beautiful name.  How
about BufferedObjectSelector?  That name leads to macros BOS_SMGR()
and BOS_REL().  Could also be BufMgrObject/BMO, ... etc etc.

This is from a patch-set that I'm about to propose for 17, which needs
one of these too, hence desire to generalise.  But if we rename them
in 17, then AM authors, who are likely to discover and make use of
this interface, would have to use different names for 16 and 17.
From 2bc28dc5780b64f990500aebe717b0d7a170f67e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 11 Aug 2023 15:07:50 +1200
Subject: [PATCH] ExtendBufferedWhat -> BufferedObjectSelector.

Commit 31966b15 invented a way for functions dealing with relation
extension to accept a Relation in online code and an SMgrRelation in
recovery code.  It seems highly likely that other bufmgr.c interfaces
will face the same problem, and need to do something similar.
Generalize the names so that each interface doesn't have to re-invent
the wheel.

Back-patch to 16, which introduced this interface.  Since extension AM
authors are likely to use it, we don't want to change them in the next
release.

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d935ed8fbd..8a92160e28 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -386,7 +386,7 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+	buffer = ExtendBufferedRel(BOS_REL(index), MAIN_FORKNUM, NULL,
 			   EB_LOCK_FIRST);
 
 	return buffer;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa..95600f6e2d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -848,7 +848,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+	meta = ExtendBufferedRel(BOS_REL(index), MAIN_FORKNUM, NULL,
 			 EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
 
@@ -915,7 +915,7 @@ brinbuildempty(Relation index)
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+	metabuf = ExtendBufferedRel(BOS_REL(index), INIT_FORKNUM, NULL,
 EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer. */
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index f4271ba31c..e67470e70b 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -569,7 +569,7 @@ revmap_physical_extend(BrinRevmap *revmap)
 	}
 	else
 	{
-		buf = ExtendBufferedRel(EB_REL(irel), MAIN_FORKNUM, NULL,
+		buf = ExtendBufferedRel(BOS_REL(irel), MAIN_FORKNUM, NULL,
 EB_LOCK_FIRST);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index be1841de7b..b6d03ac09b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -440,9 +440,9 @@ ginbuildempty(Relation index)
 MetaBuffer;
 
 	/* An empty GIN index has two pages. */
-	MetaBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+	MetaBuffer = ExtendBufferedRel(BOS_REL(index), INIT_FORKNUM, NULL,
    EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
-	RootBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+	RootBuffer = ExtendBufferedRel(BOS_REL(index), INIT_FORKNUM, NULL,
    EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer and root buffer. */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 437f24753c..1560d0c6e8 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -327,7 +327,7 @@ GinNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+	buffer = ExtendBufferedRel(BOS_REL(index), MAIN_FORKNUM, NULL,
 			   EB_LOCK_FIRST);
 
 	return buffer;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 516465f8b7..174245615f 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -134,7 +134,7 @@ gistbuildempty(Relation index)
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+	buffer = ExtendBufferedRel(BOS_REL(index), INI

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Michael Paquier

On Fri, Aug 11, 2023 at 08:11:34PM -0400, Tom Lane wrote:
> Hmm.  I'm allergic to anything that significantly increases the cost
> of check-world, and this seems like it'd do that.
> 
> Maybe we could automate it, but not as part of check-world per se?

It does not have to be part of check-world by default, as we could
make it optional with PG_TEST_EXTRA.  I bet that most committers set
this option for most of the test suites anyway, so the extra cost is
OK from here.  I don't find a single indent run to be that costly,
especially with parallelism:
$ time ./src/tools/pgindent/pgindent .
real 0m5.039s
user 0m3.403s
sys 0m1.540s
--
Michael


signature.asc
Description: PGP signature

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Tom Lane

Andres Freund  writes:
> On 2023-08-11 18:30:02 -0400, Tom Lane wrote:
>> +1 for including this in CI tests

> I didn't even mean CI - I meant 'make check-world' / 'meson test'. Which of
> course would include CI automatically.

Hmm.  I'm allergic to anything that significantly increases the cost
of check-world, and this seems like it'd do that.

Maybe we could automate it, but not as part of check-world per se?

regards, tom lane

Re: logicalrep_worker_launch -- counting/checking the worker limits

2023-08-11 Thread Peter Smith

The previous patch was accidentally not resetting the boolean limit
flags to false for retries.

Fixed in v2.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-logicalrep_worker_launch-limit-checks.patch
Description: Binary data

Re: Adding a LogicalRepWorker type field

2023-08-11 Thread Peter Smith

On Fri, Aug 11, 2023 at 9:13 PM Amit Kapila  wrote:
>
> On Fri, Aug 11, 2023 at 3:41 PM Peter Smith  wrote:
> >
> > On Fri, Aug 11, 2023 at 7:33 PM Amit Kapila  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 7:50 AM Peter Smith  wrote:
> > > >
> > > > >
> > > > > * If you do the above then there won't be a need to change the
> > > > > variable name is_parallel_apply_worker in logicalrep_worker_launch.
> > > > >
> > > >
> > > > Done.
> > > >
> > >
> > > I don't think the addition of two new macros isTablesyncWorker() and
> > > isLeaderApplyWorker() adds much value, so removed those and ran
> > > pgindent. I am planning to commit this patch early next week unless
> > > you or others have any comments.
> > >
> >
> > Thanks for considering this patch fit for pushing.
> >
> > Actually, I recently found 2 more overlooked places in the launcher.c
> > code which can benefit from using the isTablesyncWorker(w) macro that
> > was removed in patch v6-0001.
> >
>
> @@ -1301,7 +1301,7 @@ pg_stat_get_subscription(PG_FUNCTION_ARGS)
>   worker_pid = worker.proc->pid;
>
>   values[0] = ObjectIdGetDatum(worker.subid);
> - if (OidIsValid(worker.relid))
> + if (isTablesyncWorker(&worker))
>   values[1] = ObjectIdGetDatum(worker.relid);
>
> I don't see this as a good fit for using isTablesyncWorker(). If we
> were returning worker_type then using it would be okay.

Yeah, I also wasn't very sure about that one, except it seems
analogous to the existing code immediately below it, where you could
say the same thing:
if (isParallelApplyWorker(&worker))
values[3] = Int32GetDatum(worker.leader_pid);

Whatever you think is best for that one is fine by me.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 18:30:02 -0400, Tom Lane wrote:
> Peter Geoghegan  writes:
> > On Fri, Aug 11, 2023 at 2:25 PM Andres Freund  wrote:
> >> We could a test that fails when there's some mis-indented code. That seems
> >> like it might catch things earlier?
> 
> +1 for including this in CI tests

I didn't even mean CI - I meant 'make check-world' / 'meson test'. Which of
course would include CI automatically.


> > It definitely would. That would go a long way towards addressing my
> > concerns. But I suspect that that would run into problems that stem
> > from the fact that the buildfarm is testing something that isn't all
> > that simple. Don't typedefs need to be downloaded from some other
> > blessed buildfarm animal?
> 
> No.  I presume koel is using src/tools/pgindent/typedefs.list,
> which has always been the "canonical" list but up to now we've
> been lazy about maintaining it.  Part of the new regime is that
> typedefs.list should now be updated on-the-fly by patches that
> add new typedefs.

Yea. Otherwise nobody else can indent reliably, without repeating the work of
adding typedefs.list entries of all the patches since the last time it was
updated in the repository.

Greetings,

Andres Freund

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Jelte Fennema

On Fri, 11 Aug 2023 at 23:00, Peter Geoghegan  wrote:
> I'm starting to have doubts about this policy. There have now been
> quite a few follow-up "fixes" to indentation issues that koel
> complained about.

I think one thing that would help a lot in reducing the is for
committers to set up the local git commit hook that's on the wiki:
https://wiki.postgresql.org/wiki/Working_with_Git

That one fails the commit if there's wrongly indented files in the
commit. And if you still want to opt out for whatever reason you can
use git commit --no-verify

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Tom Lane

Peter Geoghegan  writes:
> I'm starting to have doubts about this policy. There have now been
> quite a few follow-up "fixes" to indentation issues that koel
> complained about. None of these fixups have been included in
> .git-blame-ignore-revs. If things continue like this then "git blame"
> is bound to become much less usable over time.

FWIW, I'm much more optimistic than that.  I think what we're seeing
is just the predictable result of not all committers having yet
incorporated "pgindent it before committing" into their workflow.
The need for followup fixes should diminish as people start doing
that.  If you want to hurry things along, peer pressure on committers
who clearly aren't bothering is the solution.

regards, tom lane

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Tom Lane

Peter Geoghegan  writes:
> My workflow up until now has avoiding making updates to typedefs.list
> in patches. I only update typedefs locally, for long enough to indent
> my code. The final patch doesn't retain any typedefs.list changes.

Yeah, I've done the same and will have to stop.

> I guess that I can't do that anymore. Hopefully maintaining the
> typedefs.list file isn't as inconvenient as it once seemed to me to
> be.

I don't think it'll be a problem.  If your rule is "add new typedef
names added by your patch to typedefs.list, keeping them in
alphabetical order" then it doesn't seem very complicated, and
hopefully conflicts between concurrently-developed patches won't
be common.

regards, tom lane

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Peter Geoghegan

On Fri, Aug 11, 2023 at 3:30 PM Tom Lane  wrote:
> No.  I presume koel is using src/tools/pgindent/typedefs.list,
> which has always been the "canonical" list but up to now we've
> been lazy about maintaining it.  Part of the new regime is that
> typedefs.list should now be updated on-the-fly by patches that
> add new typedefs.

My workflow up until now has avoiding making updates to typedefs.list
in patches. I only update typedefs locally, for long enough to indent
my code. The final patch doesn't retain any typedefs.list changes.

> We should still compare against the buildfarm's list periodically;
> but I imagine that the primary result of that will be to remove
> no-longer-used typedefs from typedefs.list.

I believe that I came up with my current workflow due to the
difficulty of maintaining the typedef file itself. Random
platform/binutils implementation details created a lot of noise,
presumably because my setup wasn't exactly the same as Bruce's setup,
in whatever way. For example, the order of certain lines would change,
in a way that had nothing whatsoever to do with structs that my patch
added.

I guess that I can't do that anymore. Hopefully maintaining the
typedefs.list file isn't as inconvenient as it once seemed to me to
be.

-- 
Peter Geoghegan

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Tom Lane

Peter Geoghegan  writes:
> On Fri, Aug 11, 2023 at 2:25 PM Andres Freund  wrote:
>> We could a test that fails when there's some mis-indented code. That seems
>> like it might catch things earlier?

+1 for including this in CI tests

> It definitely would. That would go a long way towards addressing my
> concerns. But I suspect that that would run into problems that stem
> from the fact that the buildfarm is testing something that isn't all
> that simple. Don't typedefs need to be downloaded from some other
> blessed buildfarm animal?

No.  I presume koel is using src/tools/pgindent/typedefs.list,
which has always been the "canonical" list but up to now we've
been lazy about maintaining it.  Part of the new regime is that
typedefs.list should now be updated on-the-fly by patches that
add new typedefs.

We should still compare against the buildfarm's list periodically;
but I imagine that the primary result of that will be to remove
no-longer-used typedefs from typedefs.list.

regards, tom lane

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Peter Geoghegan

On Fri, Aug 11, 2023 at 2:25 PM Andres Freund  wrote:
> > I don't think that it makes sense to invent yet another rule for
> > .git-blame-ignore-revs, though. Will we need another buildfarm member
> > to enforce that rule, too?
>
> We could a test that fails when there's some mis-indented code. That seems
> like it might catch things earlier?

It definitely would. That would go a long way towards addressing my
concerns. But I suspect that that would run into problems that stem
from the fact that the buildfarm is testing something that isn't all
that simple. Don't typedefs need to be downloaded from some other
blessed buildfarm animal?

-- 
Peter Geoghegan

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Peter Geoghegan

On Fri, Aug 11, 2023 at 2:04 PM Andres Freund  wrote:
> Where supported, a crash (distinguishing from assertion failures) will now
> report something like:
>
> process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 
> 0xdeadbeef
> [0x5628ec45212f] pg_fatalsig_handler+0x20f: 
> ../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615
> [0x7fc4b743650f] [unknown]
> [0x5628ec14897c] check_root+0x19c (inlined): 
> ../../../../home/andres/src/postgresql/src/backend/main/main.c:393
> [0x5628ec14897c] main+0x19c: 
> ../../../../home/andres/src/postgresql/src/backend/main/main.c:183
>
> after I added
> *(volatile int*)0xdeadbeef = 1;
> to check_root().

It'll be like living in the future!

> For signals sent by users, it'd show the pid of the process sending the signal
> on most OSs. I really would like some generalized infrastructure for that, so
> that we can report for things like query cancellations.

That sounds great.

> > It would also be great if the tests spit out information about
> > assertion failures that was reasonably complete (backtrace without any
> > mangling, query text included, other basic context), reliably and
> > uniformly -- it shouldn't matter if it's from TAP or pg_regress test
> > SQL scripts.
>
> Hm. What other basic context are you thinking of? Pid is obvious. I guess
> backend type could be useful too, but normally be inferred from the stack
> trace pretty easily. Application name could be useful to know which test
> caused the crash.
>
> I'm a bit wary about trying to print things like query text - what if that
> string points to memory not terminated by a \0? I guess we could use an
> approach similar to pgstat_get_crashed_backend_activity().

I agree that being less verbose by default is good. On second thought
even query text isn't all that important.

> One issue with reporting crashes from signal handlers is that the obvious way
> to make that signal safe (lots of small writes) leads to the potential for
> interspersed lines.  It's probably worth having a statically sized buffer that
> will commonly be large enough to print a whole backtrace. When too small, it
> should include the pid at the start of every "chunk".

Good idea.

> > Which kind of test happened to be involved is usually not interesting to me
> > here (even the query text won't usually be interesting), so being forced to
> > think about it slows me down quite a lot.
>
> Interesting - I quite often end up spending time trying to dig out which query
> from what sql file triggered a crash, so I can try to trigger it in
> isolation. I often wished the server knew the source line associated with the
> query. Enough that I pondered ways to have psql transport that knowledge to 
> the
> the server.

I actually do plenty of that too. My overall point was this: there is
likely some kind of pareto principle here. That should guide the sorts
of scenarios we optimize for.

If you actually benchmarked where I spent time while writing code,
minute to minute, I bet it would show that most of the individual
debug-compile cycles were triggered by issues that had a fairly simple
and immediate cause. Cases where I improve one small thing, and then
rerun the tests, which show an assertion failure in nearby code. As
soon as I see very basic details I immediately think "duh, of course"
in these cases, at which point I'll come up with a likely-good fix in
seconds. And then I'll rinse and repeat. My fix might just work (at
least to the extent that all tests now pass), but it also might lead
to another assertion failure of the same general nature.

There are also lots of cases where I really do have to think about
recreating the details from the test in order to truly understand
what's going on, of course. But there are still way way more
individual "duh, of course" assertion failures in practice. Those are
where productivity wins are still possible, because the bottleneck
isn't just that I have an incomplete mental model that I need to work
to expand.

Perhaps my experiences here aren't universal. But it seems like they
might be roughly the same as everybody else that works on Postgres?
Assuming that they are, then the information that is output should be
optimized for the "duh, of course" scenarios. Not to an absurd degree,
mind you. But the output shouldn't be too verbose. Ideally there'd be
a still fairly straightforward way of getting extra information, for
the cases where debugging is likely to take a few minutes, and require
real focus. The extra work in those other cases is relatively
insignificant, because the "startup costs" are relatively large -- a
little extra indirection (though only a little) can't hurt too much.

-- 
Peter Geoghegan

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 13:59:40 -0700, Peter Geoghegan wrote:
> On Sat, Jun 17, 2023 at 7:08 AM Andrew Dunstan  wrote:
> > I have set up a new buildfarm animal called koel which will run the module.
> 
> I'm starting to have doubts about this policy. There have now been
> quite a few follow-up "fixes" to indentation issues that koel
> complained about. None of these fixups have been included in
> .git-blame-ignore-revs. If things continue like this then "git blame"
> is bound to become much less usable over time.

I'm not sure I buy that that's going to be a huge problem - most of the time
such fixups are pretty small compared to larger reindents.


> I don't think that it makes sense to invent yet another rule for
> .git-blame-ignore-revs, though. Will we need another buildfarm member
> to enforce that rule, too?

We could a test that fails when there's some mis-indented code. That seems
like it might catch things earlier?

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 13:19:34 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 12:26 PM Andres Freund  wrote:
> > > For example, dealing with core dumps left behind by the regression
> > > tests can be annoying.
> >
> > Hm. I don't have a significant problem with that. But I can see it being
> > problematic. Unfortunately, short of preventing core dumps from happening,
> > I don't think we really can do much about that - whatever is running the 
> > tests
> > shouldn't have privileges to change system wide settings about where core
> > dumps end up etc.
>
> I was unclear. I wasn't talking about managing core dumps. I was
> talking about using core dumps to get a simple backtrace, that just
> gives me some very basic information. I probably shouldn't have even
> mentioned core dumps, because what I'm really concerned about is the
> workflow around getting very basic information about assertion
> failures. Not around core dumps per se.
>
> The inconsistent approach needed to get a simple, useful backtrace for
> assertion failures (along with other basic information associated with
> the failure) is a real problem. Particularly when running the tests.
> Most individual assertion failures that I see are for code that I'm
> practically editing in real time. So shaving cycles here really
> matters.

Ah, yes. Agreed, obviously.

> For one thing the symbol mangling that we have around the built-in
> backtraces makes them significantly less useful. I really hope that
> your libbacktrace patch gets committed soon, since that looks like it
> would be a nice quality of life improvement, all on its own.

I've been hacking a further on it:
https://github.com/anarazel/postgres/tree/backtrace

Haven't yet posted a new version. Doing it properly for fronted tools has some
dependencies on threading stuff. I'm hoping Thomas' patch for that will go in.

Now it also intercepts segfaults and prints
backtraces for them, if that's possible (it requires libbacktrace to be async
signal safe, which it isn't on all platforms).

Where supported, a crash (distinguishing from assertion failures) will now
report something like:

process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 
0xdeadbeef
[0x5628ec45212f] pg_fatalsig_handler+0x20f: 
../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615
[0x7fc4b743650f] [unknown]
[0x5628ec14897c] check_root+0x19c (inlined): 
../../../../home/andres/src/postgresql/src/backend/main/main.c:393
[0x5628ec14897c] main+0x19c: 
../../../../home/andres/src/postgresql/src/backend/main/main.c:183

after I added
*(volatile int*)0xdeadbeef = 1;
to check_root().

For signals sent by users, it'd show the pid of the process sending the signal
on most OSs. I really would like some generalized infrastructure for that, so
that we can report for things like query cancellations.

As the patch stands, the only OS that doesn't yet have useful "backtrace on
crash" support with that is windows, as libbacktrace unfortunately isn't
signal safe on windows. But it'd still provide useful backtraces on
Assert()s. I haven't yet figured out whether/when it's required to be signal
safe on windows though - crashes are intercepted by
SetUnhandledExceptionFilter() - I don't understand the precise constraints of
that. Plenty people seem to generate backtraces on crashes on windows using
that, without concerns for signal safety like things.

Currently Macos CI doesn't use libbacktrace, but as it turns out
backtrace_symbols() on windows is a heck of a lot better than on glibc
platforms.  CI for windows with visual studio doesn't have libbacktrace
installed yet (and has the aforementioned signal safety issue), I think it's
installed for windows w/ mingw.

> It would also be great if the tests spit out information about
> assertion failures that was reasonably complete (backtrace without any
> mangling, query text included, other basic context), reliably and
> uniformly -- it shouldn't matter if it's from TAP or pg_regress test
> SQL scripts.

Hm. What other basic context are you thinking of? Pid is obvious. I guess
backend type could be useful too, but normally be inferred from the stack
trace pretty easily. Application name could be useful to know which test
caused the crash.

I'm a bit wary about trying to print things like query text - what if that
string points to memory not terminated by a \0? I guess we could use an
approach similar to pgstat_get_crashed_backend_activity().

One issue with reporting crashes from signal handlers is that the obvious way
to make that signal safe (lots of small writes) leads to the potential for
interspersed lines.  It's probably worth having a statically sized buffer that
will commonly be large enough to print a whole backtrace. When too small, it
should include the pid at the start of every "chunk".

> Which kind of test happened to be involved is usually not interesting to me
> here (even the query text won't usually

Re: [PATCH] Support static linking against LLVM

2023-08-11 Thread Marcelo Juchem

In my case, my product has a very controlled environment.
We build all our infrastructure from source and we avoid dynamic linking by
design, except where technically not viable (e.g.: pgsql extensions).

LLVM is one of the libraries we're specifically required to statically link.
Unfortunately I can't share the specifics of why that's the case.

Without static link support by PostgreSQL we can't enable LLVM JIT.

On a side note, I'd like to take this opportunity to ask you if there's any
work being done towards migrating away from deprecated LLVM APIs.
If there's no work being done on that front, I might take a stab at it if
there's any interest from the PostgreSQL community in that contribution.

More specifically, there are some deprecated C APIs that are used in
PostgreSQL (more details in
https://llvm.org/docs/OpaquePointers.html#frontends).
For that reason, PostgreSQL LLVM JIT support will fail to build starting
with the next version of LLVM (version 17).

Migrating to the new APIs will have PostgreSQL require at the minimum
version 8 of LLVM (released in March 2019).

Regards,

-mj

On Fri, Aug 11, 2023 at 12:43 PM Andres Freund  wrote:

> Hi,
>
> On 2023-08-10 14:45:47 -0500, Marcelo Juchem wrote:
> > By default, PostgreSQL doesn't explicitly choose whether to link
> > statically or dynamically against LLVM when LLVM JIT is enabled (e.g.:
> > `./configure --with-llvm`).
> >
> > `llvm-config` will choose to dynamically link by default.
> >
> > In order to statically link, one must pass `--link-static` to
> > `llvm-config` when listing linker flags (`--ldflags`) and libraries
> > (`--libs`).
> >
> > This patch enables custom flags to be passed to `llvm-config` linker
> > related invocations through the environment variable
> > `LLVM_CONFIG_LINK_ARGS`.
> >
> > To statically link against LLVM it suffices, then, to call `configure`
> > with environment variable `LLVM_CONFIG_LINK_ARGS=--link-static`.
>
> I'm not opposed to it, but I'm not sure I see much need for it either. Do
> you
> have a specific use case for this?
>
> Greetings,
>
> Andres Freund
>

Re: [PATCH] Support static linking against LLVM

2023-08-11 Thread Marcelo Juchem

Andres, Tom, I found your names in the git history for JIT and LLVM.
Any chance one of you could take a look at the patch?

-mj


On Thu, Aug 10, 2023 at 2:45 PM Marcelo Juchem  wrote:

> By default, PostgreSQL doesn't explicitly choose whether to link
> statically or dynamically against LLVM when LLVM JIT is enabled (e.g.:
> `./configure --with-llvm`).
>
> `llvm-config` will choose to dynamically link by default.
>
> In order to statically link, one must pass `--link-static` to
> `llvm-config` when listing linker flags (`--ldflags`) and libraries
> (`--libs`).
>
> This patch enables custom flags to be passed to `llvm-config` linker
> related invocations through the environment variable
> `LLVM_CONFIG_LINK_ARGS`.
>
> To statically link against LLVM it suffices, then, to call `configure`
> with environment variable `LLVM_CONFIG_LINK_ARGS=--link-static`.
> ---
>  config/llvm.m4 | 5 +++--
>  configure  | 6 --
>  2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/config/llvm.m4 b/config/llvm.m4
> index 3a75cd8b4d..712bd3de6c 100644
> --- a/config/llvm.m4
> +++ b/config/llvm.m4
> @@ -13,6 +13,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
>AC_REQUIRE([AC_PROG_AWK])
>
>AC_ARG_VAR(LLVM_CONFIG, [path to llvm-config command])
> +  AC_ARG_VAR(LLVM_CONFIG_LINK_ARGS, [extra arguments for llvm-config
> linker related flags])
>PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-7 llvm-config-6.0
> llvm-config-5.0 llvm-config-4.0 llvm-config-3.9)
>
># no point continuing if llvm wasn't found
> @@ -52,7 +53,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
>  esac
>done
>
> -  for pgac_option in `$LLVM_CONFIG --ldflags`; do
> +  for pgac_option in `$LLVM_CONFIG --ldflags $LLVM_CONFIG_LINK_ARGS`; do
>  case $pgac_option in
>-L*) LDFLAGS="$LDFLAGS $pgac_option";;
>  esac
> @@ -84,7 +85,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
># And then get the libraries that need to be linked in for the
># selected components.  They're large libraries, we only want to
># link them into the LLVM using shared library.
> -  for pgac_option in `$LLVM_CONFIG --libs --system-libs
> $pgac_components`; do
> +  for pgac_option in `$LLVM_CONFIG --libs --system-libs
> $LLVM_CONFIG_LINK_ARGS $pgac_components`; do
>  case $pgac_option in
>-l*) LLVM_LIBS="$LLVM_LIBS $pgac_option";;
>  esac
> diff --git a/configure b/configure
> index 86ffccb1ee..974b7f2d4e 100755
> --- a/configure
> +++ b/configure
> @@ -1595,6 +1595,8 @@ Some influential environment variables:
>CXX C++ compiler command
>CXXFLAGSC++ compiler flags
>LLVM_CONFIG path to llvm-config command
> +  LLVM_CONFIG_LINK_ARGS
> +  extra arguments for llvm-config linker related flags
>CLANG   path to clang compiler to generate bitcode
>CPP C preprocessor
>PKG_CONFIG  path to pkg-config utility
> @@ -5200,7 +5202,7 @@ fi
>  esac
>done
>
> -  for pgac_option in `$LLVM_CONFIG --ldflags`; do
> +  for pgac_option in `$LLVM_CONFIG --ldflags $LLVM_CONFIG_LINK_ARGS`; do
>  case $pgac_option in
>-L*) LDFLAGS="$LDFLAGS $pgac_option";;
>  esac
> @@ -5232,7 +5234,7 @@ fi
># And then get the libraries that need to be linked in for the
># selected components.  They're large libraries, we only want to
># link them into the LLVM using shared library.
> -  for pgac_option in `$LLVM_CONFIG --libs --system-libs
> $pgac_components`; do
> +  for pgac_option in `$LLVM_CONFIG --libs --system-libs
> $LLVM_CONFIG_LINK_ARGS $pgac_components`; do
>  case $pgac_option in
>-l*) LLVM_LIBS="$LLVM_LIBS $pgac_option";;
>  esac
> --
> 2.40.1
>
>

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Peter Geoghegan

On Sat, Jun 17, 2023 at 7:08 AM Andrew Dunstan  wrote:
> I have set up a new buildfarm animal called koel which will run the module.

I'm starting to have doubts about this policy. There have now been
quite a few follow-up "fixes" to indentation issues that koel
complained about. None of these fixups have been included in
.git-blame-ignore-revs. If things continue like this then "git blame"
is bound to become much less usable over time.

I don't think that it makes sense to invent yet another rule for
.git-blame-ignore-revs, though. Will we need another buildfarm member
to enforce that rule, too?

-- 
Peter Geoghegan

Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-12 07:51:09 +1200, Thomas Munro wrote:
> Oh, I see what's happening.  Maybe commit b91dd9de wasn't the best
> idea, but bc971f4025c broke an assumption, since it doesn't use
> ConditionVariableSleep().  That is confusing the signal forwarding
> logic which expects to find our entry in the wait list in the common
> case.

Hm, I guess I got confused by the cv code once more. I thought that
ConditionVariableCancelSleep() would wake us up anyway, because
once we return from ConditionVariableSleep(), we'd be off the list. But I now
realize (and I think not for the first time), that ConditionVariableSleep()
always puts us *back* on the list.

Leaving aside the issue in this thread, isn't always adding us back into the
list bad from a contention POV alone - it doubles the write traffic on the CV
and is guaranteed to cause contention for ConditionVariableBroadcast().  I
wonder if this explains some performance issues I've seen in the past.

What if we instead reset cv_sleep_target once we've been taken off the list? I
think it'd not be too hard to make it safe to do the proclist_contains()
without the spinlock.  Lwlocks have something similar, there we solve it by
this sequence:

proclist_delete(&wakeup, iter.cur, lwWaitLink);

/*
 * Guarantee that lwWaiting being unset only becomes visible 
once the
 * unlink from the link has completed. Otherwise the target 
backend
 * could be woken up for other reason and enqueue for a new 
lock - if
 * that happens before the list unlink happens, the list would 
end up
 * being corrupted.
 *
 * The barrier pairs with the LWLockWaitListLock() when 
enqueuing for
 * another lock.
 */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);

I guess this means we'd need something like lwWaiting for CVs as well.

> From a85b2827f4500bc2e7c533feace474bc47086dfa Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Sat, 12 Aug 2023 07:06:08 +1200
> Subject: [PATCH] De-pessimize ConditionVariableCancelSleep().
>
> Commit b91dd9de was concerned with a theoretical problem with our
> non-atomic condition variable operations.  If you stop sleeping, and
> then cancel the sleep in a separate step, you might be signaled in
> between, and that could be lost.  That doesn't matter for callers of
> ConditionVariableBroadcast(), but callers of ConditionVariableSignal()
> might be upset if a signal went missing like this.

FWIW I suspect at least some of the places that'd have a problem without that
forwarding, might also be racy with it

> New idea: ConditionVariableCancelSleep() can just return true if you've
> been signaled.  Hypothetical users of ConditionVariableSignal() would
> then still have a way to deal with rare lost signals if they are
> concerned about that problem.

Sounds like a plan to me.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Peter Geoghegan

On Fri, Aug 11, 2023 at 12:26 PM Andres Freund  wrote:
> > For example, dealing with core dumps left behind by the regression
> > tests can be annoying.
>
> Hm. I don't have a significant problem with that. But I can see it being
> problematic. Unfortunately, short of preventing core dumps from happening,
> I don't think we really can do much about that - whatever is running the tests
> shouldn't have privileges to change system wide settings about where core
> dumps end up etc.

I was unclear. I wasn't talking about managing core dumps. I was
talking about using core dumps to get a simple backtrace, that just
gives me some very basic information. I probably shouldn't have even
mentioned core dumps, because what I'm really concerned about is the
workflow around getting very basic information about assertion
failures. Not around core dumps per se.

The inconsistent approach needed to get a simple, useful backtrace for
assertion failures (along with other basic information associated with
the failure) is a real problem. Particularly when running the tests.
Most individual assertion failures that I see are for code that I'm
practically editing in real time. So shaving cycles here really
matters.

For one thing the symbol mangling that we have around the built-in
backtraces makes them significantly less useful. I really hope that
your libbacktrace patch gets committed soon, since that looks like it
would be a nice quality of life improvement, all on its own.

It would also be great if the tests spit out information about
assertion failures that was reasonably complete (backtrace without any
mangling, query text included, other basic context), reliably and
uniformly -- it shouldn't matter if it's from TAP or pg_regress test
SQL scripts. Which kind of test happened to be involved is usually not
interesting to me here (even the query text won't usually be
interesting), so being forced to think about it slows me down quite a
lot.

> > Don't you also hate it when there's a regression.diffs that just shows 20k
> > lines of subtractions? Perhaps you don't -- perhaps your custom setup makes
> > it quick and easy to get relevant information about what actually went
> > wrong.
>
> I do really hate that. At the very least we should switch to using
> restart-after-crash by default, and not start new tests once the server has
> crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see
> if the reason the test failed is that the backend died.

+1
-- 
Peter Geoghegan

Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-11 Thread Thomas Munro

On Sat, Aug 12, 2023 at 5:51 AM Andres Freund  wrote:
> On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote:
> > It seems to me the issue is in WalSndWait, which was reworked to use
> > ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
> > waking each other in a busy loop, until the timing changes just enough
> > to break the cycle.
>
> IMO ConditionVariableCancelSleep()'s behaviour of waking up additional
> processes can nearly be considered a bug, at least when combined with
> ConditionVariableBroadcast(). In that case the wakeup is never needed, and it
> can cause situations like this, where condition variables basically
> deteriorate to a busy loop.
>
> I hit this with AIO as well. I've been "solving" it by adding a
> ConditionVariableCancelSleepEx(), which has a only_broadcasts argument.
>
> I'm inclined to think that any code that needs that needs the forwarding
> behaviour is pretty much buggy.

Oh, I see what's happening.  Maybe commit b91dd9de wasn't the best
idea, but bc971f4025c broke an assumption, since it doesn't use
ConditionVariableSleep().  That is confusing the signal forwarding
logic which expects to find our entry in the wait list in the common
case.

What do you think about this patch?
From a85b2827f4500bc2e7c533feace474bc47086dfa Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 12 Aug 2023 07:06:08 +1200
Subject: [PATCH] De-pessimize ConditionVariableCancelSleep().

Commit b91dd9de was concerned with a theoretical problem with our
non-atomic condition variable operations.  If you stop sleeping, and
then cancel the sleep in a separate step, you might be signaled in
between, and that could be lost.  That doesn't matter for callers of
ConditionVariableBroadcast(), but callers of ConditionVariableSignal()
might be upset if a signal went missing like this.

In the past I imagine that the main case would be that this sort of race
would be rare and would come up if you used
ConditionVariableTimedSleep(), but then commit bc971f4025c invented a
new way to multiplex a CV with other events using latch waits directly.
Since it doesn't even use ConditionVariableSleep() (which normally puts
us back in the wait list), ConditionVariableCancelSleep() is confused
and forwards a signal when we *have* been woken up by the condition.
That produces a nasty storm of wakeups.

New idea: ConditionVariableCancelSleep() can just return true if you've
been signaled.  Hypothetical users of ConditionVariableSignal() would
then still have a way to deal with rare lost signals if they are
concerned about that problem.

Reported-by: Tomas Vondra 
Reported-by: Andres Freund 
Discussion: https://postgr.es/m/2840876b-4cfe-240f-0a7e-29ffd66711e7%40enterprisedb.com

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 7e2bbf46d9..910a768206 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -223,15 +223,17 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
  *
  * Do nothing if nothing is pending; this allows this function to be called
  * during transaction abort to clean up any unfinished CV sleep.
+ *
+ * Return true if we've been signaled.
  */
-void
+bool
 ConditionVariableCancelSleep(void)
 {
 	ConditionVariable *cv = cv_sleep_target;
 	bool		signaled = false;

 	if (cv == NULL)
-		return;
+		return false;

 	SpinLockAcquire(&cv->mutex);
 	if (proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
@@ -240,15 +242,9 @@ ConditionVariableCancelSleep(void)
 		signaled = true;
 	SpinLockRelease(&cv->mutex);

-	/*
-	 * If we've received a signal, pass it on to another waiting process, if
-	 * there is one.  Otherwise a call to ConditionVariableSignal() might get
-	 * lost, despite there being another process ready to handle it.
-	 */
-	if (signaled)
-		ConditionVariableSignal(cv);
-
 	cv_sleep_target = NULL;
+
+	return signaled;
 }

 /*
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 589bdd323c..e218cb2c49 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -56,7 +56,7 @@ extern void ConditionVariableInit(ConditionVariable *cv);
 extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern bool ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		uint32 wait_event_info);
-extern void ConditionVariableCancelSleep(void);
+extern bool ConditionVariableCancelSleep(void);

 /*
  * Optionally, ConditionVariablePrepareToSleep can be called before entering
-- 
2.39.2

Re: [PATCH] psql: Add tab-complete for optional view parameters

2023-08-11 Thread David Zhang


Applied v3 patch to master and verified it with below commands,

#Alter view

postgres=# alter view v 
ALTER COLUMN  OWNER TO  RENAME    RESET (   SET

postgres=# alter view v set 
(   SCHEMA

postgres=# alter view v set ( 
CHECK_OPTION  SECURITY_BARRIER  SECURITY_INVOKER

postgres=# alter view v reset ( 
CHECK_OPTION  SECURITY_BARRIER  SECURITY_INVOKER

postgres=# alter view v set ( check_option = 
CASCADED  LOCAL

postgres=# alter view v set ( security_barrier = 
FALSE  TRUE

postgres=# alter view v set ( security_invoker = 
FALSE  TRUE


#Create view

postgres=# create view v
AS  WITH (
postgres=# create or replace view v
AS  WITH (

postgres=# create view v with (
CHECK_OPTION  SECURITY_BARRIER  SECURITY_INVOKER

postgres=# create or replace view v with (
CHECK_OPTION  SECURITY_BARRIER  SECURITY_INVOKER

postgres=# create view v with (*)AS
postgres=# create or replace view v with (*)AS

postgres=# create view v as SELECT
postgres=# create or replace view v as SELECT


For below changes,

 else if (TailMatches("CREATE", "VIEW", MatchAny, "AS") ||
-             TailMatches("CREATE", "OR", "REPLACE", "VIEW", MatchAny, 
"AS"))
+             TailMatches("CREATE", "VIEW", MatchAny, "WITH", "(*)", 
"AS") ||
+             TailMatches("CREATE", "OR", "REPLACE", "VIEW", MatchAny, 
"AS") ||
+             TailMatches("CREATE", "OR", "REPLACE", "VIEW", MatchAny, 
"WITH", "(*)", "AS"))


it would be great to switch the order of the 3rd and the 4th line to 
make a better match for "CREATE" and "CREATE OR REPLACE" .



Since it supports  in the middle for below case,
postgres=# alter view v set ( security_
security_barrier  security_invoker

and during view reset it can also provide all the options list,
postgres=# alter view v reset (
CHECK_OPTION  SECURITY_BARRIER  SECURITY_INVOKER

but not sure if it is a good idea or possible to autocomplete the reset 
options after seeing one of the options showing up with "," for example,

postgres=# alter view v reset ( CHECK_OPTION, 
SECURITY_BARRIER  SECURITY_INVOKER


Thank you,

David

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 11:56:27 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 11:23 AM Andres Freund  wrote:
> > > Couldn't you say the same thing about defensive "can't happen" ERRORs?
> > > They are essentially a form of assertion that isn't limited to
> > > assert-enabled builds.
> >
> > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
> > our transaction state machinery is confused. Yes, let's just continue going
> > through the same machinery again, that'll resolve it.".
> 
> I am not unsympathetic to Ashutosh's point about conventional ERRORs
> being easier to deal with when debugging your own code, during initial
> development work.

Oh, I am as well - I just don't think it's a good idea to introduce "log + 
error"
assertions to core postgres, because it seems very likely that they'll end up
getting used a lot.

> But that seems like a problem with the tooling in other areas.

Agreed.

> For example, dealing with core dumps left behind by the regression
> tests can be annoying.

Hm. I don't have a significant problem with that. But I can see it being
problematic. Unfortunately, short of preventing core dumps from happening,
I don't think we really can do much about that - whatever is running the tests
shouldn't have privileges to change system wide settings about where core
dumps end up etc.

> Don't you also hate it when there's a regression.diffs that just shows 20k
> lines of subtractions? Perhaps you don't -- perhaps your custom setup makes
> it quick and easy to get relevant information about what actually went
> wrong.

I do really hate that. At the very least we should switch to using
restart-after-crash by default, and not start new tests once the server has
crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see
if the reason the test failed is that the backend died.

> But it seems like that sort of thing could be easier to deal with by
> default, without using custom shell scripts or anything -- particularly for
> those of us that haven't been Postgres hackers for eons.

Yes, wholeheartedly agreed.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Peter Geoghegan

On Fri, Aug 11, 2023 at 11:23 AM Andres Freund  wrote:
> > Couldn't you say the same thing about defensive "can't happen" ERRORs?
> > They are essentially a form of assertion that isn't limited to
> > assert-enabled builds.
>
> Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
> our transaction state machinery is confused. Yes, let's just continue going
> through the same machinery again, that'll resolve it.".

I am not unsympathetic to Ashutosh's point about conventional ERRORs
being easier to deal with when debugging your own code, during initial
development work. But that seems like a problem with the tooling in
other areas.

For example, dealing with core dumps left behind by the regression
tests can be annoying. Don't you also hate it when there's a
regression.diffs that just shows 20k lines of subtractions? Perhaps
you don't -- perhaps your custom setup makes it quick and easy to get
relevant information about what actually went wrong. But it seems like
that sort of thing could be easier to deal with by default, without
using custom shell scripts or anything -- particularly for those of us
that haven't been Postgres hackers for eons.

> There've been people arguing in the past that it's good for robustness to do
> stuff like that. I think it's exactly the opposite.
>
> Now, don't get me wrong, there are plenty cases where a "this shouldn't
> happen" elog(ERROR) is appropriate...

Right. They're not bad -- they just need to be backed up by some kind
of reasoning, which will be particular to each case. The default
approach should be to crash whenever any invariants are violated,
because all bets are off at that point.

> What are you imagining? Basically something that generates an elog(ERROR) with
> the stringified expression as part of the error message?

I'd probably start with a new elevel, that implied an assertion
failure in assert-enabled builds but otherwise acted just like ERROR.
I remember multiple cases where I added an "Assert(false)" right after
a "can't happen" error, which isn't a great approach.

It might also be useful to offer something along the lines you've
described, which I was sort of thinking of myself. But now that I've
thought about it a little more, I think that such an approach might
end up being overused. If you're going to add what amounts to a "can't
happen" ERROR then you should really be obligated to write a halfway
reasonable error message. As I said, you should have to own the fact
that you think it's better to not just crash for this one particular
callsite, based on some fairly specific chain of reasoning. Ideally
it'll be backed up by practical experience/user reports.

-- 
Peter Geoghegan

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 11:14:34 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 10:57 AM Andres Freund  wrote:
> > I am quite strongly against this. This will lead to assertions being hit in
> > tests without that being noticed, e.g. because they happen in a background
> > process that just restarts.
> 
> Couldn't you say the same thing about defensive "can't happen" ERRORs?
> They are essentially a form of assertion that isn't limited to
> assert-enabled builds.

Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
our transaction state machinery is confused. Yes, let's just continue going
through the same machinery again, that'll resolve it.". Even worse are the
ones that are just WARNINGS. Like "Oh, something wrote beyond the length of
allocated memory, that's something to issue a WARNING about and happily
continue".

There've been people arguing in the past that it's good for robustness to do
stuff like that. I think it's exactly the opposite.

Now, don't get me wrong, there are plenty cases where a "this shouldn't
happen" elog(ERROR) is appropriate...

> I have sometimes thought that it would be handy if there was a variant
> of "can't happen" ERRORs that took on the structure of an assertion.

What are you imagining? Basically something that generates an elog(ERROR) with
the stringified expression as part of the error message?

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Peter Geoghegan

On Fri, Aug 11, 2023 at 10:57 AM Andres Freund  wrote:
> I am quite strongly against this. This will lead to assertions being hit in
> tests without that being noticed, e.g. because they happen in a background
> process that just restarts.

Couldn't you say the same thing about defensive "can't happen" ERRORs?
They are essentially a form of assertion that isn't limited to
assert-enabled builds.

I have sometimes thought that it would be handy if there was a variant
of "can't happen" ERRORs that took on the structure of an assertion.
(This is quite different to what Ashutosh has proposed, though, since
it would still look like a conventional assertion failure on
assert-enabled builds.)

-- 
Peter Geoghegan

Re: [PATCH] Support static linking against LLVM

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 12:59:27 -0500, Marcelo Juchem wrote:
> In my case, my product has a very controlled environment.
> We build all our infrastructure from source and we avoid dynamic linking by
> design, except where technically not viable (e.g.: pgsql extensions).
> 
> LLVM is one of the libraries we're specifically required to statically link.
> Unfortunately I can't share the specifics of why that's the case.

If you build llvm with just static lib support it should work as-is?


> On a side note, I'd like to take this opportunity to ask you if there's any
> work being done towards migrating away from deprecated LLVM APIs.
> If there's no work being done on that front, I might take a stab at it if
> there's any interest from the PostgreSQL community in that contribution.

Yes: 
https://postgr.es/m/CA%2BhUKGKNX_%3Df%2B1C4r06WETKTq0G4Z_7q4L4Fxn5WWpMycDj9Fw%40mail.gmail.com


> Migrating to the new APIs will have PostgreSQL require at the minimum
> version 8 of LLVM (released in March 2019).

I think Thomas' patch abstract it enough so that older versions continue to
work...

Greetings,

Andres Freund

Re: LLVM 16 (opaque pointers)

2023-08-11 Thread Andres Freund

Hi,

On 2023-05-21 15:01:41 +1200, Thomas Munro wrote:
> *For anyone working with this type of IR generation code and
> questioning their sanity, I can pass on some excellent advice I got
> from Andres: build LLVM yourself with assertions enabled, as they
> catch some classes of silly mistake that otherwise just segfault
> inscrutably on execution.

Hm. I think we need a buildfarm animal with an assertion enabled llvm 16 once
we merge this. I think after an upgrade my buildfarm machine has the necessary
resources.


> @@ -150,7 +150,7 @@ llvm_compile_expr(ExprState *state)
>  
>   /* create function */
>   eval_fn = LLVMAddFunction(mod, funcname,
> -   
> llvm_pg_var_func_type("TypeExprStateEvalFunc"));
> +   
> llvm_pg_var_func_type("ExecInterpExprStillValid"));

Hm, that's a bit ugly. But ...

> @@ -77,9 +80,44 @@ extern Datum AttributeTemplate(PG_FUNCTION_ARGS);
>  Datum
>  AttributeTemplate(PG_FUNCTION_ARGS)
>  {
> + PGFunction  fp PG_USED_FOR_ASSERTS_ONLY;
> +
> + fp = &AttributeTemplate;
>   PG_RETURN_NULL();
>  }

Other parts of the file do this by putting the functions into
referenced_functions[], i'd copy that here and below.

> +void
> +ExecEvalSubroutineTemplate(ExprState *state,
> +struct ExprEvalStep *op,
> +ExprContext *econtext)
> +{
> + ExecEvalSubroutine fp PG_USED_FOR_ASSERTS_ONLY;
> +
> + fp = &ExecEvalSubroutineTemplate;
> +}
> +
> +extern bool ExecEvalBoolSubroutineTemplate(ExprState *state,
> + 
>struct ExprEvalStep *op,
> + 
>ExprContext *econtext);
> +bool
> +ExecEvalBoolSubroutineTemplate(ExprState *state,
> +struct ExprEvalStep 
> *op,
> +ExprContext 
> *econtext)
> +{
> + ExecEvalBoolSubroutine fp PG_USED_FOR_ASSERTS_ONLY;
> +
> + fp = &ExecEvalBoolSubroutineTemplate;
> + return false;
> +}
> +


Thanks for working on this!

Greetings,

Andres Freund

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-11 Thread Bruce Momjian

On Fri, Aug 11, 2023 at 10:46:31AM +0530, Amit Kapila wrote:
> On Thu, Aug 10, 2023 at 7:07 PM Masahiko Sawada  wrote:
> > What I imagined is that we do this check before
> > check_and_dump_old_cluster() while the server is 'off'. Reading the
> > slot state file would be simple and I guess we would not need a tool
> > or cli program for that. We need to expose RepliactionSlotOnDisk,
> > though.
> 
> Won't that require a lot of version-specific checks as across versions
> the file format could be different? For the case of the control file,
> we use version-specific pg_controldata (for the old cluster, the
> corresponding version's pg_controldata) utility to read the old
> version control file. I thought we need something similar here if we
> want to do what you are suggesting.

You mean the slot file format?  We will need that complexity somewhere,
so why not in pg_upgrade?

> > After reading the control file and the slots' state files we
> > check if slot's confirmed_flush_lsn matches the latest checkpoint LSN
> > in the control file (BTW maybe we can get slot name and plugin name
> > here instead of using pg_dump?).
> 
> But isn't the advantage of doing via pg_dump (in binary_mode) that we
> allow some outside core in-place upgrade tool to also use it if
> required? If we don't think that would be required then we can
> probably use the info we retrieve it in pg_upgrade.

You mean the code reading the slot file?  I don't see the point of
adding user complexity to enable some hypothetical external usage.

> > the first commit. Or another idea would be to allow users to mark
> > replication slots "upgradable" so that pg_upgrade skips the
> > confirmed_flush_lsn check.
> 
> I guess for that we need to ask users to ensure that confirm_flush_lsn
> is up-to-date and then provide some slot-level API to mark the slots
> with the required status. If so, that sounds a bit complicated for
> users.

Agreed, not worth it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-11 Thread Bruce Momjian

On Fri, Aug 11, 2023 at 11:18:09AM +0530, Amit Kapila wrote:
> On Fri, Aug 11, 2023 at 10:43 AM Julien Rouhaud  wrote:
> > I disagree.  As I mentioned before any module registered in
> > shared_preload_libraries can spawn background workers which can perform any
> > activity.  There were previous reports of corruption because of multi-xact
> > being generated by such bgworkers during pg_upgrade, I'm pretty sure that 
> > there
> > are some modules that create objects (automatic partitioning tools for
> > instance).  It's also unclear to me what would happen if some writes are
> > performed by such module at various points of the pg_upgrade process.  
> > Couldn't
> > that lead to either data loss or broken slot (as it couldn't stream changes
> > from older major version)?
> 
> It won't be any bad than what can happen to tables. If we know that
> such bgworkers can cause corruption if they do writes during the
> upgrade, I don't think it is the job of this patch to prevent the
> related scenarios. We can probably disallow the creation of new slots
> during the binary upgrade but that also I am not sure. I guess it
> would be better to document such hazards as a first step and then
> probably write a patch to prevent WAL writes or something along those
> lines.

Yes, if users are connecting to the clusters during pg_upgrade, we have
many more problems than slots.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: LLVM 16 (opaque pointers)

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-10 16:56:54 +0200, Ronan Dunklau wrote:
> I tried my hand at backporting it to previous versions, and not knowing
> anything about it made me indeed question my sanity.  It's quite easy for PG
> 15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier
> most notably due to to the change in fcinfo args representation. But I guess
> that's also a topic for another day :-)

Given that 11 is about to be EOL, I don't think it's worth spending the time
to support a new LLVM version for it.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 17:59:37 +0530, Ashutosh Bapat wrote:
> Most of the Asserts are recoverable by rolling back the transaction
> without crashing the backend. So an elog(ERROR, ) is enough. But just
> by themselves elogs are compiled into non-debug binary and the
> condition check can waste CPU cycles esp. conditions are mostly true
> like the ones we use in Assert.
> 
> Attached patch combines Assert and elog(ERROR, ) so that when an
> Assert is triggered in assert-enabled binary, it will throw an error
> while keeping the backend intact. Thus it does not affect gdb session
> or psql session. These elogs do not make their way to non-assert
> binary so do not make it to production like Assert.

I am quite strongly against this. This will lead to assertions being hit in
tests without that being noticed, e.g. because they happen in a background
process that just restarts.

Greetings,

Andres Freund

Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote:
> That's an awful lot of CPU for a cluster doing essentially "nothing"
> (there's no WAL to decode/replicate, etc.). It usually disappears after
> a couple seconds, but sometimes it's a rather persistent state.

Ugh, that's not great.

> The profile from the walsender processes looks like this:
> 
> --99.94%--XLogSendLogical
>   |
>   |--99.23%--XLogReadRecord
>   |  XLogReadAhead
>   |  XLogDecodeNextRecord
>   |  ReadPageInternal
>   |  logical_read_xlog_page
>   |  |
>   |  |--97.80%--WalSndWaitForWal
>   |  |  |
>   |  |  |--68.48%--WalSndWait
> 
> It seems to me the issue is in WalSndWait, which was reworked to use
> ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
> waking each other in a busy loop, until the timing changes just enough
> to break the cycle.

IMO ConditionVariableCancelSleep()'s behaviour of waking up additional
processes can nearly be considered a bug, at least when combined with
ConditionVariableBroadcast(). In that case the wakeup is never needed, and it
can cause situations like this, where condition variables basically
deteriorate to a busy loop.

I hit this with AIO as well. I've been "solving" it by adding a
ConditionVariableCancelSleepEx(), which has a only_broadcasts argument.

I'm inclined to think that any code that needs that needs the forwarding
behaviour is pretty much buggy.

Greetings,

Andres Freund

Re: [PATCH] Support static linking against LLVM

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-10 14:45:47 -0500, Marcelo Juchem wrote:
> By default, PostgreSQL doesn't explicitly choose whether to link
> statically or dynamically against LLVM when LLVM JIT is enabled (e.g.:
> `./configure --with-llvm`).
> 
> `llvm-config` will choose to dynamically link by default.
> 
> In order to statically link, one must pass `--link-static` to
> `llvm-config` when listing linker flags (`--ldflags`) and libraries
> (`--libs`).
> 
> This patch enables custom flags to be passed to `llvm-config` linker
> related invocations through the environment variable
> `LLVM_CONFIG_LINK_ARGS`.
> 
> To statically link against LLVM it suffices, then, to call `configure`
> with environment variable `LLVM_CONFIG_LINK_ARGS=--link-static`.

I'm not opposed to it, but I'm not sure I see much need for it either. Do you
have a specific use case for this?

Greetings,

Andres Freund

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-11 Thread Christoph Berg

Re: Thomas Munro
> On Thu, Aug 10, 2023 at 9:15 PM Christoph Berg  wrote:
> > No XXX lines this time either, but I've seen then im logfiles that
> > went through successfully.
> 
> Do you still have the data directories around from that run, so we can
> see if the expected Heap2/PRUNE was actually logged?  For example
> (using meson layout here, in the build directory) that'd be something
> like:

Sorry for the late reply, getting the PG release out of the door was
more important.

> $ ./tmp_install/home/tmunro/install/bin/pg_waldump
> testrun/recovery/031_recovery_conflict/data/t_031_recovery_conflict_standby_data/pgdata/pg_wal/00010003
> 
> In there I see this:

It's been a long day and I can't wrap my mind around understanding
this now, so I'll just dump the output here.

[0] 16:03 myon@sid-s390x.pgs390x:~/.../build/src/test/recovery $ LC_ALL=C 
../../../tmp_install/usr/lib/postgresql/17/bin/pg_waldump 
tmp_check/t_031_recovery_conflict_standby_data/pgdata/pg_wal/00010003
 | grep -3 PRUNE > PRUNE.log
pg_waldump: error: error in WAL record at 0/347E6A8: invalid record length at 
0/347E6E0: expected at least 24, got 0

Christoph
rmgr: Heaplen (rec/tot):154/   154, tx:737, lsn: 
0/03441C58, prev 0/03441C20, desc: UPDATE old_xmax: 737, old_off: 14, 
old_infobits: [], flags: 0x00, new_xmax: 0, new_off: 19, blkref #0: rel 
1663/1/2619 blk 19, blkref #1: rel 1663/1/2619 blk 1
rmgr: Btree   len (rec/tot): 64/64, tx:737, lsn: 
0/03441CF8, prev 0/03441C58, desc: INSERT_LEAF off: 70, blkref #0: rel 
1663/1/2696 blk 1
rmgr: Transaction len (rec/tot):565/   565, tx:737, lsn: 
0/03441D38, prev 0/03441CF8, desc: COMMIT 2023-08-10 01:02:28.824209 UTC; inval 
msgs: catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63
rmgr: Heap2   len (rec/tot): 83/83, tx:  0, lsn: 
0/03441F70, prev 0/03441D38, desc: PRUNE snapshotConflictHorizon: 737, 
nredirected: 0, ndead: 14, nunused: 0, redirected: [], dead: [1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14], unused: [], blkref #0: rel 1663/1/2619 blk 1
rmgr: Heaplen (rec/tot): 76/76, tx:738, lsn: 
0/03441FC8, prev 0/03441F70, desc: HOT_UPDATE old_xmax: 738, old_off: 27, 
old_infobits: [], flags: 0x20, new_xmax: 0, new_off: 32, blkref #0: rel 
1663/1/2619 blk 1
rmgr: Heaplen (rec/tot): 59/  8211, tx:738, lsn: 
0/03442030, prev 0/03441FC8, desc: LOCK xmax: 738, off: 1, infobits: 
[LOCK_ONLY, EXCL_LOCK], flags: 0x01, blkref #0: rel 1663/1/2619 blk 2 FPW
rmgr: Heaplen (rec/tot): 54/  5686, tx:738, lsn: 
0/03444060, prev 0/03442030, desc: INSERT off: 4, flags: 0x01, blkref #0: rel 
1663/1/2840 blk 2 FPW
--
rmgr: Transaction len (rec/tot):469/   469, tx:738, lsn: 
0/03449E60, prev 0/03449E20, desc: COMMIT 2023-08-10 01:02:28.840240 UTC; inval 
msgs: catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 catcache 63 
catcache 63 catcache 63
rmgr: Heaplen (rec/tot): 53/  3097, tx:  0, lsn: 
0/0344A050, prev 0/03449E60, desc: INPLACE off: 65, blkref #0: rel 1663/1/1259 
blk 7 FPW
rmgr: Standby len (rec/tot): 90/90, tx:  0, lsn: 
0/0344AC70, prev 0/0344A050, desc: INVALIDATIONS ; relcache init file inval 
dbid 1 tsid 1663; inval msgs: catcache 55 catcache 54 relcache 1259
rmgr: Heap2   len (rec/tot):105/   105, tx:  0, lsn: 
0/0344ACD0, prev 0/0344AC70, desc: PRUNE snapshotConflictHorizon: 738, 
nredirected: 0, ndead: 25, nunused: 0, redirected: [], dead: [1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], 
unused: [], blkref #0: rel 1663/1/2619 blk 2
rmgr: Heaplen (rec/tot): 76/76, tx:739, lsn: 
0/0344AD40, prev 0/0344ACD0, desc: HOT_UPDATE old_xmax: 739, old_off: 27, 
old_infobits: [], flags: 0x20, new_xmax: 0, new_off: 31, blkref #0: rel 
1663/1/2619 blk 2
rmgr: Heaplen (rec/tot): 59/  8179, tx:739, lsn: 
0/0344AD90, prev 0/0344AD40, desc: LOCK xmax: 739, off: 1, infobits: 
[LOCK_ONLY, EXCL_LOCK], flags: 0x01, blkref #0: rel 1663/1/2619 blk 5 FPW
rmgr: Heaplen (rec/tot):   1551/  1551, tx:739, lsn: 
0/0344CDA0, prev 0/0344AD90, desc: UPDATE old_xmax: 739, old_off: 1, 
old_infobits: [], flags: 0x01, new_xmax: 0, new_off: 23, blkref #0: rel 
1663/1/2619 blk 20, blkref #1: rel 1663/1/2619 blk 5
--
rmgr: Heap

Re: Schema variables - new implementation for Postgres 15

2023-08-11 Thread Dmitry Dolgov

> On Thu, Aug 03, 2023 at 08:15:13AM +0200, Pavel Stehule wrote:
> Hi
>
> fresh rebase

Thanks for continuing efforts. The new patch structure looks better to
me (although the boundary between patches 0001 and 0002 is somewhat
fuzzy, e.g. the function NameListToString is used already in the first
one, but defined in the second). Couple of commentaries along the way:

* Looks like it's common to use BKI_DEFAULT when defining catalog
entities, something like BKI_DEFAULT(-1) for typmod, BKI_DEFAULT(0) for
collation, etc. Does it make sense to put few default values into
pg_variable as well?

* The first patch contains:

diff --git a/src/backend/access/transam/xact.c 
b/src/backend/access/transam/xact.c
@@ -2800,6 +2800,8 @@ AbortTransaction(void)
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
AtEOXact_LargeObject(false);
+
+   /* 'false' means it's abort */
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
AtAbort_Twophase();

What does the commentary refer to, is it needed?

* I see ExplainOneQuery got a new argument:

 static void ExplainOneQuery(Query *query, int cursorOptions,
-   IntoClause *into, 
ExplainState *es,
+   IntoClause *into, Oid 
targetvar, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv);

>From what I understand it represents a potential session variable to be
explained. Isn't it too specific for this interface, could it be put
somewhere else? To be honest, I don't have any suggestions myself, but
it feels a bit out of place here.

* Session variable validity logic is not always clear, at least to me,
producing following awkward pieces of code:

+   if (!svar->is_valid)
+   {
+   if (is_session_variable_valid(svar))
+   svar->is_valid = true;

I get it as there are two ways how a variable could be invalid?

* It's not always easy to follow which failure modes are taken care of. E.g.

+* Don't try to use possibly invalid data from svar. And we don't want 
to
+* overwrite invalid svar immediately. The datumCopy can fail, and in 
this
+* case, the stored value will be invalid still.

I couldn't find any similar precautions, how exactly datumCopy can fail,
are you referring to palloc/memcpy failures?

Another confusing example was this one at the end of set_session_variable:

+   /*
+* XXX While unlikely, an error here is possible. It wouldn't leak 
memory
+* as the allocated chunk has already been correctly assigned to the
+* session variable, but would contradict this function contract, which 
is
+* that this function should either succeed or leave the current value
+* untouched.
+*/
+   elog(DEBUG1, "session variable \"%s.%s\" (oid:%u) has new value",
+
get_namespace_name(get_session_variable_namespace(svar->varid)),
+get_session_variable_name(svar->varid),
+svar->varid);

It's not clear, which exactly error you're talking about, it's the last
instruction in the function.

Maybe it would be beneficial to have some overarching description, all
in one place, about how session variables implementation handles various
failures?

Re: generic plans and "initial" pruning

2023-08-11 Thread Amit Langote

On Fri, Aug 11, 2023 at 14:31 Amit Langote  wrote:

> On Wed, Aug 9, 2023 at 1:05 AM Robert Haas  wrote:
> > On Tue, Aug 8, 2023 at 10:32 AM Amit Langote 
> wrote:
> > > But should ExecInitNode() subroutines return the partially initialized
> > > PlanState node or NULL on detecting invalidation?  If I'm
> > > understanding how you think this should be working correctly, I think
> > > you mean the former, because if it were the latter, ExecInitNode()
> > > would end up returning NULL at the top for the root and then there's
> > > nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> > > In that case, I think we will need to adjust ExecEndNode() subroutines
> > > to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> > > example.  That's something Tom had said he doesn't like very much [1].
> >
> > Yeah, I understood Tom's goal as being "don't return partially
> > initialized nodes."
> >
> > Personally, I'm not sure that's an important goal. In fact, I don't
> > even think it's a desirable one. It doesn't look difficult to audit
> > the end-node functions for cases where they'd fail if a particular
> > pointer were NULL instead of pointing to some real data, and just
> > fixing all such cases to have NULL-tests looks like purely mechanical
> > work that we are unlikely to get wrong. And at least some cases
> > wouldn't require any changes at all.
> >
> > If we don't do that, the complexity doesn't go away. It just moves
> > someplace else. Presumably what we do in that case is have
> > ExecInitNode functions undo any initialization that they've already
> > done before returning NULL. There are basically two ways to do that.
> > Option one is to add code at the point where they return early to
> > clean up anything they've already initialized, but that code is likely
> > to substantially duplicate whatever the ExecEndNode function already
> > knows how to do, and it's very easy for logic like this to get broken
> > if somebody rearranges an ExecInitNode function down the road.
>
> Yeah, I too am not a fan of making ExecInitNode() clean up partially
> initialized nodes.
>
> > Option
> > two is to rearrange the ExecInitNode functions now, to open relations
> > or recurse at the beginning, so that we discover the need to fail
> > before we initialize anything. That restricts our ability to further
> > rearrange the functions in future somewhat, but more importantly,
> > IMHO, it introduces more risk right now. Checking that the ExecEndNode
> > function will not fail if some pointers are randomly null is a lot
> > easier than checking that changing the order of operations in an
> > ExecInitNode function breaks nothing.
> >
> > I'm not here to say that we can't do one of those things. But I think
> > adding null-tests to ExecEndNode functions looks like *far* less work
> > and *way* less risk.
>
> +1
>
> > There's a second issue here, too, which is when we abort ExecInitNode
> > partway through, how do we signal that? You're rightly pointing out
> > here that if we do that by returning NULL, then we don't do it by
> > returning a pointer to the partially initialized node that we just
> > created, which means that we either need to store those partially
> > initialized nodes in a separate data structure as you propose to do in
> > 0001,
> >
> > or else we need to pick a different signalling convention. We
> > could change (a) ExecInitNode to have an additional argument, bool
> > *kaboom, or (b) we could make it return bool and return the node
> > pointer via a new additional argument, or (c) we could put a Boolean
> > flag into the estate and let the function signal failure by flipping
> > the value of the flag.
>
> The failure can already be detected by seeing that
> ExecPlanIsValid(estate) is false.  The question is what ExecInitNode()
> or any of its subroutines should return once it is.  I think the
> following convention works:
>
> Return partially initialized state from ExecInit* function where we
> detect the invalidation after calling ExecInitNode() on a child plan,
> so that ExecEndNode() can recurse to clean it up.
>
> Return NULL from ExecInit* functions where we detect the invalidation
> after opening and locking a relation but before calling ExecInitNode()
> to initialize a child plan if there's one at all.  Even if we may set
> things like ExprContext, TupleTableSlot fields, they are cleaned up
> independently of the plan tree anyway via the cleanup called with
> es_exprcontexts, es_tupleTable, respectively.  I even noticed bits
> like this in ExecEnd* functions:
>
> -   /*
> -* Free the exprcontext(s) ... now dead code, see ExecFreeExprContext
> -*/
> -#ifdef NOT_USED
> -   ExecFreeExprContext(&node->ss.ps);
> -   if (node->ioss_RuntimeContext)
> -   FreeExprContext(node->ioss_RuntimeContext, true);
> -#endif
>
> So, AFAICS, ExprContext, TupleTableSlot cleanup in ExecNode* functions
> is unnecessary but remain around because nobody cared about and got
> arou

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-11 Thread Melih Mutlu

Hi Peter,

Peter Smith , 11 Ağu 2023 Cum, 01:26 tarihinde şunu
yazdı:

> No, I meant what I wrote there. When I ran the tests the HEAD included
> the v25-0001 refactoring patch, but v26 did not yet exist.
>
> For now, we are only performance testing the first
> "Reuse-Tablesyc-Workers" patch, but not yet including the second patch
> ("Reuse connection when...").
>
> Note that those "Reuse-Tablesyc-Workers" patches v24-0002 and v26-0001
> are equivalent because there are only cosmetic log message differences
> between them.
>

Ok, that's fair.



> So, my testing was with HEAD+v24-0002 (but not including v24-0003).
> Your same testing should be with HEAD+v26-0001 (but not including
> v26-0002).
>

That's actually what I did. I should have been more clear about what I
included in my previous email.With v26-0002, results are noticeably better
anyway.
I just rerun the test again against HEAD, HEAD+v26-0001 and additionally
HEAD+v26-0001+v26-0002 this time, for better comparison.

Here are my results with the same scripts you shared earlier (I obviously
only changed the number of inserts before each commit. ).
Note that this is when synchronous_commit = off.

100 inserts/tx
+-+---+--+--+--+
| | 2w| 4w   | 8w   | 16w  |
+-+---+--+--+--+
| v26-0002| 10421 | 6472 | 6656 | 6566 |
+-+---+--+--+--+
| improvement | 31%   | 12%  | 0%   | 5%   |
+-+---+--+--+--+
| v26-0001| 14585 | 7386 | 7129 | 7274 |
+-+---+--+--+--+
| improvement | 9%| 5%   | 12%  | 7%   |
+-+---+--+--+--+
| HEAD| 16130 | 7785 | 8147 | 7827 |
+-+---+--+--+--+

1000 inserts/tx
+-+---+--+--+--+
| | 2w| 4w   | 8w   | 16w  |
+-+---+--+--+--+
| v26-0002| 13796 | 6848 | 5942 | 6315 |
+-+---+--+--+--+
| improvement | 9%| 7%   | 10%  | 8%   |
+-+---+--+--+--+
| v26-0001| 14685 | 7325 | 6675 | 6719 |
+-+---+--+--+--+
| improvement | 3%| 0%   | 0%   | 2%   |
+-+---+--+--+--+
| HEAD| 15118 | 7354 | 6644 | 6890 |
+-+---+--+--+--+

2000 inserts/tx
+-+---+---+--+--+
| | 2w| 4w| 8w   | 16w  |
+-+---+---+--+--+
| v26-0002| 22442 | 9944  | 6034 | 5829 |
+-+---+---+--+--+
| improvement | 5%| 2%| 4%   | 10%  |
+-+---+---+--+--+
| v26-0001| 23632 | 10164 | 6311 | 6480 |
+-+---+---+--+--+
| improvement | 0%| 0%| 0%   | 0%   |
+-+---+---+--+--+
| HEAD| 23667 | 10157 | 6285 | 6470 |
+-+---+---+--+--+

5000 inserts/tx
+-+---+---+---+--+
| | 2w| 4w| 8w| 16w  |
+-+---+---+---+--+
| v26-0002| 41443 | 21385 | 10832 | 6146 |
+-+---+---+---+--+
| improvement | 0%| 0%| 1%| 16%  |
+-+---+---+---+--+
| v26-0001| 41293 | 21226 | 10814 | 6158 |
+-+---+---+---+--+
| improvement | 0%| 1%| 1%| 15%  |
+-+---+---+---+--+
| HEAD| 41503 | 21466 | 10943 | 7292 |
+-+---+---+---+--+


Again, I couldn't reproduce the cases where you saw significantly degraded
performance. I wonder if I'm missing something. Did you do anything not
included in the test scripts you shared? Do you think v26-0001 will
perform 84% worse than HEAD, if you try again? I just want to be sure that
it was not a random thing.
Interestingly, I also don't see an improvement in above results as big as
in your results when inserts/tx ratio is smaller. Even though it certainly
is improved in such cases.

Thanks,
-- 
Melih Mutlu
Microsoft

walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-11 Thread Tomas Vondra

Hi,

while working on some logical replication stuff, I noticed that on PG16
I often end up with a completely idle publisher (no user activity), that
however looks like this in top:

 ...  %CPU  COMMAND


 ...  17.9  postgres: walsender user test ::1(43064) START_REPLICATION


 ...  16.6  postgres: walsender user test ::1(43128) START_REPLICATION


 ...  15.0  postgres: walsender user test ::1(43202) START_REPLICATION


 ...   6.6  postgres: walsender user test ::1(43236) START_REPLICATION


 ...   5.3  postgres: walsender user test ::1(43086) START_REPLICATION


 ...   4.3  postgres: walsender user test ::1(43180) START_REPLICATION


 ...   3.7  postgres: walsender user test ::1(43052) START_REPLICATION


 ...   3.7  postgres: walsender user test ::1(43158) START_REPLICATION


 ...   3.0  postgres: walsender user test ::1(43108) START_REPLICATION


 ...   3.0  postgres: walsender user test ::1(43214) START_REPLICATION



That's an awful lot of CPU for a cluster doing essentially "nothing"
(there's no WAL to decode/replicate, etc.). It usually disappears after
a couple seconds, but sometimes it's a rather persistent state.

The profile from the walsender processes looks like this:

--99.94%--XLogSendLogical
  |
  |--99.23%--XLogReadRecord
  |  XLogReadAhead
  |  XLogDecodeNextRecord
  |  ReadPageInternal
  |  logical_read_xlog_page
  |  |
  |  |--97.80%--WalSndWaitForWal
  |  |  |
  |  |  |--68.48%--WalSndWait

It seems to me the issue is in WalSndWait, which was reworked to use
ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
waking each other in a busy loop, until the timing changes just enough
to break the cycle.

I've been unable to reproduce this on PG15, and bc971f4025c seems like
the most significant change to WalSndWait, which is why I suspect it's
related to the issue.

Reproducing this is simple, create a publisher with multiple subscribers
(could even go to the same subscriber instance) and empty publications.
If you trigger a "noop" it's likely to cause this high memory usage:

-
# init two clusters
pg_ctl -D data-publisher init
pg_ctl -D data-subscriber init

# change the parameters to allow 10 subscriptions
echo 'wal_level = logical' >> data-publisher/postgresql.conf
echo 'port = 5433' >> data-subscriber/postgresql.conf
echo 'max_worker_processes = 20' >> data-subscriber/postgresql.conf
echo 'max_logical_replication_workers = 20' >>
data-subscriber/postgresql.conf

# setup empty publication
createdb test
psql test -c "create publication p";

# setup 10 subscriptions
for i in `seq 1 10`; do
  createdb -p 5433 test$i
  psql -p 5433 test$i -c "create subscription s$i CONNECTION
'host=localhost port=5432 dbname=test' publication p";
done

# emit logical messages, which are almost noop, 5s apart
for i in `seq 1 10`; do
psql test -c "select pg_logical_emit_message(false, 'x', 'x')";
sleep 5;
done;
-


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Report planning memory in EXPLAIN ANALYZE

2023-08-11 Thread Ashutosh Bapat

On Fri, Aug 11, 2023 at 10:41 AM Andrey Lepikhov
 wrote:
> >>
> >
> > This won't just affect planner but every subsystem of PostgreSQL, not
> > just planner, unless we device a new context type for planner.
> >
> > My point is what's relevant here is how much net memory planner asked
> > for. Internally how much memory PostgreSQL allocated seems relevant
> > for the memory context infra as such but not so much for planner
> > alone.
> >
> > If you think that memory allocated is important, I will add both used
> > and (net) allocated memory to EXPLAIN output.
> As a developer, I like having as much internal info in my EXPLAIN as
> possible. But this command is designed mainly for users, not core
> developers. The size of memory allocated usually doesn't make sense for
> users. On the other hand, developers can watch it easily in different
> ways, if needed. So, I vote for the 'memory used' metric only.
>
> The patch looks good, passes the tests.

Thanks Andrey.

David, are you against reporting "memory used"? If you are not against
reporting used memory and still think that memory allocated is worth
reporting, I am fine reporting allocated memory too.

-- 
Best Wishes,
Ashutosh Bapat

Re: Inconsistent results with libc sorting on Windows

2023-08-11 Thread Noah Misch

On Fri, Aug 11, 2023 at 11:48:18AM +0200, Juan José Santamaría Flecha wrote:
> On Fri, Aug 11, 2023 at 7:29 AM Noah Misch  wrote:
> 
> >
> > The LCMapStringEx() solution is elegant.  I do see
> >
> > https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications
> > says, "If an application calls the function to create a sort key for a
> > string
> > containing an Arabic kashida, the function creates no sort key value."
> > That's
> > aggravating.
> >
> 
> I think the problem there is that it is just poorly explained.
> Take for example the attached program that compares "normal" and "kashida"
> 'Raħīm' taken from [1], you'll get:
> 
> c1 = c2
> c2 = c1
> 
> meaning that "normal" and "kashida" are the same string. So, probably that
> phrase should read something like: "If an application calls the function to
> create a sort key for a string containing an Arabic kashida character, the
> function will ignore that character and no sort key value will be generated
> for it."

Good.  That sounds fine.  Thanks for clarifying.

Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2023-08-11 Thread Ashutosh Bapat

I spent some time on 4th point below but also looked at other points.
Here's what I have found so far

On Thu, Jul 27, 2023 at 7:35 PM Ashutosh Bapat
 wrote:
>
> 1. The patch uses RestrictInfo::required_relids as the key for
> searching child RelOptInfos. I am not sure which of the two viz.
> required_relids and clause_relids is a better key. required_relids
> seems to be a subset of clause_relids and from the description it
> looks like that's the set that decides the applicability of a clause
> in a join. But clause_relids is obtained from all the Vars that appear
> in the clause, so may be that's the one that matters for the
> translations. Can somebody guide me?

I was wrong that required_relids is subset of clause_relids. The first
can contain OJ relids which the other can not. OJ relids do not have
any children, so they won't be translated. So clause_relids seems to
be a better key. I haven't made a decision yet.

>
> 2. The patch adds two extra pointers per RestrictInfo. They will
> remain unused when partitionwise join is not used. Right now, I do not
> see any increase in memory consumed by planner because of those
> pointers even in case of unpartitioned tables; maybe they are absorbed
> in memory alignment. They may show up as extra memory in the future. I
> am wondering whether we can instead save and track translations in
> PlannerInfo as a hash table using  whatever is the answer to above question) of parent and child
> respectively> as key. That will just add one extra pointer in
> PlannerInfo when partitionwise join is not used. Please let me know
> your suggestions.

I will go ahead with a pointer in PlannerInfo for now.

>
> 3. I have changed adjust_appendrel_attrs_mutator() to return a
> translated RestrictInfo if it already exists. IOW, it won't always
> return a deep copy of given RestrictInfo as it does today. This can be
> fixed by writing wrappers around adjust_appendrel_attrs() to translate
> RestrictInfo specifically. But maybe we don't always need deep copies.
> Are there any cases when we need translated deep copies of
> RestrictInfo? Those cases will require fixing callers of
> adjust_appendrel_attrs() instead of the mutator.

I think it's better to handle the tracking logic outside
adjust_appendrel_attrs. That will be some code churn but it will be
cleaner and won't affect anything other that partitionwise joins.

>
> 4. IIRC, when partitionwise join was implemented we had discussed
> creating child RestrictInfos using a login similar to
> build_joinrel_restrictlist(). That might be another way to build
> RestrictInfo only once and use it multiple times. But we felt that it
> was much harder problem to solve since it's not known which partitions
> from joining partitioned tables will match and will be joined till we
> enter try_partitionwise_join(), so the required RestrictInfos may not
> be available in RelOptInfo::joininfo. Let me know your thoughts on
> this.

Here's some lengthy description of why I feel translations are better
compared to computing restrictlist and joininfo for a child join from
joining relation's joininfo
Consider a query "select * from p, q, r where p.c1 = q.c1 and q.c1 =
r.c1 and p.c2 + q.c2 < r.c2 and p.c3 != q.c3 and q.c3 != r.c3". The
query has following clauses
1. p.c1 = q.c1
2. q.c1 = r.c1
3. p.c2 + q.c2 < r.c2
4. p.c3 != q.c3
5. q.c3 != r.c3

The first two clauses are added to EC machinery and do not appear in
joininfo. They appear in restrictlist when we construct clauses in
restrictlist from ECs. Let's ignore them for now.

Assume that each table p, q, r has partitions (p1, p2, ...), (q1, q2,
...) and (r1, r2, ... ) respectively. Each triplet (pi, qi,ri) forms
the set of matching partitions from p, q, r respectively for all "i".
Consider join, p1q1r1. We will generate relations p1, q1, r1, p1q1,
p1r1, q1r1 and p1q1r1 while building the last join. Below is
description of how these clauses would look in each of these relations
and the list they appear in when computing that join. Please notice
the numeric suffixes carefully.

p1.
joininfo: p1.c2 + q.c2 < r.c2, p1.c3 != q.c3
restrictlist: <>

q1
joininfo: p.c2 + q1.c2 < r.c2, p.c3 != q1.c3, q1.c3 != r.c3
restrictlist: <>

r1
joininfo: p.c2 + q.c2 < r1.c2, q.c3 != r1.c3
restrictlist: <>

p1q1
joininfo: p1.c2 + q1.c2 < r.c2, q1.c3 != r.c3
restrictlist: p1.c3 != q1.c3

q1r1
joininfo: p.c2 + q1.c2 < r1.c2, p.c3 != q1.c3
restrictlist: q1.c3 != r1.c3

p1r1
joininfo: p1.c2 + q.c2 < r1.c2, p1.c3 != q.c3, q.c3 != r1.c3
restrictlist: <>

p1q1r1
joininfo: <>
restrictlist for (p1q1)r1: p1.c2 + q1.c2 < r1.c2, q1.c3 != r1.c3
restrictlist for (p1r1)q1: p1.c2 + q1.c2 < r1.c2, p1.c3 != q1.c3, q1.c3 != r1.c3
restrictlist for p1(q1r1): p1.c2 + q1.c2 < r1.c2, p1.c3 != q1.c3

If we translate the clauses when building join e.g. translate p1.c3 !=
q1.c3 when building p1q1 or p1q1r1, it would cause repeated
translations. So the translations need to be saved in lower relations
when we detect matching partitions and then u

AssertLog instead of Assert in some places

2023-08-11 Thread Ashutosh Bapat

Hi All,
PostgreSQL code uses Assert() as a way to
1. document the assumption or conditions which must be true at a given
place in code
2. make sure that some bug elsewhere does not break these assumptions or rules
3. conditions which can not be (easily) induced by user actions

E.g. following conditions in adjust_appendrel_attrs()
/* If there's nothing to adjust, don't call this function. */
Assert(nappinfos >= 1 && appinfos != NULL);

/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));

These conditions do not make it to non-assert builds and thus do not
make it to the production binary. That saves some CPU cycles.

When an Assert fails, and it fails quite a lot for developers, the
Postgres backend that caused the Assert is Aborted, restarting the
server. So a developer testing code that caused the Assert has to
start a psql again, run any session setup before running the faulting
query, gdb needs to be reattached to the new backend process. That's
some waste of time and frustration, esp. when the Assert does not
damage the backend itself e.g. by corrupting memory.

Most of the Asserts are recoverable by rolling back the transaction
without crashing the backend. So an elog(ERROR, ) is enough. But just
by themselves elogs are compiled into non-debug binary and the
condition check can waste CPU cycles esp. conditions are mostly true
like the ones we use in Assert.

Attached patch combines Assert and elog(ERROR, ) so that when an
Assert is triggered in assert-enabled binary, it will throw an error
while keeping the backend intact. Thus it does not affect gdb session
or psql session. These elogs do not make their way to non-assert
binary so do not make it to production like Assert.

I have been using AssertLog for my work for some time. It is quite
convenient. With AssertLog I get
```
#explain (summary on) select * from a, b, c where a.c1 = b.c1 and b.c1
= c.c1 and a.c2 < b.c2 and a.c3 + b.c3 < c.c3;
ERROR:  failed Assert("equal(child_rinfo, adjust_appendrel_attrs(root,
(Node *) rinfo, nappinfos, appinfos))"), File: "relnode.c", Line: 997,
PID: 568088
```
instead of
```
#explain (summary on) select * from a, b, c where a.c1 = b.c1 and b.c1
= c.c1 and a.c2 < b.c2 and a.c3 + b.c3 < c.c3;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
@!#\q
```

If there is an interest in accepting the patch, I will add it to the
commitfest and work on it further.

--
Best Wishes,
Ashutosh Bapat
From 2464ac9de8a58846521d93db5c3c43a24d73f6b8 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Tue, 8 Aug 2023 14:37:59 +0530
Subject: [PATCH 1/6] Add AssertLog to report an error instead of Abort

Assert is used in the code to document assumptions, check conditions which
ought to be correct at given places in the code and so on. When an Assert
fails, the process Aborts, restarting the whole server. This is too disruptive
for development. Many of the Assert check conditions which are quite local in
nature, many a times very specific to the encapsulating transaction. In such
cases the backend can be brought out of the bad state just by aborting the
transaction. elog(ERROR) serves that purpose well. But when just elog is used
it get compiled in non-assert and non-debug binary. The condition around it
eats some CPU cycles in production.

This commit introduces AssertLog which brings together the benefits of both
Assert() and elog(). AssertLog is compiled only in assert enabled binaries. It
throws an ERROR when the condition fails instead of Aborting the process. Thus
the backend recovers simply by aborting the transaction and developers can
continue to debug, investigate without restarting their psql and reattaching
gdb.

Ashutosh Bapat
---
 src/include/c.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/include/c.h b/src/include/c.h
index f69d739be5..052b9441bc 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -66,6 +66,7 @@
 #endif
 #include 
 #include 
+#include 
 #include 
 #if defined(WIN32) || defined(__CYGWIN__)
 #include /* ensure O_BINARY is available */
@@ -840,12 +841,14 @@ typedef NameData *Name;
 #ifndef USE_ASSERT_CHECKING
 
 #define Assert(condition)	((void)true)
+#define AssertLog(condition)	((void)true)
 #define AssertMacro(condition)	((void)true)
 
 #elif defined(FRONTEND)
 
 #include 
 #define Assert(p) assert(p)
+#define AssertLog(p) assert(p)
 #define AssertMacro(p)	((void) assert(p))
 
 #else			/* USE_ASSERT_CHECKING && !FRONTEND */
@@ -860,6 +863,13 @@ typedef NameData *Name;
 			ExceptionalCondition(#condition, __FILE__, __LINE__); \
 	} while (0)
 
+#define AssertLog(condition) \
+	do { \
+		if (!(condition)) \
+			elog(ERROR, "failed Assert(\"%s\"), File: \"%s\", Line: %d, PID: %d", \
+		#condition , __FILE_

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-11 Thread vignesh C

On Fri, 11 Aug 2023 at 16:26, vignesh C  wrote:
>
> On Wed, 9 Aug 2023 at 09:51, vignesh C  wrote:
> >
> > Hi Melih,
> >
> > Here is a patch to help in getting the execution at various phases
> > like: a) replication slot creation time, b) Wal reading c) Number of
> > WAL records read d) subscription relation state change etc
> > Couple of observation while we tested with this patch:
> > 1) We noticed that the patch takes more time for finding the decoding
> > start point.
> > 2) Another observation was that the number of XLOG records read for
> > identify the consistent point was significantly high with the v26_0001
> > patch.
> >
> > HEAD
> > postgres=# select avg(counttime)/1000 "avgtime(ms)",
> > median(counttime)/1000 "median(ms)", min(counttime)/1000
> > "mintime(ms)", max(counttime)/1000 "maxtime(ms)", logtype from test
> > group by logtype;
> >   avgtime(ms)   |   median(ms)   | mintime(ms) |
> > maxtime(ms) | logtype
> > ++-+-+--
> >  0.00579245283018867920 | 0.0020 |   0 |
> > 1 | SNAPSHOT_BUILD
> >  1.2246811320754717 | 0.9855 |   0 |
> >37 | LOGICAL_SLOT_CREATION
> >171.0863283018867920 |   183.9120 |   0 |
> >   408 | FIND_DECODING_STARTPOINT
> >  2.0699433962264151 | 1.4380 |   1 |
> >49 | INIT_DECODING_CONTEXT
> > (4 rows)
> >
> > HEAD + v26-0001 patch
> > postgres=# select avg(counttime)/1000 "avgtime(ms)",
> > median(counttime)/1000 "median(ms)", min(counttime)/1000
> > "mintime(ms)", max(counttime)/1000 "maxtime(ms)", logtype from test
> > group by logtype;
> >   avgtime(ms)   |   median(ms)   | mintime(ms) |
> > maxtime(ms) | logtype
> > ++-+-+--
> >  0.00588113207547169810 | 0.0050 |   0 |
> > 0 | SNAPSHOT_BUILD
> >  1.1270962264150943 | 1.1000 |   0 |
> > 2 | LOGICAL_SLOT_CREATION
> >301.1745528301886790 |   410.4870 |   0 |
> >   427 | FIND_DECODING_STARTPOINT
> >  1.4814660377358491 | 1.4530 |   1 |
> > 9 | INIT_DECODING_CONTEXT
> > (4 rows)
> >
> > In the above FIND_DECODING_STARTPOINT is very much higher with V26-0001 
> > patch.
> >
> > HEAD
> > FIND_DECODING_XLOG_RECORD_COUNT
> > - average =  2762
> > - median = 3362
> >
> > HEAD + reuse worker patch(v26_0001 patch)
> > Where FIND_DECODING_XLOG_RECORD_COUNT
> > - average =  4105
> > - median = 5345
> >
> > Similarly Number of xlog records read is higher with v26_0001 patch.
> >
> > Steps to calculate the timing:
> > -- first collect the necessary LOG from subscriber's log.
> > cat *.log | grep -E
> > '(LOGICAL_SLOT_CREATION|INIT_DECODING_CONTEXT|FIND_DECODING_STARTPOINT|SNAPSHOT_BUILD|FIND_DECODING_XLOG_RECORD_COUNT|LOGICAL_XLOG_READ|LOGICAL_DECODE_PROCESS_RECORD|LOGICAL_WAIT_TRANSACTION)'
> > > grep.dat
> >
> > create table testv26(logtime varchar, pid varchar, level varchar,
> > space varchar, logtype varchar, counttime int);
> > -- then copy these datas into db table to count the avg number.
> > COPY testv26 FROM '/home/logs/grep.dat' DELIMITER ' ';
> >
> > -- Finally, use the SQL to analyze the data:
> > select avg(counttime)/1000 "avgtime(ms)", logtype from testv26 group by 
> > logtype;
> >
> > --- To get the number of xlog records read:
> > select avg(counttime) from testv26 where logtype
> > ='FIND_DECODING_XLOG_RECORD_COUNT' and counttime != 1;
> >
> > Thanks to Peter and Hou-san who helped in finding these out. We are
> > parallely analysing this, @Melih Mutlu  posting this information so
> > that it might help you too in analysing this issue.
>
> I analysed further on why it needs to read a larger number of XLOG
> records in some cases while creating the replication slot, here are my
> thoughts:
> Note: Tablesync worker needs to connect to the publisher and create
> consistent point for the slots by reading the XLOG records. This
> requires that all the open transactions and the transactions that are
> created while creating consistent point should be committed.
> I feel the creation of slots is better in few cases in Head because:
> Publisher| Subscriber
> 
> Begin txn1 transaction|
> Insert 1..1000 records|
> Commit   |
> Begin txn2 transaction|
> Insert 1..1000 records |  Apply worker applies transaction txn1
> |  Start tablesync table t2
> |  create consistent point in
> | publisher before transaction txn3 is
> | started
> commit|  We just need t

Re: Let's make PostgreSQL multi-threaded

2023-08-11 Thread Merlin Moncure

On Thu, Jul 27, 2023 at 8:28 AM David Geier  wrote:

> Hi,
>
> On 6/7/23 23:37, Andres Freund wrote:
> > I think we're starting to hit quite a few limits related to the process
> model,
> > particularly on bigger machines. The overhead of cross-process context
> > switches is inherently higher than switching between threads in the same
> > process - and my suspicion is that that overhead will continue to
> > increase. Once you have a significant number of connections we end up
> spending
> > a *lot* of time in TLB misses, and that's inherent to the process model,
> > because you can't share the TLB across processes.
>
> Another problem I haven't seen mentioned yet is the excessive kernel
> memory usage because every process has its own set of page table entries
> (PTEs). Without huge pages the amount of wasted memory can be huge if
> shared buffers are big.

Hm, noted this upthread, but asking again, does this
help/benefit interactions with the operating system make oom kill
situations less likely?   These things are the bane of my existence, and
I'm having a hard time finding a solution that prevents them other than
running pgbouncer and lowering max_connections, which adds complexity.  I
suspect I'm not the only one dealing with this.   What's really scary about
these situations is they come without warning.  Here's a pretty typical
example per sar -r.

 kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
%commit  kbactive   kbinact   kbdirty
 14:20:02   461612  15803476 97.16 0  11120280  12346980
  60.35  10017820   4806356   220
 14:30:01   378244  15886844 97.67 0  11239012  12296276
  60.10  10003540   4909180   240
 14:40:01   308632  15956456 98.10 0  11329516  12295892
  60.10  10015044   4981784   200
 14:50:01   458956  15806132 97.18 0  11383484  12101652
  59.15   9853612   5019916   112
 15:00:01 10592736   5672352 34.87 0   4446852   8378324
  40.95   1602532   3473020   264   <-- reboot!
 15:10:01  9151160   7113928 43.74 0   5298184   8968316
  43.83   2714936   3725092   124
 15:20:01  8629464   7635624 46.94 0   6016936   8777028
  42.90   2881044   4102888   148
 15:30:01  8467884   7797204 47.94 0   6285856   8653908
  42.30   2830572   4323292   436
 15:40:02  8077480   8187608 50.34 0   6828240   8482972
  41.46   2885416   4671620   320
 15:50:01  7683504   8581584 52.76 0   7226132   8511932
  41.60   2998752   4958880   308
 16:00:01  7239068   9026020 55.49 0   7649948   8496764
  41.53   3032140   5358388   232
 16:10:01  7030208   9234880 56.78 0   7899512   8461588
  41.36   3108692   5492296   216

Triggering query was heavy (maybe even runaway), server load was minimal
otherwise:

 CPU %user %nice   %system   %iowait%steal
%idle
 14:30:01all  9.55  0.00  0.63  0.02  0.00
89.81

 14:40:01all  9.95  0.00  0.69  0.02  0.00
89.33

 14:50:01all 10.22  0.00  0.83  0.02  0.00
88.93

 15:00:01all 10.62  0.00  1.63  0.76  0.00
86.99

 15:10:01all  8.55  0.00  0.72  0.12  0.00
90.61

The conjecture here is that lots of idle connections make the server appear
to have less memory available than it looks, and sudden transient demands
can cause it to destabilize.

Just throwing it out there, if it can be shown to help it may be supportive
of moving forward with something like this, either instead of, or along
with, O_DIRECT or other internalized database memory management
strategies.  Lowering context switches, faster page access etc are of
course nice would not be a game changer for the workloads we see which are
pretty varied  (OLTP, analytics) although we don't extremely high
transaction rates.

merlin

Re: libpq compression (part 2)

2023-08-11 Thread Andrey M. Borodin

> On 10 Aug 2023, at 19:47, Jonah H. Harris  wrote:
> 
> Pinging to see if anyone has continued to work on this behind-the-scenes or 
> whether this is the latest patch set there is.

It's still on my TODO list, but I haven't done much review cycles yet. And the 
patch series already needs heavy rebase.

Thank for your interest in the topic.

Best regards, Andrey Borodin.

Re: Adding a LogicalRepWorker type field

2023-08-11 Thread Amit Kapila

On Fri, Aug 11, 2023 at 3:41 PM Peter Smith  wrote:
>
> On Fri, Aug 11, 2023 at 7:33 PM Amit Kapila  wrote:
> >
> > On Thu, Aug 10, 2023 at 7:50 AM Peter Smith  wrote:
> > >
> > > >
> > > > * If you do the above then there won't be a need to change the
> > > > variable name is_parallel_apply_worker in logicalrep_worker_launch.
> > > >
> > >
> > > Done.
> > >
> >
> > I don't think the addition of two new macros isTablesyncWorker() and
> > isLeaderApplyWorker() adds much value, so removed those and ran
> > pgindent. I am planning to commit this patch early next week unless
> > you or others have any comments.
> >
>
> Thanks for considering this patch fit for pushing.
>
> Actually, I recently found 2 more overlooked places in the launcher.c
> code which can benefit from using the isTablesyncWorker(w) macro that
> was removed in patch v6-0001.
>

@@ -1301,7 +1301,7 @@ pg_stat_get_subscription(PG_FUNCTION_ARGS)
  worker_pid = worker.proc->pid;

  values[0] = ObjectIdGetDatum(worker.subid);
- if (OidIsValid(worker.relid))
+ if (isTablesyncWorker(&worker))
  values[1] = ObjectIdGetDatum(worker.relid);

I don't see this as a good fit for using isTablesyncWorker(). If we
were returning worker_type then using it would be okay.


-- 
With Regards,
Amit Kapila.

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-11 Thread vignesh C

On Wed, 9 Aug 2023 at 09:51, vignesh C  wrote:
>
> Hi Melih,
>
> Here is a patch to help in getting the execution at various phases
> like: a) replication slot creation time, b) Wal reading c) Number of
> WAL records read d) subscription relation state change etc
> Couple of observation while we tested with this patch:
> 1) We noticed that the patch takes more time for finding the decoding
> start point.
> 2) Another observation was that the number of XLOG records read for
> identify the consistent point was significantly high with the v26_0001
> patch.
>
> HEAD
> postgres=# select avg(counttime)/1000 "avgtime(ms)",
> median(counttime)/1000 "median(ms)", min(counttime)/1000
> "mintime(ms)", max(counttime)/1000 "maxtime(ms)", logtype from test
> group by logtype;
>   avgtime(ms)   |   median(ms)   | mintime(ms) |
> maxtime(ms) | logtype
> ++-+-+--
>  0.00579245283018867920 | 0.0020 |   0 |
> 1 | SNAPSHOT_BUILD
>  1.2246811320754717 | 0.9855 |   0 |
>37 | LOGICAL_SLOT_CREATION
>171.0863283018867920 |   183.9120 |   0 |
>   408 | FIND_DECODING_STARTPOINT
>  2.0699433962264151 | 1.4380 |   1 |
>49 | INIT_DECODING_CONTEXT
> (4 rows)
>
> HEAD + v26-0001 patch
> postgres=# select avg(counttime)/1000 "avgtime(ms)",
> median(counttime)/1000 "median(ms)", min(counttime)/1000
> "mintime(ms)", max(counttime)/1000 "maxtime(ms)", logtype from test
> group by logtype;
>   avgtime(ms)   |   median(ms)   | mintime(ms) |
> maxtime(ms) | logtype
> ++-+-+--
>  0.00588113207547169810 | 0.0050 |   0 |
> 0 | SNAPSHOT_BUILD
>  1.1270962264150943 | 1.1000 |   0 |
> 2 | LOGICAL_SLOT_CREATION
>301.1745528301886790 |   410.4870 |   0 |
>   427 | FIND_DECODING_STARTPOINT
>  1.4814660377358491 | 1.4530 |   1 |
> 9 | INIT_DECODING_CONTEXT
> (4 rows)
>
> In the above FIND_DECODING_STARTPOINT is very much higher with V26-0001 patch.
>
> HEAD
> FIND_DECODING_XLOG_RECORD_COUNT
> - average =  2762
> - median = 3362
>
> HEAD + reuse worker patch(v26_0001 patch)
> Where FIND_DECODING_XLOG_RECORD_COUNT
> - average =  4105
> - median = 5345
>
> Similarly Number of xlog records read is higher with v26_0001 patch.
>
> Steps to calculate the timing:
> -- first collect the necessary LOG from subscriber's log.
> cat *.log | grep -E
> '(LOGICAL_SLOT_CREATION|INIT_DECODING_CONTEXT|FIND_DECODING_STARTPOINT|SNAPSHOT_BUILD|FIND_DECODING_XLOG_RECORD_COUNT|LOGICAL_XLOG_READ|LOGICAL_DECODE_PROCESS_RECORD|LOGICAL_WAIT_TRANSACTION)'
> > grep.dat
>
> create table testv26(logtime varchar, pid varchar, level varchar,
> space varchar, logtype varchar, counttime int);
> -- then copy these datas into db table to count the avg number.
> COPY testv26 FROM '/home/logs/grep.dat' DELIMITER ' ';
>
> -- Finally, use the SQL to analyze the data:
> select avg(counttime)/1000 "avgtime(ms)", logtype from testv26 group by 
> logtype;
>
> --- To get the number of xlog records read:
> select avg(counttime) from testv26 where logtype
> ='FIND_DECODING_XLOG_RECORD_COUNT' and counttime != 1;
>
> Thanks to Peter and Hou-san who helped in finding these out. We are
> parallely analysing this, @Melih Mutlu  posting this information so
> that it might help you too in analysing this issue.

I analysed further on why it needs to read a larger number of XLOG
records in some cases while creating the replication slot, here are my
thoughts:
Note: Tablesync worker needs to connect to the publisher and create
consistent point for the slots by reading the XLOG records. This
requires that all the open transactions and the transactions that are
created while creating consistent point should be committed.
I feel the creation of slots is better in few cases in Head because:
Publisher| Subscriber

Begin txn1 transaction|
Insert 1..1000 records|
Commit   |
Begin txn2 transaction|
Insert 1..1000 records |  Apply worker applies transaction txn1
|  Start tablesync table t2
|  create consistent point in
| publisher before transaction txn3 is
| started
commit|  We just need to wait till
| transaction txn2 is finished.
Begin txn3 transaction|
Insert 1..1000 records |
commit|

In V26, this is happening in some cases:
Publisher| Subscriber
---

Re: Adding a LogicalRepWorker type field

2023-08-11 Thread Peter Smith

On Fri, Aug 11, 2023 at 7:33 PM Amit Kapila  wrote:
>
> On Thu, Aug 10, 2023 at 7:50 AM Peter Smith  wrote:
> >
> > >
> > > * If you do the above then there won't be a need to change the
> > > variable name is_parallel_apply_worker in logicalrep_worker_launch.
> > >
> >
> > Done.
> >
>
> I don't think the addition of two new macros isTablesyncWorker() and
> isLeaderApplyWorker() adds much value, so removed those and ran
> pgindent. I am planning to commit this patch early next week unless
> you or others have any comments.
>

Thanks for considering this patch fit for pushing.

Actually, I recently found 2 more overlooked places in the launcher.c
code which can benefit from using the isTablesyncWorker(w) macro that
was removed in patch v6-0001.

I have posted another v7.  (v7-0001 is identical to v6-0001). The
v7-0002 patch has the isTablesyncWorker changes. I think wherever
possible it is better to check the worker-type via macro instead of
deducing it by fields like 'relid', and patch v7-0002 makes the code
more consistent with other nearby isParallelApplyWorker checks in
launcher.c

--
Kind Regards,
Peter Smith.
Fujitsu Australia

v7-0002-Add-isTablesyncWorker.patch
Description: Binary data

v7-0001-Simplify-determining-logical-replication-worker-t.patch
Description: Binary data

Re: Inconsistent results with libc sorting on Windows

2023-08-11 Thread Juan José Santamaría Flecha

On Fri, Aug 11, 2023 at 7:29 AM Noah Misch  wrote:

>
> The LCMapStringEx() solution is elegant.  I do see
>
> https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications
> says, "If an application calls the function to create a sort key for a
> string
> containing an Arabic kashida, the function creates no sort key value."
> That's
> aggravating.
>

I think the problem there is that it is just poorly explained.
Take for example the attached program that compares "normal" and "kashida"
'Raħīm' taken from [1], you'll get:

c1 = c2
c2 = c1

meaning that "normal" and "kashida" are the same string. So, probably that
phrase should read something like: "If an application calls the function to
create a sort key for a string containing an Arabic kashida character, the
function will ignore that character and no sort key value will be generated
for it."

[1] https://en.wikipedia.org/wiki/Kashida

Regards,

Juan José Santamaría Flecha
#include "windows.h"
#include "stdio.h"

static char
comparison(int r)
{
switch (r) {
case 1:
return '>';
case -1:
return '<';
case 0:
return '=';
default:
return '!';
}
}

int
wmain(int argc, wchar_t* argv[])
{
const wchar_t *s1 = L"\u0631\u062d\u064a\u0645";
const wchar_t *s2 = 
L"\u0631\u062d\u0640\u0640\u0640\u0640\u0640\u0640\u064a\u0645";
int len;
wchar_t c1[13];
wchar_t c2[13];
int result;

len = LCMapStringEx(L"es-US", LCMAP_SORTKEY, s1, -1, c1, 13, NULL, NULL, 0);
len = LCMapStringEx(L"es-US", LCMAP_SORTKEY, s2, -1, c2, 13, NULL, NULL, 0);

result = memcmp(c1, c2, 13);
printf("c1 %c c2\n", comparison(result));
result = memcmp(c2, c1, 13);
printf("c2 %c c1\n", comparison(result));

return 0;
}

Re: Adding a LogicalRepWorker type field

2023-08-11 Thread Amit Kapila

On Thu, Aug 10, 2023 at 7:50 AM Peter Smith  wrote:
>
> >
> > * If you do the above then there won't be a need to change the
> > variable name is_parallel_apply_worker in logicalrep_worker_launch.
> >
>
> Done.
>

I don't think the addition of two new macros isTablesyncWorker() and
isLeaderApplyWorker() adds much value, so removed those and ran
pgindent. I am planning to commit this patch early next week unless
you or others have any comments.

-- 
With Regards,
Amit Kapila.

v6-0001-Simplify-determining-logical-replication-worker-t.patch
Description: Binary data

logicalrep_worker_launch -- counting/checking the worker limits

2023-08-11 Thread Peter Smith

While reviewing other threads I have been looking more closely at the
the logicalrep_worker_launch() function. IMO the logic of that
function seems not quite right.

Here are a few things I felt are strange:

1. The function knows exactly what type of worker it is launching, but
still, it is calling the worker counting functions
logicalrep_sync_worker_count() and  logicalrep_pa_worker_count() even
when launching a worker of a *different* type.

1a. I think should only count/check the tablesync worker limit when
trying to launch a tablesync worker

1b. I think should only count/check the parallel apply worker limit
when trying to launch a parallel apply worker

~

2. There is some condition for attempting the garbage-collection of workers:

/*
* If we didn't find a free slot, try to do garbage collection.  The
* reason we do this is because if some worker failed to start up and its
* parent has crashed while waiting, the in_use state was never cleared.
*/
if (worker == NULL || nsyncworkers >= max_sync_workers_per_subscription)

The inclusion of that nsyncworkers check here has very subtle
importance. AFAICT this means that even if we *did* find a free
worker, we still need to do garbage collection just in case one of
those 'in_use' tablesync worker was in error (e.g. crashed after
marked in_use). By garbage-collecting (and then re-counting
nsyncworkers) we might be able to launch the tablesync successfully
instead of just returning that it has maxed out.

2a. IIUC that is all fine. The problem is that I think there should be
exactly the same logic for the parallel apply workers here also.

2b. The comment above should explain more about the reason for the
nsyncworkers condition -- the existing comment doesn't really cover
it.

~

3. There is a wrong cut/paste comment in the body of
logicalrep_sync_worker_count().

That comment should be changed to read similarly to the equivalent
comment in logicalrep_pa_worker_count.

--

PSA a patch to address all these items.

This patch is about making the function logically consistent. Removing
some of the redundant countings should also be more efficient in
theory, but in practice, I think the unnecessary worker loops are too
short (max_logical_replication_workers) for any performance
improvements to be noticeable.

Thoughts?

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-logicalrep_worker_launch-limit-checks.patch
Description: Binary data

58 matches

Mail list logo