Re: EINTR in ftruncate()

2022-07-06 Thread Thomas Munro
On Thu, Jul 7, 2022 at 9:05 AM Thomas Munro  wrote:
> On Thu, Jul 7, 2022 at 9:03 AM Andres Freund  wrote:
> > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund  wrote:
> > > > So I think we need: 1) block most signals, 2) a retry loop *without*
> > > > interrupt checks.

Here's a draft patch that tries to explain all this in the commit
message and comments.

Even if we go with this approach now, I think it's plausible that we
might want to reconsider this yet again one day, perhaps allocating
memory with some future asynchronous infrastructure while still
processing interrupts.  It's not very nice to hold up recovery or
ProcSignalBarrier for long operations.

I'm a little unclear about ftruncate() here.  I don't expect it to
report EINTR in other places where we use it (ie to make a local file
on a non-"slow device" smaller), because I expect that to be like
read(), write() etc which we don't wrap in EINTR loops.  Here you've
observed EINTR when messing around with a debugger*.  It seems
inconsistent to put posix_fallocate() in an EINTR retry loop for the
benefit of debuggers, but not ftruncate().  But perhaps that's good
enough, on the theory that posix_fallocate(1GB) is a very large target
and you have a decent chance of hitting it.

Another observation while staring at that ftruncate():  It's entirely
redundant on Linux, because we only ever call dsm_impl_posix_resize()
to make the segment bigger.  Before commit 3c60d0fa (v12) it was
possible to resize a segment to be smaller with dsm_resize(), so you
needed one or t'other depending on the requested size and we just
called both, but dsm_resize() wasn't ever used AFAIK and didn't even
work on all DSM implementations, among other problems, so we ripped it
out.  So... on master at least, we could also change the #ifdef to be
either-or.  While refactoring like that, I think we might as well also
rearrange the code so that the wait event is reported also for other
OSes, just in case it takes a long time.  See 0002 patch.

*It's funny that ftruncate() apparently doesn't automatically restart
for ptrace SIGCONT on Linux according to your report, while poll()
does according to my experiments, even though the latter is one of the
never-restart functions (it doesn't on other OSes I hack on, and you
feel the difference when debugging missing wakeup type bugs...).
Random implementation details...
From d22c937914906135fddc8635792641089ba3e2cd Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 7 Jul 2022 15:15:11 +1200
Subject: [PATCH 1/2] Block signals while allocating DSM memory.

On Linux, we call posix_fallocate() on shm_open()'d memory to avoid
later potential SIGBUS (see commit 899bd785).

Based on field reports of systems stuck in an EINTR retry loop there,
there, we made it possible to break out of that loop via slightly odd
coding where the CHECK_FOR_INTERRUPTS() call was somewhat removed from
the loop (see commit 422952ee).

On further reflection, that was not a great choice for at least two
reasons:

1.  If interrupts were held, the CHECK_FOR_INTERRUPTS() would do nothing
and the EINTR error would be surfaced to the user.

2.  If EINTR was reported but neither QueryCancelPending nor
ProcDiePending was set, then we'd dutifully retry, but with a bit more
understanding of how posix_fallocate() works, it's now clear that you
can get into a loop that never terminates.  posix_fallocate() is not a
function that can do some of the job and tell you about progress if it's
interrupted, it has to undo what it's done so far and report EINTR, and
if signals keep arriving faster than it can complete (cf recovery
conflict signals), you're stuck.

Therefore, for now, we'll simply block most signals to guarantee
progress.  SIGQUIT is not blocked (see InitPostmasterChild()), because
its expected handler doesn't return, and unblockable signals like
SIGCONT are not expected to arrive at a high rate.  For good measure,
we'll include the ftruncate() call in the blocked region.

Back-patch to all supported releases.

Reported-by: Alvaro Herrera 
Reported-by: Nicola Contu 
Discussion: https://postgr.es/m/20220701154105.jjfutmngoedgiad3%40alvherre.pgsql
---
 src/backend/storage/ipc/dsm_impl.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 323862a3d2..aa47e5de70 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -62,6 +62,7 @@
 #endif
 
 #include "common/file_perm.h"
+#include "libpq/pqsignal.h"		/* for PG_SETMASK macro */
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "portability/mem.h"
@@ -306,14 +307,6 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 		shm_unlink(name);
 		errno = save_errno;
 
-		/*
-		 * If we received a query cancel or termination signal, we will have
-		 * EINTR set here.  If the caller said that errors are 

Re: Eliminating SPI from RI triggers - take 2

2022-07-06 Thread Amit Langote
On Wed, Jul 6, 2022 at 11:55 AM Amit Langote  wrote:
> On Wed, Jul 6, 2022 at 3:24 AM Jacob Champion  wrote:
> > On Thu, Jun 30, 2022 at 11:23 PM Amit Langote  
> > wrote:
> > >
> > > I will continue investigating what to do about points (1) and (2)
> > > mentioned above and see if we can do away with using SQL in the
> > > remaining cases.
> >
> > Hi Amit, looks like isolation tests are failing in cfbot:
> >
> > https://cirrus-ci.com/task/6642884727275520
> >
> > Note also the uninitialized variable warning that cfbot picked up;
> > that may or may not be related.
>
> Thanks for the heads up.
>
> Yeah, I noticed the warning when I compiled with a different set of
> gcc parameters, though not the isolation test failures, so not sure
> what the bot is running into.
>
> Attaching updated patches which fix the warning and a few other issues
> I noticed.

Hmm, cfbot is telling me that detach-partition-concurrently-2 is
failing on Cirrus-CI [1].  Will look into it.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://cirrus-ci.com/task/5253369525698560?logs=test_world#L317




Re: pg_upgrade (12->14) fails on aggregate

2022-07-06 Thread Michael Paquier
On Tue, Jul 05, 2022 at 11:29:03PM -0400, Tom Lane wrote:
> Yeah.  I think that 08385ed26 fixes this, but we've had no new
> reports yet :-(

Indeed.  Things are right now.  Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-07-06 Thread Michael Paquier
On Thu, Jul 07, 2022 at 09:11:57AM +0900, Michael Paquier wrote:
> Okay, thanks for confirming.  I think that I'll give it a try today
> then, my schedule would fit nicely with the buildfarm monitoring.

And I have applied that, after noticing that the MinGW was complaining
because _WIN32_WINNT was not getting set like previously and removing
_WIN32_WINNT as there is no need for it anymore.  The CI has reported
green for all my tests, so I am rather confident to not have messed up
something.  Now, let's see what the buildfarm members tell.  This
should take a couple of hours..
--
Michael


signature.asc
Description: PGP signature


RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-06 Thread wangw.f...@fujitsu.com
On Mon, Jul 4, 2022 at 14:47 AM Peter Smith  wrote:
> Below are some review comments for patch v14-0004:

Thanks for your comments.

> 4.0 General.
> 
> This comment is an after-thought but as I write this mail I am
> wondering if most of this 0004 patch is even necessary at all? Instead
> of introducing a new column and all the baggage that goes with it,
> can't the same functionality be achieved just by toggling the
> streaming mode 'substream' value from 'p' (parallel) to 't' (on)
> whenever an error occurs causing a retry? Anyway, if you do change it
> this way then most of the following comments can be disregarded.

In the approach that you mentioned, after retrying, the transaction will always
be applied in "on" mode. This will change the user's setting. 
That is to say, in most cases, user needs to manually reset option "streaming"
to "parallel". I think it might not be very friendly.

> 4.6 src/backend/replication/logical/worker.c
> 
> 4.6.a - apply_handle_commit
> 
> + /* Set the flag that we will not retry later. */
> + set_subscription_retry(false);
> 
> But the comment is wrong, isn't it? Shouldn’t it just say that we are
> not *currently* retrying? And can’t this just anyway be redundant if
> only the catalog column has a DEFAULT value of false?
> 
> 4.6.b - apply_handle_prepare
> Ditto
> 
> 4.6.c - apply_handle_commit_prepared
> Ditto
> 
> 4.6.d - apply_handle_rollback_prepared
> Ditto
> 
> 4.6.e - apply_handle_stream_prepare
> Ditto
> 
> 4.6.f - apply_handle_stream_abort
> Ditto
> 
> 4.6.g - apply_handle_stream_commit
> Ditto

Set default value of the field "subretry" to "false" as you suggested.
We need to reset this field to false after retrying to apply a streaming
transaction in main apply worker ("on" mode).
I think this comment is not clear. So I change it to
```
Reset the retry flag.
```

> 4.7 src/backend/replication/logical/worker.c
> 
> 4.7.a - start_table_sync
> 
> @@ -3894,6 +3917,9 @@ start_table_sync(XLogRecPtr *origin_startpos,
> char **myslotname)
>   }
>   PG_CATCH();
>   {
> + /* Set the flag that we will retry later. */
> + set_subscription_retry(true);
> 
> Maybe this should say more like "Flag that the next apply will be the
> result of a retry"
> 
> 4.7.b - start_apply
> Ditto

Similar to the reply in #4.6, I changed it to `Set the retry flag.`.

> 4.8 src/backend/replication/logical/worker.c - set_subscription_retry
> 
> +
> +/*
> + * Set subretry of pg_subscription catalog.
> + *
> + * If retry is true, subscriber is about to exit with an error. Otherwise, it
> + * means that the changes was applied successfully.
> + */
> +static void
> +set_subscription_retry(bool retry)
> 
> "changes" -> "change" ?

I did not make it clear before.
I modified "changes" to "transaction".

> 4.8 src/backend/replication/logical/worker.c - set_subscription_retry
> 
> Isn't this flag only every used when streaming=parallel? But it does
> not seem ot be checking that anywhere before potentiall executing all
> this code when maybe will never be used.

Yes, currently this field is only checked by apply background worker.

> 4.9 src/include/catalog/pg_subscription.h
> 
> @@ -76,6 +76,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
> BKI_SHARED_RELATION BKI_ROW
>   bool subdisableonerr; /* True if a worker error should cause the
>   * subscription to be disabled */
> 
> + bool subretry; /* True if the previous apply change failed. */
> 
> I was wondering if you can give this column a DEFAULT value of false,
> because then perhaps most of the patch code from worker.c may be able
> to be eliminated.

Please refer to the reply to #4.6.

The rest of the comments are improved as suggested.
The new patches were attached in [1].

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB62755C6C9A75EB09F7218B589E839%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei


RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-06 Thread wangw.f...@fujitsu.com
On Mon, Jul 4, 2022 at 12:12 AM Peter Smith  wrote:
> Below are some review comments for patch v14-0003:

Thanks for your comments.

> 3.1 Commit message
> 
> If any of the following checks are violated, an error will be reported.
> 1. The unique columns between publisher and subscriber are difference.
> 2. There is any non-immutable function present in expression in
> subscriber's relation. Check from the following 4 items:
>a. The function in triggers;
>b. Column default value expressions and domain constraints;
>c. Constraint expressions.
>d. The foreign keys.
> 
> SUGGESTION (rewording to match the docs and the code).
> 
> Add some checks before using apply background worker to apply changes.
> streaming=parallel mode has two requirements:
> 1) The unique columns must be the same between publisher and subscriber
> 2) There cannot be any non-immutable functions in the subscriber-side
> replicated table. Look for functions in the following places:
> * a. Trigger functions
> * b. Column default value expressions and domain constraints
> * c. Constraint expressions
> * d. Foreign keys
> 
> ==
> 
> 3.2 doc/src/sgml/ref/create_subscription.sgml
> 
> +  To run in this mode, there are following two requirements. The 
> first
> +  is that the unique column should be the same between publisher and
> +  subscriber; the second is that there should not be any 
> non-immutable
> +  function in subscriber-side replicated table.
> 
> SUGGESTION
> Parallel mode has two requirements: 1) the unique columns must be the
> same between publisher and subscriber; 2) there cannot be any
> non-immutable functions in the subscriber-side replicated table.

I did not write clearly enough before. So I made some slight modifications to
the first requirement you suggested. Like this:
```
1) The unique column in the relation on the subscriber-side should also be the
unique column on the publisher-side;
```

> 3.9 src/backend/replication/logical/proto.c - logicalrep_write_attrs
> 
> This big slab of new code to get the BMS looks very similar to
> RelationGetIdentityKeyBitmap. So perhaps this code should be
> encapsulated in another function like that one (in relcache.c?) and
> then just called from logicalrep_write_attrs

I think the file relcache.c should contain cache-build operations, and the code
I added doesn't have this operation. So I didn't change.

> 3.12 src/backend/replication/logical/relation.c -
> logicalrep_rel_mark_safe_in_apply_bgworker
> 
> I did not really understand why you used an enum for one flag
> (volatility) but not the other one (sameunique); shouldn’t both of
> these be tri-values: unknown/yes/no?
> 
> For E.g. there is a quick exit from this function if the
> FUNCTION_UNKNOWN, but there is no equivalent quick exit for the
> sameunique? It seems inconsistent.

After rethinking patch 0003, I think we only need one flag. So I merged flags
'volatility' and 'sameunique' into a new flag 'parallel'. It is a tri-state
flag. And I also made some related modifications.

> 3.14 src/backend/replication/logical/relation.c -
> logicalrep_rel_mark_safe_in_apply_bgworker
> 
> There are lots of places setting FUNCTION_NONIMMUTABLE, so I think
> this code might be tidier if you just have a single return at the end
> of this function and 'goto' it.
> 
> e.g.
> if (...)
> goto function_not_immutable;
> 
> ...
> 
> return;
> 
> function_not_immutable:
> entry->volatility = FUNCTION_NONIMMUTABLE;

Personally, I do not like to use the `goto` syntax if it is not necessary,
because the `goto` syntax will forcibly change the flow of code execution.

> 3.17 src/backend/utils/cache/typcache.c
> 
> +/*
> + * GetDomainConstraints --- get DomainConstraintState list of
> specified domain type
> + */
> +List *
> +GetDomainConstraints(Oid type_id)
> 
> This is an unusual-looking function header comment, with the function
> name and the "---".

Not sure about this. Please refer to function lookup_rowtype_tupdesc_internal.

> 3.19
> 
> @@ -31,6 +42,11 @@ typedef struct LogicalRepRelMapEntry
>   Relation localrel; /* relcache entry (NULL when closed) */
>   AttrMap*attrmap; /* map of local attributes to remote ones */
>   bool updatable; /* Can apply updates/deletes? */
> + bool sameunique; /* Are all unique columns of the local
> +relation contained by the unique columns in
> +remote? */
> 
> (This is similar to review comment 3.12)
> 
> I felt it was inconsistent for this to be a boolean but for the
> 'volatility' member to be an enum. AFAIK these 2 flags are similar
> kinds – e.g. essentially tri-state flags unknown/true/false so I
> thought they should be treated the same.  E.g. both enums?

Please refer to the reply to #3.12.

> 3.22 .../subscription/t/032_streaming_apply.pl
> 
> 3.22.a
> +# Setup structure on publisher
> 
> "structure"?
> 
> 3.22.b
> +# Setup structure on subscriber
> 
> "structure"?

Just refer to other subscription test.

> 3.23
> 
> +# Check that a 

RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-06 Thread wangw.f...@fujitsu.com
On Fri, Jul 1, 2022 at 17:44 PM Amit Kapila  wrote:
>
Thanks for your comments.

> On Fri, Jul 1, 2022 at 12:13 PM Peter Smith  wrote:
> >
> > ==
> >
> > 1.2 doc/src/sgml/protocol.sgml - Protocol constants
> >
> > Previously I wrote that since there are protocol changes here,
> > shouldn’t there also be some corresponding LOGICALREP_PROTO_XXX
> > constants and special checking added in the worker.c?
> >
> > But you said [1 comment #6] you think it is OK because...
> >
> > IMO, I still disagree with the reply. The fact is that the protocol
> > *has* been changed, so IIUC that is precisely the reason for having
> > those protocol constants.
> >
> > e.g I am guessing you might assign the new one somewhere here:
> > --
> > server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
> > options.proto.logical.proto_version =
> > server_version >= 15 ?
> LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
> > server_version >= 14 ?
> LOGICALREP_PROTO_STREAM_VERSION_NUM :
> > LOGICALREP_PROTO_VERSION_NUM;
> > --
> >
> > And then later you would refer to this new protocol version (instead
> > of the server version) when calling to the apply_handle_stream_abort
> > function.
> >
> > ==
> >
> 
> One point related to this that occurred to me is how it will behave if
> the publisher is of version >=16 whereas the subscriber is of versions
> <=15? Won't in that case publisher sends the new fields but
> subscribers won't be reading those which may cause some problems.

Makes sense. Fixed this point.
As Peter-san suggested, I added a new protocol macro
LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM.
This new macro marks the version that supports apply background worker (it
means we will read abort_lsn and abort_time). And the publisher sends abort_lsn
and abort_time fields only if subscriber will read them. (see function
logicalrep_write_stream_abort)

The new patches were attached in [1].

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB62755C6C9A75EB09F7218B589E839%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei


RE: Perform streaming logical transactions by background workers and parallel apply

2022-07-06 Thread shiy.f...@fujitsu.com
On Tue, Jun 28, 2022 11:22 AM Wang, Wei/王 威  wrote:
> 
> I also improved patches as suggested by Peter-san in [1] and [2].
> Thanks for Shi Yu to improve the patches by addressing the comments in [2].
> 
> Attach the new patches.
> 

Thanks for updating the patch.

Here are some comments.

0001 patch
==
1.
+   /* Check If there are free worker slot(s) */
+   LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

I think "Check If" should be "Check if".

0003 patch
==
1.
Should we call apply_bgworker_relation_check() in apply_handle_truncate()?

0004 patch
==
1.
@@ -3932,6 +3958,9 @@ start_apply(XLogRecPtr origin_startpos)
}
PG_CATCH();
{
+   /* Set the flag that we will retry later. */
+   set_subscription_retry(true);
+
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
Else

I think we need to emit the error and recover from the error state before
setting the retry flag, like what we do in DisableSubscriptionAndExit().
Otherwise if an error is detected when setting the retry flag, we won't get the
error message reported by the apply worker.

Regards,
Shi yu


Re: tuplesort Generation memory contexts don't play nicely with index builds

2022-07-06 Thread Tom Lane
David Rowley  writes:
> On Thu, 7 Jul 2022 at 13:41, John Naylor  wrote:
>> Hmm, the commit appeared on git.postgresql.org, but apparently not in
>> my email nor the list archives.

> Strange. I'd suspect a temporary hiccup in whatever code pushes the
> commits onto the mailing list, but I see that my fe3caa143 from
> yesterday was also missed.

Caught in list moderation maybe?

regards, tom lane




Re: Fast COPY FROM based on batch insert

2022-07-06 Thread Ian Barwick

2022年3月24日(木) 15:44 Andrey V. Lepikhov :
>
> On 3/22/22 06:54, Etsuro Fujita wrote:
> > On Fri, Jun 4, 2021 at 5:26 PM Andrey Lepikhov
> >  wrote:
> >> We still have slow 'COPY FROM' operation for foreign tables in current
> >> master.
> >> Now we have a foreign batch insert operation And I tried to rewrite the
> >> patch [1] with this machinery.
> >
> > The patch has been rewritten to something essentially different, but
> > no one reviewed it.  (Tsunakawa-san gave some comments without looking
> > at it, though.)  So the right status of the patch is “Needs review”,
> > rather than “Ready for Committer”?  Anyway, here are a few review
> > comments from me:
> >
> > * I don’t think this assumption is correct:
> >
> > @@ -359,6 +386,12 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
> >   (resultRelInfo->ri_TrigDesc->trig_insert_after_row ||
> >resultRelInfo->ri_TrigDesc->trig_insert_new_table))
> >  {
> > +   /*
> > +* AFTER ROW triggers aren't allowed with the foreign bulk 
insert
> > +* method.
> > +*/
> > +   Assert(resultRelInfo->ri_RelationDesc->rd_rel->relkind !=
> > RELKIND_FOREIGN_TABLE);
> > +
> >
> > In postgres_fdw we disable foreign batch insert when the target table
> > has AFTER ROW triggers, but the core allows it even in that case.  No?
> Agree
>
> > * To allow foreign multi insert, the patch made an invasive change to
> > the existing logic to determine whether to use multi insert for the
> > target relation, adding a new member ri_usesMultiInsert to the
> > ResultRelInfo struct, as well as introducing a new function
> > ExecMultiInsertAllowed().  But I’m not sure we really need such a
> > change.  Isn’t it reasonable to *adjust* the existing logic to allow
> > foreign multi insert when possible?
> Of course, such approach would look much better, if we implemented it.
> I'll ponder how to do it.
>
> > I didn’t finish my review, but I’ll mark this as “Waiting on Author”.
> I rebased the patch onto current master. Now it works correctly. I'll
> mark it as "Waiting for review".

I took a look at this patch as it would a useful optimization to have.

It applies cleanly to current HEAD, but as-is, with a large data set, it
reproducibly fails like this (using postgres_fdw):

postgres=# COPY foo FROM '/tmp/fast-copy-from/test.csv' WITH (format csv);
ERROR:  bind message supplies 0 parameters, but prepared statement 
"pgsql_fdw_prep_19422" requires 6
CONTEXT:  remote SQL command: INSERT INTO public.foo_part_1(t, v1, v2, v3, 
v4, v5) VALUES ($1, $2, $3, $4, $5, $6)
COPY foo, line 17281589

This occurs because not all multi-insert buffers being flushed actually contain
tuples; the fix is simply not to call ExecForeignBatchInsert() if that's the 
case,
e.g:


/* Flush into foreign table or partition */
do {
int size = (resultRelInfo->ri_BatchSize < nused - sent) ?
resultRelInfo->ri_BatchSize : (nused - sent);

if (size)
{
int inserted = size;

resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert(estate,
 
resultRelInfo,
 
[sent],
 NULL,
 );
sent += size;
}
} while (sent < nused);


There might a case for arguing that the respective FDW should check that it has
actually received some tuples to insert, but IMHO it's much preferable to catch
this as early as possible and avoid a superfluous call.

FWIW, with the above fix in place, with a simple local test the patch produces a
consistent speed-up of about 8 times compared to the existing functionality.


Regards

Ian Barwick

--

EnterpriseDB - https://www.enterprisedb.com




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-06 Thread Masahiko Sawada
On Wed, Jul 6, 2022 at 5:55 PM Amit Kapila  wrote:
>
> On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila  wrote:
> > >
> > > 2. Are we anytime removing transaction ids from catchanges->xip array?
> >
> > No.
> >
> > > If not, is there a reason for the same? I think we can remove it
> > > either at commit/abort or even immediately after adding the xid/subxid
> > > to committed->xip array.
> >
> > It might be a good idea but I'm concerned that removing XID from the
> > array at every commit/abort or after adding it to committed->xip array
> > might be costly as it requires adjustment of the array to keep its
> > order. Removing XIDs from the array would make bsearch faster but the
> > array is updated reasonably often (every 15 sec).
> >
>
> Fair point. However, I am slightly worried that we are unnecessarily
> searching in this new array even when ReorderBufferTxn has the
> required information. To avoid that, in function
> SnapBuildXidHasCatalogChange(), we can first check
> ReorderBufferXidHasCatalogChanges() and then check the array if the
> first check doesn't return true. Also, by the way, do we need to
> always keep builder->catchanges.xip updated via SnapBuildRestore()?
> Isn't it sufficient that we just read and throw away contents from a
> snapshot if builder->catchanges.xip is non-NULL?

IIUC catchanges.xip is restored only once when restoring a consistent
snapshot via SnapBuildRestore(). I think it's necessary to set
catchanges.xip for later use in SnapBuildXidHasCatalogChange(). Or did
you mean via SnapBuildSerialize()?∫

>
> I had additionally thought if can further optimize this solution to
> just store this additional information when we need to serialize for
> checkpoint record but I think that won't work because walsender can
> restart even without resatart of server in which case the same problem
> can occur.

Yes, probably we need to write catalog modifying transactions for
every serialized snapshot.

> I am not if sure there is a way to further optimize this
> solution, let me know if you have any ideas?

I suppose that writing additional information to serialized snapshots
would not be a noticeable overhead since we need 4 bytes per
transaction and we would not expect there is a huge number of
concurrent catalog modifying transactions. But both collecting catalog
modifying transactions (especially when there are many ongoing
transactions) and bsearch'ing on the XID list every time decoding the
COMMIT record could bring overhead.

A solution for the first point would be to keep track of catalog
modifying transactions by using a linked list so that we can avoid
checking all ongoing transactions.

Regarding the second point, on reflection, I think we need to look up
the XID list until all XID in the list is committed/aborted. We can
remove XIDs from the list after adding it to committed.xip as you
suggested. Or when decoding a RUNNING_XACTS record, we can remove XIDs
older than builder->xmin from the list like we do for committed.xip in
SnapBuildPurgeCommittedTxn().

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: tuplesort Generation memory contexts don't play nicely with index builds

2022-07-06 Thread David Rowley
On Thu, 7 Jul 2022 at 13:41, John Naylor  wrote:
>
> On Thu, Jul 7, 2022 at 3:16 AM David Rowley  wrote:
> >
> > Pushed.
>
> Hmm, the commit appeared on git.postgresql.org, but apparently not in
> my email nor the list archives.

Strange. I'd suspect a temporary hiccup in whatever code pushes the
commits onto the mailing list, but I see that my fe3caa143 from
yesterday was also missed.

The only difference in my workflow is that I'm sshing to the machine I
push from via another room rather than sitting right in front of it
like I normally am. I struggle to imagine why that would cause this to
happen.

David




Re: Issue with pg_stat_subscription_stats

2022-07-06 Thread Masahiko Sawada
On Thu, Jul 7, 2022 at 1:28 AM Andres Freund  wrote:
>
> On 2022-07-05 14:52:45 -0700, Andres Freund wrote:
> > On 2022-07-04 11:01:01 +0900, Masahiko Sawada wrote:
> > > I've attached the patch, fix_drop_subscriptions_stats.patch, to fix it.
> >
> > LGTM. Unless somebody sees a reason not to, I'm planning to commit that to 
> > 15
> > and HEAD.
>
> Pushed.

Thanks!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Issue with pg_stat_subscription_stats

2022-07-06 Thread Masahiko Sawada
On Thu, Jul 7, 2022 at 12:53 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-07-06 11:41:46 +0900, Masahiko Sawada wrote:
> > diff --git a/src/test/regress/sql/subscription.sql 
> > b/src/test/regress/sql/subscription.sql
> > index 74c38ead5d..6a46956f6e 100644
> > --- a/src/test/regress/sql/subscription.sql
> > +++ b/src/test/regress/sql/subscription.sql
> > @@ -30,6 +30,12 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 
> > 'dbname=regress_doesnotexist' PUB
> >  COMMENT ON SUBSCRIPTION regress_testsub IS 'test subscription';
> >  SELECT obj_description(s.oid, 'pg_subscription') FROM pg_subscription s;
> >
> > +-- Check if the subscription stats are created and stats_reset is updated
> > +-- by pg_stat_reset_subscription_stats().
> > +SELECT subname, stats_reset IS NULL stats_reset_is_null FROM 
> > pg_stat_subscription_stats ORDER BY 1;
>
> Why use ORDER BY 1 instead of just getting the stats for the subscription we
> want to test? Seems a bit more robust to show only that one, so we don't get
> unnecessary changes if the test needs to create another subscription or such.

Right, it's more robust. I've updated the patch accordingly.

>
>
> > +SELECT pg_stat_reset_subscription_stats(oid) FROM pg_subscription;
> > +SELECT subname, stats_reset IS NULL stats_reset_is_null FROM 
> > pg_stat_subscription_stats ORDER BY 1;
> > +
>
> Perhaps worth resetting again and checking that the timestamp is bigger than
> the previous timestamp? You can do that with \gset etc.

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v3-0001-Create-subscription-stats-entry-at-CREATE-SUBSCRI.patch
Description: Binary data


Re: tuplesort Generation memory contexts don't play nicely with index builds

2022-07-06 Thread John Naylor
On Thu, Jul 7, 2022 at 3:16 AM David Rowley  wrote:
>
> Pushed.

Hmm, the commit appeared on git.postgresql.org, but apparently not in
my email nor the list archives.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Fix unnecessary includes and comments in 019_replslot_limit.pl, 007_wal.pl and 004_timeline_switch.pl

2022-07-06 Thread Michael Paquier
On Wed, Jul 06, 2022 at 04:26:38PM +0300, Maxim Orlov wrote:
> Yeah, it looks even better now.

Especially knowing that the test uses a segment size of 1MB via initdb
to be cheaper.  v2 looks fine by itself, so applied.
--
Michael


signature.asc
Description: PGP signature


Re: pg_parameter_aclcheck() and trusted extensions

2022-07-06 Thread Michael Paquier
On Wed, Jul 06, 2022 at 03:47:27PM -0700, Nathan Bossart wrote:
> I think the call to superuser_arg() in pg_parameter_aclmask() is causing
> set_config_option() to bypass the normal privilege checks, as
> execute_extension_script() will have set the user ID to the bootstrap
> superuser for trusted extensions like plperl.  I don't have a patch or a
> proposal at the moment, but I thought it was worth starting the discussion.

Looks like a bug to me, so I have added an open item assigned to Tom.
--
Michael


signature.asc
Description: PGP signature


Re: defGetBoolean - Fix comment

2022-07-06 Thread Michael Paquier
On Thu, Jul 07, 2022 at 09:53:01AM +1000, Peter Smith wrote:
> Really this code is for the case when there *was* a parameter given
> (e.g. "copy_data" in my example above) but when there is no parameter
> *value* given.
> 
> Suggested comment fix:
> BEFORE
> If no parameter given, assume "true" is meant.
> AFTER
> If no parameter value given, assume "true" is meant.

Still, I think that your adjustment is right, as the check is, indeed,
on the DefElem's value*.  Or you could just say "If no value given".
--
Michael


signature.asc
Description: PGP signature


Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-07-06 Thread Michael Paquier
On Wed, Jul 06, 2022 at 05:13:27PM -0400, Andrew Dunstan wrote:
> On 2022-07-06 We 16:46, Thomas Munro wrote:
>> The build farm also has frogmouth and currawong, 32 bit systems
>> running Windows XP, but they are only testing REL_10_STABLE so I
>> assume Andrew will decommission them in November.
> 
> Yeah, it's not capable of supporting anything newer, so it will finally
> go to sleep this year.

Okay, thanks for confirming.  I think that I'll give it a try today
then, my schedule would fit nicely with the buildfarm monitoring.
--
Michael


signature.asc
Description: PGP signature


Re: Backup command and functions can cause assertion failure and segmentation fault

2022-07-06 Thread Michael Paquier
On Wed, Jul 06, 2022 at 11:27:58PM +0900, Fujii Masao wrote:
> For the test, BASE_BACKUP needs to be canceled after it finishes
> do_pg_backup_start(), i.e., checkpointing, and before it calls
> do_pg_backup_stop(). So the timing to cancel that seems more severe
> than the test added in 0475a97f. I'm afraid that some tests can
> easily cancel the BASE_BACKUP while it's performing a checkpoint in
> do_pg_backup_start(). So for now I'm thinking to avoid such an
> unstable test.

Hmm.  In order to make sure that the checkpoint of the base backup is
completed, and assuming that the checkpoint is fast while the base
backup has a max rate, you could rely on a query that does a
poll_query_until() on pg_control_checkpoint(), no?  As long as you use
IPC::Run::start, pg_basebackup would be async so the polling query and
the cancellation can be done in parallel of it.  0475a97 did almost
that, except that it waits for the WAL sender to be started.
--
Michael


signature.asc
Description: PGP signature


Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-07-06 Thread Andres Freund
Hi,

On 2022-06-13 19:08:35 +0530, Nitin Jadhav wrote:
> > Have you measured the performance effects of this? On fast storage with 
> > large
> > shared_buffers I've seen these loops in profiles. It's probably fine, but 
> > it'd
> > be good to verify that.
> 
> To understand the performance effects of the above, I have taken the
> average of five checkpoints with the patch and without the patch in my
> environment. Here are the results.
> With patch: 269.65 s
> Without patch: 269.60 s

Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.

To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.

Greetings,

Andres Freund




Re: Make mesage at end-of-recovery less scary.

2022-07-06 Thread Michael Paquier
On Wed, Jul 06, 2022 at 11:05:51AM -0700, Jacob Champion wrote:
> [CFM hat] Looking through the history here, this has been bumped to
> Ready for Committer a few times and then bumped back to Needs Review
> after a required rebase. What's the best way for us to provide support
> for contributors who get stuck in this loop? Maybe we can be more
> aggressive about automated notifications when a RfC patch goes red in
> the cfbot?

Having a better integration between the CF bot and the CF app would be
great, IMO.  People tend to easily forget about what they send in my
experience, even if they manage a small pool of patches or a larger
one.
--
Michael


signature.asc
Description: PGP signature


defGetBoolean - Fix comment

2022-07-06 Thread Peter Smith
Hi.

IMO the comment ("If no parameter given, ...") is a bit misleading:



(gdb) list
108 defGetBoolean(DefElem *def)
109 {
110 /*
111 * If no parameter given, assume "true" is meant.
112 */
113 if (def->arg == NULL)
114 return true;
115
116 /*
117 * Allow 0, 1, "true", "false", "on", "off"
(gdb) p *def
$9 = {type = T_DefElem, defnamespace = 0x0, defname = 0x1c177a8
"copy_data", arg = 0x0,
  defaction = DEFELEM_UNSPEC, location = 93}
(gdb)




Really this code is for the case when there *was* a parameter given
(e.g. "copy_data" in my example above) but when there is no parameter
*value* given.

Suggested comment fix:
BEFORE
If no parameter given, assume "true" is meant.
AFTER
If no parameter value given, assume "true" is meant.

~~

Although it seems a trivial point, the motivation is that the above
code has been adapted (cut/paste) by multiple other WIP patches [1][2]
that I'm aware of, and so this original (misleading) comment is now
spreading to other places.

PSA patch to tweak both the original comment and another place it
already spread to.

--
[1] 
https://www.postgresql.org/message-id/flat/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com
+static char
+defGetStreamingMode(DefElem *def)

[2] 
https://www.postgresql.org/message-id/flat/CAHut%2BPs%2B4iLzJGkPFEatv%3D%2Baa6NUB38-WT050RFKeJqhdcLaGA%40mail.gmail.com#6d43277cbb074071b8e9602ff8be7e41
+static CopyData
+DefGetCopyData(DefElem *def)

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Fix-comment-in-defGetBoolean.patch
Description: Binary data


Re:

2022-07-06 Thread Peter Smith
On Thu, Jun 16, 2022 at 10:59 AM Peter Smith  wrote:
>
> On Thu, Jun 9, 2022 at 11:50 PM Tom Lane  wrote:
> >
> > Peter Eisentraut  writes:
> > > Initially, that chapter did not document any system views.
> >
> > Maybe we could make the system views a separate chapter?
>
> +1

There has not been any activity on this thread for a while, so I am
just wondering what I should do next about it:

Are there any other opinions about this?

If there is no interest whatsoever in splitting the existing "System
Catalogs" into 2 chapters ("System Catalogs" and "System Views") then
I will abandon the idea.

But if others also feel it might be better to split them, I can put
patching this on my TODO list and share it sometime later.

TIA.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Fwd: Add red-black tree missing comparison searches

2022-07-06 Thread Steve Chavez
-- Forwarded message -
From: Steve Chavez 
Date: Wed, 6 Jul 2022 at 18:14
Subject: Re: Add red-black tree missing comparison searches
To: Alexander Korotkov 


Thanks Alexander!

wrt to the new patch. I think the following comment is misleading since
keyDeleted can be true or false:

+ /* switch equal_match to false so we only find greater matches now */
+ node = (IntRBTreeNode *) rbt_find_great(tree, (RBTNode *) ,
+ keyDeleted);

Maybe it should be the same used for searching lesser keys:

+ /*
+ * Find the next key.  If the current key is deleted, we can pass
+ * equal_match == true and still find the next one.
+ */

On Wed, 6 Jul 2022 at 13:53, Alexander Korotkov 
wrote:

> Hi, Steve!
>
> On Sat, Jul 2, 2022 at 10:38 PM Steve Chavez  wrote:
> > > But I think we can support strict inequalities too (e.g.
> > less and greater without equals).  Could you please make it a bool
> > argument equal_matches?
> >
> > Sure, an argument is a good idea to keep the code shorter.
> >
> > > Could you please extract this change as a separate patch.
> >
> > Done!
>
> Thank you!
>
> I did some improvements to the test suite, run pgindent and wrote
> commit messages.
>
> I think this is quite straightforward and low-risk patch.  I'm going
> to push it if no objections.
>
> --
> Regards,
> Alexander Korotkov
>


pg_parameter_aclcheck() and trusted extensions

2022-07-06 Thread Nathan Bossart
Hi hackers,

I found that as of a0ffa88, it's possible to set a PGC_SUSET GUC defined by
a trusted extension as a non-superuser.  I've confirmed that this only
affects v15 and later versions.

postgres=# CREATE ROLE testuser;
CREATE ROLE
postgres=# GRANT CREATE ON DATABASE postgres TO testuser;
GRANT
postgres=# SET ROLE testuser;
SET
postgres=> SET plperl.on_plperl_init = 'test';
SET
postgres=> CREATE EXTENSION plperl;
CREATE EXTENSION
postgres=> SELECT setting FROM pg_settings WHERE name = 
'plperl.on_plperl_init';
 setting 
-
 test
(1 row)

On previous versions, the CREATE EXTENSION command emits the following
WARNING, and the setting does not take effect:

WARNING:  permission denied to set parameter "plperl.on_plperl_init"

I think the call to superuser_arg() in pg_parameter_aclmask() is causing
set_config_option() to bypass the normal privilege checks, as
execute_extension_script() will have set the user ID to the bootstrap
superuser for trusted extensions like plperl.  I don't have a patch or a
proposal at the moment, but I thought it was worth starting the discussion.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: tuplesort Generation memory contexts don't play nicely with index builds

2022-07-06 Thread Robert Haas
On Tue, Jul 5, 2022 at 9:34 PM David Rowley  wrote:
> I voiced my dislike for the patch I came up with to fix this issue to
> Andres.  He suggested that I just add a version of index_form_tuple
> that can be given a MemoryContext pointer to allocate the returned
> tuple into.
>
> I like that idea much better, so I've attached a patch to fix it that way.
>
> If there are no objections, I plan to push this in the next 24 hours.

Apologies for not having looked at this thread sooner, but for what
it's worth, I think this is a fine solution.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: automatically generating node support functions

2022-07-06 Thread Tom Lane
I wrote:
> I have gone through this and made some proposed changes (attached),
> and I think it is almost committable.

I see from the cfbot that it now needs to be taught about RelFileNumber...

regards, tom lane




Re: making relfilenodes 56 bits

2022-07-06 Thread Robert Haas
On Wed, Jul 6, 2022 at 11:57 AM Robert Haas  wrote:
> I think 0002 and 0003 need more work yet; I'll try to write a review
> of those soon.

Regarding 0002:

I don't particularly like the names BufTagCopyRelFileLocator and
BufTagRelFileLocatorEquals. My suggestion is to rename
BufTagRelFileLocatorEquals to BufTagMatchesRelFileLocator, because it
doesn't really make sense to me to talk about equality between values
of different data types. Instead of BufTagCopyRelFileLocator I would
prefer BufTagGetRelFileLocator. That would make it more similar to
BufTagGetFileNumber and BufTagSetFileNumber, which I think would be a
good thing.

Other than that I think 0002 seems fine.

Regarding 0003:

/*
 * Don't try to prefetch
anything in this database until
-* it has been created, or we
might confuse the blocks of
-* different generations, if a
database OID or
-* relfilenumber is reused.
It's also more efficient than
+* it has been created,
because it's more efficient than
 * discovering that relations
don't exist on disk yet with
 * ENOENT errors.
 */

I'm worried that this might not be correct. The comment changes here
(and I think also in some other plces) imply that we've eliminated
relfilenode ruse, but I think that's not true. createdb() and movedb()
don't seem to be modified, so I think it's possible to just copy a
template database over without change, which means that relfilenumbers
and even relfilelocators could be reused. So I feel like maybe this
and similar places shouldn't be modified in this way. Am I
misunderstanding?

/*
-* Relfilenumbers are not unique in databases across
tablespaces, so we need
-* to allocate a new one in the new tablespace.
+* Generate a new relfilenumber. Although relfilenumber are
unique within a
+* cluster, we are unable to use the old relfilenumber since unused
+* relfilenumber are not unlinked until commit.  So if within a
+* transaction, if we set the old tablespace again, we will
get conflicting
+* relfilenumber file.
 */
-   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
-
rel->rd_rel->relpersistence);
+   newrelfilenumber = GetNewRelFileNumber();

I can't clearly understand this comment. Is it saying that the code
which follows is broken and needs to be fixed by a future patch before
things are OK again? If so, that's not good.

- * callers should be GetNewOidWithIndex() and GetNewRelFileNumber() in
- * catalog/catalog.c.
+ * callers should be GetNewOidWithIndex() in catalog/catalog.c.

If there is only one, it should say "caller", not "callers".

 Orphan files are harmless --- at worst they waste a bit of disk space ---
-because we check for on-disk collisions when allocating new relfilenumber
-OIDs.  So cleaning up isn't really necessary.
+because relfilenumber is 56 bit wide so logically there should not be any
+collisions.  So cleaning up isn't really necessary.

I don't agree that orphaned files are harmless, but changing that is
beyond the scope of this patch. I think that the way you've ended the
sentence isn't sufficiently clear and correct even if we accept the
principle that orphaned files are harmless. What I think we should
stay instead is "because the relfilenode counter is monotonically
increasing. The maximum value is 2^56-1, and there is no provision for
wraparound."

+   /*
+* Check if we set the new relfilenumber then do we run out of
the logged
+* relnumber, if so then we need to WAL log again.  Otherwise,
just adjust
+* the relnumbercount.
+*/
+   relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
+   if (ShmemVariableCache->relnumbercount <= relnumbercount)
+   {
+   LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH);
+   ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
+   }
+   else
+   ShmemVariableCache->relnumbercount -= relnumbercount;

Would it be clearer, here and elsewhere, if VariableCacheData tracked
nextRelFileNumber and nextUnloggedRelFileNumber instead of
nextRelFileNumber and relnumbercount? I'm not 100% sure, but the idea
seems worth considering.

+* Flush xlog record to disk before returning.  To protect against file
+* system changes reaching the disk before the
XLOG_NEXT_RELFILENUMBER log.

The way this is worded, you would need it to be just one sentence,
like "Flush xlog record to disk before returning to protect
against...". Or else add "this is," like "This is to protect
against..."

But I'm thinking maybe we could reword it a 

Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-07-06 Thread Andrew Dunstan


On 2022-07-06 We 16:46, Thomas Munro wrote:
> On Wed, Jul 6, 2022 at 7:28 PM Michael Paquier  wrote:
>> On Thu, Jun 16, 2022 at 03:14:16PM +0900, Michael Paquier wrote:
>>> Actually, this can go with the bump of MIN_WINNT as it uses one of the
>>> IsWindows*OrGreater() macros as a runtime check.  And there are two
>>> more places in pg_ctl.c that can be similarly cleaned up.
>>>
>>> It is possible that I have missed some spots, of course.
>> It does not seem to be the case on a second look.  The buildfarm
>> animals running Windows are made of:
>> - hamerkop, Windows server 2016 (based on Win10 AFAIK)
>> - drongo, Windows server 2019
>> - bowerbird, Windows 10 pro
>> - jacana, Windows 10
>> - fairywren, Msys server 2019
>> - bichir, Ubuntu/Windows 10 mix
>>
>> Now that v16 is open for business, any objections to move on with this
>> patch and bump MIN_WINNT to 0x0A00 on HEAD?  There are follow-up items
>> for large pages and more.
> +1 for proceeding.  This will hopefully unblock a few things, and it's
> good to update our claims to match the reality of what we are actually
> testing and able to debug.
>
> The build farm also has frogmouth and currawong, 32 bit systems
> running Windows XP, but they are only testing REL_10_STABLE so I
> assume Andrew will decommission them in November.


Yeah, it's not capable of supporting anything newer, so  it will finally
go to sleep this year.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: EINTR in ftruncate()

2022-07-06 Thread Thomas Munro
On Thu, Jul 7, 2022 at 9:03 AM Andres Freund  wrote:
> On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund  wrote:
> > > So I think we need: 1) block most signals, 2) a retry loop *without*
> > > interrupt checks.
> >
> > Yeah.  I was also wondering about wrapping the whole function in
> > PG_SETMASK(), PG_SETMASK(), but also leaving the
> > while (rc == EINTR) loop there (without the check for *Pending
> > variables), only because otherwise when you attach a debugger and
> > continue you'll get a spurious EINTR and it'll interfere with program
> > execution.  All blockable signals would be blocked *except* SIGQUIT,
> > which means that fast shutdown/crash will still work.  It seems nice
> > to leave that way to interrupt it without resorting to SIGKILL.
>
> Fast shutdown shouldn't use SIGQUIT - did you mean immediate? I think
> it's fine to allow immediate shutdowns, but I don't think we should
> allow fast shutdowns to interrupt it.

Err, yeah, that one.




Re: EINTR in ftruncate()

2022-07-06 Thread Andres Freund
Hi,

On 2022-07-07 08:56:33 +1200, Thomas Munro wrote:
> On Thu, Jul 7, 2022 at 8:39 AM Andres Freund  wrote:
> > So I think we need: 1) block most signals, 2) a retry loop *without*
> > interrupt checks.
> 
> Yeah.  I was also wondering about wrapping the whole function in
> PG_SETMASK(), PG_SETMASK(), but also leaving the
> while (rc == EINTR) loop there (without the check for *Pending
> variables), only because otherwise when you attach a debugger and
> continue you'll get a spurious EINTR and it'll interfere with program
> execution.  All blockable signals would be blocked *except* SIGQUIT,
> which means that fast shutdown/crash will still work.  It seems nice
> to leave that way to interrupt it without resorting to SIGKILL.

Fast shutdown shouldn't use SIGQUIT - did you mean immediate? I think
it's fine to allow immediate shutdowns, but I don't think we should
allow fast shutdowns to interrupt it.

Greetings,

Andres Freund




Re: EINTR in ftruncate()

2022-07-06 Thread Thomas Munro
On Thu, Jul 7, 2022 at 8:39 AM Andres Freund  wrote:
> On 2022-07-06 21:29:41 +0200, Alvaro Herrera wrote:
> > On 2022-Jul-05, Andres Freund wrote:
> >
> > > I think we'd be better off disabling at least some signals during
> > > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another
> > > variation of these problems. I haven't checked the source of ftruncate, 
> > > but
> > > what Thomas dug up for fallocate makes it pretty clear that our current
> > > approach of just retrying again and again isn't good enough. It's a bit 
> > > more
> > > obvious that it's a problem for fallocate, but I don't think it's worth 
> > > having
> > > different solutions for the two.
> >
> > So what if we move the retry loop one level up?  As in the attached.
> > Here, if we get EINTR then we retry both syscalls.
>
> Doesn't really seem to address the problem to me. posix_fallocate()
> takes some time (~1s for 3GB roughly), so if we signal at a higher rate,
> we'll just get stuck.
>
> I hacked a bit on a test program from Thomas, and it's pretty clearly
> that with a 5ms timer interval you'll pretty much not make
> progress. It's much easier to get fallocate() to get interrupted than
> ftruncate(), but the latter gets interrupted e.g. when you do a strace
> in the "wrong" moment (afaics SIGSTOP/SIGCONT trigger EINTR in
> situations that are retried otherwise).
>
> So I think we need: 1) block most signals, 2) a retry loop *without*
> interrupt checks.

Yeah.  I was also wondering about wrapping the whole function in
PG_SETMASK(), PG_SETMASK(), but also leaving the
while (rc == EINTR) loop there (without the check for *Pending
variables), only because otherwise when you attach a debugger and
continue you'll get a spurious EINTR and it'll interfere with program
execution.  All blockable signals would be blocked *except* SIGQUIT,
which means that fast shutdown/crash will still work.  It seems nice
to leave that way to interrupt it without resorting to SIGKILL.




Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-07-06 Thread Thomas Munro
On Wed, Jul 6, 2022 at 7:28 PM Michael Paquier  wrote:
> On Thu, Jun 16, 2022 at 03:14:16PM +0900, Michael Paquier wrote:
> > Actually, this can go with the bump of MIN_WINNT as it uses one of the
> > IsWindows*OrGreater() macros as a runtime check.  And there are two
> > more places in pg_ctl.c that can be similarly cleaned up.
> >
> > It is possible that I have missed some spots, of course.
>
> It does not seem to be the case on a second look.  The buildfarm
> animals running Windows are made of:
> - hamerkop, Windows server 2016 (based on Win10 AFAIK)
> - drongo, Windows server 2019
> - bowerbird, Windows 10 pro
> - jacana, Windows 10
> - fairywren, Msys server 2019
> - bichir, Ubuntu/Windows 10 mix
>
> Now that v16 is open for business, any objections to move on with this
> patch and bump MIN_WINNT to 0x0A00 on HEAD?  There are follow-up items
> for large pages and more.

+1 for proceeding.  This will hopefully unblock a few things, and it's
good to update our claims to match the reality of what we are actually
testing and able to debug.

The build farm also has frogmouth and currawong, 32 bit systems
running Windows XP, but they are only testing REL_10_STABLE so I
assume Andrew will decommission them in November.




Re: automatically generating node support functions

2022-07-06 Thread Tom Lane
Peter Eisentraut  writes:
> [ v7-0001-Automatically-generate-node-support-functions.patch ]

I have gone through this and made some proposed changes (attached),
and I think it is almost committable.  There is one nasty problem
we need a solution to, which is that pgindent is not at all on board
with this idea of attaching node attrs to typedefs.  It pushes them
to the next line, like this:

@@ -691,7 +709,8 @@
 (rel)->reloptkind == RELOPT_OTHER_JOINREL || \
 (rel)->reloptkind == RELOPT_OTHER_UPPER_REL)
 
-typedef struct RelOptInfo pg_node_attr(no_copy_equal, no_read)
+typedef struct RelOptInfo
+pg_node_attr(no_copy_equal, no_read)
 {
NodeTag type;
 
which is already enough to break the simplistic parsing in
gen_node_support.pl.  Now, we could fix that parsing logic to deal
with this layout, but this also seems to change pgindent's opinion
of whether the subsequent braced material is part of a typedef or a
function.  That results in it injecting a lot of vertical space
that wasn't there before, which is annoying.

I experimented a bit and found that we could do it this way:

 typedef struct RelOptInfo
 {
+   pg_node_attr(no_copy_equal, no_read)
+
NodeTag type;

without (AFAICT) confusing pgindent, but I've not tried to adapt
the perl script or the code to that style.

Anyway, besides that, I have some comments that I've implemented
in the attached delta patch.

* After further thought I'm okay with your theory that attaching
special copy/equal rules to specific field types is appropriate.
We might at some point want the pg_node_attr(copy_as_scalar)
approach too, but we can always add that later.  However, I thought
some more comments about it were needed in the *nodes.h files,
so I added those.  (My general feeling about this is that if
anyone needs to look into gen_node_support.pl to understand how
the backend works, we've failed at documentation.)

* As written, the patch created equal() support for all Plan structs,
which is quite a bit of useless code bloat.  I solved this by
separating no_copy and no_equal properties, so that we could mark
Plan as no_equal while still having copy support.

* I did not like the semantics of copy_ignore one bit: it was
relying on the pre-zeroing behavior of makeNode() to be sane at
all, and I don't want that to be a requirement.  (I foresee
wanting to flat-copy node contents and turn COPY_SCALAR_FIELD
into a no-op.)  I replaced it with copy_as(VALUE) to provide
better-defined semantics.

* Likewise, read_write_ignore left the contents of the field after
reading too squishy for me.  I invented read_as(VALUE) parallel
to copy_as() to fix the semantics, and added a check that you
can only use read_write_ignore if the struct is no_read or
you provide read_as().  (This could be factored differently
of course.)

* I threw in a bunch more no_read markers to bring the readfuncs.c
contents into closer alignment with what we have today.  Maybe
there is an argument for accepting that code bloat, but it's a
discussion to have later.  In any case, most of the pathnodes.h
structs HAVE to be marked no_read because there's no sane way
to reconstruct them from outfuncs output.

* I got rid of the code that stripped underscores from outfuncs
struct labels.  That seemed like an entirely unnecessary
behavioral change.

* FWIW, I'm okay with the question about

# XXX Previously, for subtyping, only the leaf field name is
# used. Ponder whether we want to keep it that way.

I thought that it might make the output too cluttered, but after
some study of the results from printing plans and planner data
structures, it's not a big addition, and indeed I kind of like it.

* Fixed a bug in write_only_req_outer code.

* Made Plan and Join into abstract nodes.

Anyway, if we can fix the impedance mismatch with pgindent,
I think this is committable.  There is a lot of follow-on
work that could be considered, but I'd like to get the present
changes in place ASAP so that other patches can be rebased
onto something stable.

I've attached a delta patch, and also repeated v7 so as not
to confuse the cfbot.

regards, tom lane

From c82ee081a7a8cdc77b44f325d1df695b55a60b06 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 6 Jul 2022 12:13:32 +0200
Subject: [PATCH v7] Automatically generate node support functions

Add a script to automatically generate the node support functions
(copy, equal, out, and read, as well as the node tags enum) from the
struct definitions.

For each of the four node support files, it creates two include files,
e.g., copyfuncs.funcs.c and copyfuncs.switch.c, to include in the main
file.  All the scaffolding of the main file stays in place.

TODO: In this patch, I have only ifdef'ed out the code to could be
removed, mainly so that it won't constantly have merge conflicts.
Eventually, that should all be changed to delete the code.  All the
code comments that are worth 

Re: EINTR in ftruncate()

2022-07-06 Thread Andres Freund
Hi,

On 2022-07-06 21:29:41 +0200, Alvaro Herrera wrote:
> On 2022-Jul-05, Andres Freund wrote:
> 
> > I think we'd be better off disabling at least some signals during
> > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another
> > variation of these problems. I haven't checked the source of ftruncate, but
> > what Thomas dug up for fallocate makes it pretty clear that our current
> > approach of just retrying again and again isn't good enough. It's a bit more
> > obvious that it's a problem for fallocate, but I don't think it's worth 
> > having
> > different solutions for the two.
> 
> So what if we move the retry loop one level up?  As in the attached.
> Here, if we get EINTR then we retry both syscalls.

Doesn't really seem to address the problem to me. posix_fallocate()
takes some time (~1s for 3GB roughly), so if we signal at a higher rate,
we'll just get stuck.

I hacked a bit on a test program from Thomas, and it's pretty clearly
that with a 5ms timer interval you'll pretty much not make
progress. It's much easier to get fallocate() to get interrupted than
ftruncate(), but the latter gets interrupted e.g. when you do a strace
in the "wrong" moment (afaics SIGSTOP/SIGCONT trigger EINTR in
situations that are retried otherwise).

So I think we need: 1) block most signals, 2) a retry loop *without*
interrupt checks.

Greetings,

Andres Freund




Re: tuplesort Generation memory contexts don't play nicely with index builds

2022-07-06 Thread David Rowley
On Wed, 6 Jul 2022 at 13:34, David Rowley  wrote:
> If there are no objections, I plan to push this in the next 24 hours.

Pushed.

David




Re: EINTR in ftruncate()

2022-07-06 Thread Alvaro Herrera
On 2022-Jul-05, Andres Freund wrote:

> I think we'd be better off disabling at least some signals during
> dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another
> variation of these problems. I haven't checked the source of ftruncate, but
> what Thomas dug up for fallocate makes it pretty clear that our current
> approach of just retrying again and again isn't good enough. It's a bit more
> obvious that it's a problem for fallocate, but I don't think it's worth having
> different solutions for the two.

So what if we move the retry loop one level up?  As in the attached.
Here, if we get EINTR then we retry both syscalls.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"
>From 67d4447bc8075b9eb249e7543afb136b2cfaad9b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 1 Jul 2022 17:16:33 +0200
Subject: [PATCH v4] rework retry loop for dsm_impl_op

---
 src/backend/storage/ipc/dsm_impl.c | 101 +
 1 file changed, 58 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 323862a3d2..4f3284b66b 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -295,30 +295,57 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 		}
 		request_size = st.st_size;
 	}
-	else if (dsm_impl_posix_resize(fd, request_size) != 0)
+	else
 	{
-		int			save_errno;
+		int			rc;
 
-		/* Back out what's already been done. */
-		save_errno = errno;
-		close(fd);
-		ReleaseExternalFD();
-		shm_unlink(name);
-		errno = save_errno;
+		for (;;)
+		{
+			rc = dsm_impl_posix_resize(fd, request_size);
+			if (rc == 0)
+break;			/* all good */
 
-		/*
-		 * If we received a query cancel or termination signal, we will have
-		 * EINTR set here.  If the caller said that errors are OK here, check
-		 * for interrupts immediately.
-		 */
-		if (errno == EINTR && elevel >= ERROR)
-			CHECK_FOR_INTERRUPTS();
+			/*
+			 * If there's a signal that we need to handle, bail out here to
+			 * clean up and handle it.
+			 */
+			if (QueryCancelPending || ProcDiePending)
+break;
 
-		ereport(elevel,
-(errcode_for_dynamic_shared_memory(),
- errmsg("could not resize shared memory segment \"%s\" to %zu bytes: %m",
-		name, request_size)));
-		return false;
+			/*
+			 * If we were interrupted but it wasn't a Postgres-level signal,
+			 * try again.
+			 */
+			if (rc == EINTR)
+continue;
+			break;
+		}
+
+		if (rc != 0)
+		{
+			int			save_errno;
+
+			/* Back out what's already been done. */
+			save_errno = errno;
+			close(fd);
+			ReleaseExternalFD();
+			shm_unlink(name);
+			errno = save_errno;
+
+			/*
+			 * If we received a query cancel or termination signal, we will have
+			 * EINTR set here.  If the caller said that errors are OK here, check
+			 * for interrupts immediately.
+			 */
+			if (errno == EINTR && elevel >= ERROR)
+CHECK_FOR_INTERRUPTS();
+
+			ereport(elevel,
+	(errcode_for_dynamic_shared_memory(),
+	 errmsg("could not resize shared memory segment \"%s\" to %zu bytes: %m",
+			name, request_size)));
+			return false;
+		}
 	}
 
 	/* Map it. */
@@ -355,15 +382,20 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
  * If necessary, also ensure that virtual memory is actually allocated by the
  * operating system, to avoid nasty surprises later.
  *
- * Returns non-zero if either truncation or allocation fails, and sets errno.
+ * Returns non-zero if either truncation or allocation fails, and errno is set.
  */
 static int
 dsm_impl_posix_resize(int fd, off_t size)
 {
 	int			rc;
 
-	/* Truncate (or extend) the file to the requested size. */
+	/*
+	 * Truncate (or extend) the file to the requested size.
+	 */
+	errno = 0;
 	rc = ftruncate(fd, size);
+	if (rc != 0)
+		return errno;
 
 	/*
 	 * On Linux, a shm_open fd is backed by a tmpfs file.  After resizing with
@@ -374,27 +406,10 @@ dsm_impl_posix_resize(int fd, off_t size)
 	 * SIGBUS later.
 	 */
 #if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
-	if (rc == 0)
-	{
-		/*
-		 * We may get interrupted.  If so, just retry unless there is an
-		 * interrupt pending.  This avoids the possibility of looping forever
-		 * if another backend is repeatedly trying to interrupt us.
-		 */
-		pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
-		do
-		{
-			rc = posix_fallocate(fd, 0, size);
-		} while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
-		pgstat_report_wait_end();
-
-		/*
-		 * The caller expects errno to be set, but posix_fallocate() doesn't
-		 * set it.  Instead it returns error numbers directly.  So set errno,
-		 * even though we'll also return rc to indicate success or failure.
-		 */
-		errno = rc;
-	}
+	pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
+	rc = posix_fallocate(fd, 0, 

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-06 Thread Andres Freund
Hi,

On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:
> From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Tue, 28 Jun 2022 11:33:04 -0400
> Subject: [PATCH v22 1/3] Add BackendType for standalone backends
> 
> All backends should have a BackendType to enable statistics reporting
> per BackendType.
> 
> Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
> alphabetize the BackendTypes). Both the bootstrap backend and single
> user mode backends will have BackendType B_STANDALONE_BACKEND.
> 
> Author: Melanie Plageman 
> Discussion: 
> https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
> ---
>  src/backend/utils/init/miscinit.c | 17 +++--
>  src/include/miscadmin.h   |  5 +++--
>  2 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/src/backend/utils/init/miscinit.c 
> b/src/backend/utils/init/miscinit.c
> index eb43b2c5e5..07e6db1a1c 100644
> --- a/src/backend/utils/init/miscinit.c
> +++ b/src/backend/utils/init/miscinit.c
> @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
>  {
>   Assert(!IsPostmasterEnvironment);
>  
> + MyBackendType = B_STANDALONE_BACKEND;

Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.


> @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
>* out the initial relation mapping files.
>*/
>   RelationMapFinishBootstrap();
> + // TODO: should this be done for bootstrap?
> + pgstat_report_io_ops();

Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.



> diff --git a/src/backend/postmaster/autovacuum.c 
> b/src/backend/postmaster/autovacuum.c
> index 2e146aac93..e6dbb1c4bb 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
>   recentXid = ReadNextTransactionId();
>   recentMulti = ReadNextMultiXactId();
>   do_autovacuum();
> +
> + // TODO: should this be done more often somewhere in 
> do_autovacuum()?
> + pgstat_report_io_ops();
>   }

Don't think you need all these calls before process exit - it'll happen
via pgstat_shutdown_hook().

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.


> diff --git a/src/backend/postmaster/bgwriter.c 
> b/src/backend/postmaster/bgwriter.c
> index 91e6f6ea18..87e4b9e9bd 100644
> --- a/src/backend/postmaster/bgwriter.c
> +++ b/src/backend/postmaster/bgwriter.c
> @@ -242,6 +242,7 @@ BackgroundWriterMain(void)
>  
>   /* Report pending statistics to the cumulative stats system */
>   pgstat_report_bgwriter();
> + pgstat_report_io_ops();
>  
>   if (FirstCallSinceLastCheckpoint())
>   {

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.



> +/*
> + * Flush out locally pending IO Operation statistics entries
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true. Writer processes are mutually excluded
> + * using LWLock, but readers are expected to use change-count protocol to 
> avoid
> + * interference with writers.
> + *
> + * If nowait is true, this function returns true if the lock could not be
> + * acquired. Otherwise return false.
> + *
> + */
> +bool
> +pgstat_flush_io_ops(bool nowait)
> +{
> + PgStat_IOPathOps *dest_io_path_ops;
> + PgStatShared_BackendIOPathOps *stats_shmem;
> +
> + PgBackendStatus *beentry = MyBEEntry;
> +
> + if (!have_ioopstats)
> + return false;
> +
> + if (!beentry || beentry->st_backendType == B_INVALID)
> + return false;
> +
> + stats_shmem = >io_ops;
> +
> + if (!nowait)
> + LWLockAcquire(_shmem->lock, LW_EXCLUSIVE);
> + else if (!LWLockConditionalAcquire(_shmem->lock, LW_EXCLUSIVE))
> + return true;

Wonder if it's worth making the lock specific to the backend type?


> + dest_io_path_ops =
> + 
> _shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +

This could be done before acquiring the lock, right?


> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> + PgStatShared_BackendIOPathOps *stats_shmem = >io_ops;
> + PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
> + PgStat_IOPathOps *reset_ops;
> +
> + PgStat_IOPathOps *reset_offset = 

Re: Add red-black tree missing comparison searches

2022-07-06 Thread Alexander Korotkov
Hi, Steve!

On Sat, Jul 2, 2022 at 10:38 PM Steve Chavez  wrote:
> > But I think we can support strict inequalities too (e.g.
> less and greater without equals).  Could you please make it a bool
> argument equal_matches?
>
> Sure, an argument is a good idea to keep the code shorter.
>
> > Could you please extract this change as a separate patch.
>
> Done!

Thank you!

I did some improvements to the test suite, run pgindent and wrote
commit messages.

I think this is quite straightforward and low-risk patch.  I'm going
to push it if no objections.

--
Regards,
Alexander Korotkov


0001-Use-C99-designator-in-the-rbtree-sentinel-definit-v2.patch
Description: Binary data


0002-Add-missing-inequality-searches-to-rbtree-v2.patch
Description: Binary data


Re: Make name optional in CREATE STATISTICS

2022-07-06 Thread Matthias van de Meent
On Sun, 15 May 2022 at 14:20, Simon Riggs  wrote:
>
> Currently, CREATE STATS requires you to think of a name for each stats
> object, which is fairly painful, so users would prefer an
> automatically assigned name.
>
> Attached patch allows this, which turns out to be very simple, since a
> name assignment function already exists.
>
> The generated name is simple, but that's exactly what users do anyway,
> so it is not too bad.

Cool.

> Tests, docs included.

Something I noticed is that this grammar change is quite different
from how create index specifies its optional name. Because we already
have a seperate statement sections for with and without IF NOT EXISTS,
adding another branch will add even more duplication. Using a new
opt_name production (potentially renamed from opt_index_name?) would
probably reduce the amount of duplication in the grammar.

We might be able to use opt_if_not_exists to fully remove the
duplicated grammars, but I don't think we would be able to keep the
"CREATE STATISTICS IF NOT EXISTS <> ON col1, col2 FROM table"
syntax illegal.

Please also update the comment in gram.y above the updated section
that details the expected grammar for CREATE STATISTICS, as you seem
to have overlooked that copy of grammar documentation.

Apart from these two small issues, this passes tests and seems complete.


Kind regards,

Matthias van de Meent




Re: Make mesage at end-of-recovery less scary.

2022-07-06 Thread Jacob Champion
On Mon, Mar 28, 2022 at 11:07 PM Kyotaro Horiguchi
 wrote:
>
> Rebased.

Unfortunately this will need another rebase over latest.

[CFM hat] Looking through the history here, this has been bumped to
Ready for Committer a few times and then bumped back to Needs Review
after a required rebase. What's the best way for us to provide support
for contributors who get stuck in this loop? Maybe we can be more
aggressive about automated notifications when a RfC patch goes red in
the cfbot?

Thanks,
--Jacob




Re: Custom tuplesorts for extensions

2022-07-06 Thread Alexander Korotkov
.Hi!

On Wed, Jul 6, 2022 at 6:01 PM Andres Freund  wrote:
> I think this needs to be evaluated for performance...

Surely, this needs.

> I agree with the nearby comment that the commits need a bit of justification
> at least to review them.

Will do this.
> > From 1d78e271b22d7c6a1557defbe15ea5039ff28510 Mon Sep 17 00:00:00 2001
> > From: Alexander Korotkov 
> > Date: Tue, 21 Jun 2022 14:03:13 +0300
> > Subject: [PATCH v2 2/6] Tuplesortstate.getdatum1 method
>
> Hm. This adds a bunch of indirect function calls were there previously
> weren't.

Yep.  I think it worth changing this function to deal with many
SortTuple's at once.

> > From 494d46dcf938e5f824a498e38ce77782751208e1 Mon Sep 17 00:00:00 2001
> > From: Alexander Korotkov 
> > Date: Tue, 21 Jun 2022 14:13:56 +0300
> > Subject: [PATCH v2 3/6] Put abbreviation logic into puttuple_common()
>
> There's definitely a lot of redundancy removed... But the list of branches in
> puttuple_common() grew.  Perhaps we instead can add a few flags to
> putttuple_common() that determine whether abbreviation should happen, so that
> we only do the work necessary for the "type" of sort?

Good point, will refactor that.

> > From ee2dd46b07d62e13ed66b5a38272fb5667c943f3 Mon Sep 17 00:00:00 2001
> > From: Alexander Korotkov 
> > Date: Wed, 22 Jun 2022 00:14:51 +0300
> > Subject: [PATCH v2 4/6] Move freeing memory away from writetup()
>
> Seems to do more than just moving freeing around?

Yes, it also move memory accounting from tuplesort_put*() to
puttuple_common().  Will revise this.

> > @@ -1973,8 +1963,13 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, 
> > bool isNull)
> >  static void
> >  puttuple_common(Tuplesortstate *state, SortTuple *tuple)
> >  {
> > + MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
> > +
> >   Assert(!LEADER(state));
> >
> > + if (tuple->tuple != NULL)
> > + USEMEM(state, GetMemoryChunkSpace(tuple->tuple));
> > +
>
> Adding even more branches into common code...

I will see how to reduce branching here.

> > From 3a0e1fa7c7e4da46a86f7d5b9dd0392549f3b460 Mon Sep 17 00:00:00 2001
> > From: Alexander Korotkov 
> > Date: Wed, 22 Jun 2022 18:11:26 +0300
> > Subject: [PATCH v2 5/6] Reorganize data structures
>
> Hard to know what this is trying to achieve.

Split the public interface part out of Tuplesortstate.

> > -struct Tuplesortstate
> > +struct TuplesortOps
> >  {
> > - TupSortStatus status;   /* enumerated value as shown above */
> > - int nKeys;  /* number of columns 
> > in sort key */
> > - int sortopt;/* Bitmask of flags 
> > used to setup sort */
> > - boolbounded;/* did caller specify a 
> > maximum number of
> > -  * tuples to 
> > return? */
> > - boolboundUsed;  /* true if we made use of a 
> > bounded heap */
> > - int bound;  /* if bounded, the 
> > maximum number of tuples */
> > - booltuples; /* Can SortTuple.tuple ever 
> > be set? */
> > - int64   availMem;   /* remaining memory 
> > available, in bytes */
> > - int64   allowedMem; /* total memory allowed, in 
> > bytes */
> > - int maxTapes;   /* max number of 
> > input tapes to merge in each
> > -  * pass */
> > - int64   maxSpace;   /* maximum amount of space 
> > occupied among sort
> > -  * of groups, 
> > either in-memory or on-disk */
> > - boolisMaxSpaceDisk; /* true when maxSpace is value for 
> > on-disk
> > -  * space, 
> > false when it's value for in-memory
> > -  * space */
> > - TupSortStatus maxSpaceStatus;   /* sort status when maxSpace was 
> > reached */
> >   MemoryContext maincontext;  /* memory context for tuple sort 
> > metadata that
> >* persists 
> > across multiple batches */
> >   MemoryContext sortcontext;  /* memory context holding most sort 
> > data */
> >   MemoryContext tuplecontext; /* sub-context of sortcontext for tuple 
> > data */
> > - LogicalTapeSet *tapeset;/* logtape.c object for tapes in a 
> > temp file */
> >
> >   /*
> >* These function pointers decouple the routines that must know what 
> > kind
>
> To me it seems odd to have memory contexts and similar things in a
> datastructure calls *Ops.

Yep, it worth renaming TuplesortOps into TuplesortPublic or something.

> > From b06bcb5f3666f0541dfcc27c9c8462af2b5ec9e0 Mon Sep 17 

Re: doc: Move enum storage commentary to top of section

2022-07-06 Thread David G. Johnston
On Wed, Jul 6, 2022 at 10:24 AM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Thu, 9 Jun 2022 at 18:12, David G. Johnston
>  wrote:
> >
> > Per suggestion over on -docs:
> >
> >
> https://www.postgresql.org/message-id/bl0pr06mb4978f6c0b69f3f03aebed0fbb3...@bl0pr06mb4978.namprd06.prod.outlook.com
>
> I agree with moving the size of an enum into the top, but I don't
> think that the label length documentation also should be included in
> the top section. It seems excessively detailed for that section with
> its reference to NAMEDATALEN.
>
> -Matthias
>

Agreed.

Tangentially: It does seem a bit unusual to call the fact that the values
both case-sensitive and limited to the length of a system identifier an
implementation detail.  But if anything the length is more of one than the
case-sensitivity.  Specifying NAMEDATALEN here seems like it breaks
encapsulation, it could refer by comparison to an identifier and only those
that care can learn how that length might be changed in a custom build of
PostgreSQL.

David J.


Re: doc: Move enum storage commentary to top of section

2022-07-06 Thread Matthias van de Meent
On Thu, 9 Jun 2022 at 18:12, David G. Johnston
 wrote:
>
> Per suggestion over on -docs:
>
> https://www.postgresql.org/message-id/bl0pr06mb4978f6c0b69f3f03aebed0fbb3...@bl0pr06mb4978.namprd06.prod.outlook.com

I agree with moving the size of an enum into the top, but I don't
think that the label length documentation also should be included in
the top section. It seems excessively detailed for that section with
its reference to NAMEDATALEN.

-Matthias




Re: AIX support - alignment issues

2022-07-06 Thread Robert Haas
On Wed, Jul 6, 2022 at 12:27 PM Andres Freund  wrote:
> I think my proposal of introducing a version of double that is marked to be 8
> byte aligned should do the trick as well, and doesn't have the problem of
> changing the meaning of 'double' references in external headers. In fact, we
> already have float8 as a type, so we could just add it there.

Yeah, but how easily will it be to know whether we've used that in
every relevant place?

Could we insist on 8-byte alignment even on 32-bit platforms? I think
we have a few of those in the buildfarm, so maybe that would help us
spot problems. Although I'm not sure how, exactly.

> The problem with having a lot more alignment values is that it adds a bunch of
> overhead to very performance critical paths. We don't want to add more
> branches to att_align_nominal() if we can avoid it.

Fair.

> I'm fairly certain that we're going to add a lot more 64bit ints to catalogs
> in the next few years, so this will become a bigger issue over time...

Absolutely.

> Outside of the catalogs I still think that we should work towards not aligning
> byval values (and instead memcpy-ing the values to deal with alignment
> sensitive platforms), so we don't waste so much space. And for catalogs we've
> been talking about giving up the struct mapping as well, in the thread about
> variable length names. In which case we could the cost of handling more
> alignment values wouldn't be incurred as frequently.

+1. Aligning stuff on disk appears to have few redeeming properties
for the amount of pain it causes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Possible fails in pg_stat_statements test

2022-07-06 Thread Robert Haas
On Thu, Mar 31, 2022 at 12:00 PM Julien Rouhaud  wrote:
> > Indeed. Then there is a very simple solution for this particular case as
> > wal_records counter may only sometime becomes greater but never less.
> > The corresponding patch is attached.
>
> +1 for this approach, and the patch looks good to me.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: O(n) tasks cause lengthy startups and checkpoints

2022-07-06 Thread Nathan Bossart
Here's a new revision where I've attempted to address all the feedback I've
received thus far.  Notably, the custodian now uses a queue for registering
tasks and determining which tasks to execute.  Other changes include
splitting the temporary file functions apart to avoid consecutive boolean
flags, using a timestamp instead of an integer for the staging name for
temporary directories, moving temporary directories to a dedicated
directory so that the custodian doesn't need to scan relation files,
ERROR-ing when something goes wrong when cleaning up temporary files,
executing requested tasks immediately in single-user mode, and more.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f2d5205b7fab3c1dccbe25829c9b46ba26b3cd9f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 5 Jan 2022 19:24:22 +
Subject: [PATCH v7 1/6] Introduce custodian.

The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing.  This commit simply adds the new process; it does
not yet do anything useful.
---
 src/backend/postmaster/Makefile |   1 +
 src/backend/postmaster/auxprocess.c |   8 +
 src/backend/postmaster/custodian.c  | 383 
 src/backend/postmaster/postmaster.c |  44 ++-
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/wait_event.c |   3 +
 src/backend/utils/init/miscinit.c   |   3 +
 src/include/miscadmin.h |   3 +
 src/include/postmaster/custodian.h  |  32 ++
 src/include/storage/proc.h  |  11 +-
 src/include/utils/wait_event.h  |   1 +
 12 files changed, 488 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/postmaster/custodian.c
 create mode 100644 src/include/postmaster/custodian.h

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	bgworker.o \
 	bgwriter.o \
 	checkpointer.o \
+	custodian.o \
 	fork_process.o \
 	interrupt.o \
 	pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 39ac4490db..620a0b1bae 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 		case CheckpointerProcess:
 			MyBackendType = B_CHECKPOINTER;
 			break;
+		case CustodianProcess:
+			MyBackendType = B_CUSTODIAN;
+			break;
 		case WalWriterProcess:
 			MyBackendType = B_WAL_WRITER;
 			break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 			CheckpointerMain();
 			proc_exit(1);
 
+		case CustodianProcess:
+			CustodianMain();
+			proc_exit(1);
+
 		case WalWriterProcess:
 			WalWriterMain();
 			proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 00..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc.  Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding).  However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode).  The custodian is not
+ * an essential process and can shutdown quickly when requested.  The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/postmaster/custodian.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+	slock_t		cust_lck;
+
+	CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+	int			task_queue_head;
+} CustodianShmemStruct;
+
+static 

Re: archive modules

2022-07-06 Thread Nathan Bossart
On Wed, Jul 06, 2022 at 06:21:24PM +0200, talk to ben wrote:
> I am not sure why, but I can't find "basic_archive.archive_directory" in
> pg_settings the same way I would find for example :
> "pg_stat_statements.max".
> 
> [local]:5656 benoit@postgres=# SELECT count(*) FROM pg_settings WHERE name
> = 'basic_archive.archive_directory';
>  count
> ---
>  0
> (1 row)
> 
> show can find it if I use the complete name but tab completion can't find
> the guc:
> 
> [local]:5656 benoit@postgres=# show basic_archive.archive_directory;
>  basic_archive.archive_directory
> -
>  /home/benoit/tmp/tmp/archives
> (1 row)

I think the reason is that only the archiver process loads the library, so
the GUC isn't registered at startup like you'd normally see with
shared_preload_libraries.  IIUC the server will still create a placeholder
GUC during startup for custom parameters, which is why it shows up for SHOW
commands.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Issue with pg_stat_subscription_stats

2022-07-06 Thread Andres Freund
On 2022-07-05 14:52:45 -0700, Andres Freund wrote:
> On 2022-07-04 11:01:01 +0900, Masahiko Sawada wrote:
> > I've attached the patch, fix_drop_subscriptions_stats.patch, to fix it.
> 
> LGTM. Unless somebody sees a reason not to, I'm planning to commit that to 15
> and HEAD.

Pushed.




Re: AIX support - alignment issues

2022-07-06 Thread Andres Freund
Hi,

On 2022-07-06 11:55:57 -0400, Robert Haas wrote:
> On Sat, Jul 2, 2022 at 2:34 PM Andres Freund  wrote:
> > I strikes me as a remarkably bad idea to manually try to maintain the 
> > correct
> > alignment. Even with the tests added it's still quite manual and requires
> > contorted struct layouts (see e.g. [1]).
> >
> > I think we should either teach our system the correct alignment rules or we
> > should drop AIX support.
>
> I raised this same issue at
> http://postgr.es/m/CA+TgmoaK377MXCWJqEXM3VvKDDC-frNUMKb=7u07tja59wt...@mail.gmail.com
> and discussion ensued from there. I agree that manually maintaining
> alignment, even with a regression test to help, is a really bad plan.
>
> The rule about columns of type "name" can be relaxed easily enough,
> just by insisting that NAMEDATALEN must be a multiple of 8. As Tom
> also said on this thread, adding such a constraint seems to have no
> real downside. But the problem has a second aspect not related to
> NameData, which is that int64 and double have different alignment
> requirements on that platform. To get out from under that part of it,
> it seems we either need to de-support AIX and any other platforms that
> have such a discrepancy, or else have separate typalign values for
> int64-align vs. double-align.

I think my proposal of introducing a version of double that is marked to be 8
byte aligned should do the trick as well, and doesn't have the problem of
changing the meaning of 'double' references in external headers. In fact, we
already have float8 as a type, so we could just add it there.

We don't currently have a float8 in the catalogs afaics, but I think it'd be
better to not rely on that.

It's not pretty, but still seems a lot better than doing this stuff manually.


> From a theoretical point of view, I think what we're doing now is
> pretty unprincipled. I've always found it a bit surprising that we get
> away with just assuming that a bunch of various different primitive
> data types are all going to have the same alignment requirement. The
> purist in me feels that it would be better to have separate typalign
> values for things that aren't guaranteed to behave the same. However,
> there's a practical difficulty with that approach: if the only
> operating system where this issue occurs in practice is AIX, I feel
> it's going to be pretty hard for us to keep the code that caters to
> this unusual situation working properly. And I'd rather have no code
> for it at all than have code which doesn't really work.

The problem with having a lot more alignment values is that it adds a bunch of
overhead to very performance critical paths. We don't want to add more
branches to att_align_nominal() if we can avoid it.

I guess we can try to introduce TYPALIGN_INT64 and then hide the relevant
branch with an ifdef for the common case of TYPALIGN_INT64 == TYPALIGN_DOUBLE.


I'm fairly certain that we're going to add a lot more 64bit ints to catalogs
in the next few years, so this will become a bigger issue over time...


Outside of the catalogs I still think that we should work towards not aligning
byval values (and instead memcpy-ing the values to deal with alignment
sensitive platforms), so we don't waste so much space. And for catalogs we've
been talking about giving up the struct mapping as well, in the thread about
variable length names. In which case we could the cost of handling more
alignment values wouldn't be incurred as frequently.

Greetings,

Andres Freund




Re: archive modules

2022-07-06 Thread talk to ben
Hi,

I am not sure why, but I can't find "basic_archive.archive_directory" in
pg_settings the same way I would find for example :
"pg_stat_statements.max".

[local]:5656 benoit@postgres=# SELECT count(*) FROM pg_settings WHERE name
= 'basic_archive.archive_directory';
 count
---
 0
(1 row)

show can find it if I use the complete name but tab completion can't find
the guc:

[local]:5656 benoit@postgres=# show basic_archive.archive_directory;
 basic_archive.archive_directory
-
 /home/benoit/tmp/tmp/archives
(1 row)

The archiver is configured with "basic_archive" and is working fine. I use
this version of pg:

PostgreSQL 15beta2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.3.1
20220421 (Red Hat 11.3.1-2), 64-bit


Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-07-06 Thread Fujii Masao




On 2022/03/16 10:29, Kyotaro Horiguchi wrote:

At Wed, 16 Mar 2022 09:19:13 +0900 (JST), Kyotaro Horiguchi 
 wrote in

In short, I split out the two topics other than checkpoint log to
other threads.


So, this is about the main topic of this thread, adding LSNs to
checkpint log.  Other topics have moved to other treads [1], [2] ,
[3].

I think this is no longer controversial alone.  So this patch is now
really Read-for-Commiter and is waiting to be picked up.


+1

+* ControlFileLock is not 
required as we are the only
+* updator of these variables.

Isn't it better to add "at this time" or something at the end of the comment 
because only we're not always updator of them? No?

Barring any objection, I'm thinking to apply the above small change and commit 
the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: making relfilenodes 56 bits

2022-07-06 Thread Robert Haas
On Wed, Jul 6, 2022 at 7:55 AM Dilip Kumar  wrote:
> Okay, changed that and changed a few more occurrences in 0001 which
> were on similar lines.  I also tested the performance of pg_bench
> where concurrently I am running the script which creates/drops
> relation but I do not see any regression with fairly small values of
> VAR_RELNUMBER_PREFETCH, the smallest value I tried was 8.  That
> doesn't mean I am suggesting this small value but I think we can keep
> the value something like 512 or 1024 without worrying much about the
> performance, so changed to 512 in the latest patch.

OK, I have committed 0001 now with a few changes. pgindent did not
agree with some of your whitespace changes, and I also cleaned up a
few long lines. I replaced one instance of InvalidOid with
InvalidRelFileNumber also, and changed a word in a comment.

I think 0002 and 0003 need more work yet; I'll try to write a review
of those soon.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: AIX support - alignment issues

2022-07-06 Thread Robert Haas
On Sat, Jul 2, 2022 at 2:34 PM Andres Freund  wrote:
> I strikes me as a remarkably bad idea to manually try to maintain the correct
> alignment. Even with the tests added it's still quite manual and requires
> contorted struct layouts (see e.g. [1]).
>
> I think we should either teach our system the correct alignment rules or we
> should drop AIX support.

I raised this same issue at
http://postgr.es/m/CA+TgmoaK377MXCWJqEXM3VvKDDC-frNUMKb=7u07tja59wt...@mail.gmail.com
and discussion ensued from there. I agree that manually maintaining
alignment, even with a regression test to help, is a really bad plan.

The rule about columns of type "name" can be relaxed easily enough,
just by insisting that NAMEDATALEN must be a multiple of 8. As Tom
also said on this thread, adding such a constraint seems to have no
real downside. But the problem has a second aspect not related to
NameData, which is that int64 and double have different alignment
requirements on that platform. To get out from under that part of it,
it seems we either need to de-support AIX and any other platforms that
have such a discrepancy, or else have separate typalign values for
int64-align vs. double-align.

>From a theoretical point of view, I think what we're doing now is
pretty unprincipled. I've always found it a bit surprising that we get
away with just assuming that a bunch of various different primitive
data types are all going to have the same alignment requirement. The
purist in me feels that it would be better to have separate typalign
values for things that aren't guaranteed to behave the same. However,
there's a practical difficulty with that approach: if the only
operating system where this issue occurs in practice is AIX, I feel
it's going to be pretty hard for us to keep the code that caters to
this unusual situation working properly. And I'd rather have no code
for it at all than have code which doesn't really work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Issue with pg_stat_subscription_stats

2022-07-06 Thread Andres Freund
Hi,

On 2022-07-06 11:41:46 +0900, Masahiko Sawada wrote:
> diff --git a/src/test/regress/sql/subscription.sql 
> b/src/test/regress/sql/subscription.sql
> index 74c38ead5d..6a46956f6e 100644
> --- a/src/test/regress/sql/subscription.sql
> +++ b/src/test/regress/sql/subscription.sql
> @@ -30,6 +30,12 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 
> 'dbname=regress_doesnotexist' PUB
>  COMMENT ON SUBSCRIPTION regress_testsub IS 'test subscription';
>  SELECT obj_description(s.oid, 'pg_subscription') FROM pg_subscription s;
>  
> +-- Check if the subscription stats are created and stats_reset is updated
> +-- by pg_stat_reset_subscription_stats().
> +SELECT subname, stats_reset IS NULL stats_reset_is_null FROM 
> pg_stat_subscription_stats ORDER BY 1;

Why use ORDER BY 1 instead of just getting the stats for the subscription we
want to test? Seems a bit more robust to show only that one, so we don't get
unnecessary changes if the test needs to create another subscription or such.


> +SELECT pg_stat_reset_subscription_stats(oid) FROM pg_subscription;
> +SELECT subname, stats_reset IS NULL stats_reset_is_null FROM 
> pg_stat_subscription_stats ORDER BY 1;
> +

Perhaps worth resetting again and checking that the timestamp is bigger than
the previous timestamp? You can do that with \gset etc.


Greetings,

Andres Freund




Re: In-placre persistance change of a relation

2022-07-06 Thread Jacob Champion
On Thu, Mar 31, 2022 at 2:36 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > I don't think this can be commited to 15. So I post the fixed version
> > then move this to the next CF.
>
> Then done. Thanks!

Hello! This patchset will need to be rebased over latest -- looks like
b74e94dc27f (Rethink PROCSIGNAL_BARRIER_SMGRRELEASE) and 5c279a6d350
(Custom WAL Resource Managers) are interfering.

Thanks,
--Jacob




Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

2022-07-06 Thread Matthias van de Meent
Sorry for waking a dead thread, I had this in my drafts folder that I
was cleaning, and wanted to share this so anyone interested can reuse
these thoughts.

On Thu, 26 May 2022 at 03:19, Bruce Momjian  wrote:
> I read this email today and participated in an unconference session on
> this topic today too.  Let me explain what I learned.

My takeaways from this thread and that unconference session (other
notes from the session: [0]):

- Lossy count-distinct sketches require at least 100 "buckets" to get
a RSE of <10%, and 1000 buckets for RSE <2%.
The general formula for RSE for the most popular of these sketches is
within a constant factor of 1 / root(m) for m "buckets"- which is
theorized to be the theoretical limit for count-distinct sketches that
utilize n "buckets".
RSE is the residual statistical error, i.e. accuracy of the model, so
with the popular sketches to double the accuracy you need 4x as many
buckets.
A "bucket" is a distinct counting value, e.g. the log-counters in
(H)LL, and bits in HyperBitBit.
Most sketches use anywhere from several bits to several bytes per
bucket: HLL uses 5 and 6 bits for 32- and 64-bits hashes,
respectively,

- If we will implement sketches, it would be preferred if they support
the common set operations of [join, intersect, imply] while retaining
their properties, so that we don't lose the increased accuracy.
HLL does not support intersect- or imply-operations directly, which
makes it a bad choice as an estimator of rows returned.

- Bitmaps would be a good first implementation for an initial
sketch-based statistics implementation
Assuming the implementation would compress these bitmaps, the average
size would definitely not be larger than 900 bytes (+ administrative
overhead) per MCV entry - 3 bytes per sampled row for the index in the
bitmap * 300 rows on average per MCV entry. Rows would be identified
by their index in the sampled list of rows.

- It is not clear we need to be able to combine statistics from
multiple runs of ANALYZE.
\We considered that it is rare for people to analyze only a subset of
columns, or that people otherwise would need to combine analytics from
distinct table samples of the same table.

- Accurately combining statistics from two different runs of ANALYZE
requires stable sampling, and lossless tuple identification
The above-mentioned bitmap-based statistics would not allow this,
because the index of a sampled row will likely shift between runs,
even assuming that a row is shared in the sample.

Summary:
We shouldn't use HLL, (compressed) bitmaps will work fine for an
initial implementation of combined sketch-based MCV estimates.


Kind regards,

Matthias van de Meent

[0] 
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Improving_Statistics_Accuracy




Re: Custom tuplesorts for extensions

2022-07-06 Thread Andres Freund
Hi,

I think this needs to be evaluated for performance...

I agree with the nearby comment that the commits need a bit of justification
at least to review them.


On 2022-06-23 15:12:27 +0300, Maxim Orlov wrote:
> From 03b78cdade3b86a0e97723721fa1d0bd64d0c7df Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov 
> Date: Tue, 21 Jun 2022 13:28:27 +0300
> Subject: [PATCH v2 1/6] Remove Tuplesortstate.copytup

Yea. I was recently complaining about the pointlessness of copytup.


> From 1d78e271b22d7c6a1557defbe15ea5039ff28510 Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov 
> Date: Tue, 21 Jun 2022 14:03:13 +0300
> Subject: [PATCH v2 2/6] Tuplesortstate.getdatum1 method

Hm. This adds a bunch of indirect function calls were there previously
weren't.


> From 494d46dcf938e5f824a498e38ce77782751208e1 Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov 
> Date: Tue, 21 Jun 2022 14:13:56 +0300
> Subject: [PATCH v2 3/6] Put abbreviation logic into puttuple_common()

There's definitely a lot of redundancy removed... But the list of branches in
puttuple_common() grew.  Perhaps we instead can add a few flags to
putttuple_common() that determine whether abbreviation should happen, so that
we only do the work necessary for the "type" of sort?



> From ee2dd46b07d62e13ed66b5a38272fb5667c943f3 Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov 
> Date: Wed, 22 Jun 2022 00:14:51 +0300
> Subject: [PATCH v2 4/6] Move freeing memory away from writetup()

Seems to do more than just moving freeing around?

> @@ -1973,8 +1963,13 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, 
> bool isNull)
>  static void
>  puttuple_common(Tuplesortstate *state, SortTuple *tuple)
>  {
> + MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
> +
>   Assert(!LEADER(state));
>
> + if (tuple->tuple != NULL)
> + USEMEM(state, GetMemoryChunkSpace(tuple->tuple));
> +

Adding even more branches into common code...



> From 3a0e1fa7c7e4da46a86f7d5b9dd0392549f3b460 Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov 
> Date: Wed, 22 Jun 2022 18:11:26 +0300
> Subject: [PATCH v2 5/6] Reorganize data structures

Hard to know what this is trying to achieve.

> -struct Tuplesortstate
> +struct TuplesortOps
>  {
> - TupSortStatus status;   /* enumerated value as shown above */
> - int nKeys;  /* number of columns in 
> sort key */
> - int sortopt;/* Bitmask of flags 
> used to setup sort */
> - boolbounded;/* did caller specify a maximum 
> number of
> -  * tuples to 
> return? */
> - boolboundUsed;  /* true if we made use of a 
> bounded heap */
> - int bound;  /* if bounded, the 
> maximum number of tuples */
> - booltuples; /* Can SortTuple.tuple ever be 
> set? */
> - int64   availMem;   /* remaining memory available, 
> in bytes */
> - int64   allowedMem; /* total memory allowed, in 
> bytes */
> - int maxTapes;   /* max number of input 
> tapes to merge in each
> -  * pass */
> - int64   maxSpace;   /* maximum amount of space 
> occupied among sort
> -  * of groups, 
> either in-memory or on-disk */
> - boolisMaxSpaceDisk; /* true when maxSpace is value for 
> on-disk
> -  * space, false 
> when it's value for in-memory
> -  * space */
> - TupSortStatus maxSpaceStatus;   /* sort status when maxSpace was 
> reached */
>   MemoryContext maincontext;  /* memory context for tuple sort 
> metadata that
>* persists 
> across multiple batches */
>   MemoryContext sortcontext;  /* memory context holding most sort 
> data */
>   MemoryContext tuplecontext; /* sub-context of sortcontext for tuple 
> data */
> - LogicalTapeSet *tapeset;/* logtape.c object for tapes in a temp 
> file */
>
>   /*
>* These function pointers decouple the routines that must know what 
> kind

To me it seems odd to have memory contexts and similar things in a
datastructure calls *Ops.


> From b06bcb5f3666f0541dfcc27c9c8462af2b5ec9e0 Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov 
> Date: Wed, 22 Jun 2022 21:48:05 +0300
> Subject: [PATCH v2 6/6] Split tuplesortops.c

I strongly suspect this will cause a slowdown. There was potential for
cross-function optimization that's now removed.


Greetings,

Andres Freund




Re: Use outerPlanState macro instead of referring to leffttree

2022-07-06 Thread Tom Lane
Richard Guo  writes:
> On Sat, Jul 2, 2022 at 5:32 AM Tom Lane  wrote:
>> Backing up a little bit, one thing not to like about the outerPlanState
>> and innerPlanState macros is that they lose all semblance of type
>> safety:

> Your concern makes sense. I think outerPlan and innerPlan macros share
> the same issue. Not sure if there is a way to do the type check.

Yeah, I don't know of one either.  It needn't hold up this patch.

>> Would it make sense to try to use the local-variable style everywhere?

> Do you mean the pattern like below?
>   outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);
> It seems that this pattern is mostly used when initializing child nodes
> with ExecInitNode(), and most calls to ExecInitNode() are using this
> pattern as a convention. Not sure if it's better to change them to
> local-variable style.

That's probably fine, especially if it's a commonly used pattern.

Typically, if one applies outerPlan() or outerPlanState() to the
wrong pointer, the mistake will become obvious upon even minimal
testing.  My concern here is more about usages in edge cases that
perhaps escape testing, for instance in the arguments of an
elog() for some nearly-can't-happen case.

regards, tom lane




Re: Custom tuplesorts for extensions

2022-07-06 Thread Matthias van de Meent
On Thu, 23 Jun 2022 at 14:12, Maxim Orlov  wrote:
>
> Hi!
>
> I've reviewed the patchset and noticed some minor issues:
> - extra semicolon in macro (lead to warnings)
> - comparison of var isWorker should be done in different way
>
> Here is an upgraded version of the patchset.
>
> Overall, I consider this patchset useful. Any opinions?

All of the patches are currently missing descriptive commit messages,
which I think is critical for getting this committed. As for per-patch
comments:

0001: This patch removes copytup, but it is not quite clear why -
please describe the reasoning in the commit message.

0002: getdatum1 needs comments on what it does. From the name, it
would return the datum1 from a sorttuple, but that's not what it does;
a better name would be putdatum1 or populatedatum1.

0003: in the various tuplesort_put*tuple[values] functions, the datum1
field is manually extracted. Shouldn't we use the getdatum1 functions
from 0002 instead? We could use either them directly to allow
inlining, or use the indirection through tuplesortstate.

0004: Needs a commit message, but otherwise seems fine.

0005:
> +struct TuplesortOps

This struct has no comment on what it is. Something like "Public
interface of tuplesort operators, containing data directly accessable
to users of tuplesort" should suffice, but feel free to update the
wording.

> +void *arg;
> +};

This field could use a comment on what it is used for, and how to use it.

> +struct Tuplesortstate
> +{
> +TuplesortOps ops;

This field needs a comment too.

0006: Needs a commit message, but otherwise seems fine.

Kind regards,

Matthias van de Meent




Re: transformLockingClause() bug

2022-07-06 Thread Tom Lane
Dean Rasheed  writes:
> The problem is that the parser has generated a join rte with
> eref->aliasname = "unnamed_join", and then transformLockingClause()
> finds that before finding the relation rte for t3 whose user-supplied
> alias is also "unnamed_join".

> I think the answer is that transformLockingClause() should ignore join
> rtes that don't have a user-supplied alias, since they are not visible
> as relation names in the query (and then [1] will want to do the same
> for subquery and values rtes without aliases).

Agreed.

> Except, if the rte has a join_using_alias (and no regular alias), I
> think transformLockingClause() should actually be matching on that and
> then throwing the above error. So for the following:

Yeah, that's clearly an oversight in the join_using_alias patch.

regards, tom lane




Re: Backup command and functions can cause assertion failure and segmentation fault

2022-07-06 Thread Fujii Masao




On 2022/07/01 15:41, Michael Paquier wrote:

On Fri, Jul 01, 2022 at 03:32:50PM +0900, Fujii Masao wrote:

Sounds good idea to me. I updated the patch in that way. Attached.


Skimming quickly through the thread, this failure requires a
termination of a backend running BASE_BACKUP.  This is basically
something done by the TAP test added in 0475a97f with a WAL sender
killed, and MAX_RATE being used to make sure that we have enough time
to kill the WAL sender even on fast machines.  So you could add a
regression test, no?


For the test, BASE_BACKUP needs to be canceled after it finishes 
do_pg_backup_start(), i.e., checkpointing, and before it calls 
do_pg_backup_stop(). So the timing to cancel that seems more severe than the 
test added in 0475a97f. I'm afraid that some tests can easily cancel the 
BASE_BACKUP while it's performing a checkpoint in do_pg_backup_start(). So for 
now I'm thinking to avoid such an unstable test.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: AIX support - alignment issues

2022-07-06 Thread Tom Lane
I wrote:
> Our HEAD does work on that NetBSD installation.  I can try this
> patch, but it'll take an hour or two to get results ... stay tuned.

Indeed, I still get a clean build and "make check" passes with
this patch.

regards, tom lane




transformLockingClause() bug

2022-07-06 Thread Dean Rasheed
While doing more testing of [1], I realised that it has a bug, which
reveals a pre-existing problem in transformLockingClause():

CREATE TABLE t1(a int);
CREATE TABLE t2(a int);
CREATE TABLE t3(a int);

SELECT 1
FROM t1 JOIN t2 ON t1.a = t2.a,
 t3 AS unnamed_join
FOR UPDATE OF unnamed_join;

ERROR:  FOR UPDATE cannot be applied to a join

which is wrong, because it should lock t3.

Similarly:

SELECT foo.*
FROM t1 JOIN t2 USING (a) AS foo,
 t3 AS unnamed_join
FOR UPDATE OF unnamed_join;

ERROR:  FOR UPDATE cannot be applied to a join


The problem is that the parser has generated a join rte with
eref->aliasname = "unnamed_join", and then transformLockingClause()
finds that before finding the relation rte for t3 whose user-supplied
alias is also "unnamed_join".

I think the answer is that transformLockingClause() should ignore join
rtes that don't have a user-supplied alias, since they are not visible
as relation names in the query (and then [1] will want to do the same
for subquery and values rtes without aliases).

Except, if the rte has a join_using_alias (and no regular alias), I
think transformLockingClause() should actually be matching on that and
then throwing the above error. So for the following:

SELECT foo.*
FROM t1 JOIN t2 USING (a) AS foo,
 t3 AS unnamed_join
FOR UPDATE OF foo;

ERROR:  relation "foo" in FOR UPDATE clause not found in FROM clause

the error should actually be

ERROR:  FOR UPDATE cannot be applied to a join


So something like the attached.

Thoughts?

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/CAEZATCUCGCf82=hxd9n5n6xghpyypqnxw8hneeh+up7ynal...@mail.gmail.com
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
new file mode 100644
index 1bcb875..8ed2c4b
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -3291,11 +3291,28 @@ transformLockingClause(ParseState *pstat
 			foreach(rt, qry->rtable)
 			{
 RangeTblEntry *rte = (RangeTblEntry *) lfirst(rt);
+char	   *rtename;
 
 ++i;
 if (!rte->inFromCl)
 	continue;
-if (strcmp(rte->eref->aliasname, thisrel->relname) == 0)
+
+/*
+ * A join RTE without an alias is not visible as a relation
+ * name and needs to be skipped (otherwise it might hide a
+ * base relation with the same name), except if it has a USING
+ * alias, which *is* visible.
+ */
+if (rte->rtekind == RTE_JOIN && rte->alias == NULL)
+{
+	if (rte->join_using_alias == NULL)
+		continue;
+	rtename = rte->join_using_alias->aliasname;
+}
+else
+	rtename = rte->eref->aliasname;
+
+if (strcmp(rtename, thisrel->relname) == 0)
 {
 	switch (rte->rtekind)
 	{
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
new file mode 100644
index 2538bd6..1f0df6b
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2501,6 +2501,39 @@ ERROR:  column t1.x does not exist
 LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
^
 HINT:  Perhaps you meant to reference the column "t3.x".
+-- Test matching of locking clause with wrong alias
+select t1.*, t2.*, unnamed_join.* from
+  t1 join t2 on (t1.a = t2.a), t3 as unnamed_join
+  for update of unnamed_join;
+ a | b | a | b | x | y 
+---+---+---+---+---+---
+(0 rows)
+
+select foo.*, unnamed_join.* from
+  t1 join t2 using (a) as foo, t3 as unnamed_join
+  for update of unnamed_join;
+ a | x | y 
+---+---+---
+(0 rows)
+
+select foo.*, unnamed_join.* from
+  t1 join t2 using (a) as foo, t3 as unnamed_join
+  for update of foo;
+ERROR:  FOR UPDATE cannot be applied to a join
+LINE 3:   for update of foo;
+^
+select bar.*, unnamed_join.* from
+  (t1 join t2 using (a) as foo) as bar, t3 as unnamed_join
+  for update of foo;
+ERROR:  relation "foo" in FOR UPDATE clause not found in FROM clause
+LINE 3:   for update of foo;
+^
+select bar.*, unnamed_join.* from
+  (t1 join t2 using (a) as foo) as bar, t3 as unnamed_join
+  for update of bar;
+ERROR:  FOR UPDATE cannot be applied to a join
+LINE 3:   for update of bar;
+^
 --
 -- regression test for 8.1 merge right join bug
 --
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
new file mode 100644
index a27a720..b5f41c4
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -520,6 +520,28 @@ select * from t1 left join t2 on (t1.a =
 
 select t1.x from t1 join t3 on (t1.a = t3.x);
 
+-- Test matching of locking clause with wrong alias
+
+select t1.*, t2.*, unnamed_join.* from
+  t1 join t2 on (t1.a = t2.a), t3 as unnamed_join
+  for update of unnamed_join;
+
+select foo.*, unnamed_join.* from
+  t1 join t2 using (a) as foo, t3 as unnamed_join
+  for update of unnamed_join;
+
+select foo.*, unnamed_join.* from
+  t1 join t2 using (a) as foo, t3 as unnamed_join
+  for update of foo;
+
+select bar.*, unnamed_join.* from
+  (t1 join t2 using 

Re: Making the subquery alias optional in the FROM clause

2022-07-06 Thread Dean Rasheed
On Tue, 5 Jul 2022 at 19:00, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > This was discussed previously in [1], and there seemed to be general
> > consensus in favour of it, but no new patch emerged.
>
> As I said in that thread, I'm not super enthused about this, but I was
> clearly in the minority so I think it should go forward.
>

Cool. Thanks for looking.


> > Attached is a patch that takes the approach of not generating an alias
> > at all, which seems to be neater and simpler, and less code than
> > trying to generate a unique alias.
>
> Hm.  Looking at the code surrounding what you touched, I'm reminded
> that we allow JOIN nodes to not have an alias, and represent that
> situation by rte->alias == NULL.  I wonder if it'd be better in the
> long run to make alias-less subqueries work similarly,

That is what the patch does: transformRangeSubselect() passes a NULL
alias to addRangeTableEntryForSubquery(), which has been modified to
cope with that in a similar way to addRangeTableEntryForJoin() and
other addRangeTableEntryFor...() functions.

So for example, with the following query, this is what the output from
the parser looks like:

SELECT * FROM (SELECT 1);

query->rtable:
  rte:
rtekind = RTE_SUBQUERY
alias = NULL
eref = { aliasname = "subquery", colnames = ... }


> rather than
> generating something that after-the-fact will be indistinguishable
> from a user-written alias.  If that turns out to not work well,
> I'd agree with "unnamed_subquery" as the inserted name.
>

The result is distinguishable from a user-written alias, because
rte->alias is NULL. I think the confusion is that when I suggested
using "unnamed_subquery", I was referring to rte->eref->aliasname, and
I still think it's a good idea to change that, for consistency with
unnamed joins.


> Also, what about VALUES clauses?  It seems inconsistent to remove
> this restriction for sub-SELECT but not VALUES.  Actually it looks
> like your patch already does remove that restriction in gram.y,
> but you didn't follow through elsewhere.
>

It does support unnamed VALUES clauses in the FROM list (there's a
regression test exercising that). It wasn't necessary to make any
additional code changes because addRangeTableEntryForValues() already
supported having a NULL alias, and it all just flowed through.

In fact, the grammar forces you to enclose a VALUES clause in the FROM
list in parentheses, so this ends up being an unnamed subquery in the
FROM list as well. For example:

SELECT * FROM (VALUES(1),(2),(3));

produces

query->rtable:
  rte:
rtekind = RTE_SUBQUERY
alias = NULL
eref = { aliasname = "subquery", colnames = ... }
subquery->rtable:
  rte:
rtekind = RTE_VALUES
alias = NULL
eref = { aliasname = "*VALUES*", colnames = ... }

So it's not really any different from a normal subquery.


> As far as the docs go, I think it's sufficient to mention the
> inconsistency with SQL down at the bottom; we don't need a
> redundant in-line explanation.

OK, fair enough.

I'll post an update in a little while, but first, I found a bug, which
revealed a pre-existing bug in transformLockingClause(). I'll start a
new thread for that, since it'd be good to get that resolved
independently of this patch.

Regards,
Dean




Re: AIX support - alignment issues

2022-07-06 Thread Tom Lane
Peter Eisentraut  writes:
> On 06.07.22 04:21, Thomas Munro wrote:
>> /*
>> * Do not try to collapse these into one "w+" mode file. Doesn't work on
>> - * some platforms (eg, HPUX 10.20).
>> + * some platforms.
>> */
>> termin = fopen("/dev/tty", "r");
>> termout = fopen("/dev/tty", "w");

> I don't know how /dev/tty behaves in detail under stdio.  I think 
> removing this part of the comment might leave the impression that 
> attempting to use "w+" will never work, whereas the existing comment 
> appears to indicate that it was only very old platforms that had the 
> issue.  If we don't have an immediate answer to that, I'd leave the 
> comment as is.

Yeah, I was kind of wondering whether we should give w+ a try now.
IIRC, the code was like that at one point, but we had to change it
(ie the comment comes from bitter experience).  On the other hand,
it's probably not worth the trouble and risk to change it again.

regards, tom lane




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-06 Thread Masahiko Sawada
On Tue, Jul 5, 2022 at 5:09 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
> > On Tue, Jul 5, 2022 at 6:18 AM Andres Freund  wrote:
> > A datum value is convenient to represent both a pointer and a value so
> > I used it to avoid defining node types for inner and leaf nodes
> > separately.
>
> I'm not convinced that's a good goal. I think we're going to want to have
> different key and value types, and trying to unify leaf and inner nodes is
> going to make that impossible.
>
> Consider e.g. using it for something like a buffer mapping table - your key
> might be way too wide to fit it sensibly into 64bit.

Right. It seems to be better to have an interface so that the user of
the radix tree can specify the arbitrary key size (and perhaps value
size too?) on creation. And we can have separate leaf node types that
have values instead of pointers. If the value size is less than
pointer size, we can have values within leaf nodes but if it’s bigger
probably the leaf node can have pointers to memory where to store the
value.

>
>
> > Since a datum could be 4 bytes or 8 bytes depending it might not be good for
> > some platforms.
>
> Right - thats another good reason why it's problematic. A lot of key types
> aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.
>
>
> > > > +void
> > > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool 
> > > > *found_p)
> > > > +{
> > > > + int shift;
> > > > + boolreplaced;
> > > > + radix_tree_node *node;
> > > > + radix_tree_node *parent = tree->root;
> > > > +
> > > > + /* Empty tree, create the root */
> > > > + if (!tree->root)
> > > > + radix_tree_new_root(tree, key, val);
> > > > +
> > > > + /* Extend the tree if necessary */
> > > > + if (key > tree->max_val)
> > > > + radix_tree_extend(tree, key);
> > >
> > > FWIW, the reason I used separate functions for these in the prototype is 
> > > that
> > > it turns out to generate a lot better code, because it allows non-inlined
> > > function calls to be sibling calls - thereby avoiding the need for a 
> > > dedicated
> > > stack frame. That's not possible once you need a palloc or such, so 
> > > splitting
> > > off those call paths into dedicated functions is useful.
> >
> > Thank you for the info. How much does using sibling call optimization
> > help the performance in this case? I think that these two cases are
> > used only a limited number of times: inserting the first key and
> > extending the tree.
>
> It's not that it helps in the cases moved into separate functions - it's that
> not having that code in the "normal" paths keeps the normal path faster.

Thanks, understood.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Fix unnecessary includes and comments in 019_replslot_limit.pl, 007_wal.pl and 004_timeline_switch.pl

2022-07-06 Thread Maxim Orlov
>
> I'm sorry for some nitpicking about changes in the comments:
> - The number of WAL segments advanced hasn't changed from 5 to 1, it just
> advances as 1+4 as previously. So the original comment is right. I reverted
> this in v2.
>

Yeah, it looks even better now.

-- 
Best regards,
Maxim Orlov.


Re: Fix unnecessary includes and comments in 019_replslot_limit.pl, 007_wal.pl and 004_timeline_switch.pl

2022-07-06 Thread Pavel Borisov
On Wed, Jul 6, 2022 at 5:05 PM Maxim Orlov  wrote:

> Hi!
>
> This is an obvious change, I totally for it. Hope it will be commited soon.
>

I'm sorry for some nitpicking about changes in the comments:
- The number of WAL segments advanced hasn't changed from 5 to 1, it just
advances as 1+4 as previously. So the original comment is right. I reverted
this in v2.
- wal_segment_size is in bytes so comment "(wal_segment_size * n) MB" is
incorrect. I corrected this to bytes.

PFA  v2 of a patch (only comments changed/reverted to original).
Overall I completely agree with Maxim: the patch is good and simple enough
to be RfC.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v2-0001-Fix-unnecessary-includes-and-comments-in-019_repl.patch
Description: Binary data


Re: [RFC] building postgres with meson -v9

2022-07-06 Thread Andres Freund
Hi

On 2022-07-06 11:03:31 +0200, Peter Eisentraut wrote:
> On 01.07.22 11:33, Andres Freund wrote:
> > Attached is an updated version of the meson patchset. There has been a 
> > steady
> > stream of incremental work over the last month, with patches from Peter
> > Eisentraut and Nazir Yavuz.
> > 
> > I tried to address the review comments Peter had downthread about the prep
> > patches. The one that I know is still outstanding is that there's still
> > different ways of passing output directories as parameters to a bunch of
> > scripts added, will resolve that next (some have been fixed).
> 
> Here is my rough assessment of where we are with this patch set:
> 
> 08b4330ded prereq: deal with \ paths in basebackup_to_shell tests.
> 
> This still needs clarification, per my previous review.

Hm. I thought I had explained that bit, but apparently not. Well, it's pretty
simple - without this, the test fail on windows for me, as soon as one of the
binaries is in a directory with spaces (which is common on windows). Iimagine
what happens with e.g.
  qq{$gzip --fast > "$escaped_backup_path%f.gz"}
if $gzip contains spaces.


This doesn't happen currently on CI because nothing runs these tests on
windows yet.


> bda6a45bae meson: prereq: Refactor PG_TEST_EXTRA logic in autoconf build
> 
> I understand the intention behind this, but I think it changes the
> behavior in an undesirable way.  Before this patch, you can go into
> src/test/ssl/ and run make check manually.  This was indeed the only
> way to do it before PG_TEST_EXTRA.  With this patch, this would now
> skip all the tests unless you set PG_TEST_EXTRA, even if you run the
> specific test directly.

It's not a free lunch, I agree. But I think the downsides outweigh the upsides
by far. Not seeing that tests were skipped in the test output is quite
problematic imo. And with meson's testrunner we're going to need something
that deals with skipping these tests - and it's more important to have them
skipped, so that they show up in the test summary.

It's not like it's hard to set PG_TEST_EXTRA for a single command invocation?



> 243f99da38 wip: split TESTDIR into two.
> 
> This one has already caused a bit of confusion, but the explanation at
> 
> https://www.postgresql.org/message-id/flat/2022060122.td2ato4wjqf7afnv%40alap3.anarazel.de#1f250dee73cf0da29a6d2c020c3bde08
> 
> seems reasonable.  But it clearly needs further work.

Yea. I kind of want to get some of the preparatory stuff out of the way first.


> 88dd280835 meson: Add meson based buildsystem.
> 1ee3073a3c meson: ci: Build both with meson and as before.
> 
> These are for later. ;-)
> 
> In the meantime, also of interest to this effort:
> 
> - If we're planning to remove the postmaster symlink in PG16, maybe we
>   should start a discussion on that.

Yea.


> - This patch is for unifying the list of languages in NLS, as
>   previously discussed: https://commitfest.postgresql.org/38/3737/

There seems little downside to doing so, so ...

Greetings,

Andres Freund




Re: First draft of the PG 15 release notes

2022-07-06 Thread Bruce Momjian
On Tue, Jul  5, 2022 at 07:45:57PM -0700, Noah Misch wrote:
> On Tue, Jul 05, 2022 at 07:47:52PM -0400, Bruce Momjian wrote:
> > Yes, I think it is a question of practicality vs. desirability.  We are
> > basically telling people they have to do research to get the old
> > behavior in their new databases and clusters.
> 
> True.  I want to maximize the experience for different classes of database:
> 
> 1. Databases needing user isolation and unknowingly not getting it.
> 2. Databases not needing user isolation, e.g. automated test environments.
> 
> Expecting all of these DBAs to read a 500-word doc section is failure-prone.
> For the benefit of (2), I'm now thinking about adding a release note sentence,
> "For a new database having zero need to defend against insider threats,
> granting back the privilege yields the PostgreSQL 14 behavior."

I think you would need to say "previous behavior" since people might be
upgrading from releases before PG 14.  I also would change "In existing
databases" to "For existing databases".  I think your big risk here is
trying to explain how to have new clusters get the old or new behavior
in the same text block, e.g.:

 The new default is one of the secure schema usage patterns that
  has recommended since the
 security release for CVE-2018-1058.  Upgrading a cluster or restoring a
 database dump will preserve existing permissions.  This is a change in
 the default for newly-created databases in existing clusters and for new
 clusters.  For existing databases, especially those having multiple
 users, consider issuing a REVOKE to adopt this new
 default.  (USAGE permission on this schema has not
 changed.)  For a new database having zero need to defend against insider
 threats, granting back the privilege yields the previous behavior.

Is this something we want to get into in the release notes, or perhaps
do we need to link to a wiki page for these details?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Fix unnecessary includes and comments in 019_replslot_limit.pl, 007_wal.pl and 004_timeline_switch.pl

2022-07-06 Thread Maxim Orlov
Hi!

This is an obvious change, I totally for it. Hope it will be commited soon.

-- 
Best regards,
Maxim Orlov.


Re: AIX support - alignment issues

2022-07-06 Thread Peter Eisentraut

On 06.07.22 04:21, Thomas Munro wrote:

/*
 * Do not try to collapse these into one "w+" mode file. Doesn't work on
-* some platforms (eg, HPUX 10.20).
+* some platforms.
 */
termin = fopen("/dev/tty", "r");
termout = fopen("/dev/tty", "w");


I don't know how /dev/tty behaves in detail under stdio.  I think 
removing this part of the comment might leave the impression that 
attempting to use "w+" will never work, whereas the existing comment 
appears to indicate that it was only very old platforms that had the 
issue.  If we don't have an immediate answer to that, I'd leave the 
comment as is.






Re: Custom tuplesorts for extensions

2022-07-06 Thread Maxim Orlov
Hi!

Overall patch looks good let's mark it as ready for committer, shall we?

-- 
Best regards,
Maxim Orlov.


Re: doc: Fix description of how the default user name is chosen

2022-07-06 Thread Peter Eisentraut

On 06.07.22 14:30, Matthias van de Meent wrote:

If we're going to change this anyway, could we replace 'user name'
with 'username' in the connection documentation? It irks me to see so
much 'user name' while our connection parameter is 'username', and we
use the username of the OS user, not the OS user's (display) name - or
at least, that's how it behaved under Linux last time I checked.


This might make sense if you are referring specifically to the value of 
that connection option, and you mark it up accordingly.  Otherwise, the 
subtlety might get lost.






Re: doc: Fix description of how the default user name is chosen

2022-07-06 Thread Matthias van de Meent
On Wed, 6 Jul 2022 at 02:43, David G. Johnston
 wrote:
>
> On Tue, Jul 5, 2022 at 5:20 PM Tom Lane  wrote:
>>
>> "David G. Johnston"  writes:
>> > In passing, the authentication error examples use the phrase
>> > "database user name" in a couple of locations.  The word
>> > database in both cases is both unusual and unnecessary for
>> > understanding.  The reference to user name means the one in/for the
>> > database unless otherwise specified.
>>
>> I'm not convinced that just saying "user name" is an improvement.
>> The thing that we are trying to clarify in much of this section
>> is the relationship between your operating-system-assigned user
>> name and your database-cluster-assigned user name.  So just saying
>> "user name" adds an undesirable element of ambiguity.
>>
>>
>> Maybe we could change "database user name" to "Postgres user name"?
>
>
> I'm fine with just leaving "database user name" as no one seems to have the 
> same qualm with it that I do.  Besides, I just finished reading:
>
> https://www.postgresql.org/docs/current/client-authentication.html
>
> and it seems pointless to leave that written as-is and gripe about the 
> specific change I was recommending.

If we're going to change this anyway, could we replace 'user name'
with 'username' in the connection documentation? It irks me to see so
much 'user name' while our connection parameter is 'username', and we
use the username of the OS user, not the OS user's (display) name - or
at least, that's how it behaved under Linux last time I checked.

>>
>>
>> -if you do not specify a database name, it defaults to the database
>> -user name, which might or might not be the right thing.
>> +if the database name shown matches the user name you are connecting
>> +as it is not by accident: the default database name is the
>> +user name.
>>
>> This does absolutely not seem like an improvement.
>
>
> In that case I don't see the need for any form of commentary beyond:
>
> "If you do not specify a database name it defaults to the database user name."

Agreed to both.
The right-ness of the default can either be systematic ("we should or
should not default to connection username instead of some other
default") or in context of connection establishment ("this connection
should or should not connect to the database named after the user's
username").
The right-ness of the systematic default doesn't matter in this
context (that's something to put in the comments of that code or
discuss on -hackers), and the right-ness of the contextual default was
already proven to be wrong in this configuration of server and client,
by the context of failing to connect to that defaulted database.

Kind regards,

Matthias van de Meent




Re: pg15b2: large objects lost on upgrade

2022-07-06 Thread Robert Haas
On Wed, Jul 6, 2022 at 7:56 AM Justin Pryzby  wrote:
> I'm looking into it, but it'd help to hear suggestions about where to put it.
> My current ideas aren't very good.

In main() there is a comment that begins "Most failures happen in
create_new_objects(), which has just completed at this point." I am
thinking you might want to insert a new function call just before that
comment, like remove_orphaned_files() or tidy_up_new_cluster().

Another option could be to do something at the beginning of
transfer_all_new_tablespaces().

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re[2]: [PATCH] Optional OR REPLACE in CREATE OPERATOR statement

2022-07-06 Thread Svetlana Derevyanko

  
>Вторник, 5 июля 2022, 18:29 +03:00 от Tom Lane :
> 
>=?UTF-8?B?U3ZldGxhbmEgRGVyZXZ5YW5rbw==?= < s.derevya...@postgrespro.ru > 
>writes:
>> It seems useful to have [OR REPLACE] option in CREATE OPERATOR statement, as 
>> in CREATE FUNCTION. This option may be good for writing extension update 
>> scripts, to avoid errors with re-creating the same operator.
>No, that's not acceptable. CREATE OR REPLACE should always produce
>exactly the same final state of the object, but in this case we cannot
>change the underlying function if the operator already exists.
>
>(At least, not without writing a bunch of infrastructure to update
>existing views/rules that might use the operator; which among other
>things would create a lot of deadlock risks.)
>
>regards, tom lane
Hello,
 
> CREATE OR REPLACE should always produce exactly the same final state of the 
> object,
> but in this case we cannot change the underlying function if the operator 
> already exists.
   
Do you mean that for existing operator CREATE OR REPLACE should be the same as 
DROP OPERATOR and CREATE OPERATOR,  with relevant re-creation of existing 
view/rules/..., using this operator? If yes, what exactly is wrong with  
changing only RESTRICT and JOIN parameters (or is the problem in possible 
user`s confusion with attempts to change something more?). If no, could you, 
please, clarify what "final state" here means?
 
Also, if OR REPLACE is unacceptable, then what do you think about IF NOT EXISTS 
option?
 
Thanks,
 
--
Svetlana Derevyanko
Postgres Professional:  http://www.postgrespro.com
The Russian Postgres Company

Re: pg15b2: large objects lost on upgrade

2022-07-06 Thread Justin Pryzby
On Tue, Jul 05, 2022 at 02:40:21PM -0400, Robert Haas wrote:
> On Tue, Jul 5, 2022 at 12:56 PM Justin Pryzby  wrote:
> > My patch also leaves a 0 byte file around from initdb, which is harmless, 
> > but
> > dirty.
> >
> > I've seen before where a bunch of 0 byte files are abandoned in an
> > otherwise-empty tablespace, with no associated relation, and I have to "rm"
> > them to be able to drop the tablespace.  Maybe that's a known issue, maybe 
> > it's
> > due to crashes or other edge case, maybe it's of no consequence, and maybe 
> > it's
> > already been fixed or being fixed already.  But it'd be nice to avoid 
> > another
> > way to have a 0 byte files - especially ones named with system OIDs.
> 
> Do you want to add something to the patch to have pg_upgrade remove
> the stray file?

I'm looking into it, but it'd help to hear suggestions about where to put it.
My current ideas aren't very good.

-- 
Justin




Re: Handle infinite recursion in logical replication setup

2022-07-06 Thread Amit Kapila
On Tue, Jul 5, 2022 at 9:33 PM vignesh C  wrote:
>
> Since the existing test is already handling the verification of this
> scenario, I felt no need to add the test. Updated v29 patch removes
> the 0001 patch which had the test case.
>

I have again looked at the first and it looks good to me. I would like
to push it after some more review but before that, I would like to
know if someone else has any suggestions/objections to this patch, so
let me summarize the idea of the first patch. It will allow users to
skip the replication of remote data. Here remote data is the data that
the publisher node has received from some other node. The primary use
case is to avoid loops (infinite replication of the same data) among
nodes as shown in the initial email of this thread.

To achieve that this patch adds a new SUBSCRIPTION parameter "origin".
It specifies whether the subscription will request the publisher to
only send changes that originated locally or to send changes
regardless of origin. Setting it to "local" means that the
subscription will request the publisher to only send changes that
originated locally. Setting it to "any" means that the publisher sends
changes regardless of their origin. The default is "any".

Usage:
CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=postgres port='
PUBLICATION pub1 WITH (origin = local);

For now, even though the "origin" parameter allows only "local" and
"any" values, it is implemented as a string type so that the parameter
can be extended in future versions to support filtering using origin
names specified by the user.

This feature allows filtering only the replication data originated
from WAL but for initial sync (initial copy of table data) we don't
have such a facility as we can only distinguish the data based on
origin from WAL. As a separate patch (v29-0002*), we are planning to
forbid the initial sync if we notice that the publication tables were
also replicated from other publishers to avoid duplicate data or
loops. They will be allowed to copy with the 'force' option in such
cases.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-06 Thread Dilip Kumar
On Wed, Jul 6, 2022 at 2:48 PM Amit Kapila  wrote:
>
> On Wed, Jul 6, 2022 at 1:47 PM Dilip Kumar  wrote:
> >
> > On Wed, Jul 6, 2022 at 9:06 AM Amit Kapila  wrote:
> > >
> > > How would you choose the slot name for the table sync, right now it
> > > contains the relid of the table for which it needs to perform sync?
> > > Say, if we ignore to include the appropriate identifier in the slot
> > > name, we won't be able to resue/drop the slot after restart of table
> > > sync worker due to an error.
> >
> > I had a quick look into the patch and it seems it is using the worker
> > array index instead of relid while forming the slot name, and I think
> > that make sense, because now whichever worker is using that worker
> > index can reuse the slot created w.r.t that index.
> >
>
> I think that won't work because each time on restart the slot won't be
> fixed. Now, it is possible that we may drop the wrong slot if that
> state of copying rel is SUBREL_STATE_DATASYNC.

So it will drop the previous slot the worker at that index was using,
so it is possible that on that slot some relation was at
SUBREL_STATE_FINISHEDCOPY or so and we will drop that slot.  Because
now relid and replication slot association is not 1-1 so it would be
wrong to drop based on the relstate which is picked by this worker.
In short it makes sense what you have pointed out.

Also, it is possible
> that while creating a slot, we fail because the same name slot already
> exists due to some other worker which has created that slot has been
> restarted. Also, what about origin_name, won't that have similar
> problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
> the slot is not the same as we have used in the previous run of a
> particular worker, it may start WAL streaming from a different point
> based on the slot's confirmed_flush_location.

Yeah this is also true, when a tablesync worker has to do catch up
after completing the copy then it might stream from the wrong lsn.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: automatically generating node support functions

2022-07-06 Thread Peter Eisentraut

On 06.07.22 02:54, Tom Lane wrote:

It might be enough to invent a struct-level attribute allowing
manual assignment of node tags, ie

typedef struct MyNewNode pg_node_attr(nodetag=466)

where it'd be the programmer's responsibility to pick a nonconflicting
tag number.  We'd only ever use that in ABI-frozen branches, so
manual assignment of the tag value should be workable.


Yes, I'm aware of this issue, and that was also more or less my idea.

(Well, before the introduction of per-struct attributes, I was thinking 
about parsing nodes.h to see if the tag is listed explicitly.  But this 
is probably better.)





Re: automatically generating node support functions

2022-07-06 Thread Peter Eisentraut

The new patch addresses almost all of these issues.

> Also, I share David's upthread allergy to the option names
> "path_hackN" and to documenting those only inside the conversion
> script.

I have given these real names now and documented them with the other 
attributes.


> BTW, I think this: "Unknown attributes are ignored" is a seriously
> bad idea; it will allow typos to escape detection.

fixed

(I have also changed the inside of pg_node_attr to be comma-separated, 
rather than space-separated.  This matches better how attribute-type 
things look in C.)



I think we ought to put it into the *nodes.h headers as much as
possible, perhaps like this:

typedef struct A_Const pg_node_attr(custom_copy)
{ ...


done


So I propose that we handle these things via struct-level pg_node_attr
markers, rather than node-type lists embedded in the script:

abstract_types
no_copy
no_read_write
no_read
custom_copy
custom_readwrite


done (no_copy is actually no_copy_equal, hence renamed)


The hacks for scalar-copying EquivalenceClass*, EquivalenceMember*,
struct CustomPathMethods*, and CustomScan.methods should be replaced
with "pg_node_attr(copy_as_scalar)" labels on affected fields.


Hmm, at least for Equivalence..., this is repeated a bunch of times for 
each field.  I don't know if this is really a property of the type or 
something you can choose for each field? [not changed in v7 patch]



I wonder whether this:

 # We do not support copying Path trees, mainly
 # because the circular linkages between RelOptInfo
 # and Path nodes can't be handled easily in a
 # simple depth-first traversal.

couldn't be done better by inventing an inheritable no_copy attr
to attach to the Path supertype.  Or maybe it'd be okay to just
automatically inherit the no_xxx properties from the supertype?


This is an existing comment in copyfuncs.c.  I haven't looked into it 
any further.



I don't terribly like the ad-hoc mechanism for not comparing
CoercionForm fields.  OTOH, I am not sure whether replacing it
with per-field equal_ignore attrs would be better; there's at least
an argument that that invites bugs of omission.  But implementing
this with an uncommented test deep inside a script that most hackers
should not need to read is not good.  On the whole I'd lean towards
the equal_ignore route.


The definition of CoercionForm in primnodes.h says that the comparison 
behavior is a property of the type, so it needs to be handled somewhere 
centrally, not on each field. [not changed in v7 patch]



I'm confused by the "various field types to ignore" at the end
of the outfuncs/readfuncs code.  Do we really ignore those now?
How could that be safe?  If it is safe, wouldn't it be better
to handle that with per-field pg_node_attrs?  Silently doing
what might be the wrong thing doesn't seem good.


I have replaced these with explicit ignore markings in pathnodes.h 
(PlannerGlobal, PlannerInfo, RelOptInfo).  (This could then use a bit 
more rearranging some of the per-field comments.)



* copyfuncs.switch.c and equalfuncs.switch.c are missing trailing
newlines.


fixed


* pgindent is not very happy with a lot of your comments in *nodes.h.


fixed


* I think we should add explicit dependencies in backend/nodes/Makefile,
along the lines of

copyfuncs.o: copyfuncs.c copyfuncs.funcs.c copyfuncs.switch.c

Otherwise the whole thing is a big gotcha for anyone not using
--enable-depend.


fixed -- I think, could use more testingFrom c82ee081a7a8cdc77b44f325d1df695b55a60b06 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 6 Jul 2022 12:13:32 +0200
Subject: [PATCH v7] Automatically generate node support functions

Add a script to automatically generate the node support functions
(copy, equal, out, and read, as well as the node tags enum) from the
struct definitions.

For each of the four node support files, it creates two include files,
e.g., copyfuncs.funcs.c and copyfuncs.switch.c, to include in the main
file.  All the scaffolding of the main file stays in place.

TODO: In this patch, I have only ifdef'ed out the code to could be
removed, mainly so that it won't constantly have merge conflicts.
Eventually, that should all be changed to delete the code.  All the
code comments that are worth keeping from those sections have already
been moved to the header files where the structs are defined.

I have tried to mostly make the coverage of the output match what is
currently there.  For example, one could now do out/read coverage of
utility statement nodes, but I have manually excluded those for now.
The reason is mainly that it's easier to diff the before and after,
and adding a bunch of stuff like this might require a separate
analysis and review.

Subtyping (TidScan -> Scan) is supported.

For the hard cases, you can just write a manual function and exclude
generating one.  For the not so hard cases, there is a way of
annotating struct fields to get 

Re: Logging query parmeters in auto_explain

2022-07-06 Thread Dagfinn Ilmari Mannsåker
On Wed, 6 Jul 2022, at 02:02, Michael Paquier wrote:
> On Fri, Jul 01, 2022 at 09:58:52AM +0900, Michael Paquier wrote:
>> I have kept things as I originally intended, and applied 0001 for the
>> refactoring pieces.
>
> And done as well with 0002.  So we are good for this thread.

Thanks!

- ilmari




Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-07-06 Thread Jehan-Guillaume de Rorthais
On Tue, 5 Jul 2022 09:59:42 -0400
Andrew Dunstan  wrote:

> On 2022-07-03 Su 16:12, Jehan-Guillaume de Rorthais wrote:
> > On Sun, 3 Jul 2022 10:40:21 -0400
> > Andrew Dunstan  wrote:
> >  
> >> On 2022-06-29 We 05:09, Jehan-Guillaume de Rorthais wrote:  
> >>> On Tue, 28 Jun 2022 18:17:40 -0400
> >>> Andrew Dunstan  wrote:
> >>>
>  On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote:
> > ...
> > A better fix would be to store the version internally as version_num
> > that are trivial to compute and compare. Please, find in attachment an
> > implementation of this.
> >
> > The patch is a bit bigger because it improved the devel version to
> > support rc/beta/alpha comparison like 14rc2 > 14rc1.
> >
> > Moreover, it adds a bunch of TAP tests to check various use cases.  
>  Nice catch, but this looks like massive overkill. I think we can very
>  simply fix the test in just a few lines of code, instead of a 190 line
>  fix and a 130 line TAP test.
> >>> I explained why the patch was a little bit larger than required: it fixes
> >>> the bugs and do a little bit more. The _version_cmp sub is shorter and
> >>> easier to understand, I use multi-line code where I could probably fold
> >>> them in a one-liner, added some comments... Anyway, I don't feel the
> >>> number of line changed is "massive". But I can probably remove some code
> >>> and shrink some other if it is really important...
> >>>
> >>> Moreover, to be honest, I don't mind the number of additional lines of TAP
> >>> tests. Especially since it runs really, really fast and doesn't hurt
> >>> day-to-day devs as it is independent from other TAP tests anyway. It could
> >>> be 1k, if it runs fast, is meaningful and helps avoiding futur
> >>> regressions, I would welcome the addition.
> >>
> >> I don't see the point of having a TAP test at all. We have TAP tests for
> >> testing the substantive products we test, not for the test suite
> >> infrastructure. Otherwise, where will we stop? Shall we have tests for
> >> the things that test the test suite?  
> > Tons of perl module have regression tests. When questioning where testing
> > should stop, it seems the Test::More module itself is not the last frontier:
> > https://github.com/Test-More/test-more/tree/master/t
> >
> > Moreover, the PostgreSQL::Version is not a TAP test module, but a module to
> > deal with PostgreSQL versions and compare them.
> >
> > Testing makes development faster as well when it comes to test the code.
> > Instead of testing vaguely manually, you can test a whole bunch of
> > situations and add accumulate some more when you think about a new one or
> > when a bug is reported. Having TAP test helps to make sure the code work as
> > expected.
> >
> > It helped me when creating my patch. With all due respect, I just don't
> > understand your arguments against them. The number of lines or questioning
> > when testing should stop doesn't hold much.  
> 
> 
> There is not a single TAP test in our source code that is aimed at
> testing our test infrastructure as opposed to testing what we are
> actually in the business of building, and I'm not about to add one.

Whatever, it helped me during the dev process of fixing this bug. Remove
them if you are uncomfortable with them.

> Every added test consumes buildfarm cycles and space on the buildfarm
> server for the report, be it ever so small.

They were not supposed to enter the buildfarm cycles. I wrote it earlier, they
do not interfere with day-to-day dev activity.

> Every added test needs maintenance, be it ever so small. There's no such
> thing as a free test (apologies to Heinlein and others).

This is the first argument I can understand.

> > ...
> > But anyway, this is not the point. Using an array to compare versions where
> > we can use version_num seems like useless and buggy convolutions to me.
> 
> I think we'll just have to agree to disagree about it.

Noted.

Cheers,




Re: Use outerPlanState macro instead of referring to leffttree

2022-07-06 Thread Richard Guo
Thanks for reviewing this patch.

On Sat, Jul 2, 2022 at 5:32 AM Tom Lane  wrote:

> Richard Guo  writes:
> > In the executor code, we mix use outerPlanState macro and referring to
> > leffttree. Commit 40f42d2a tried to keep the code consistent by
> > replacing referring to lefftree with outerPlanState macro, but there are
> > still some outliers. This patch tries to clean them up.
>
> Seems generally reasonable, but what about righttree?  I find a few
> of those too with "grep".
>

Yes. We may do the same trick for righttree.


>
> Backing up a little bit, one thing not to like about the outerPlanState
> and innerPlanState macros is that they lose all semblance of type
> safety:
>
> #define innerPlanState(node)(((PlanState *)(node))->righttree)
> #define outerPlanState(node)(((PlanState *)(node))->lefttree)
>
> You can pass any pointer you want, and the compiler will not complain.
> I wonder if there's any trick (even a gcc-only one) that could improve
> on that.  In the absence of such a check, people might feel that
> increasing our reliance on these macros isn't such a hot idea.
>

Your concern makes sense. I think outerPlan and innerPlan macros share
the same issue. Not sure if there is a way to do the type check.


>
> Now, the typical coding pattern you've used:
>
>  ExecReScanHash(HashState *node)
>  {
> +   PlanState  *outerPlan = outerPlanState(node);
>
> is probably reasonably secure against wrong-pointer slip-ups.  But
> I'm less convinced about that for in-line usages in the midst of
> a function, particularly in the common case that the function has
> a variable pointing to its Plan node as well as PlanState node.
> Would it make sense to try to use the local-variable style everywhere?
>

Do you mean the pattern like below?

  outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);

It seems that this pattern is mostly used when initializing child nodes
with ExecInitNode(), and most calls to ExecInitNode() are using this
pattern as a convention. Not sure if it's better to change them to
local-variable style.

Thanks
Richard


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-06 Thread Amit Kapila
On Wed, Jul 6, 2022 at 1:47 PM Dilip Kumar  wrote:
>
> On Wed, Jul 6, 2022 at 9:06 AM Amit Kapila  wrote:
> >
> > How would you choose the slot name for the table sync, right now it
> > contains the relid of the table for which it needs to perform sync?
> > Say, if we ignore to include the appropriate identifier in the slot
> > name, we won't be able to resue/drop the slot after restart of table
> > sync worker due to an error.
>
> I had a quick look into the patch and it seems it is using the worker
> array index instead of relid while forming the slot name, and I think
> that make sense, because now whichever worker is using that worker
> index can reuse the slot created w.r.t that index.
>

I think that won't work because each time on restart the slot won't be
fixed. Now, it is possible that we may drop the wrong slot if that
state of copying rel is SUBREL_STATE_DATASYNC. Also, it is possible
that while creating a slot, we fail because the same name slot already
exists due to some other worker which has created that slot has been
restarted. Also, what about origin_name, won't that have similar
problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
the slot is not the same as we have used in the previous run of a
particular worker, it may start WAL streaming from a different point
based on the slot's confirmed_flush_location.

-- 
With Regards,
Amit Kapila.




Re: [RFC] building postgres with meson -v9

2022-07-06 Thread Peter Eisentraut

On 01.07.22 11:33, Andres Freund wrote:

Attached is an updated version of the meson patchset. There has been a steady
stream of incremental work over the last month, with patches from Peter
Eisentraut and Nazir Yavuz.

I tried to address the review comments Peter had downthread about the prep
patches. The one that I know is still outstanding is that there's still
different ways of passing output directories as parameters to a bunch of
scripts added, will resolve that next (some have been fixed).


Here is my rough assessment of where we are with this patch set:

08b4330ded prereq: deal with \ paths in basebackup_to_shell tests.

This still needs clarification, per my previous review.

3bf5b317d5 meson: prereq: Specify output directory for psql sql help script.
2e5ed807f8 meson: prereq: ecpg: add and use output directory argument 
for parse.pl.

4e7fab01c5 meson: prereq: msvc: explicit output file for pgflex.pl
cdcd3da4c4 meson: prereq: add output path arg in generate-lwlocknames.pl
1f655486e4 meson: prereq: generate-errcodes.pl: accept output file
e834c48758 meson: prereq: unicode: allow to specify output directory.

You said you are still finalizing these.  I think we can move ahead with 
these once that is done.


9f4a9b1749 meson: prereq: move snowball_create.sql creation into perl file.

This looks ready, except I think it needs to be hooked into the
distprep target, since it's now a Perl script running at build time.

8951a6721e meson: prereq: Refactor dtrace postprocessing make rules

This looks ready.

bda6a45bae meson: prereq: Refactor PG_TEST_EXTRA logic in autoconf build

I understand the intention behind this, but I think it changes the
behavior in an undesirable way.  Before this patch, you can go into
src/test/ssl/ and run make check manually.  This was indeed the only
way to do it before PG_TEST_EXTRA.  With this patch, this would now
skip all the tests unless you set PG_TEST_EXTRA, even if you run the
specific test directly.

I think this needs a different idea.

eb852cc023 meson: prereq: Can we get away with not export-all'ing libraries?

This is also at , which
hasn't seen any activity in a while.  I think this needs a resolution
one way or the other before we can proceed to the main act.

2cc276ced6 meson: prereq: add src/tools/gen_versioning_script.pl.

Note that in the make build system we can only use perl before
distprep.  So it's not clear whether a script like this would help
unify the code.  Of course, we could still use it with the
understanding that it will be separate.

351ac51a89 meson: prereq: remove LLVM_CONFIG from Makefile.global.in

This can be committed.  AFAICT, LLVM_CONFIG is only used within
configure.

dff7b5a960 meson: prereq: regress: allow to specify director containing 
expected files.


This could use a bit more explanation, but it doesn't look
controversial so far.

243f99da38 wip: split TESTDIR into two.

This one has already caused a bit of confusion, but the explanation at

https://www.postgresql.org/message-id/flat/2022060122.td2ato4wjqf7afnv%40alap3.anarazel.de#1f250dee73cf0da29a6d2c020c3bde08

seems reasonable.  But it clearly needs further work.

88dd280835 meson: Add meson based buildsystem.
1ee3073a3c meson: ci: Build both with meson and as before.

These are for later. ;-)

In the meantime, also of interest to this effort:

- If we're planning to remove the postmaster symlink in PG16, maybe we
  should start a discussion on that.

- This patch is for unifying the list of languages in NLS, as
  previously discussed: https://commitfest.postgresql.org/38/3737/




Re: [PATCH] Optimize json_lex_string by batching character copying

2022-07-06 Thread John Naylor
On Wed, Jul 6, 2022 at 12:18 PM Andres Freund  wrote:

> I think before committing something along those lines we should make the
> relevant bits also be applicable when ->strval is NULL, as several functions
> use that (notably json_in IIRC). Afaics we'd just need to move the strval
> check to be around the appendBinaryStringInfo().

That makes sense and is easy.

> And it should simplify the
> function, because some of the relevant code is duplicated outside as well...

Not sure how far to take this, but I put the returnable paths inside
the "other" path, so only backslash will go back to the top.

Both the above changes are split into a new 0003 patch for easier
review, but in the end will likely be squashed with 0002.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 3d8b39ff1c1a4abf9effc45323b293b62551770a Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 6 Jul 2022 08:35:24 +0700
Subject: [PATCH v4 1/4] Simplify json lexing state

Instead of updating the length as we go, use a const pointer to end of
the input, which we know already at the start
---
 src/common/jsonapi.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 98e4ef0942..eeedc0645a 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -519,26 +519,23 @@ JsonParseErrorType
 json_lex(JsonLexContext *lex)
 {
 	char	   *s;
-	int			len;
+	char	   *const end = lex->input + lex->input_length;
 	JsonParseErrorType result;
 
 	/* Skip leading whitespace. */
 	s = lex->token_terminator;
-	len = s - lex->input;
-	while (len < lex->input_length &&
-		   (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
+	while (s < end && (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
 	{
 		if (*s++ == '\n')
 		{
 			++lex->line_number;
 			lex->line_start = s;
 		}
-		len++;
 	}
 	lex->token_start = s;
 
 	/* Determine token type. */
-	if (len >= lex->input_length)
+	if (s >= end)
 	{
 		lex->token_start = NULL;
 		lex->prev_token_terminator = lex->token_terminator;
@@ -623,7 +620,7 @@ json_lex(JsonLexContext *lex)
 	 * the whole word as an unexpected token, rather than just
 	 * some unintuitive prefix thereof.
 	 */
-	for (p = s; p - s < lex->input_length - len && JSON_ALPHANUMERIC_CHAR(*p); p++)
+	for (p = s; p < end && JSON_ALPHANUMERIC_CHAR(*p); p++)
 		 /* skip */ ;
 
 	/*
@@ -672,7 +669,7 @@ static inline JsonParseErrorType
 json_lex_string(JsonLexContext *lex)
 {
 	char	   *s;
-	int			len;
+	char	   *const end = lex->input + lex->input_length;
 	int			hi_surrogate = -1;
 
 	if (lex->strval != NULL)
@@ -680,13 +677,11 @@ json_lex_string(JsonLexContext *lex)
 
 	Assert(lex->input_length > 0);
 	s = lex->token_start;
-	len = lex->token_start - lex->input;
 	for (;;)
 	{
 		s++;
-		len++;
 		/* Premature end of the string. */
-		if (len >= lex->input_length)
+		if (s >= end)
 		{
 			lex->token_terminator = s;
 			return JSON_INVALID_TOKEN;
@@ -704,8 +699,7 @@ json_lex_string(JsonLexContext *lex)
 		{
 			/* OK, we have an escape character. */
 			s++;
-			len++;
-			if (len >= lex->input_length)
+			if (s >= end)
 			{
 lex->token_terminator = s;
 return JSON_INVALID_TOKEN;
@@ -718,8 +712,7 @@ json_lex_string(JsonLexContext *lex)
 for (i = 1; i <= 4; i++)
 {
 	s++;
-	len++;
-	if (len >= lex->input_length)
+	if (s >= end)
 	{
 		lex->token_terminator = s;
 		return JSON_INVALID_TOKEN;
-- 
2.36.1

From 82e13b6bebd85a152ededcfd75495c0c0f642354 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 6 Jul 2022 15:50:09 +0700
Subject: [PATCH v4 4/4] Use vectorized lookahead in json_lex_string on x86

---
 src/common/jsonapi.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 81e176ad8d..44e8ed2b2f 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -24,6 +24,12 @@
 #include "miscadmin.h"
 #endif
 
+/*  WIP: put somewhere sensible and consider removing CRC from the names */
+#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#include 
+#define USE_SSE2
+#endif
+
 /*
  * The context of the parser is maintained by the recursive descent
  * mechanism, but is passed explicitly to the error reporting routine
@@ -842,12 +848,54 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
+#ifdef USE_SSE2
+			__m128i		block,
+		has_backslash,
+		has_doublequote,
+		control,
+		has_control,
+		error_cum = _mm_setzero_si128();
+			const		__m128i backslash = _mm_set1_epi8('\\');
+			const		__m128i doublequote = _mm_set1_epi8('"');
+			const		__m128i max_control = _mm_set1_epi8(0x1F);
+#endif
 			/* start lookahead at current byte */
 			char	   *p = s;
 
 			if (hi_surrogate != -1)
 return JSON_UNICODE_LOW_SURROGATE;
 
+#ifdef USE_SSE2
+			while (p < end - sizeof(__m128i))
+			{

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-06 Thread Amit Kapila
On Wed, Jul 6, 2022 at 12:19 PM Masahiko Sawada  wrote:
>
> On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila  wrote:
> >
> > 2. Are we anytime removing transaction ids from catchanges->xip array?
>
> No.
>
> > If not, is there a reason for the same? I think we can remove it
> > either at commit/abort or even immediately after adding the xid/subxid
> > to committed->xip array.
>
> It might be a good idea but I'm concerned that removing XID from the
> array at every commit/abort or after adding it to committed->xip array
> might be costly as it requires adjustment of the array to keep its
> order. Removing XIDs from the array would make bsearch faster but the
> array is updated reasonably often (every 15 sec).
>

Fair point. However, I am slightly worried that we are unnecessarily
searching in this new array even when ReorderBufferTxn has the
required information. To avoid that, in function
SnapBuildXidHasCatalogChange(), we can first check
ReorderBufferXidHasCatalogChanges() and then check the array if the
first check doesn't return true. Also, by the way, do we need to
always keep builder->catchanges.xip updated via SnapBuildRestore()?
Isn't it sufficient that we just read and throw away contents from a
snapshot if builder->catchanges.xip is non-NULL?

I had additionally thought if can further optimize this solution to
just store this additional information when we need to serialize for
checkpoint record but I think that won't work because walsender can
restart even without resatart of server in which case the same problem
can occur. I am not if sure there is a way to further optimize this
solution, let me know if you have any ideas?

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-06 Thread Dilip Kumar
On Wed, Jul 6, 2022 at 9:06 AM Amit Kapila  wrote:
>
> How would you choose the slot name for the table sync, right now it
> contains the relid of the table for which it needs to perform sync?
> Say, if we ignore to include the appropriate identifier in the slot
> name, we won't be able to resue/drop the slot after restart of table
> sync worker due to an error.

I had a quick look into the patch and it seems it is using the worker
array index instead of relid while forming the slot name, and I think
that make sense, because now whichever worker is using that worker
index can reuse the slot created w.r.t that index.

> >
> > With those changes, I did some benchmarking to see if it improves anything.
> > This results compares this patch with the latest version of master branch. 
> > "max_sync_workers_per_subscription" is set to 2 as default.
> > Got some results simply averaging timings from 5 consecutive runs for each 
> > branch.
> >
> > First, tested logical replication with empty tables.
> > 10 tables
> > 
> > - master:286.964 ms
> > - the patch:116.852 ms
> >
> > 100 tables
> > 
> > - master:2785.328 ms
> > - the patch:706.817 ms
> >
> > 10K tables
> > 
> > - master:39612.349 ms
> > - the patch:12526.981 ms
> >
> >
> > Also tried replication tables with some data
> > 10 tables loaded with 10MB data
> > 
> > - master:1517.714 ms
> > - the patch:1399.965 ms
> >
> > 100 tables loaded with 10MB data
> > 
> > - master:16327.229 ms
> > - the patch:11963.696 ms
> >
> >
> > Then loaded more data
> > 10 tables loaded with 100MB data
> > 
> > - master:13910.189 ms
> > - the patch:14770.982 ms
> >
> > 100 tables loaded with 100MB data
> > 
> > - master:146281.457 ms
> > - the patch:156957.512
> >
> >
> > If tables are mostly empty, the improvement can be significant - up to 3x 
> > faster logical replication.
> > With some data loaded, it can still be faster to some extent.
> >
>
> These results indicate that it is a good idea, especially for very small 
> tables.
>
> > When the table size increases more, the advantage of reusing workers 
> > becomes insignificant.
> >
>
> It seems from your results that performance degrades for large
> relations. Did you try to investigate the reasons for the same?

Yeah, that would be interesting to know that why there is a drop in some cases.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Patch proposal: New hooks in the connection path

2022-07-06 Thread Drouvot, Bertrand

Hi,

On 7/6/22 12:11 AM, Joe Conway wrote:


On 7/5/22 03:37, Bharath Rupireddy wrote:
On Mon, Jul 4, 2022 at 6:23 PM Drouvot, Bertrand 
 wrote:

On 7/2/22 1:00 AM, Nathan Bossart wrote:
> Could we model this after fmgr_hook?  The first argument in that hook
> indicates where it is being called from.  This doesn't alleviate 
the need
> for several calls to the hook in the authentication logic, but 
extension

> authors would only need to define one hook.

I like the idea and indeed fmgr.h looks a good place to model it.

Attached a new patch version doing so.


I was thinking along the same lines, so +1 for the general approach


Thanks for the review!




Thanks for the patch. Can we think of enhancing
ClientAuthentication_hook_type itself i.e. make it a generic hook for
all sorts of authentication metrics, info etc. with the type parameter
embedded to it instead of new hook FailedConnection_hook?We can either
add a new parameter for the "event" (the existing
ClientAuthentication_hook_type implementers will have problems), or
embed/multiplex the "event" info to existing Port structure or status
variable (macro or enum) (existing implementers will not have
compatibility problems).  IMO, this looks cleaner going forward.


Not sure I like this though -- I'll have to think about that


Not sure about this one neither.

The "enhanced" ClientAuthentication_hook will have to be fired at the 
same places as the new FailedConnection_hook is, but i think those 
places are not necessary linked to real authentication per say (making 
the name confusing).





On the v2 patch:

1. Why do we need to place the hook and structures in fmgr.h? Why not 
in auth.h?


agreed -- it does not belong in fmgr.h


Moved to auth.h.




2. Timeout Handler is a signal handler, called as part of SIGALRM
signal handler, most of the times, signal handlers ought to be doing
small things, now that we are handing off the control to hook, which
can do any long running work (writing to a remote storage, file,
aggregate etc.), I don't think it's the right thing to do here.
  static void
  StartupPacketTimeoutHandler(void)
  {
+ if (FailedConnection_hook)
+ (*FailedConnection_hook) (FCET_STARTUP_PACKET_TIMEOUT, MyProcPort);
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("timeout while processing startup packet")));


Why add the ereport()?


removed it.



But more to Bharath's point, perhaps this is a case that is better
served by incrementing a stat counter and not exposed as a hook?


I think that the advantage of the hook is that it gives the extension 
author the ability/flexibility to aggregate the counter based on 
information available in the Port Struct (say the client addr for 
example) at this stage.


What about to aggregate the stat counter based on the client addr? (Not 
sure if there is more useful information (than the client addr) at this 
stage though)


That said, i agree that having a hook in a time out handler might not be 
the right thing to do (even if at the end that would be to the extension 
author responsibility to do "small things" in it), so it has been 
removed in the new attached version.




Also, a teeny nit:
8<--
+   if (status != STATUS_OK) {
+   if (FailedConnection_hook)
8<--

does not follow usual practice and probably should be:

8<--
+   if (status != STATUS_OK)
+   {
+   if (FailedConnection_hook)
8<--



Thanks!, fixed.

--

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index d7257e4056..d9e1e3b4c1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -567,6 +567,8 @@ int postmaster_alive_fds[2] = {-1, -1};
 HANDLE PostmasterHandle;
 #endif
 
+FailedConnection_hook_type FailedConnection_hook = NULL;
+
 /*
  * Postmaster main entry point
  */
@@ -4462,7 +4464,11 @@ BackendInitialize(Port *port)
 * already did any appropriate error reporting.
 */
if (status != STATUS_OK)
+   {
+   if (FailedConnection_hook)
+   (*FailedConnection_hook) (FCET_BAD_STARTUP_PACKET, 
port);
proc_exit(0);
+   }
 
/*
 * Now that we have the user and database name, we can set the process
diff --git a/src/backend/utils/init/postinit.c 
b/src/backend/utils/init/postinit.c
index 6b9082604f..e9bbd185f4 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -360,10 +360,14 @@ CheckMyDatabase(const char *name, bool am_superuser, bool 
override_allow_connect
if (!am_superuser &&
pg_database_aclcheck(MyDatabaseId, GetUserId(),
 ACL_CONNECT) 
!= ACLCHECK_OK)
+   {
+   if (FailedConnection_hook)
+  

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-07-06 Thread David Rowley
Thanks for looking at this.

On Wed, 6 Jul 2022 at 12:32, Andres Freund  wrote:
>
> On 2022-06-29 11:40:45 +1200, David Rowley wrote:
> > Another small thing which I considered doing was to put the
> > hash_fcinfo_data field as the final field in
> > ScalarArrayOpExprHashTable so that we could allocate the memory for
> > the hash_fcinfo_data in the same allocation as the
> > ScalarArrayOpExprHashTable.  This would reduce the pointer
> > dereferencing done in saop_element_hash() a bit further.  I just
> > didn't notice anywhere else where we do that for FunctionCallInfo, so
> > I resisted doing this.
>
> I think that'd make sense - it does add a bit of size calculation magic, but
> it shouldn't be a problem. I'm fairly sure we do this in other parts of the
> code.

I've now adjusted that.  I also changed the hash_finfo field to make
it so the FmgrInfo is inline rather than a pointer. This saves an
additional dereference in saop_element_hash() and also saves a
palloc().

I had to adjust the palloc for the ScalarArrayOpExprHashTable struct
into a palloc0 due to the FmgrInfo being inlined.  I considered just
zeroing out the hash_finfo portion but thought it wasn't worth the
extra code.

> Are you good pushing this? I'm fine with you doing so wether you adapt it
> further or not.

Pushed.

David




Re: doc: BRIN indexes and autosummarize

2022-07-06 Thread Alvaro Herrera
On 2022-Jul-05, Justin Pryzby wrote:

> One issue:
> 
> +   summarized range, the range that the new page belongs into
> +   does not automatically acquire a summary tuple;
> 
> "belongs into" sounds wrong - "belongs to" is better.

Hah, and I was wondering if "belongs in" was any better.

> I'll put that change into my "typos" branch to fix later if it's not addressed
> in this thread.

Roberto has some more substantive comments on the new text, so let's try
and fix everything together.  This time, I'll let you guys come up with
a new patch.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-07-06 Thread Michael Paquier
On Thu, Jun 16, 2022 at 03:14:16PM +0900, Michael Paquier wrote:
> Actually, this can go with the bump of MIN_WINNT as it uses one of the
> IsWindows*OrGreater() macros as a runtime check.  And there are two
> more places in pg_ctl.c that can be similarly cleaned up.
> 
> It is possible that I have missed some spots, of course.

It does not seem to be the case on a second look.  The buildfarm
animals running Windows are made of:
- hamerkop, Windows server 2016 (based on Win10 AFAIK)
- drongo, Windows server 2019
- bowerbird, Windows 10 pro
- jacana, Windows 10
- fairywren, Msys server 2019
- bichir, Ubuntu/Windows 10 mix

Now that v16 is open for business, any objections to move on with this
patch and bump MIN_WINNT to 0x0A00 on HEAD?  There are follow-up items
for large pages and more.
--
Michael


signature.asc
Description: PGP signature


Re: refactor some protocol message sending in walsender and basebackup

2022-07-06 Thread Peter Eisentraut

On 01.07.22 23:36, Nathan Bossart wrote:

On Thu, Jun 23, 2022 at 04:36:36PM +0200, Peter Eisentraut wrote:

Some places in walsender.c and basebackup_copy.c open-code the sending of
RowDescription and DataRow protocol messages.  But there are already more
compact and robust solutions available for this, using DestRemoteSimple and
associated machinery, already in use in walsender.c.

The attached patches 0001 and 0002 are tiny bug fixes I found during this.

Patches 0003 and 0004 are the main refactorings.  They should probably be
combined into one patch eventually, but this way the treatment of
RowDescription and DataRow is presented separately.


All 4 patches look reasonable to me.


All committed now, thanks.

(I cleaned up the 0004 patch a bit more; there was some junk left in the 
posted patch.)





Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-06 Thread Masahiko Sawada
On Tue, Jul 5, 2022 at 8:00 PM Amit Kapila  wrote:
>
> On Mon, Jul 4, 2022 at 6:12 PM Amit Kapila  wrote:
> >
> > On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada  
> > wrote:
> > >
> > > I've attached three POC patches:
> > >
> >
> > I think it will be a good idea if you can add a short commit message
> > at least to say which patch is proposed for HEAD and which one is for
> > back branches. Also, it would be good if you can add some description
> > of the fix in the commit message. Let's remove poc* from the patch
> > name.
> >
> > Review poc_add_running_catchanges_xacts_to_serialized_snapshot
> > =
>
> Few more comments:

Thank you for the comments.

> 1.
> +
> + /* This array must be sorted in xidComparator order */
> + TransactionId *xip;
> + } catchanges;
>  };
>
> This array contains the transaction ids for subtransactions as well. I
> think it is better mention the same in comments.

Updated.

>
> 2. Are we anytime removing transaction ids from catchanges->xip array?

No.

> If not, is there a reason for the same? I think we can remove it
> either at commit/abort or even immediately after adding the xid/subxid
> to committed->xip array.

It might be a good idea but I'm concerned that removing XID from the
array at every commit/abort or after adding it to committed->xip array
might be costly as it requires adjustment of the array to keep its
order. Removing XIDs from the array would make bsearch faster but the
array is updated reasonably often (every 15 sec).

>
> 3.
> + if (readBytes != sz)
> + {
> + int save_errno = errno;
> +
> + CloseTransientFile(fd);
> +
> + if (readBytes < 0)
> + {
> + errno = save_errno;
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not read file \"%s\": %m", path)));
> + }
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> + errmsg("could not read file \"%s\": read %d of %zu",
> + path, readBytes, sz)));
> + }
>
> This is the fourth instance of similar error handling code in
> SnapBuildRestore(). Isn't it better to extract this into a separate
> function?

Good idea, updated.

>
> 4.
> +TransactionId *
> +ReorderBufferGetCatalogChangesXacts(ReorderBuffer *rb, size_t *xcnt_p)
> +{
> + HASH_SEQ_STATUS hash_seq;
> + ReorderBufferTXNByIdEnt *ent;
> + TransactionId *xids;
> + size_t xcnt = 0;
> + size_t xcnt_space = 64; /* arbitrary number */
> +
> + xids = (TransactionId *) palloc(sizeof(TransactionId) * xcnt_space);
> +
> + hash_seq_init(_seq, rb->by_txn);
> + while ((ent = hash_seq_search(_seq)) != NULL)
> + {
> + ReorderBufferTXN *txn = ent->txn;
> +
> + if (!rbtxn_has_catalog_changes(txn))
> + continue;
>
> It would be better to allocate memory the first time we have to store
> xids. There is a good chance that many a time this function will do
> just palloc without having to store any xid.

Agreed.

>
> 5. Do you think we should do some performance testing for a mix of
> ddl/dml workload to see if it adds any overhead in decoding due to
> serialize/restore doing additional work? I don't think it should add
> some meaningful overhead but OTOH there is no harm in doing some
> testing of the same.

Yes, it would be worth trying. I also believe this change doesn't
introduce noticeable overhead but let's check just in case.

I've attached an updated patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


0001-Add-catalog-modifying-transactions-to-logical-decodi.patch
Description: Binary data


Re: Improve TAP tests of pg_upgrade for cross-version tests

2022-07-06 Thread Michael Paquier
On Thu, Jun 09, 2022 at 04:49:01PM +0900, Michael Paquier wrote:
> Rebased to cope with the recent changes in this area.

Please find attached an updated version of this patch, where I have
extended the support of the upgrade script down to 9.5 as origin
version, as ~9.4 now fail because of cluster_name, so Cluster.pm does
not support that.  FWIW, I have created an extra set of dumps down to
9.4.

This adds proper handling for initdb --wal-segsize and
--allow-group-access when it comes to v11~, as these options are
missing in ~10.
--
Michael
From d6a585e7b0180ca9e4cda56474ceffe146593ce2 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 6 Jul 2022 15:23:24 +0900
Subject: [PATCH v3] Add more filtering capabilities in the dumps.

This adds support for some filtering capabilities, for some tests done
down to v9.5.  This allows the tests to pass with v14, while v9.5~13
still generate a few diffs, but these are minimal compared to what
happened before this commit.
---
 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 85 --
 1 file changed, 66 insertions(+), 19 deletions(-)

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 2f9b13bf0a..159417ceec 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -30,6 +30,39 @@ sub generate_db
 		"created database with ASCII characters from $from_char to $to_char");
 }
 
+# Filter the contents of a dump before its use in a content comparison.
+# This returns the path to the filtered dump.
+sub filter_dump
+{
+	my ($node, $tempdir, $dump_file_name) = @_;
+	my $dump_file = "$tempdir/$dump_file_name";
+	my $dump_contents = slurp_file($dump_file);
+
+	# Remove the comments.
+	$dump_contents =~ s/^\-\-.*//mgx;
+	# Remove empty lines.
+	$dump_contents =~ s/^\n//mgx;
+
+	# Locale setup has changed for collations in 15~.
+	$dump_contents =~
+	  s/(^CREATE\sCOLLATION\s.*?)\slocale\s=\s'und'/$1 locale = ''/mgx
+	  if ($node->pg_version < 15);
+
+	# Dumps taken from <= 11 use EXECUTE PROCEDURE.  Replace it
+	# with EXECUTE FUNCTION.
+	$dump_contents =~
+	  s/(^CREATE\sTRIGGER\s.*?)\sEXECUTE\sPROCEDURE/$1 EXECUTE FUNCTION/mgx
+	  if ($node->pg_version < 12);
+
+	my $dump_file_filtered = "$tempdir/${dump_file_name}_filter";
+	open(my $dh, '>', $dump_file_filtered)
+	  || die "opening $dump_file_filtered";
+	print $dh $dump_contents;
+	close($dh);
+
+	return $dump_file_filtered;
+}
+
 # The test of pg_upgrade requires two clusters, an old one and a new one
 # that gets upgraded.  Before running the upgrade, a logical dump of the
 # old cluster is taken, and a second logical dump of the new one is taken
@@ -60,7 +93,10 @@ my $oldnode =
 # To increase coverage of non-standard segment size and group access without
 # increasing test runtime, run these tests with a custom setting.
 # --allow-group-access and --wal-segsize have been added in v11.
-$oldnode->init(extra => [ '--wal-segsize', '1', '--allow-group-access' ]);
+my %node_params = ();
+$node_params{extra} = [ '--wal-segsize', '1', '--allow-group-access' ]
+if $oldnode->pg_version >= 11;
+$oldnode->init(%node_params);
 $oldnode->start;
 
 # The default location of the source code is the root of this directory.
@@ -147,19 +183,19 @@ if (defined($ENV{oldinstall}))
 
 # Initialize a new node for the upgrade.
 my $newnode = PostgreSQL::Test::Cluster->new('new_node');
-$newnode->init(extra => [ '--wal-segsize', '1', '--allow-group-access' ]);
+$newnode->init(%node_params);
 my $newbindir = $newnode->config_data('--bindir');
 my $oldbindir = $oldnode->config_data('--bindir');
 
 # Take a dump before performing the upgrade as a base comparison. Note
 # that we need to use pg_dumpall from the new node here.
-$newnode->command_ok(
-	[
-		'pg_dumpall', '--no-sync',
-		'-d', $oldnode->connstr('postgres'),
-		'-f', "$tempdir/dump1.sql"
-	],
-	'dump before running pg_upgrade');
+my @dump_command = (
+	'pg_dumpall', '--no-sync', '-d', $oldnode->connstr('postgres'),
+	'-f', "$tempdir/dump1.sql");
+# --extra-float-digits is needed when upgrading from a version older than 11.
+push(@dump_command, '--extra-float-digits', '0')
+  if ($oldnode->pg_version < 12);
+$newnode->command_ok(\@dump_command, 'dump before running pg_upgrade');
 
 # After dumping, update references to the old source tree's regress.so
 # to point to the new tree.
@@ -284,24 +320,35 @@ if (-d $log_path)
 }
 
 # Second dump from the upgraded instance.
-$newnode->command_ok(
-	[
-		'pg_dumpall', '--no-sync',
-		'-d', $newnode->connstr('postgres'),
-		'-f', "$tempdir/dump2.sql"
-	],
-	'dump after running pg_upgrade');
+@dump_command = (
+	'pg_dumpall', '--no-sync', '-d', $newnode->connstr('postgres'),
+	'-f', "$tempdir/dump2.sql");
+# --extra-float-digits is needed when upgrading from a version older than 11.
+push(@dump_command, '--extra-float-digits', '0')
+  if ($oldnode->pg_version < 12);

Re: Unify DLSUFFIX on Darwin

2022-07-06 Thread Peter Eisentraut

On 24.06.22 16:13, Tom Lane wrote:

[ thinks for a bit... ]  Might be worth double-checking that pg_upgrade
doesn't get confused in a cross-version upgrade.  A quick grep doesn't
find that it refers to DLSUFFIX anywhere, but it definitely does pay
attention to extensions' shared library names.


pg_upgrade just checks that it can "LOAD" whatever it finds in probin. 
So this will work if extensions use the recommended extension-free file 
names.  If they don't, they should get a clean failure.


If this becomes a problem in practice, we could make pg_dump 
automatically adjust the probin on upgrade from an old version.


I have committed this now.  We can see how it goes.




  1   2   >