date:20240221

On Tue, Nov 28, 2023 at 2:30 AM Nathan Bossart  wrote:
>
> Here's a v2 of the patch set in which I've attempted to address all
> feedback.  I've also added a pg_write_membarrier_u* pair of functions that

There's some immediate use for reads/writes with barrier semantics -
https://www.postgresql.org/message-id/CALj2ACXrePj4E6ocKr-%2Bb%3DrjT-8yeMmHnEeWQP1bc-WXETfTVw%40mail.gmail.com.
Any plan for taking this forward?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Speeding up COPY TO for uuids and arrays

On Thu, Feb 22, 2024 at 08:16:09AM +0100, Laurenz Albe wrote:
> I'm attaching the remaining patch for the Juli commitfest, if you
> don't get inspired before that.

There are good chances that I get inspired some time next week.  We'll
see ;)
--
Michael


signature.asc
Description: PGP signature

Re: Add lookup table for replication slot invalidation causes

On Thu, Feb 22, 2024 at 12:26 PM Michael Paquier  wrote:
>
> On Thu, Feb 22, 2024 at 05:30:08PM +1100, Peter Smith wrote:
> > Oops. Perhaps I meant more like below -- in any case, the point was
> > the same -- to ensure RS_INVAL_NONE is what returns if something
> > unexpected happens.
>
> You are right that this could be a bit confusing, even if we should
> never reach this state.  How about avoiding to return the index of the
> loop as result, as of the attached?  Would you find that cleaner?

Looks neat!

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Speeding up COPY TO for uuids and arrays

2024-02-21 Thread Laurenz Albe

On Thu, 2024-02-22 at 10:34 +0900, Michael Paquier wrote:
> This part is done as of 011d60c4352c.  I did not evaluate the rest
> yet.

Thanks!

I'm attaching the remaining patch for the Juli commitfest, if you
don't get inspired before that.

Yours,
Laurenz Albe
From e8346ea88785a763d2bd3f99800ae928b7469f64 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Sat, 17 Feb 2024 17:24:19 +0100
Subject: [PATCH v1 3/3] Small speedup for array_out

Avoid writing zero bytes where it is not necessary.
This offers only a small, but measurable speed gain
for larger arrays.
---
 src/backend/utils/adt/arrayfuncs.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index f3fee54e37..306c5062f7 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -1200,7 +1200,7 @@ array_out(PG_FUNCTION_ARGS)
 	p = retval;
 
 #define APPENDSTR(str)	(strcpy(p, (str)), p += strlen(p))
-#define APPENDCHAR(ch)	(*p++ = (ch), *p = '\0')
+#define APPENDCHAR(ch)	(*p++ = (ch))
 
 	if (needdims)
 		APPENDSTR(dims_str);
@@ -1222,10 +1222,9 @@ array_out(PG_FUNCTION_ARGS)
 char		ch = *tmp;
 
 if (ch == '"' || ch == '\\')
-	*p++ = '\\';
-*p++ = ch;
+	APPENDCHAR('\\');
+APPENDCHAR(ch);
 			}
-			*p = '\0';
 			APPENDCHAR('"');
 		}
 		else
@@ -1248,6 +1247,8 @@ array_out(PG_FUNCTION_ARGS)
 		j = i;
 	} while (j != -1);
 
+	*p = '\0';
+
 #undef APPENDSTR
 #undef APPENDCHAR
 
-- 
2.43.2

Re: POC: GROUP BY optimization

On 22/2/2024 13:35, Richard Guo wrote:

The avg() function on integer argument is commonly used in
aggregates.sql.  I don't think this is an issue.  See the first test
query in aggregates.sql.

Make sense

 > it should be parallel to the test cases for utilize the ordering of
 > index scan and subquery scan.

Also, I'm unsure about removing the disabling of the
max_parallel_workers_per_gather parameter. Have you discovered the
domination of the current plan over the partial one? Do the cost
fluctuations across platforms not trigger a parallel plan?

The table used for testing contains only 100 tuples, which is the size
of only one page.  I don't believe it would trigger any parallel plans,
unless we manually change min_parallel_table_scan_size.
I don't intend to argue it, but just for the information, I frequently 
reduce it to zero, allowing PostgreSQL to make a decision based on 
costs. It sometimes works much better, because one small table in multi 
join can disallow an effective parallel plan.

What's more, I suggest to address here the complaint from [1]. As I
see,
cost difference between Sort and IncrementalSort strategies in that
case
is around 0.5. To make the test more stable I propose to change it a
bit
and add a limit:
SELECT count(*) FROM btg GROUP BY z, y, w, x LIMIT 10;
It makes efficacy of IncrementalSort more obvious difference around 10
cost points.

I don't think that's necessary.  With Incremental Sort the final cost
is:

     GroupAggregate  (cost=1.66..19.00 rows=100 width=25)

while with full Sort it is:

     GroupAggregate  (cost=16.96..19.46 rows=100 width=25)

With the STD_FUZZ_FACTOR (1.01), there is no doubt that the first path
is cheaper on total cost.  Not to say that even if somehow we decide the
two paths are fuzzily the same on total cost, the first path still
dominates because its startup cost is much cheaper.
As before, I won't protest here - it needs some computations about how 
much cost can be added by bulk extension of the relation blocks. If 
Maxim will answer that it's enough to resolve his issue, why not?

--
regards,
Andrei Lepikhov
Postgres Professional

Re: Add lookup table for replication slot invalidation causes

On Thu, Feb 22, 2024 at 05:30:08PM +1100, Peter Smith wrote:
> Oops. Perhaps I meant more like below -- in any case, the point was
> the same -- to ensure RS_INVAL_NONE is what returns if something
> unexpected happens.

You are right that this could be a bit confusing, even if we should
never reach this state.  How about avoiding to return the index of the
loop as result, as of the attached?  Would you find that cleaner?
--
Michael
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 233652b479..eb1ada2f9f 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -2313,6 +2313,7 @@ ReplicationSlotInvalidationCause
 GetSlotInvalidationCause(const char *conflict_reason)
 {
 	ReplicationSlotInvalidationCause cause;
+	ReplicationSlotInvalidationCause result = RS_INVAL_NONE;
 	bool		found PG_USED_FOR_ASSERTS_ONLY = false;
 
 	Assert(conflict_reason);
@@ -2322,10 +2323,11 @@ GetSlotInvalidationCause(const char *conflict_reason)
 		if (strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0)
 		{
 			found = true;
+			result = cause;
 			break;
 		}
 	}
 
 	Assert(found);
-	return cause;
+	return result;
 }


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

2024-02-21 Thread shveta malik

On Thu, Feb 22, 2024 at 10:31 AM shveta malik  wrote:
>
> Please find patch001 attached. There is a CFBot failure in patch002.
> The test added there needs some adjustment. We will rebase and post
> rest of the patches once we fix that issue.
>

There was a recent commit 801792e to improve error messaging in
slotsync.c which resulted in conflict. Thus rebased the patch. There
is no new change in the patch attached

thanks
Shveta

v94_2-0001-Add-a-new-slotsync-worker.patch
Description: Binary data

Re: Make COPY format extendable: Extract COPY TO format implementations

On Thu, Feb 15, 2024 at 03:34:21PM +0900, Sutou Kouhei wrote:
> It seems that it improves performance a bit but my
> environment isn't suitable for benchmark. So they may not
> be valid numbers.

I was comparing what you have here, and what's been attached by Andres
at [1] and the top of the changes on my development branch at [2]
(v3-0008, mostly).  And, it strikes me that there is no need to do any
major changes in any of the callbacks proposed up to v13 and v14 in
this thread, as all the changes proposed want to plug in more data
into each StateData for COPY FROM and COPY TO, the best part being
that v3-0008 can just reuse the proposed callbacks as-is.  v1-0001
from Sutou-san would need one slight tweak in the per-row callback,
still that's minor.

I have been spending more time on the patch to introduce the COPY
APIs, leading me to the v15 attached, where I have replaced the
previous attribute callbacks for the output representation and the
reads with hardcoded routines that should be optimized by compilers,
and I have done more profiling with -O2.  I'm aware of the disparities
in the per-row and start callbacks for the text/csv cases as well as
the default expressions, but these are really format-dependent with
their own assumptions so splitting them is something that makes
limited sense to me.  I've also looks at externalizing some of the
error handling, though the result was not that beautiful, so what I
got here is what makes the callbacks leaner and easier to work with.

First, some results for COPY FROM using the previous tests (30 int
attributes, running on scissors, data sent to blackhole_am, etc.) in
NextCopyFrom() which becomes the hot-spot:
* Using v15:
  Children  Self  Command   Shared Object   Symbol
-   66.42% 0.71%  postgres  postgres[.] NextCopyFrom
- 65.70% NextCopyFrom
   - 65.49% CopyFromTextLikeOneRow
  + 19.29% InputFunctionCallSafe
  + 15.81% CopyReadLine
13.89% CopyReadAttributesText
+ 0.71% _start
* Using HEAD (today's 011d60c4352c):
  Children  Self  Command   Shared Object   Symbol
-   67.09%16.64%  postgres  postgres[.] NextCopyFrom
- 50.45% NextCopyFrom
   - 30.89% NextCopyFromRawFields
  + 16.26% CopyReadLine
13.59% CopyReadAttributesText
   + 19.24% InputFunctionCallSafe
+ 16.64% _start

In this case, I have been able to limit the effects of the per-row
callback by making NextCopyFromRawFields() local to copyfromparse.c
while applying some inlining to it.  This brings me to a different
point, why don't we do this change independently on HEAD?  It's not 
really complicated to make NextCopyFromRawFields show high in the
profiles.  I was looking at external projects, and noticed that
there's nothing calling NextCopyFromRawFields() directly.

Second, some profiles with COPY TO (30 int integers, running on
scissors) where data is sent /dev/null:
* Using v15:
  Children  Self  Command   Shared Object   Symbol
-   85.61% 0.34%  postgres  postgres[.] CopyOneRowTo
- 85.26% CopyOneRowTo
   - 75.86% CopyToTextOneRow
  + 36.49% OutputFunctionCall
  + 10.53% appendBinaryStringInfo
9.66% CopyAttributeOutText
1.34% int4out
0.92% 0xa9803be8
0.79% enlargeStringInfo
0.77% memcpy@plt
0.69% 0xa9803be4
   + 3.12% CopySendEndOfRow
 2.81% CopySendChar
 0.95% pgstat_progress_update_param
 0.95% appendBinaryStringInfo
 0.55% MemoryContextReset
* Using HEAD (today's 011d60c4352c):
  Children  Self  Command   Shared Object   Symbol
-   80.35%14.23%  postgres  postgres[.] CopyOneRowTo
- 66.12% CopyOneRowTo
   + 35.40% OutputFunctionCall
   + 11.00% appendBinaryStringInfo
 8.38% CopyAttributeOutText
   + 2.98% CopySendEndOfRow
 1.52% int4out
 0.88% pgstat_progress_update_param
 0.87% 0x8ab32be8
 0.74% memcpy@plt
 0.68% enlargeStringInfo
 0.61% 0x8ab32be4
 0.51% MemoryContextReset
+ 14.23% _start

The increase in CopyOneRowTo from 80% to 85% worries me but I am not
quite sure how to optimize that with the current structure of the
code, so the dispatch caused by per-row callback is noticeable in
what's my worst test case.  I am not quite sure how to avoid that,
TBH.  A result that has been puzzling me is that I am getting faster
runtimes with v15 (6232ms in average) vs HEAD (6550ms) at 5M rows with
COPY TO for what led to these profiles (for tests without perf
attached to the backends).

Any thoughts overall?

[1]: 
https://www.postgresql.org/message-id/20240218015955.rmw5mcmobt5hbene%40awork3.anarazel.de
[2]: https://www.postgresql.org/message-id/zcwotr1n0gelf...@paquier.xyz
--
Michael
From 9787247320f9d11756a20e3f94e84ec9bb10092e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 22 Feb 2024

Re: Why is pq_begintypsend so slow?

2024-02-21 Thread Sutou Kouhei

Hi,

In <20240219195351.5vy7cdl3wxia6...@awork3.anarazel.de>
  "Re: Why is pq_begintypsend so slow?" on Mon, 19 Feb 2024 11:53:51 -0800,
  Andres Freund  wrote:

>> I don't have a strong opinion how to implement this
>> optimization as I said above. It seems that you like your
>> approach. So I withdraw [1]. Could you complete this
>> optimization? Can we proceed making COPY format extensible
>> without this optimization? It seems that it'll take a little
>> time to complete this optimization because your patch is
>> still WIP. And it seems that you can work on it after making
>> COPY format extensible.
> 
> I don't think optimizing this aspect needs to block making copy extensible.

OK. I'll work on making copy extensible without this
optimization.

> I don't know how much time/energy I'll have to focus on this in the near
> term. I really just reposted this because the earlier patches were relevant
> for the discussion in another thread.  If you want to pick the COPY part up,
> feel free.

OK. I may work on this after I complete making copy
extensible if you haven't completed this yet.


Thanks,
-- 
kou

Re: POC: GROUP BY optimization

On Thu, Feb 22, 2024 at 12:18 PM Andrei Lepikhov 
wrote:

> On 22/2/2024 09:09, Richard Guo wrote:
> > I looked through the v2 patch and have two comments.
> >
> > * The test case under "Check we don't pick aggregate path key instead of
> > grouping path key" does not have EXPLAIN to show the plan.  So how can
> > we ensure we do not mistakenly select the aggregate pathkeys instead of
> > the grouping pathkeys?
>


> I confirm it works correctly. I am not sure about the stability of the
> zeros number in the output on different platforms here:
> avg
> 
>   4.
>   5.
> It was why I'd used the format() function before. So, may we elaborate
> on the test and restrict the output?


The avg() function on integer argument is commonly used in
aggregates.sql.  I don't think this is an issue.  See the first test
query in aggregates.sql.

SELECT avg(four) AS avg_1 FROM onek;
   avg_1

 1.5000
(1 row)


> > * I don't think the test case introduced by e1b7fde418 is still needed,
> > because we already have one under "Utilize the ordering of merge join to
> > avoid a full Sort operation".  This kind of test case is just to ensure
> > that we are able to utilize the ordering of the subplans underneath.  So
> > it should be parallel to the test cases for utilize the ordering of
> > index scan and subquery scan.
>
> Also, I'm unsure about removing the disabling of the
> max_parallel_workers_per_gather parameter. Have you discovered the
> domination of the current plan over the partial one? Do the cost
> fluctuations across platforms not trigger a parallel plan?


The table used for testing contains only 100 tuples, which is the size
of only one page.  I don't believe it would trigger any parallel plans,
unless we manually change min_parallel_table_scan_size.


> What's more, I suggest to address here the complaint from [1]. As I see,
> cost difference between Sort and IncrementalSort strategies in that case
> is around 0.5. To make the test more stable I propose to change it a bit
> and add a limit:
> SELECT count(*) FROM btg GROUP BY z, y, w, x LIMIT 10;
> It makes efficacy of IncrementalSort more obvious difference around 10
> cost points.


I don't think that's necessary.  With Incremental Sort the final cost
is:

GroupAggregate  (cost=1.66..19.00 rows=100 width=25)

while with full Sort it is:

GroupAggregate  (cost=16.96..19.46 rows=100 width=25)

With the STD_FUZZ_FACTOR (1.01), there is no doubt that the first path
is cheaper on total cost.  Not to say that even if somehow we decide the
two paths are fuzzily the same on total cost, the first path still
dominates because its startup cost is much cheaper.

Thanks
Richard

Re: Add lookup table for replication slot invalidation causes

2024-02-21 Thread Peter Smith

On Thu, Feb 22, 2024 at 5:19 PM Peter Smith  wrote:
>
> Hi, Sorry for the late comment but isn't the pushed logic now
> different to what it was there before?
>
> IIUC previously (in a non-debug build) if the specified
> conflict_reason was not found, it returned RS_INVAL_NONE -- now it
> seems to return whatever enum happens to be last.
>
> How about something more like below:
>
> --
> ReplicationSlotInvalidationCause
> GetSlotInvalidationCause(const char *conflict_reason)
> {
>   ReplicationSlotInvalidationCause cause;
>   boolfound  = false;
>
>   for (cause = 0; !found && cause <= RS_INVAL_MAX_CAUSES; cause++)
> found = strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0;
>
>   Assert(found);
>   return found ? cause : RS_INVAL_NONE;
> }
> --
>

Oops. Perhaps I meant more like below -- in any case, the point was
the same -- to ensure RS_INVAL_NONE is what returns if something
unexpected happens.

ReplicationSlotInvalidationCause
GetSlotInvalidationCause(const char *conflict_reason)
{
ReplicationSlotInvalidationCause cause;

for (cause = 0; cause <= RS_INVAL_MAX_CAUSES; cause++)
{
if (strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0)
return cause;
}

Assert(0);
return RS_INVAL_NONE;
}

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add lookup table for replication slot invalidation causes

2024-02-21 Thread Peter Smith

Hi, Sorry for the late comment but isn't the pushed logic now
different to what it was there before?

IIUC previously (in a non-debug build) if the specified
conflict_reason was not found, it returned RS_INVAL_NONE -- now it
seems to return whatever enum happens to be last.

How about something more like below:

--
ReplicationSlotInvalidationCause
GetSlotInvalidationCause(const char *conflict_reason)
{
  ReplicationSlotInvalidationCause cause;
  boolfound  = false;

  for (cause = 0; !found && cause <= RS_INVAL_MAX_CAUSES; cause++)
found = strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0;

  Assert(found);
  return found ? cause : RS_INVAL_NONE;
}
--

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: About a recently-added message

2024-02-21 Thread Kyotaro Horiguchi

At Thu, 22 Feb 2024 10:51:07 +0530, Amit Kapila  wrote 
in 
> > Do you think some additional tests for the rest of the messages are
> > worth the trouble?
> >
> 
> We have discussed this during development and didn't find it worth
> adding tests for all misconfigured parameters. However, in the next
> patch where we are planning to add a slot sync worker that will
> automatically sync slots, we are adding a test for one more parameter.
> I am not against adding tests for all the parameters but it didn't
> appeal to add more test cycles for this.

Thanks for the explanation. I'm fine with that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: About a recently-added message

2024-02-21 Thread Amit Kapila

On Thu, Feb 22, 2024 at 6:16 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 22 Feb 2024 09:36:43 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > Yes, I'm happy with all of the changes.  The proposed patch appears to
> > cover all instances related to slotsync.c, and it looks fine to
> > me. Thanks!
>
> I'd like to raise another potential issue outside the patch. The patch
> needed to change only one test item even though it changed nine
> messages. This means eigh out of nine messages that the patch changed
> are not covered by our test. I doubt all of them are worth additional
> test items; however, I think we want to increase coverage.
>
> Do you think some additional tests for the rest of the messages are
> worth the trouble?
>

We have discussed this during development and didn't find it worth
adding tests for all misconfigured parameters. However, in the next
patch where we are planning to add a slot sync worker that will
automatically sync slots, we are adding a test for one more parameter.
I am not against adding tests for all the parameters but it didn't
appeal to add more test cycles for this.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-02-21 Thread shveta malik

On Wed, Feb 21, 2024 at 5:19 PM Amit Kapila  wrote:
>
> A few minor comments:

Thanks for the feedback.

> =
> 1.
> +/*
> + * Is stopSignaled set in SlotSyncCtx?
> + */
> +bool
> +IsStopSignaledSet(void)
> +{
> + bool signaled;
> +
> + SpinLockAcquire(>mutex);
> + signaled = SlotSyncCtx->stopSignaled;
> + SpinLockRelease(>mutex);
> +
> + return signaled;
> +}
> +
> +/*
> + * Reset stopSignaled in SlotSyncCtx.
> + */
> +void
> +ResetStopSignaled(void)
> +{
> + SpinLockAcquire(>mutex);
> + SlotSyncCtx->stopSignaled = false;
> + SpinLockRelease(>mutex);
> +}
>
> I think these newly introduced functions don't need spinlock to be
> acquired as these are just one-byte read-and-write. Additionally, when
> IsStopSignaledSet() is invoked, there shouldn't be any concurrent
> process to update that value. What do you think?

Yes, we can avoid taking spinlock here. These functions are invoked
after checking that pmState is PM_RUN. And in that state we do not
expect any other process writing this flag.

> 2.
> +REPL_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
> +REPL_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down."
>
> Let's use REPLICATION instead of REPL. I see other wait events using
> REPLICATION in their names.

Modified.

> 3.
> - * In standalone mode and in autovacuum worker processes, we use a fixed
> - * ID, otherwise we figure it out from the authenticated user name.
> + * In standalone mode, autovacuum worker processes and slot sync worker
> + * process, we use a fixed ID, otherwise we figure it out from the
> + * authenticated user name.
> */
> - if (bootstrap || IsAutoVacuumWorkerProcess())
> + if (bootstrap || IsAutoVacuumWorkerProcess() || IsLogicalSlotSyncWorker())
> {
> InitializeSessionUserIdStandalone();
> am_superuser = true;
>
> IIRC, we discussed this previously and it is safe to make the local
> connection as superuser as we don't consult any user tables, so we can
> probably add a comment where we invoke InitPostgres in slotsync.c

Added comment. Thanks Hou-San for the analysis here and providing comment.

> 4.
> $publisher->safe_psql('postgres',
> - "CREATE PUBLICATION regress_mypub FOR ALL TABLES;");
> + "CREATE PUBLICATION regress_mypub FOR ALL TABLES;"
> +);
>
> Why this change is required in the patch?

Not needed, removed it.

> 5.
> +# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot
> are synced
> +# to the standby
>
> /and of/; looks like a typo

Modified.

> 6.
> +# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot
> are synced
> +# to the standby
> +ok( $standby1->poll_query_until(
> + 'postgres',
> + "SELECT '$primary_restart_lsn' = restart_lsn AND
> '$primary_flush_lsn' = confirmed_flush_lsn from pg_replication_slots
> WHERE slot_name = 'lsub1_slot';"),
> + 'restart_lsn and confirmed_flush_lsn of slot lsub1_slot synced to standby');
> +
> ...
> ...
> +# Confirm the synced slot 'lsub1_slot' is retained on the new primary
> +is($standby1->safe_psql('postgres',
> + q{SELECT slot_name FROM pg_replication_slots WHERE slot_name =
> 'lsub1_slot';}),
> + 'lsub1_slot',
> + 'synced slot retained on the new primary');
>
> In both these checks, we should additionally check the 'synced' and
> 'temporary' flags to ensure that they are marked appropriately.

Modified.

Please find patch001 attached. There is a CFBot failure in patch002.
The test added there needs some adjustment. We will rebase and post
rest of the patches once we fix that issue.

thanks
Shveta


v94-0001-Add-a-new-slotsync-worker.patch
Description: Binary data

Re: LogwrtResult contended spinlock

2024-02-21 Thread Robert Haas

On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis  wrote:
> On Wed, 2024-02-21 at 20:00 +0530, Bharath Rupireddy wrote:
> > The local copy of LogwrtResult is so frequently used in the backends,
> > if we were to replace it with atomic accesses, won't the atomic reads
> > be costly and start showing up in perf profiles?
>
> I don't see exactly where the extra cost would come from. What is your
> concern?
>
> In functions that access it several times, it may make sense to copy it
> to a function-local variable, of course. But having a global non-shared
> LogwrtResult doesn't seem particularly useful to me.

I would love to get rid of the global variable; all of our code relies
too much on global variables, but the xlog code is some of the worst.
The logic is complex and difficult to reason about.

But I do think that there's room for concern about performance, too. I
don't know if there is an actual problem, but there could be a
problem. My impression is that accessing shared memory isn't
intrinsically more costly than accessing backend-private memory, but
accessing heavily contended memory can definitely be more expensive
than accessing uncontended memory -- and not just slightly more
expensive, but WAY more expensive.

I think the problems tend to be worst when you have some bit of data
that's being frequently modified by multiple backends. Every backend
that wants to modify the value needs to steal the cache line, and
eventually you spend most of your CPU time stealing cache lines from
other sockets and not much of it doing any actual work. If you have a
value that's just being read by a lot of backends without
modification, I think the cache line can be shared in read only mode
by all the CPUs and it's not too bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Unlinking Parallel Hash Join inner batch files sooner


On 22/2/2024 06:42, Thomas Munro wrote:

On Wed, Feb 21, 2024 at 7:34 PM Andrei Lepikhov
 wrote:

I see in [1] that the reporter mentioned a delay between the error
message in parallel HashJoin and the return control back from PSQL. Your
patch might reduce this delay.
Also, I have the same complaint from users who processed gigabytes of
data in parallel HashJoin. Presumably, they also stuck into the unlink
of tons of temporary files. So, are you going to do something with this
code?


Yeah, right.  I will aim to get this into the tree next week.  First,
there are a couple of minor issues to resolve around freeing that
Heikki mentioned.  Then there is the question of whether we think this
might be a candidate for back-patching, given the complaints you
mention.  Opinions?
The code is related to performance, not a bug. Also, it adds one 
external function into the 'sharedtuplestore.h'. IMO, it isn't worth it 
to make back-patches.


I would add that the problems you reach when you get to very large
number of partitions are hard (see several very long threads about
extreme skew for one version of the problem, but even with zero/normal
skewness and perfect estimation of the number of partitions, if you
ask a computer to partition 42TB of data into partitions that fit in a
work_mem suitable for a Commodore 64, it's gonna hurt on several
levels) and this would only slightly improve one symptom.  One idea
that might improve just the directory entry and file descriptor
aspect, would be to scatter the partitions into (say) 1MB chunks
within the file, and hope that the file system supports holes (a bit
like logtape.c's multiplexing but I wouldn't do it quite like that).

Thanks, I found in [1] good entry point to dive into this issue.

[1] 
https://www.postgresql.org/message-id/CA+hUKGKDbv+5uiJZDdB1wttkMPFs9CDb6=02qxitq4am-kb...@mail.gmail.com


--
regards,
Andrei Lepikhov
Postgres Professional

Re: DSA_ALLOC_NO_OOM doesn't work

2024-02-21 Thread Robert Haas

On Thu, Feb 22, 2024 at 12:49 AM Heikki Linnakangas  wrote:
> That's fair, I can see those reasons. Nevertheless, I do think it was a
> bad tradeoff. A little bit of repetition would be better here, or we can
> extract the common parts to smaller functions.
>
> I came up with the attached:
>
>25 files changed, 1710 insertions(+), 1113 deletions(-)
>
> So yeah, it's more code, and there's some repetition, but I think this
> is more readable. Some of that is extra boilerplate because I split the
> implementations to separate files, and I also added tests.

Adding tests is great. I'm unenthusiastic about the rest, but I don't
really care enough to argue.

What's the goal here, anyway? I mean, why bother?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC: GROUP BY optimization

On 22/2/2024 09:09, Richard Guo wrote:

On Wed, Feb 21, 2024 at 6:20 PM Alexander Korotkov > wrote:

Hi, Richard!

> What do you think about the revisions for the test cases?

I've rebased your patch upthread. Did some minor beautifications.

> * The table 'btg' is inserted with 1 tuples, which seems a bit
> expensive for a test. I don't think we need such a big table to test
> what we want.

Your patch reduces the number of rows to 1000 tuples. I found it
possible to further reduce it to 100 tuples. That also allowed me to
save the plan in the test case introduced by e1b7fde418.

Please check if you're OK with the patch attached.

I looked through the v2 patch and have two comments.

* The test case under "Check we don't pick aggregate path key instead of
grouping path key" does not have EXPLAIN to show the plan. So how can
we ensure we do not mistakenly select the aggregate pathkeys instead of
the grouping pathkeys?
I confirm it works correctly. I am not sure about the stability of the
zeros number in the output on different platforms here:

avg

4.
5.
It was why I'd used the format() function before. So, may we elaborate
on the test and restrict the output?

* I don't think the test case introduced by e1b7fde418 is still needed,
because we already have one under "Utilize the ordering of merge join to
avoid a full Sort operation". This kind of test case is just to ensure
that we are able to utilize the ordering of the subplans underneath. So
it should be parallel to the test cases for utilize the ordering of
index scan and subquery scan.
I confirm, this version also checks ec_sortref initialization in the
case when ec are contructed from WHERE clause. Generally, I like more
one test for one issue instead of one test for all at once. But it works
and I don't have any reason to dispute it.

Also, I'm unsure about removing the disabling of the
max_parallel_workers_per_gather parameter. Have you discovered the
domination of the current plan over the partial one? Do the cost
fluctuations across platforms not trigger a parallel plan?

What's more, I suggest to address here the complaint from [1]. As I see,
cost difference between Sort and IncrementalSort strategies in that case
is around 0.5. To make the test more stable I propose to change it a bit
and add a limit:

SELECT count(*) FROM btg GROUP BY z, y, w, x LIMIT 10;
It makes efficacy of IncrementalSort more obvious difference around 10
cost points.

[1]
https://www.postgresql.org/message-id/CACG=ezaYM1tr6Lmp8PRH1aeZq=rbkxeotwgzmclad5mphfw...@mail.gmail.com

--
regards,
Andrei Lepikhov
Postgres Professional

Re: RFC: Logging plan of the running query

2024-02-21 Thread Ashutosh Bapat

On Thu, Feb 22, 2024 at 6:25 AM James Coleman  wrote:
> >
> > ...I think the current approach is just plain dead, because of this
> > issue. We can't take an approach that creates an unbounded number of
> > unclear reentrancy issues and then insert hacks one by one to cure
> > them (or hack around them, more to the point) as they're discovered.
> >
> > The premise has to be that we only allow logging the query plan at
> > points where we know it's safe, rather than, as at present, allowing
> > it in places that are unsafe and then trying to compensate with code
> > elsewhere. That's not likely to ever be as stable as we want
> > PostgreSQL to be.
>
> This is potentially a bit of a wild idea, but I wonder if having some
> kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in
> "normal" as opposed to "critical" (using that word differently than
> the existing critical sections) would be worth it.

My hunch is this will end up being a maintenance burden since every
caller has to decide (carefully) whether the call is under normal
condition or not. Developers will tend to take a safe approach and
flag calls as critical. But importantly, what's normal for one
interrupt action may be critical for another and vice versa. Approach
would be useful depending upon how easy it is to comprehend the
definition of "normal".

If a query executes for longer than a user defined threashold (session
level GUC? or same value as auto_explain parameter), the executor
proactively prepares an EXPLAIN output and keeps it handy in case
asked for. It can do so at a "known" safe place and time rather than
at any random time and location. Extra time spent in creating EXPLAIN
output may not be noticeable in a long running query. The EXPLAIN
output could be saved in pg_stat_activity or similar place. This will
avoid signaling the backend.

-- 
Best Wishes,
Ashutosh Bapat

Re: Documentation to upgrade logical replication cluster

2024-02-21 Thread Peter Smith

Here are some minor comments for patch v8-0001.

==
doc/src/sgml/glossary.sgml

1.
+   
+
+ A set of publisher and subscriber instances with publisher instance
+ replicating changes to the subscriber instance.
+
+   

/with publisher instance/with the publisher instance/

~~~

2.
There are 2 SQL fragments but they are wrapped differently (see
below). e.g. it is not clear why is the 2nd fragment wrapped since it
is shorter than the 1st. OTOH, maybe you want the 1st fragment to
wrap. Either way, consistency wrapping would be better.


postgres=# SELECT slot_name FROM pg_replication_slots WHERE slot_type
= 'logical' AND conflict_reason IS NOT NULL;
 slot_name
---
(0 rows)

versus

SELECT count(*) FROM pg_replication_slots WHERE slot_type = 'logical'
AND temporary IS false;

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Injection points: some tools to wait and wake

On Wed, Feb 21, 2024 at 11:50:21AM +, Bertrand Drouvot wrote:
> I think the approach is fine and the hardcoded value is "large" enough (it 
> would
> be surprising, at least to me, to write a test that would need more than 32
> waiters).

This could use less.  I've never used more than 3 of them in a single
test, and that was already very complex to follow.


> A few comments:
> 
> 1 ===
> I think "up" is missing at several places in the patch where "wake" is used.
> I could be wrong as non native english speaker though.

Patched up three places to be more consisten.

> 2 ===
> 
> +   /* Counters advancing when injection_points_wakeup() is called */
> +   int wait_counts[INJ_MAX_WAIT];
> 
> uint32? (here and other places where counter is manipulated).

Okay, why not.

> "Remove this injection wait name from the waiting list" instead?
> s/a condition variable/an injection wait point/ ?

Okay.

> 5 ===
> 
> +PG_FUNCTION_INFO_V1(injection_points_wakeup);
> +Datum
> +injection_points_wakeup(PG_FUNCTION_ARGS)
> 
> Empty line missing before "Datum"?

I've used the same style as anywhere else.

> Also maybe some comments are missing above injection_point_init_state(), 
> injection_init_shmem() but it's more a Nit.

Sure.

> While at it, should we add a second injection wait point in
> t/041_invalid_checkpoint_after_promote.pl to check that they are wake up
> individually?

I'm not sure that's a good idea for the sake of this test, TBH, as
that would provide coverage outside the original scope of the
restartpoint/promote check.

I have also looked at if it would be possible to get an isolation test
out of that, but the arbitrary wait does not make that possible with
the existing isolation APIs, see GetSafeSnapshotBlockingPids() used in
pg_isolation_test_session_is_blocked().  isolation/README seems to be
a bit off here, actually, mentioning pg_locks..  We could use some
tricks with transactions or locks internally, but I'm not really
tempted to make the wait callback more complicated for the sake of
more coverage as I'd rather keep it generic for anybody who wants to
control the order of events across processes.

Attaching a v3 for reference with everything that has been mentioned
up to now.
--
Michael
From 0ccbacd01b326078aa750f1a72f38823f8f9c8b3 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 22 Feb 2024 11:53:29 +0900
Subject: [PATCH v3 1/2] injection_points: Add routines to wait and wake
 processes

This commit is made of two parts:
- A new callback that can be attached to a process to make it wait on a
condition variable.  The condition checked is registered in shared
memory by the module injection_points.
- A new SQL function to update the shared state and broadcast the update
using a condition variable.

The shared state used by the module is registered using the DSM
registry, and is optional.
---
 .../injection_points--1.0.sql |  10 ++
 .../injection_points/injection_points.c   | 155 ++
 src/tools/pgindent/typedefs.list  |   1 +
 3 files changed, 166 insertions(+)

diff --git a/src/test/modules/injection_points/injection_points--1.0.sql b/src/test/modules/injection_points/injection_points--1.0.sql
index 5944c41716..0a2e59aba7 100644
--- a/src/test/modules/injection_points/injection_points--1.0.sql
+++ b/src/test/modules/injection_points/injection_points--1.0.sql
@@ -24,6 +24,16 @@ RETURNS void
 AS 'MODULE_PATHNAME', 'injection_points_run'
 LANGUAGE C STRICT PARALLEL UNSAFE;
 
+--
+-- injection_points_wakeup()
+--
+-- Wakes up a waiting injection point.
+--
+CREATE FUNCTION injection_points_wakeup(IN point_name TEXT)
+RETURNS void
+AS 'MODULE_PATHNAME', 'injection_points_wakeup'
+LANGUAGE C STRICT PARALLEL UNSAFE;
+
 --
 -- injection_points_detach()
 --
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index e843e6594f..142211d0d7 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -18,18 +18,76 @@
 #include "postgres.h"
 
 #include "fmgr.h"
+#include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "storage/dsm_registry.h"
 #include "utils/builtins.h"
 #include "utils/injection_point.h"
 #include "utils/wait_event.h"
 
 PG_MODULE_MAGIC;
 
+/* Maximum number of wait usable in injection points at once */
+#define INJ_MAX_WAIT	32
+#define INJ_NAME_MAXLEN	64
+
+/* Shared state information for injection points. */
+typedef struct InjectionPointSharedState
+{
+	/* protects accesses to wait_counts */
+	slock_t		lock;
+
+	/* Counters advancing when injection_points_wakeup() is called */
+	uint32		wait_counts[INJ_MAX_WAIT];
+
+	/* Names of injection points attached to wait counters */
+	char		name[INJ_MAX_WAIT][INJ_NAME_MAXLEN];
+
+	/*
+	 * Condition variable used for waits and wakeups, checking upon the set of
+	 * wait_counts when

Re: Add LSN <-> time conversion functionality

2024-02-21 Thread Melanie Plageman

Thanks so much for reviewing!

On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
 wrote:
>
> When I first read this, I immediately started wondering if this might
> use the commit timestamp stuff we already have. Because for each commit
> we already store the LSN and commit timestamp, right? But I'm not sure
> that would be a good match - the commit_ts serves a very special purpose
> of mapping XID => (LSN, timestamp), I don't see how to make it work for
> (LSN=>timestmap) and (timestamp=>LSN) very easily.

I took a look at the code in commit_ts.c, and I really don't see a way
of reusing any of this commit<->timestamp infrastructure for
timestamp<->LSN mappings.

> As for the inner workings of the patch, my understanding is this:
>
> - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
> points, but those points are really "buckets" that grow larger and
> larger for older periods of time.

Yes, they are buckets in the sense that they represent multiple values
but each contains a single LSNTime value which is the minimum of all
the LSNTimes we "merged" into that single array element. In order to
represent a range of time, you need to use two array elements. The
linear interpolation from time <-> LSN is all done with two elements.

> - AFAIK each entry represent an interval of time, and the next (older)
> interval is twice as long, right? So the first interval is 1 second,
> then 2 seconds, 4 seconds, 8 seconds, ...
>
> - But I don't understand how the LSNTimeline entries are "aging" and get
> less accurate, while the "current" bucket is short. lsntime_insert()
> seems to simply move to the next entry, but doesn't that mean we insert
> the entries into larger and larger buckets?

Because the earlier array elements can represent fewer logical members
than later ones and because elements are merged into the next element
when space runs out, later array elements will contain older data and
more of it, so those "ranges" will be larger. But, after thinking
about it and also reading your feedback, I realized my algorithm was
silly because it starts merging logical members before it has even
used the whole array.

The attached v3 has a new algorithm. Now, LSNTimes are added from the
end of the array forward until all array elements have at least one
logical member (array length == volume). Once array length == volume,
new LSNTimes will result in merging logical members in existing
elements. We want to merge older members because those can be less
precise. So, the number of logical members per array element will
always monotonically increase starting from the beginning of the array
(which contains the most recent data) and going to the end. We want to
use all the available space in the array. That means that each LSNTime
insertion will always result in a single merge. We want the timeline
to be inclusive of the oldest data, so merging means taking the
smaller value of two LSNTime values. I had to pick a rule for choosing
which elements to merge. So, I choose the merge target as the oldest
element whose logical membership is < 2x its predecessor. I merge the
merge target's predecessor into the merge target. Then I move all of
the intervening elements down 1. Then I insert the new LSNTime at
index 0.

> - The comments never really spell what amount of time the entries cover
> / how granular it is. My understanding is it's simply measured in number
> of entries added, which is assumed to be constant and drive by
> bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
> isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?
>
> Is there some case where bgwriter would just loop without sleeping,
> filling the timeline much faster? (I can't think of any, but ...)

bgwriter will wake up when there are buffers to flush, which is likely
correlated with there being new LSNs. So, actually it seems like it
might work well to rely on only filling the timeline when there are
things for bgwriter to do.

> - The LSNTimeline comment claims an array of size 64 is large enough to
> not need to care about filling it, but maybe it should briefly explain
> why we can never fill it (I guess 2^64 is just too many).

The new structure fits a different number of members. I have yet to
calculate that number, but it should be explained in the comments once
I do.

For example, if we made an LSNTimeline with volume 4, once every
element had one LSNTime and we needed to start merging, the following
is how many logical members each element would have after each of four
merges:

1112
1122
1114
1124
So, if we store the number of members as an unsigned 64-bit int and we
have an LSNTimeline with volume 64, what is the maximum number of
members can we store if we hold all of the invariants described in my
algorithm above (we only merge when required, every element holds < 2x
the number of logical members as its predecessor, we do exactly one
merge every insertion [when required], membership must monotonically

Re: POC: GROUP BY optimization

On Wed, Feb 21, 2024 at 6:20 PM Alexander Korotkov 
wrote:

> Hi, Richard!
>
> > What do you think about the revisions for the test cases?
>
> I've rebased your patch upthread.  Did some minor beautifications.
>
> > * The table 'btg' is inserted with 1 tuples, which seems a bit
> > expensive for a test.  I don't think we need such a big table to test
> > what we want.
>
> Your patch reduces the number of rows to 1000 tuples.  I found it
> possible to further reduce it to 100 tuples.  That also allowed me to
> save the plan in the test case introduced by e1b7fde418.
>
> Please check if you're OK with the patch attached.

I looked through the v2 patch and have two comments.

* The test case under "Check we don't pick aggregate path key instead of
grouping path key" does not have EXPLAIN to show the plan.  So how can
we ensure we do not mistakenly select the aggregate pathkeys instead of
the grouping pathkeys?

* I don't think the test case introduced by e1b7fde418 is still needed,
because we already have one under "Utilize the ordering of merge join to
avoid a full Sort operation".  This kind of test case is just to ensure
that we are able to utilize the ordering of the subplans underneath.  So
it should be parallel to the test cases for utilize the ordering of
index scan and subquery scan.

See the attached v3 patch.  I also made cosmetic tweaks to the comments,
and simplified a test case query.

Thanks
Richard

v3-0001-Multiple-revisions-to-the-GROUP-BY-reordering-tests.patch
Description: Binary data

Re: Speeding up COPY TO for uuids and arrays

On Wed, Feb 21, 2024 at 02:29:05PM +0900, Michael Paquier wrote:
> That's very good, so I'd like to apply this part.  Let me know if
> there are any objections.

This part is done as of 011d60c4352c.  I did not evaluate the rest
yet.
--
Michael


signature.asc
Description: PGP signature

Re: Test to dump and restore objects left behind by regression

On Wed, Feb 21, 2024 at 12:18:45PM +0530, Ashutosh Bapat wrote:
> Even with 1 and 2 the test is useful to detect dump/restore anomalies.
> I think we should improve 3, but I don't have a good and simpler
> solution. I didn't find any way to compare two given clusters in our
> TAP test framework. Building it will be a lot of work. Not sure if
> it's worth it.

+   my $rc =
+ system($ENV{PG_REGRESS}
+ . " $extra_opts "
+ . "--dlpath=\"$dlpath\" "
+ . "--bindir= "
+ . "--host="
+ . $node->host . " "
+ . "--port="
+ . $node->port . " "
+ . "--schedule=$srcdir/src/test/regress/parallel_schedule "
+ . "--max-concurrent-tests=20 "
+ . "--inputdir=\"$inputdir\" "
+ . "--outputdir=\"$outputdir\"");

I am not sure that it is a good idea to add a full regression test
cycle while we have already 027_stream_regress.pl that would be enough
to test some dump scenarios.  These are very expensive and easy to
notice even with a high level of parallelization of the tests.
--
Michael


signature.asc
Description: PGP signature

Re: RFC: Logging plan of the running query

2024-02-21 Thread James Coleman

On Mon, Feb 19, 2024 at 11:53 PM Robert Haas  wrote:
>
> On Fri, Feb 16, 2024 at 12:29 AM Andres Freund  wrote:
> > If we went with something like tht approach, I think we'd have to do 
> > something
> > like redirecting node->ExecProcNode to a wrapper, presumably from within a
> > CFI. That wrapper could then implement the explain support, without slowing
> > down the normal execution path.
>
> That's an annoying complication; maybe there's some better way to
> handle this. But I think we need to do something different than what
> the patch does currently because...
>
> > > It's really hard for me to accept that the heavyweight lock problem
> > > for which the patch contains a workaround is the only one that exists.
> > > I can't see any reason why that should be true.
> >
> > I suspect you're right.
>
> ...I think the current approach is just plain dead, because of this
> issue. We can't take an approach that creates an unbounded number of
> unclear reentrancy issues and then insert hacks one by one to cure
> them (or hack around them, more to the point) as they're discovered.
>
> The premise has to be that we only allow logging the query plan at
> points where we know it's safe, rather than, as at present, allowing
> it in places that are unsafe and then trying to compensate with code
> elsewhere. That's not likely to ever be as stable as we want
> PostgreSQL to be.

This is potentially a bit of a wild idea, but I wonder if having some
kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in
"normal" as opposed to "critical" (using that word differently than
the existing critical sections) would be worth it.

Regards,
James Coleman

Re: About a recently-added message

2024-02-21 Thread Kyotaro Horiguchi

At Thu, 22 Feb 2024 09:36:43 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Yes, I'm happy with all of the changes.  The proposed patch appears to
> cover all instances related to slotsync.c, and it looks fine to
> me. Thanks!

I'd like to raise another potential issue outside the patch. The patch
needed to change only one test item even though it changed nine
messages. This means eigh out of nine messages that the patch changed
are not covered by our test. I doubt all of them are worth additional
test items; however, I think we want to increase coverage.

Do you think some additional tests for the rest of the messages are
worth the trouble?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: About a recently-added message

2024-02-21 Thread Kyotaro Horiguchi

At Wed, 21 Feb 2024 14:57:42 +0530, Amit Kapila  wrote 
in 
> On Tue, Feb 20, 2024 at 3:21 PM shveta malik  wrote:
> >
> > okay, attached v2 patch with changed error msgs and double quotes
> > around logical.
> >
> 
> Horiguchi-San, does this address all your concerns related to
> translation with these new messages?

Yes, I'm happy with all of the changes.  The proposed patch appears to
cover all instances related to slotsync.c, and it looks fine to
me. Thanks!

I found that logica.c is also using the policy that I complained
about, but it is a separate issue.

./logical.c122:errmsg("logical decoding requires wal_level >= 
logical")));

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Patch: Add parse_type Function

On Feb 21, 2024, at 17:19, Erik Wienhold  wrote:

> Thanks David!  LGTM.

Thanks. Anyone else? Or is it ready for committer?

Best,

David

Re: DSA_ALLOC_NO_OOM doesn't work

2024-02-21 Thread Heikki Linnakangas


On 22/02/2024 01:03, Thomas Munro wrote:

On Thu, Feb 22, 2024 at 10:30 AM Thomas Munro  wrote:

collisions arbitrarily far apart (just decide how many bits to use).


. o O ( Perhaps if you also allocated slots using a FIFO freelist,
instead of the current linear search for the first free slot, you
could maximise the time before a slot is reused, improving the
collision-avoiding power of a generation scheme? )


We could also enlarge dsm_handle from 32-bits to 64-bits, if we're 
worried about collisions.


I actually experimented with something like that too: I encoded the "is 
this in main region" in one of the high bits and let the implementation 
use the low bits. One small issue with that is that we have a few places 
that pass a DSM handle as the 'bgw_main' argument when launching a 
worker process, and on 32-bit platforms that would not be wide enough. 
Those could be changed to use the wider 'bgw_extra' field instead, though.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Add lookup table for replication slot invalidation causes

On Wed, Feb 21, 2024 at 12:50:00PM +0530, Bharath Rupireddy wrote:
> Please see the attached v3 patch.

Seems globally OK, so applied.  I've simplified a bit the comments,
painted some extra const, and kept variable name as conflict_reason as 
the other routines of slot.h use "name" already to refer to the slot
names, and that was a bit confusing IMO.
--
Michael


signature.asc
Description: PGP signature

Re: Experiments with Postgres and SSL

2024-02-21 Thread Matthias van de Meent

On Wed, 10 Jan 2024 at 09:31, Heikki Linnakangas  wrote:
>
> Some more comments on this:
>
> 1. It feels weird that the combination of "gssencmode=require
> sslnegotiation=direct" combination is forbidden. Sure, the ssl
> negotiation will never happen with gssencmode=require, so the
> sslnegotiation option has no effect. But by that token, should we also
> forbid the combination "sslmode=disable sslnegotiation=direct"? I think
> not. The sslnegotiation option should mean "if we are going to try SSL,
> should we try it in direct or negotiated mode?"

I'm not sure about this either. The 'gssencmode' option is already
quite weird in that it seems to override the "require"d priority of
"sslmode=require", which it IMO really shouldn't.

> 2. Should we allow direct SSL only at the very beginning of a TCP
> connection, or should we also allow it after we have requested GSS and
> the server said no? Like this:
>
> Client: GSSENCRequest
> Server: 'N' (gss not supported)
> Client: TLS client Hello
>
> On one hand, why not? It saves you a round-trip in this case too. If we
> don't allow it, the client will have to send SSLRequest and wait for
> response, or reconnect to try direct SSL. On the other hand, flexibility
> is not necessarily a good thing in security-critical code like this.

I think this should be "no".
Once we start accepting PostgreSQL protocol packets (such as the
GSSENCRequest packet) I don't think we should start treating data
stream corruption as attempted SSL connections.

> The patch set is confused on whether that's allowed or not. The server
> rejects it. But if you use "gssencmode=prefer
> sslnegotiation=requiredrect", libpq will attempt to do it, and fail.

That should then be detected as an incorrect combination of flags in
psql: you can't have direct-to-ssl and put something in front of it.

> 3. With "sslmode=verify-full sslnegotiation=direct", if the direct SSL
> connection fails because of a problem with the certificate, libpq will
> try again in negotiated SSL mode. That seems pointless. If the server
> responded to the direct TLS Client Hello message with a valid
> ServerHello, that indicates that the server supports direct SSL. If
> anything goes wrong after that, retrying in negotiated mode is not going
> to help.

This makes sense.

> 4. The number of combinations of sslmode, gssencmode and sslnegotiation
> settings is scary. And we have very few tests for them.

Yeah, it's not great. We could easily automate this better though. I
mean, can't we run the tests using a "cube" configuration, i.e. test
every combination of parameters? We would use a mapping function of
(psql connection parameter values -> expectations), which would be
along the lines of the attached pl testfile. I feel it's a bit more
approachable than the lists of manual option configurations, and makes
it a bit easier to program the logic of which connection security
option we should have used to connect.
The attached file would be a drop-in replacement; it's tested to work
with SSL only - without GSS - because I've been having issues getting
GSS working on my machine.

> I'm going to put this down for now. The attached patch set is even more
> raw than v6, but I'm including it here to "save the work".

v6 doesn't apply cleanly anymore after 774bcffe, but here are some notes:

Several patches are still very much WIP. Reviewing them on a
patch-by-patch basis is therefore nigh impossible; the specific
reviews below are thus on changes that could be traced back to a
specific patch. A round of cleanup would be appreciated.

> 0003: Direct SSL connections postmaster support
> [...]
> -extern bool pq_buffer_has_data(void);
> +extern size_t pq_buffer_has_data(void);

This should probably be renamed to pg_buffer_remaining_data or such,
if we change the signature like this.

> +/* Read from the "unread" buffered data first. c.f. libpq-be.h */
> +if (port->raw_buf_remaining > 0)
> +{
> +/* consume up to len bytes from the raw_buf */
> +if (len > port->raw_buf_remaining)
> +len = port->raw_buf_remaining;

Shouldn't we also try to read from the socket, instead of only
consuming bytes from the raw buffer if it contains bytes?

> 0008: Allow pipelining data after ssl request
> +/*
> + * At this point we should have no data already buffered.  If we 
> do,
> + * it was received before we performed the SSL handshake, so it 
> wasn't
> + * encrypted and indeed may have been injected by a 
> man-in-the-middle.
> + * We report this case to the client.
> + */
> +if (port->raw_buf_remaining > 0)
> +ereport(FATAL,
> +(errcode(ERRCODE_PROTOCOL_VIOLATION),
> + errmsg("received unencrypted data after SSL 
> request"),
> + errdetail("This could be either a client-software 
> bug or evidence of an attempted man-in-the-middle attack.")));

Re: Unlinking Parallel Hash Join inner batch files sooner

On Wed, Feb 21, 2024 at 7:34 PM Andrei Lepikhov
 wrote:
> I see in [1] that the reporter mentioned a delay between the error
> message in parallel HashJoin and the return control back from PSQL. Your
> patch might reduce this delay.
> Also, I have the same complaint from users who processed gigabytes of
> data in parallel HashJoin. Presumably, they also stuck into the unlink
> of tons of temporary files. So, are you going to do something with this
> code?

Yeah, right.  I will aim to get this into the tree next week.  First,
there are a couple of minor issues to resolve around freeing that
Heikki mentioned.  Then there is the question of whether we think this
might be a candidate for back-patching, given the complaints you
mention.  Opinions?

I would add that the problems you reach when you get to very large
number of partitions are hard (see several very long threads about
extreme skew for one version of the problem, but even with zero/normal
skewness and perfect estimation of the number of partitions, if you
ask a computer to partition 42TB of data into partitions that fit in a
work_mem suitable for a Commodore 64, it's gonna hurt on several
levels) and this would only slightly improve one symptom.  One idea
that might improve just the directory entry and file descriptor
aspect, would be to scatter the partitions into (say) 1MB chunks
within the file, and hope that the file system supports holes (a bit
like logtape.c's multiplexing but I wouldn't do it quite like that).

RE: Psql meta-command conninfo+

2024-02-21 Thread Maiquel Grassi

Hi!

(v19)

Adjusted the description of each column in the documentation as promised.
I used the existing documentation as a basis for each SQL function used,
as well as for functions and views related to SSL and GSSAPI (documentation).

If you can validate, I appreciate it.

Regards,
Maiquel Grassi.


v19-0001-psql-meta-command-conninfo-plus.patch
Description: v19-0001-psql-meta-command-conninfo-plus.patch

Re: DSA_ALLOC_NO_OOM doesn't work

On Thu, Feb 22, 2024 at 10:30 AM Thomas Munro  wrote:
> collisions arbitrarily far apart (just decide how many bits to use).

. o O ( Perhaps if you also allocated slots using a FIFO freelist,
instead of the current linear search for the first free slot, you
could maximise the time before a slot is reused, improving the
collision-avoiding power of a generation scheme? )

Discover PostgreSQL's Graph Power with Apache AGE!

2024-02-21 Thread Nandhini Jayakumar

Hello PostgreSQL Community,

Excited to share how Apache AGE enhances PostgreSQL with smooth graph
features! Handles complex data, and supports SQL and Cypher. Join our
awesome community, check tutorials, and let's dive into those data projects!

More info.: Apache AGE GitHub  & Website


Regards,
Nandhini Jayakumar

Re: Patch: Add parse_type Function

2024-02-21 Thread Erik Wienhold

On 2024-02-21 22:49 +0100, David E. Wheeler wrote:
> On Feb 21, 2024, at 16:43, Erik Wienhold  wrote:
> 
> > The docs still state that to_regtypemod() has a named parameter, which
> > is not the case per pg_proc.dat.
> 
> Bah, I cleaned it up before but somehow put it back. Thanks for the
> catch. Fixed.

Thanks David!  LGTM.

-- 
Erik

Re: Patch: Add parse_type Function

On Feb 21, 2024, at 16:43, Erik Wienhold  wrote:

> The docs still state that to_regtypemod() has a named parameter, which
> is not the case per pg_proc.dat.

Bah, I cleaned it up before but somehow put it back. Thanks for the catch. 
Fixed.

Best,

David



v9-0001-Add-to_regtypemod-SQL-function.patch
Description: Binary data

Re: Patch: Add parse_type Function

2024-02-21 Thread Erik Wienhold

On 2024-02-21 17:51 +0100, David E. Wheeler wrote:
> On Feb 21, 2024, at 11:18, Erik Wienhold  wrote:
> 
> > Thanks.  But it's an applefile again, not a patch :P
> 
> Let’s try that again.

Thanks.

> +to_regtypemod ( type 
> text )

The docs still state that to_regtypemod() has a named parameter, which
is not the case per pg_proc.dat.

Besides that, LGTM.

-- 
Erik

Re: DSA_ALLOC_NO_OOM doesn't work

On Thu, Feb 22, 2024 at 8:19 AM Heikki Linnakangas  wrote:
> - Separate dsm_handle, used by backend code to interact with the high
> level interface in dsm.c, from dsm_impl_handle, which is used to
> interact with the low-level functions in dsm_impl.c. This gets rid of
> the convention in dsm.c of reserving odd numbers for DSM segments stored
> in the main shmem area. There is now an explicit flag for that the
> control slot. For generating dsm_handles, we now use the same scheme we
> used to use for main-area shm segments for all DSM segments, which
> includes the slot number in the dsm_handle. The implementations use
> their own mechanisms for generating the low-level dsm_impl_handles (all
> but the SysV implementation generate a random handle and retry on
> collision).

Could we use slot number and generation number, instead of slot number
and random number?  I have never liked the random number thing, which
IIUC was needed because of SysV key space management problems 'leaking'
up to the handle level (yuck).  With generations, you could keep
collisions arbitrarily far apart (just decide how many bits to use).
Collisions aren't exactly likely, but if there is no need for that
approach, I'm not sure why we'd keep it.  (I remember dealing with
actual collisions in the wild due to lack of PRNG seeding in workers,
which admittedly should be vanishingly rare now).

If the slot number is encoded into the handle, why do we still need a
linear search for the slot?

> - create() no longer returns the mapped_size. The old Windows
> implementation had some code to read the actual mapped size after
> creating the mapping, and returned that in *mapped_size. Others just
> returned the requested size. In principle reading the actual size might
> be useful; the caller might be able to make use of the whole mapped size
> when it's larger than requested. In practice, the callers didn't do
> that. Also, POSIX shmem on FreeBSD has similar round-up-to-page-size
> behavior but the implementation did not query the actual mapped size
> after creating the segment, so you could not rely on it.

I think that is an interesting issue with the main shmem area.  There,
we can set huge_page_size to fantastically large sizes up to 16GB on
some architectures, but we have nothing to make sure we don't waste
some or most of the final page.  But I agree that there's not much
point in worrying about this for DSM.

> - Added a test that exercises basic create, detach, attach functionality
> using all the different implementations supported on the current platform.

I wonder how we could test the cleanup-after-crash behaviour.

Re: [PATCH] Exponential backoff for auth_delay

2024-02-21 Thread Tomas Vondra

Hi,

Thanks for the patch. I took a closer look at v3, so let me share some
review comments. Please push back if you happen to disagree with some of
it, some of this is going to be more a matter of personal preference.


1) I think it's a bit weird to have two options specifying amount of
time, but one is in seconds and one in milliseconds. Won't that be
unnecessarily confusing? Could we do both in milliseconds?


2) The C code defines the GUC as auth_delay.exponential_backoff, but the
SGML docs say auth_delay.exp_backoff.


3) Do we actually need the exponential_backoff GUC? Wouldn't it be
sufficient to have the maximum value, and if it's -1 treat that as no
backoff?


4) I think the SGML docs should more clearly explain that the delay is
initialized to auth_delay.milliseconds, and reset to this value after
successful authentication. The wording kinda implies that, but it's not
quite clear I think.


4) I've been really confused by the change from

 if (status != STATUS_OK)

   to

 if (status == STATUS_ERROR)

in auth_delay_checks, until I realized those two codes are not the only
ones, and we intentionally ignore STATUS_EOF. I think it'd be good to
mention that in a comment, it's not quite obvious (I only realized it
because the e-mail mentioned it).


5) I kinda like the custom that functions in a contrib module start with
a clear common prefix, say auth_delay_ in this case. Matter of personal
preference, ofc.


6) I'm a bit skeptical about some acr_array details. Firstly, why would
50 entries be enough? Seems like a pretty low and arbitrary number ...
Also, what happens if someone attempts to authenticate, fails to do so,
and then disconnects and never tries again? Or just changes IP? Seems
like the entry will remain in the array forever, no?

Seems like a way to cause a "limited" DoS - do auth failure from 50
different hosts, to fill the table, and now everyone has to wait the
maximum amount of time (if they fail to authenticate).

I think it'd be good to consider:

- Making the acr_array a hash table, and larger than 50 entries (I
wonder if that should be dynamic / customizable by GUC?).

- Make sure the entries are eventually expired, based on time (for
example after 2*max_delay?).

- It would be a good idea to log something when we get into the "full
table" and start delaying everyone by max_delay_seconds. (How would
anyone know that's what's happening, seems rather confusing.)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: margay fails assertion in stats/dsa/dsm code

On Sat, Jul 2, 2022 at 11:10 AM Thomas Munro  wrote:
> On Sat, Jul 2, 2022 at 1:15 AM Robert Haas  wrote:
> > Changing the default on certain platforms to 'posix' or 'sysv'
> > according to what works best on that platform seems reasonable to me.
>
> Ok, I'm going to make that change in 15 + master.

For the record, I asked a Solaris kernel engineer about that
shm_open() problem and learned that a fix shipped about a month after
we had this discussion (though I haven't tested it myself):

https://twitter.com/casperdik/status/1730288613722562986

I also reported the issue to illumos, since I'd like to be able to
revert 94ebf811 eventually...:

https://www.illumos.org/issues/16093

Re: WIP Incremental JSON Parser

2024-02-21 Thread Jacob Champion

On Wed, Feb 21, 2024 at 6:50 AM Jacob Champion
 wrote:
> On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan  wrote:
> > *sigh* That's weird. I wonder why you can reproduce it and I can't. Can
> > you give me details of the build? OS, compiler, path to source, build
> > setup etc.? Anything that might be remotely relevant.

This construction seems suspect, in json_lex_number():

>   if (lex->incremental && !lex->inc_state->is_last_chunk &&
>   len >= lex->input_length)
>   {
>   appendStringInfoString(>inc_state->partial_token,
>  lex->token_start);
>   return JSON_INCOMPLETE;
>   }

appendStringInfoString() isn't respecting the end of the chunk: if
there's extra data after the chunk boundary (as
AppendIncrementalManifestData() does) then all of that will be stuck
onto the end of the partial_token.

I'm about to context-switch off of this for the day, but I can work on
a patch tomorrow if that'd be helpful. It looks like this is not the
only call to appendStringInfoString().

--Jacob

Re: LogwrtResult contended spinlock

2024-02-21 Thread Jeff Davis

On Wed, 2024-02-21 at 20:00 +0530, Bharath Rupireddy wrote:
> The local copy of LogwrtResult is so frequently used in the backends,
> if we were to replace it with atomic accesses, won't the atomic reads
> be costly and start showing up in perf profiles?

I don't see exactly where the extra cost would come from. What is your
concern?

In functions that access it several times, it may make sense to copy it
to a function-local variable, of course. But having a global non-shared
LogwrtResult doesn't seem particularly useful to me.

>  In any case, I prefer
> discussing this separately once we get to a conclusion on converting
> shared memory Write and Flush ptrs to atomics.

I agree that it shouldn't block the earlier patches, but it seems on-
topic for this thread.

> 
> 1. I guess we need to initialize the new atomics with
> pg_atomic_init_u64 initially in XLOGShmemInit:

Thank you.

> 2. I guess we need to update both the Write and Flush local copies in
> AdvanceXLInsertBuffer.

I agree. Whenever we update the non-shared LogwrtResult, let's update
the whole thing. Otherwise it's confusing.

> 3.
> @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI)
>  {
>  Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE);
> 
> +    pg_read_barrier();
>  LogwrtResult.Flush = pg_atomic_read_u64(
> >LogwrtResult.Flush);
> +    LogwrtResult.Write = pg_atomic_read_u64(
> >LogwrtResult.Write);
> 
> Do we need a read barrier here to not reorder things when more than
> one process is accessing the flush and write ptrs?

The above seems wrong: we don't want to reorder a load of Write before
a load of Flush, otherwise it could look like Write < Flush.
(Similarly, we never want to reorder a store of Flush before a store of
Write.) So I think we should have another read barrier between those
two atomic reads.

I also think we need the read barrier at the beginning because the
caller doesn't expect a stale value of the Flush pointer. It might not
be strictly required for correctness, because I don't think that stale
value can be inconsistent with anything else, but getting an up-to-date
value is better.

>  If at all, a read
> barrier is warranted here, we can use atomic read with full barrier

I don't think we need a full barrier but I'm fine with using
pg_atomic_read_membarrier_u64() if it's better for whatever reason.

> +    pg_atomic_write_u64(>LogwrtResult.Write, EndOfLog);
> +    pg_atomic_write_u64(>LogwrtResult.Flush, EndOfLog);
>  XLogCtl->LogwrtRqst.Write = EndOfLog;
> 
> pg_atomic_write_u64 here seems fine to me as no other process is
> active writing WAL yet, otherwise, we need write with full barrier

I don't see any memory ordering hazard. In any case, there's an
LWLockAcquire a few lines later, which acts as a full barrier.

> something like pg_write_barrier + pg_atomic_write_u64 or
> pg_atomic_exchange_u64 like in LWLockUpdateVar or use atomic write
> with full barrier sematices as proposed here (when that's in) -
> https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13
> .
> 
> 5. I guess we'd better use pg_atomic_read_u64_impl and
> pg_atomic_compare_exchange_u64_impl in
> pg_atomic_monotonic_advance_u64
> to reduce one level of function indirections.

Aren't they inlined?

> 6.
> + * Full barrier semantics (even when value is unchanged).
> 
> +    currval = pg_atomic_read_u64(ptr);
> +    if (currval >= target_)
> +    {
> +    pg_memory_barrier();
> 
> Perhaps, we can use atomic read with full barrier sematices proposed
> here -
> https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13

I don't think they are exactly equivalent: in the current patch, the
first pg_atomic_read_u64() could be reordered with earlier reads;
whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it
could not be. I'm not sure whether that could create a performance
problem or not.

> Just for the sake of completeness, do we need
> pg_atomic_monotonic_advance_u32 as well?

+1.

> 8. I'd split the addition of these new monotonic functions into 0001,
> then 0002 adding XLogwrtAtomic and 0003 adding Copy to XLogwrtAtomic.

+1

> 9.
> +    copyptr = pg_atomic_read_u64(>LogwrtResult.Copy);
> +    if (startptr + count > copyptr)
> +    ereport(WARNING,
> +    (errmsg("request to read past end of generated WAL;
> request %X/%X, current position %X/%X",
> +    LSN_FORMAT_ARGS(startptr + count),
> LSN_FORMAT_ARGS(copyptr;
> 
> Any specific reason for this to be a WARNING rather than an ERROR?

Good question. WaitXLogInsertionsToFinish() uses a LOG level message
for the same situation. They should probably be the same log level, and
I would think it would be either PANIC or WARNING. I have no idea why
LOG was chosen.

> I've addressed some of my review comments (#1, #5, #7 and #8) above,
> merged the v11j-0002-fixup.patch into 0002, ran pgindent. Please see
> the attached v12 patch set. FWIW, CF bot is happy with these

RE: Popcount optimization using AVX512

2024-02-21 Thread Amonson, Paul D

Hi,

I am encountering a problem that I don't think I understand. I cannot get the 
MSVC build to link in CI. I added 2 files to the build, but the linker is 
complaining about the original pg_bitutils.c file is missing (specifically 
symbol 'pg_popcount'). To my knowledge my changes did not change linking for 
the offending file and I see the compiles for pg_bitutils.c in all 3 libs in 
the build. All other builds are compiling.

Any help on this issue would be greatly appreciated.

My fork is at https://github.com/paul-amonson/postgresql/tree/popcnt_patch and 
the CI build is at https://cirrus-ci.com/task/4927666021728256.

Thanks,
Paul

-Original Message-
From: Andres Freund  
Sent: Monday, February 12, 2024 12:37 PM
To: Amonson, Paul D 
Cc: Alvaro Herrera ; Shankaran, Akash 
; Nathan Bossart ; Noah 
Misch ; Tom Lane ; Matthias van de Meent 
; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

Hi,

On 2024-02-12 20:14:06 +, Amonson, Paul D wrote:
> > > +# Check for header immintrin.h
> > ...
> > Do these all actually have to link?  Invoking the linker is slow.
> > I think you might be able to just use cc.has_header_symbol().
>
> I took this to mean the last of the 3 new blocks.

Yep.

> > Does this work with msvc?
>
> I think it will work but I have no way to validate it. I propose we remove 
> the AVX-512 popcount feature from MSVC builds. Sound ok?

CI [1], whould be able to test at least building. Including via cfbot, 
automatically run for each commitfest entry - you can see prior runs at [2].  
They run on Zen 3 epyc instances, so unfortunately runtime won't be tested.  If 
you look at [3], you can see that currently it doesn't seem to be considered 
supported at configure time:

...
[00:23:48.480] Checking if "__get_cpuid" : links: NO [00:23:48.480] Checking if 
"__cpuid" : links: YES ...
[00:23:48.492] Checking if "x86_64: popcntq instruction" compiles: NO ...

Unfortunately CI currently is configured to not upload the build logs if the 
build succeeds, so we don't have enough details to see why.

> > This will build all of pgport with the avx flags, which wouldn't be 
> > correct, I think? The compiler might inject automatic uses of avx512 in 
> > places, which would cause problems, no?
>
> This will take me some time to learn how to do this in meson. Any 
> pointers here would be helpful.

Should be fairly simple, add it to the replace_funcs_pos and add the relevant 
cflags to pgport_cflags, similar to how it's done for crc.

> > While you don't do the same for make, isn't even just using the avx512 for 
> > all of pg_bitutils.c broken for exactly that reson? That's why the existing 
> > code builds the files for various crc variants as their own file.
>
> I don't think its broken, nothing else in pg_bitutils.c will make use 
> of
> AVX-512

You can't really guarantee that compiler auto-vectorization won't decide to do 
so, no? I wouldn't call it likely, but it's also hard to be sure it won't 
happen at some point.

> If splitting still makes sense, I propose splitting into 3 files:  
> pg_bitutils.c (entry point +sw popcnt implementation), pg_popcnt_choose.c 
> (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86 
> implementations).
> I'm not an expert in meson, but splitting might add complexity to meson.build.
>
> Could you elaborate if there are other benefits to the split file approach?

It won't lead to SIGILLs ;)

Greetings,

Andres Freund

[1] https://github.com/postgres/postgres/blob/master/src/tools/ci/README
[2] 
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675
[3] https://cirrus-ci.com/task/5645112189911040

Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-02-21 Thread Dean Rasheed

On Tue, 20 Feb 2024 at 14:49, Dean Rasheed  wrote:
>
> On the face of it, the simplest fix is to tweak is_simple_union_all()
> to prevent UNION ALL subquery pullup for MERGE, forcing a
> subquery-scan plan. A quick test shows that that fixes the reported
> issue.
>
> However, that leaves the question of whether we should do the same for
> UPDATE and DELETE.
>

Attached is a patch that prevents UNION ALL subquery pullup in MERGE only.

I've re-used and extended the isolation test cases added by
1d5caec221, since it's clear that replacing the plain source relation
in those tests with a UNION ALL subquery that returns the same results
should produce the same end result. (Without this patch, the UNION ALL
subquery is pulled up, EPQ rechecking fails to re-find the match, and
a WHEN NOT MATCHED THEN INSERT action is executed instead, resulting
in a primary key violation.)

It's still not quite clear whether preventing UNION ALL subquery
pullup should also apply to UPDATE and DELETE, but I wasn't able to
find any live bug there, so I think they're best left alone.

This fixes the reported issue, though it's worth noting that
concurrent WHEN NOT MATCHED THEN INSERT actions will still lead to
duplicate rows being inserted, which is a limitation that is already
documented [1].

[1] https://www.postgresql.org/docs/current/transaction-iso.html

Regards,
Dean
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
new file mode 100644
index aa83dd3..50ebdd2
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -102,7 +102,7 @@ static bool is_simple_values(PlannerInfo
 static Node *pull_up_constant_function(PlannerInfo *root, Node *jtnode,
 	   RangeTblEntry *rte,
 	   AppendRelInfo *containing_appendrel);
-static bool is_simple_union_all(Query *subquery);
+static bool is_simple_union_all(PlannerInfo *root, Query *subquery);
 static bool is_simple_union_all_recurse(Node *setOp, Query *setOpQuery,
 		List *colTypes);
 static bool is_safe_append_member(Query *subquery);
@@ -849,7 +849,7 @@ pull_up_subqueries_recurse(PlannerInfo *
 		 * in set_append_rel_pathlist, not here.)
 		 */
 		if (rte->rtekind == RTE_SUBQUERY &&
-			is_simple_union_all(rte->subquery))
+			is_simple_union_all(root, rte->subquery))
 			return pull_up_simple_union_all(root, jtnode, rte);
 
 		/*
@@ -1888,8 +1888,9 @@ pull_up_constant_function(PlannerInfo *r
  * same datatypes.
  */
 static bool
-is_simple_union_all(Query *subquery)
+is_simple_union_all(PlannerInfo *root, Query *subquery)
 {
+	Query	   *parse = root->parse;
 	SetOperationStmt *topop;
 
 	/* Let's just make sure it's a valid subselect ... */
@@ -1910,6 +1911,21 @@ is_simple_union_all(Query *subquery)
 		subquery->cteList)
 		return false;
 
+	/*
+	 * UPDATE/DELETE/MERGE is also a problem, because preprocess_rowmarks()
+	 * will add a rowmark to the UNION ALL subquery, to ensure that it returns
+	 * the same row if rescanned by EvalPlanQual.  Allowing the subquery to be
+	 * pulled up would render that rowmark ineffective.
+	 *
+	 * In practice, however, this seems to only be a problem for MERGE, which
+	 * allows the subquery to appear under an outer join with the target
+	 * relation --- such an outer join might output multiple not-matched rows
+	 * in addition to the required matched row, breaking EvalPlanQual's
+	 * expectation that rerunning the query returns just one row.
+	 */
+	if (parse->commandType == CMD_MERGE)
+		return false;
+
 	/* Recursively check the tree of set operations */
 	return is_simple_union_all_recurse((Node *) topop, subquery,
 	   topop->colTypes);
diff --git a/src/test/isolation/expected/merge-join.out b/src/test/isolation/expected/merge-join.out
new file mode 100644
index 57f048c..6cf045a
--- a/src/test/isolation/expected/merge-join.out
+++ b/src/test/isolation/expected/merge-join.out
@@ -146,3 +146,151 @@ id|val
  3| 30
 (3 rows)
 
+
+starting permutation: b1 b2 m1 hj exu m2u c1 c2 s1
+step b1: BEGIN ISOLATION LEVEL READ COMMITTED;
+step b2: BEGIN ISOLATION LEVEL READ COMMITTED;
+step m1: MERGE INTO tgt USING src ON tgt.id = src.id
+ WHEN MATCHED THEN UPDATE SET val = src.val
+ WHEN NOT MATCHED THEN INSERT VALUES (src.id, src.val);
+step hj: SET LOCAL enable_mergejoin = off; SET LOCAL enable_nestloop = off;
+step exu: EXPLAIN (verbose, costs off)
+   MERGE INTO tgt USING (SELECT * FROM src
+ UNION ALL
+ SELECT * FROM src2) src ON tgt.id = src.id
+ WHEN MATCHED THEN UPDATE SET val = src.val
+ WHEN NOT MATCHED THEN INSERT VALUES (src.id, src.val);
+QUERY PLAN   
+-
+Merge on public.tgt  
+  ->  Hash Left Join 
+Output: tgt.ctid, src.val, src.id, src.* 
+

Re: to_regtype() Raises Error

On Feb 21, 2024, at 11:54 AM, David Wheeler  wrote:

> Merged this change into the [to_regtypemod 
> patch](https://commitfest.postgresql.org/47/4807/), which has exactly the 
> same issue.
> 
> The new status of this patch is: Needs review

Bah, withdrawn.

D

Re: to_regtype() Raises Error

2024-02-21 Thread David Wheeler

Merged this change into the [to_regtypemod 
patch](https://commitfest.postgresql.org/47/4807/), which has exactly the same 
issue.

The new status of this patch is: Needs review

Re: Patch: Add parse_type Function

On Feb 21, 2024, at 11:18, Erik Wienhold  wrote:

> Thanks.  But it's an applefile again, not a patch :P

WTF. I still have that feature disabled.

Oh, I think I deleted the file after dragged it into Mail but before sending, 
because it’s empty everywhere I look. 臘‍♂️ Let’s try that again.

Best,

David


v8-0001-Add-to_regtypemod-SQL-function.patch
Description: Binary data

Re: Patch: Add parse_type Function

2024-02-21 Thread Erik Wienhold

On 2024-02-21 16:14 +0100, David E. Wheeler wrote:
>
> V8 attached.

Thanks.  But it's an applefile again, not a patch :P

-- 
Erik

Re: When extended query protocol ends?

2024-02-21 Thread Tom Lane

Vladimir Sitnikov  writes:
>> Would performance suffer that much?

> I have not benchmarked it much, however, the driver sends "autosave"
> queries once (savepoint) or twice(savepoint+release) for every
> user-provided query.
> If we use extended queries (parse+bind+exec) for every savepoint, that
> would result in 3 or 6 messages overhead for every user query.

Those should get bundled into single TCP messages, though.  Assuming
that that's done properly, I share the doubt that you're saving
anything very meaningful.

regards, tom lane

Re: When extended query protocol ends?

2024-02-21 Thread Vladimir Sitnikov

>Would performance suffer that much?

I have not benchmarked it much, however, the driver sends "autosave"
queries once (savepoint) or twice(savepoint+release) for every
user-provided query.
If we use extended queries (parse+bind+exec) for every savepoint, that
would result in 3 or 6 messages overhead for every user query.

>From many measurements we know that insert into table(id, name)
values(?,?),(?,?),(?,?) is much more efficient than
sending individual bind-exec-bind-exec-bind-exec-sync messages like "insert
into table(id, name) values(?,?)"
For instance, here are some measurements:
https://www.baeldung.com/spring-jdbc-batch-inserts#performance-comparisons
Based on that measurements I assume there's a non-trivial per-message
overhead.

Vladimir

Re: When extended query protocol ends?

2024-02-21 Thread Jelte Fennema-Nio

On Wed, 21 Feb 2024 at 16:35, Vladimir Sitnikov
 wrote:
> We can't send complete parse-bind-execute commands for every "savepoint" call 
> as it would hurt performance.

Would performance suffer that much? I'd expect the simple and extended
protocol to have roughly the same overhead for simple queries without
arguments such SAVEPOINT.

Re: When extended query protocol ends?

2024-02-21 Thread Vladimir Sitnikov

>Is it possible for the JDBC
>driver to issue a Sync message before sending SAVEPOINT in simple
>query protocol?

Apparently, sending an extra message would increase the overhead of the
protocol, thus reducing the efficiency of the application.
What is the benefit of sending extra Sync?

https://www.postgresql.org/docs/current/protocol-overview.html#PROTOCOL-MESSAGE-CONCEPTS
suggests that is is fine to mix both simple and extended messages
depending on the needs of the application.

https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-PIPELINING
reads that clients can omit sending Sync to make the error handling the way
they like.
I am not that sure we must omit sync there, however, it might be the case.

https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY
reads "simple Query message also destroys the unnamed statement" and
"simple Query message also destroys the unnamed portal"

>Or you can send SAVEPOINT using the extended query protocol.

I am afraid we can't.
The purpose of savepoints at the driver's level is to enable migrating
applications from other databases to PostgreSQL.
In PostgreSQL any SQL exception fails the transaction, including errors
like "prepared statement \"...\" does not exist", and so on.
It might be unexpected for the users to unexpectedly get "prepared
statement is no longer valid" errors in case somebody adds a column to a
table.

We can't send complete parse-bind-execute commands for every "savepoint"
call as it would hurt performance.
We can't cache the parsed statement as it could be discarded by a random
"deallocate all".
So the only way out I see is to use simple query mode for savepoint queries.

Vladimir

Re: Patch: Add parse_type Function

On Feb 20, 2024, at 21:09, Erik Wienhold  wrote:

> The references are printed as "???" right now.  Can be fixed with
> xreflabel="format_type" and xreflabel="to_regtype" on those 
> elements.

Thanks.

> The docs show parameter name "type" but pg_proc.data does not define
> proargnames.  So the to_regtypemod() cannot be called using named
> notation:
> 
> test=> select to_regtypemod(type => 'int'::text);
> ERROR:  function to_regtypemod(type => text) does not exist

Yes, this format is identical to that of to_regtype():

david=# select to_regtype(type => 'int'::text);
ERROR:  function to_regtype(type => text) does not exist

> +Parses a string of text, extracts a potential type name from it, and
>> +translates its typmod, the modifier for the type, if any. Failure to
> 
> s/typmod, the modifier of the type/type modifier/
> 
> Because format_type() already uses "type modifier" and I think it helps
> searchability to use the same wording.

Yes, definitely better wording, thanks. V8 attached.

Best,

David



v8-0001-Add-to_regtypemod-SQL-function.patch
Description: application/applefile

Re: POC: GROUP BY optimization

2024-02-21 Thread Maxim Orlov

Hi!

Another issue on test introduced in 0452b461bc405. I think it may be
unstable in some circumstances.
For example, if we'll try to use different BLCKSZ. See, I've made a little
change in the number of tuples to be inserted:

$ git diff
diff --git a/src/test/regress/sql/aggregates.sql
b/src/test/regress/sql/aggregates.sql
index d6ed5d0eff..414078d4ec 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1187,7 +1187,7 @@ CREATE TABLE btg AS SELECT
   i % 100 AS y,
   'abc' || i % 10 AS z,
   i AS w
-FROM generate_series(1,1) AS i;
+FROM generate_series(1,11900) AS i;
 CREATE INDEX btg_x_y_idx ON btg(x,y);
 ANALYZE btg;

And the bulk extension is kicked, so we got zeroed pages in the relation.
The plane is also changed,
switched to seq scan from index scan:
@@ -2734,7 +2734,7 @@
   i % 100 AS y,
   'abc' || i % 10 AS z,
   i AS w
-FROM generate_series(1,1) AS i;
+FROM generate_series(1,11900) AS i;
 CREATE INDEX btg_x_y_idx ON btg(x,y);
 ANALYZE btg;
 -- GROUP BY optimization by reorder columns by frequency
@@ -2760,62 +2760,57 @@

 -- Engage incremental sort
 explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w;
-   QUERY PLAN
--
+  QUERY PLAN
+--
  Group
Group Key: x, y, z, w
-   ->  Incremental Sort
+   ->  Sort
  Sort Key: x, y, z, w
- Presorted Key: x, y
- ->  Index Scan using btg_x_y_idx on btg
-(6 rows)
+ ->  Seq Scan on btg
+(5 rows)
... and so on.

So, my proposal is simple. I think we need not just "ANALYZE btg", but
"VACUUM ANALYZE btg", to get rid of zeroed pages in this particular
case. PFA corresponding patch.

-- 
Best regards,
Maxim Orlov.


0001-Force-VACUUM-to-avoid-zeroed-pages-from-bulk-extensi.patch
Description: Binary data

Improve readability by using designated initializers when possible

2024-02-21 Thread Jelte Fennema-Nio

Usage of designated initializers came up in:
https://www.postgresql.org/message-id/flat/ZdWXhAt9Tz4d-lut%40paquier.xyz#9dc17e604e58569ad35643672bf74acc

This converts all arrays that I could find that could clearly benefit
from this without any other code changes being necessary.

There were a few arrays that I didn't convert that seemed like they
could be useful to convert, but where the variables started counting
at 1. So by converting them elements the array would grow and elements
would be shifted by one. Changing those might be nice, but would
require some more code changes so I didn't want to combine it with
these simpler refactors. The arrays I'm talking about were
specifically tsearch_op_priority, BT_implies_table, BT_refutes_table,
and BT_implic_table.


v1-0001-Use-designated-initializer-syntax-to-improve-read.patch
Description: Binary data

Re: WIP Incremental JSON Parser

2024-02-21 Thread Jacob Champion

On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan  wrote:
> *sigh* That's weird. I wonder why you can reproduce it and I can't. Can
> you give me details of the build? OS, compiler, path to source, build
> setup etc.? Anything that might be remotely relevant.

Sure:
- guest VM running in UTM (QEMU 7.2) is Ubuntu 22.04 for ARM, default
core count, 8GB
- host is macOS Sonoma 14.3.1, Apple Silicon (M3 Pro), 36GB
- it's a Meson build (plus a diff for the test utilities, attached,
but that's hopefully not relevant to the failure)
- buildtype=debug, optimization=g, cassert=true
- GCC 11.4
- build path is nested a bit (~/src/postgres/worktree-inc-json/build-dev)

--Jacob
commit 0075a88beec160cbb408d9a1e0a11d836fb55bdf
Author: Jacob Champion 
Date:   Wed Feb 21 06:36:55 2024 -0800

WIP: mesonify

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..e5c9bd10cf 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_dsm_registry')
 subdir('test_extensions')
 subdir('test_ginpostinglist')
 subdir('test_integerset')
+subdir('test_json_parser')
 subdir('test_lfind')
 subdir('test_misc')
 subdir('test_oat_hooks')
diff --git a/src/test/modules/test_json_parser/meson.build 
b/src/test/modules/test_json_parser/meson.build
new file mode 100644
index 00..42eb670864
--- /dev/null
+++ b/src/test/modules/test_json_parser/meson.build
@@ -0,0 +1,39 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_json_parser_incremental_sources = files(
+  'test_json_parser_incremental.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_incremental_sources += rc_bin_gen.process(win32ver_rc, 
extra_args: [
+'--NAME', 'test_json_parser_incremental',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_incremental = executable('test_json_parser_incremental',
+  test_json_parser_incremental_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)
+
+test_json_parser_perf_sources = files(
+  'test_json_parser_perf.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_perf_sources += rc_bin_gen.process(win32ver_rc, extra_args: 
[
+'--NAME', 'test_json_parser_perf',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_perf = executable('test_json_parser_perf',
+  test_json_parser_perf_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)

Re: Introduce XID age and inactive timeout based replication slot invalidation

On Wed, Feb 21, 2024 at 5:55 PM Bertrand Drouvot
 wrote:
>
> > As far as the 0001 patch is concerned, it reports the
> > invalidation_reason as long as slot_contents.data.invalidated !=
> > RS_INVAL_NONE. I think this is okay.
> >
> > Thoughts?
>
> Yeah, looking at the code I agree that looks ok. OTOH, that looks confusing,
> maybe we should add a few words about it in the doc?

I'll think about it.

> Looking at v5-0001:
>
> +  
> +   invalidation_reason text
> +  
> +  
>
> My initial thought was to put "conflict" value in this new field in case of
> conflict (not to mention the conflict reason in it). With the current proposal
> invalidation_reason could report the same as conflict_reason, which sounds 
> weird
> to me.
>
> Does that make sense to you to use "conflict" as value in 
> "invalidation_reason"
> when the slot has "conflict_reason" not NULL?

I'm thinking the other way around - how about we revert
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=007693f2a3ac2ac19affcb03ad43cdb36ccff5b5,
that is, put in place "conflict" as a boolean and introduce
invalidation_reason the text form. So, for logical slots, whenever the
"conflict" column is true, the reason is found in invaldiation_reason
column? How does it sound? Again the debate might be "conflict" vs
"invalidation", but that looks clean IMHO.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: LogwrtResult contended spinlock

On Sat, Feb 17, 2024 at 2:24 AM Jeff Davis  wrote:
>
> Though it looks like we can remove the non-shared LogwrtResult
> entirely. Andres expressed some concern here:
>
> https://www.postgresql.org/message-id/20210130020211.rtu5ir3dpjrbi...@alap3.anarazel.de
>
> But then seemed to favor removing it here:
>
> https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de
>
> I'm inclined to think we can get rid of the non-shared copy.

The local copy of LogwrtResult is so frequently used in the backends,
if we were to replace it with atomic accesses, won't the atomic reads
be costly and start showing up in perf profiles? In any case, I prefer
discussing this separately once we get to a conclusion on converting
shared memory Write and Flush ptrs to atomics.

> A few other comments:
>
>  * Should GetFlushRecPtr()/GetXLogWriteRecPtr() use a read memory
> barrier?
>  * Why did you add pg_memory_barrier() right before a spinlock
> acquisition?
>  * Is it an invariant that Write >= Flush at all times? Are there
> guaranteed to be write barriers in the right place to ensure that?

I'll continue to think about these points.

> I would also like it if we could add a new "Copy" pointer indicating
> how much WAL data has been copied to the WAL buffers. That would be set
> by WaitXLogInsertionsToFinish() so that subsequent calls are cheap.
> Attached a patch (0003) for illustration purposes. It adds to the size
> of XLogCtlData, but it's fairly large already, so I'm not sure if
> that's a problem. If we do add this, there would be an invariant that
> Copy >= Write at all times.

Thanks. I have a few comments on v11j patches.

1. I guess we need to initialize the new atomics with
pg_atomic_init_u64 initially in XLOGShmemInit:

2. I guess we need to update both the Write and Flush local copies in
AdvanceXLInsertBuffer. Overall, I guess we need to update both the
Write and Flush local copies whenever we previously read LogwrtResult,
no? Missing to do that caused regression tests to fail and since we
were able to catch it, we ended up with v11j-0002-fixup.patch. There
might be cases where we aren't able to catch if we didn't update both
Write and Flush local copies. I'd suggest we just wrap both of these
under a macro:

 LogwrtResult.Flush = pg_atomic_read_u64(>LogwrtResult.Flush);
 LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);

and just use it wherever we did LogwrtResult = XLogCtl->LogwrtResult;
previously but not under spinlock.

3.
@@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI)
 {
 Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE);

+pg_read_barrier();
 LogwrtResult.Flush = pg_atomic_read_u64(>LogwrtResult.Flush);
+LogwrtResult.Write = pg_atomic_read_u64(>LogwrtResult.Write);

Do we need a read barrier here to not reorder things when more than
one process is accessing the flush and write ptrs? If at all, a read
barrier is warranted here, we can use atomic read with full barrier
sematices as proposed here (when that's in) -
https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13.

4.
+/*
+ * Update local and shared status.  This is OK to do without any locks
+ * because no other process can be reading or writing WAL yet.
+ */
 LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

+pg_atomic_write_u64(>LogwrtResult.Write, EndOfLog);
+pg_atomic_write_u64(>LogwrtResult.Flush, EndOfLog);
 XLogCtl->LogwrtRqst.Write = EndOfLog;

pg_atomic_write_u64 here seems fine to me as no other process is
active writing WAL yet, otherwise, we need write with full barrier
something like pg_write_barrier + pg_atomic_write_u64 or
pg_atomic_exchange_u64 like in LWLockUpdateVar or use atomic write
with full barrier sematices as proposed here (when that's in) -
https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13.

5. I guess we'd better use pg_atomic_read_u64_impl and
pg_atomic_compare_exchange_u64_impl in pg_atomic_monotonic_advance_u64
to reduce one level of function indirections. Of course, we need the
pointer alignments that pg_atomic_compare_exchange_u64 in the new
monotonic function.

6.
+ * Full barrier semantics (even when value is unchanged).

+currval = pg_atomic_read_u64(ptr);
+if (currval >= target_)
+{
+pg_memory_barrier();

Perhaps, we can use atomic read with full barrier sematices proposed
here - 
https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13

7.
+static inline void
+pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)

Just for the sake of completeness, do we need
pg_atomic_monotonic_advance_u32 as well?

8. I'd split the addition of these new monotonic functions into 0001,
then 0002 adding XLogwrtAtomic and 0003 adding Copy to XLogwrtAtomic.

9.
+copyptr = pg_atomic_read_u64(>LogwrtResult.Copy);
+if (startptr + count > copyptr)
+ereport(WARNING,
+

Re: Shared detoast Datum proposal

2024-02-21 Thread Andy Fan




I see your reply when I started to write a more high level
document. Thanks for the step by step help!

Tomas Vondra  writes:

> On 2/20/24 19:38, Andy Fan wrote:
>> 
>> ...
>>
>>> I think it should be done the other
>>> way around, i.e. the patch should introduce the main feature first
>>> (using the traditional Bitmapset), and then add Bitset on top of that.
>>> That way we could easily measure the impact and see if it's useful.
>> 
>> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
>> indicate it is too expensive. and after talk with David at [2], I
>> introduced bitset and use it here. the test case I used comes from [1].
>> IRCC, there were 5% performance difference because of this.
>> 
>> create table w(a int, b numeric);
>> insert into w select i, i from generate_series(1, 100)i;
>> select b from w where b > 0;
>> 
>> To reproduce the difference, we can replace the bitset_clear() with
>> 
>> bitset_free(slot->pre_detoasted_attrs);
>> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
>> 
>> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
>> 
>
> I understand the bitset was not introduced until v5, after noticing the
> bitmapset is not quite efficient. But I still think the patches should
> be the other way around, i.e. the main feature first, then the bitset as
> an optimization.
>
> That allows everyone to observe the improvement on their own (without
> having to tweak the code), and it also doesn't require commit of the
> bitset part before it gets actually used by anything.

I start to think this is a better way rather than the opposite. The next
version will be:

0001: shared detoast datum feature with high level introduction.
0002: introduce bitset and use it shared-detoast-datum feature, with the
test case to show the improvement. 

>> I will do more test on the memory leak stuff, since there are so many
>> operation aginst slot like ExecCopySlot etc, I don't know how to test it
>> fully. the method in my mind now is use TPCH with 10GB data size, and
>> monitor the query runtime memory usage.
>> 
>
> I think this is exactly the "high level design" description that should
> be in a comment, somewhere.

Got it.

>>> Of course, expanded datums are
>>> not meant to be long-lived, while "shared detoasted values" are meant to
>>> exist (potentially) for the query duration.
>> 
>> hmm, acutally the "shared detoast value" just live in the
>> TupleTableSlot->tts_values[*], rather than the whole query duration. The
>> simple case is:
>> 
>> SELECT * FROM t WHERE a_text LIKE 'abc%';
>> 
>> when we scan to the next tuple, the detoast value for the previous tuple
>> will be relased.
>> 
>
> But if the (detoasted) value is passed to the next executor node, it'll
> be kept, right?

Yes and only one copy for all the executor nodes.

> Unrelated question - could this actually end up being too aggressive?
> That is, could it detoast attributes that we end up not needing? For
> example, what if the tuple gets eliminated for some reason (e.g. because
> of a restriction on the table, or not having a match in a join)? Won't
> we detoast the tuple only to throw it away?

The detoast datum will have the exactly same lifespan with other
tts_values[*]. If the tuple get eliminated for any reason, those detoast
datum still exist until the slot is cleared for storing the next tuple.

>> typedef struct Join
>> {
>> ..
>>  /*
>>   * Records of var's varattno - 1 where the Var is accessed indirectly by
>>   * any expression, like a > 3.  However a IS [NOT] NULL is not included
>>   * since it doesn't access the tts_values[*] at all.
>>   *
>>   * This is a essential information to figure out which attrs should use
>>   * the pre-detoast-attrs logic.
>>   */
>>  Bitmapset  *outer_reference_attrs;
>>  Bitmapset  *inner_reference_attrs;
>> } Join;
>> 
>
> Is it actually necessary to add new fields to these nodes? Also, the
> names are not particularly descriptive of the purpose - it'd be better
> to have "detoast" in the name, instead of generic "reference".

Because of the way of the data transformation, I think we must add the
fields to keep such inforamtion. Then these information will be used
initilize the necessary information in PlanState. maybe I am having a
fixed mindset, I can't think out a way to avoid that right now.

I used 'reference' rather than detoast is because some implementaion
issues. In the createplan.c and setref.c, I can't check the atttyplen
effectively, so even a Var with int type is still hold here which may
have nothing with detoast.

>> 
>> When I worked with the UniqueKey feature, I maintained a
>> UniqueKey.README to summaried all the dicussed topics in threads, the
>> README is designed to save the effort for more reviewer, I think I
>> should apply the same logic for this feature.
>> 
>
> Good idea. Either that (a separate README), or a comment in a header of
> some suitable .c/.h file (I

Re: Shared detoast Datum proposal

2024-02-21 Thread Tomas Vondra

On 2/20/24 19:38, Andy Fan wrote:
> 
> ...
>
>> I think it should be done the other
>> way around, i.e. the patch should introduce the main feature first
>> (using the traditional Bitmapset), and then add Bitset on top of that.
>> That way we could easily measure the impact and see if it's useful.
> 
> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
> indicate it is too expensive. and after talk with David at [2], I
> introduced bitset and use it here. the test case I used comes from [1].
> IRCC, there were 5% performance difference because of this.
> 
> create table w(a int, b numeric);
> insert into w select i, i from generate_series(1, 100)i;
> select b from w where b > 0;
> 
> To reproduce the difference, we can replace the bitset_clear() with
> 
> bitset_free(slot->pre_detoasted_attrs);
> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
> 
> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
> 

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

> 
>> On the whole, my biggest concern is memory usage & leaks. It's not
>> difficult to already have problems with large detoasted values, and if
>> we start keeping more of them, that may get worse. Or at least that's my
>> intuition - it can't really get better by keeping the values longer, right?
>>
>> The other thing is the risk of leaks (in the sense of keeping detoasted
>> values longer than expected). I see the values are allocated in
>> tts_mcxt, and maybe that's the right solution - not sure.
> 
> about the memory usage, first it is kept as the same lifesplan as the
> tts_values[*] which can be released pretty quickly, only if the certain
> values of the tuples is not needed. it is true that we keep the detoast
> version longer than before,  but that's something we have to pay I
> think. 
> 
> Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
> we forget to release the memory when the tts_values[*] is invalidated
> somehow, the memory will be leaked until the end of executor. I think
> that will be enough to cause an issue. Currently besides I release such
> memory at the ExecClearTuple, I also relase such memory whenever we set
> tts_nvalid to 0, the theory used here is:
> 
> /*
>  * tts_values is treated invalidated since tts_nvalid is set to 0, so
>  * let's free the pre-detoast datum.
>  */
> ExecFreePreDetoastDatum(slot);
> 
> I will do more test on the memory leak stuff, since there are so many
> operation aginst slot like ExecCopySlot etc, I don't know how to test it
> fully. the method in my mind now is use TPCH with 10GB data size, and
> monitor the query runtime memory usage.
> 

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

> 
>> FWIW while looking at the patch, I couldn't help but to think about
>> expanded datums. There's similarity in what these two features do - keep
>> detoasted values for a while, so that we don't need to do the expensive
>> processing if we access them repeatedly.
> 
> Could you provide some keyword or function names for the expanded datum
> here, I probably miss this.
> 

see src/include/utils/expandeddatum.h

>> Of course, expanded datums are
>> not meant to be long-lived, while "shared detoasted values" are meant to
>> exist (potentially) for the query duration.
> 
> hmm, acutally the "shared detoast value" just live in the
> TupleTableSlot->tts_values[*], rather than the whole query duration. The
> simple case is:
> 
> SELECT * FROM t WHERE a_text LIKE 'abc%';
> 
> when we scan to the next tuple, the detoast value for the previous tuple
> will be relased.
> 

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

>> But maybe there's something
>> we could learn from expanded datums? For example how the varlena pointer
>> is leveraged to point to the expanded object.
> 
> maybe. currently I just use detoast_attr to get the desired version. I'm
> pleasure if we have more effective way.
> 
> if (!slot->tts_isnull[attnum] &&
> VARATT_IS_EXTENDED(slot->tts_values[attnum]))
> {
>   Datum   oldDatum;
>   MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
> 
>   oldDatum = slot->tts_values[attnum];
>   slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
>   
>   (struct varlena *) oldDatum));
>   Assert(slot->tts_nvalid > attnum);
>   Assert(oldDatum != slot->tts_values[attnum]);
>

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-02-21 Thread Bertrand Drouvot

Hi,

On Wed, Feb 21, 2024 at 10:55:00AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 20, 2024 at 12:05 PM Bharath Rupireddy
>  wrote:
> >
> >> [...] and was able to produce something like:
> > >
> > > postgres=# select 
> > > slot_name,slot_type,active,active_pid,wal_status,invalidation_reason from 
> > > pg_replication_slots;
> > >   slot_name  | slot_type | active | active_pid | wal_status | 
> > > invalidation_reason
> > > -+---++++-
> > >  rep1| physical  | f  || reserved   |
> > >  master_slot | physical  | t  |1482441 | unreserved | wal_removed
> > > (2 rows)
> > >
> > > does that make sense to have an "active/working" slot "ivalidated"?
> >
> > Thanks. Can you please provide the steps to generate this error? Are
> > you setting max_slot_wal_keep_size on primary to generate
> > "wal_removed"?
> 
> I'm able to reproduce [1] the state [2] where the slot got invalidated
> first, then its wal_status became unreserved, but still the slot is
> serving after the standby comes up online after it catches up with the
> primary getting the WAL files from the archive. There's a good reason
> for this state -
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/slotfuncs.c;h=d2fa5e669a32f19989b0d987d3c7329851a1272e;hb=ff9e1e764fcce9a34467d614611a34d4d2a91b50#l351.
> This intermittent state can only happen for physical slots, not for
> logical slots because logical subscribers can't get the missing
> changes from the WAL stored in the archive.
> 
> And, the fact looks to be that an invalidated slot can never become
> normal but still can serve a standby if the standby is able to catch
> up by fetching required WAL (this is the WAL the slot couldn't keep
> for the standby) from elsewhere (archive via restore_command).
> 
> As far as the 0001 patch is concerned, it reports the
> invalidation_reason as long as slot_contents.data.invalidated !=
> RS_INVAL_NONE. I think this is okay.
> 
> Thoughts?

Yeah, looking at the code I agree that looks ok. OTOH, that looks confusing,
maybe we should add a few words about it in the doc?

Looking at v5-0001:

+  
+   invalidation_reason text
+  
+  

My initial thought was to put "conflict" value in this new field in case of
conflict (not to mention the conflict reason in it). With the current proposal
invalidation_reason could report the same as conflict_reason, which sounds weird
to me.

Does that make sense to you to use "conflict" as value in "invalidation_reason"
when the slot has "conflict_reason" not NULL?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: 'Shutdown <= SmartShutdown' check while launching processes in postmaster.

On Wed, Feb 21, 2024 at 3:38 PM shveta malik  wrote:
>
> > Children that are stopped by the "if (pmState == PM_STOP_BACKENDS)"
> > stanza in PostmasterStateMachine should not be allowed to start
> > again later if we are trying to shut down.  (But "smart" shutdown
> > doesn't enforce that, since it's a very weak state that only
> > prohibits new client sessions.)  The processes that are allowed
> > to continue beyond that point are ones that are needed to perform
> > the shutdown checkpoint, or useful to make it finish faster.
>
> Thank you for providing the details. It clarifies the situation. Do
> you think it would be beneficial to include this as a code comment in
> postmaster.c to simplify understanding for future readers?

+1 for a note either before the StartChildProcess() or before the
PMState enum definition.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Injection points: some tools to wait and wake

2024-02-21 Thread Bertrand Drouvot

Hi,

On Wed, Feb 21, 2024 at 04:46:00PM +0900, Michael Paquier wrote:
> On Wed, Feb 21, 2024 at 07:08:03AM +0900, Michael Paquier wrote:
> > Well, both you and Andrey are asking for it now, so let's do it.  The
> > implementation is simple:
> > - Store in InjectionPointSharedState an array of wait_counts and an
> > array of names.  There is only one condition variable.
> > - When a point wants to wait, it takes the spinlock and looks within
> > the array of names until it finds a free slot, adds its name into the
> > array to reserve a wait counter at the same position, releases the
> > spinlock.  Then it loops on the condition variable for an update of
> > the counter it has reserved.  It is possible to make something more
> > efficient, but at a small size it would not really matter.
> > - The wakeup takes a point name in argument, acquires the spinlock,
> > and checks if it can find the point into the array, pinpoints the
> > location of the counter to update and updates it.  Then it broadcasts
> > the change.
> > - The wait loop checks its counter, leaves its loop, cancels the
> > sleep, takes the spinlock to unregister from the array, and leaves.
> > 
> > I would just hardcode the number of points that can wait, say 5 of
> > them tracked in shmem?  Does that look like what you are looking at?
> 
> I was looking at that, and it proves to work OK, so you can do stuff
> like waits and wakeups for multiple processes in a controlled manner.
> The attached patch authorizes up to 32 waiters.  I have switched
> things so as the information reported in pg_stat_activity is the name
> of the injection point itself.

Thanks!

I think the approach is fine and the hardcoded value is "large" enough (it would
be surprising, at least to me, to write a test that would need more than 32
waiters).

A few comments:

1 ===

+-- Wakes a condition variable

I think "up" is missing at several places in the patch where "wake" is used.
I could be wrong as non native english speaker though.

2 ===

+   /* Counters advancing when injection_points_wakeup() is called */
+   int wait_counts[INJ_MAX_WAIT];

uint32? (here and other places where counter is manipulated).

3 ===

+   /* Remove us from the waiting list */

"Remove this injection wait name from the waiting list" instead?

4 ===

+ * SQL function for waking a condition variable.

s/a condition variable/an injection wait point/ ?

5 ===

+PG_FUNCTION_INFO_V1(injection_points_wakeup);
+Datum
+injection_points_wakeup(PG_FUNCTION_ARGS)

Empty line missing before "Datum"?

6 ===

Also maybe some comments are missing above injection_point_init_state(), 
injection_init_shmem() but it's more a Nit.

7 ===

While at it, should we add a second injection wait point in
t/041_invalid_checkpoint_after_promote.pl to check that they are wake up
individually?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-02-21 Thread Amit Kapila

On Wed, Feb 21, 2024 at 12:15 PM shveta malik  wrote:
>
>
> Thanks for the feedback. I have addressed it in v93.
>

A few minor comments:
=
1.
+/*
+ * Is stopSignaled set in SlotSyncCtx?
+ */
+bool
+IsStopSignaledSet(void)
+{
+ bool signaled;
+
+ SpinLockAcquire(>mutex);
+ signaled = SlotSyncCtx->stopSignaled;
+ SpinLockRelease(>mutex);
+
+ return signaled;
+}
+
+/*
+ * Reset stopSignaled in SlotSyncCtx.
+ */
+void
+ResetStopSignaled(void)
+{
+ SpinLockAcquire(>mutex);
+ SlotSyncCtx->stopSignaled = false;
+ SpinLockRelease(>mutex);
+}

I think these newly introduced functions don't need spinlock to be
acquired as these are just one-byte read-and-write. Additionally, when
IsStopSignaledSet() is invoked, there shouldn't be any concurrent
process to update that value. What do you think?

2.
+REPL_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
+REPL_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down."

Let's use REPLICATION instead of REPL. I see other wait events using
REPLICATION in their names.

3.
- * In standalone mode and in autovacuum worker processes, we use a fixed
- * ID, otherwise we figure it out from the authenticated user name.
+ * In standalone mode, autovacuum worker processes and slot sync worker
+ * process, we use a fixed ID, otherwise we figure it out from the
+ * authenticated user name.
*/
- if (bootstrap || IsAutoVacuumWorkerProcess())
+ if (bootstrap || IsAutoVacuumWorkerProcess() || IsLogicalSlotSyncWorker())
{
InitializeSessionUserIdStandalone();
am_superuser = true;

IIRC, we discussed this previously and it is safe to make the local
connection as superuser as we don't consult any user tables, so we can
probably add a comment where we invoke InitPostgres in slotsync.c

4.
$publisher->safe_psql('postgres',
- "CREATE PUBLICATION regress_mypub FOR ALL TABLES;");
+ "CREATE PUBLICATION regress_mypub FOR ALL TABLES;"
+);

Why this change is required in the patch?

5.
+# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot
are synced
+# to the standby

/and of/; looks like a typo

6.
+# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot
are synced
+# to the standby
+ok( $standby1->poll_query_until(
+ 'postgres',
+ "SELECT '$primary_restart_lsn' = restart_lsn AND
'$primary_flush_lsn' = confirmed_flush_lsn from pg_replication_slots
WHERE slot_name = 'lsub1_slot';"),
+ 'restart_lsn and confirmed_flush_lsn of slot lsub1_slot synced to standby');
+
...
...
+# Confirm the synced slot 'lsub1_slot' is retained on the new primary
+is($standby1->safe_psql('postgres',
+ q{SELECT slot_name FROM pg_replication_slots WHERE slot_name =
'lsub1_slot';}),
+ 'lsub1_slot',
+ 'synced slot retained on the new primary');

In both these checks, we should additionally check the 'synced' and
'temporary' flags to ensure that they are marked appropriately.

-- 
With Regards,
Amit Kapila.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Michail Nikolaev

Hi!

> How do you suppose this would work differently from a long-lived
> normal snapshot, which is how it works right now?

Difference in the ability to take new visibility snapshot periodically
during the second phase with rechecking visibility of tuple according
to the "reference" snapshot (which is taken only once like now).
It is the approach from (1) but with a workaround for the issues
caused by heap_page_prune_opt.

> Would it be exclusively for that relation?
Yes, only for that affected relation. Other relations are unaffected.

> How would this be integrated with e.g. heap_page_prune_opt?
Probably by some flag in RelationData, but not sure here yet.

If the idea looks sane, I could try to extend my POC - it should be
not too hard, likely (I already have tests to make sure it is
correct).

(1): 
https://www.postgresql.org/message-id/flat/CANtu0oijWPRGRpaRR_OvT2R5YALzscvcOTFh-%3DuZKUpNJmuZtw%40mail.gmail.com#8141eb2ea177ff560ee713b3f20de404

Re: A problem about partitionwise join

On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion 
wrote:

> Once you think you've built up some community support and the patchset
> is ready for review, you (or any interested party) can resurrect the
> patch entry by visiting
>
> https://commitfest.postgresql.org/38/2266/
>
> and changing the status to "Needs Review", and then changing the
> status again to "Move to next CF". (Don't forget the second step;
> hopefully we will have streamlined this in the near future!)

This patch was returned due to 'lack of interest'.  However, upon
verification, it appears that the reported issue still exists, and the
proposed fix in the thread remains valid.  Hence, resurrect this patch
after rebasing it on master.  I've also written a detailed commit
message which hopefully can help people review the changes more
effectively.

Thanks
Richard

v7-0001-Fix-partitionwise-join-with-partially-redundant-join-clauses.patch
Description: Binary data

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Matthias van de Meent

On Wed, 21 Feb 2024 at 09:35, Michail Nikolaev
 wrote:
>
> One more idea - is just forbid HOT prune while the second phase is
> running. It is not possible anyway currently because of snapshot held.
>
> Possible enhancements:
> * we may apply restriction only to particular tables
> * we may apply restrictions only to part of the tables (not yet
> scanned by R/CICs).
>
> Yes, it is not an elegant solution, limited, not reliable in terms of
> architecture, but a simple one.

How do you suppose this would work differently from a long-lived
normal snapshot, which is how it works right now?
Would it be exclusively for that relation? How would this be
integrated with e.g. heap_page_prune_opt?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Reduce useless changes before reassembly during logical replication

2024-02-21 Thread Andy Fan


Hi Jie,

> Most importantly, if we filter out unpublished relationship-related
> changes after
> constructing the changes but before queuing the changes into a transaction,
> will it reduce the workload of logical decoding and avoid disk or memory 
> growth
> as much as possible?

Thanks for the report!

Discarding the unused changes as soon as possible looks like a valid
optimization for me, but I pretty like more experienced people have a
double check. 

> Attached is the patch I used to implement this optimization.

After a quick look at the patch, I found FilterByTable is too expensive
because of the StartTransaction and AbortTransaction. With your above
setup and run the below test:

insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,999100) i;

perf the wal sender of mypub for 30 seconds, then I get:

- 22.04% 1.53%  postgres  postgres   [.] FilterByTable  
  - 20.51% FilterByTable
  
AbortTransaction
   ResourceOwnerReleaseInternal 
  LockReleaseAll
 
hash_seq_search 

The main part comes from AbortTransaction, and the 20% is not trivial.

>From your patch:
+
+   /*
+* Decoding needs access to syscaches et al., which in turn use
+* heavyweight locks and such. Thus we need to have enough state around 
to
+* keep track of those.  The easiest way is to simply use a transaction
+* internally.
+ 
+   using_subtxn = IsTransactionOrTransactionBlock();
+
+   if (using_subtxn)
+   BeginInternalSubTransaction("filter change by table");
+   else
+   StartTransactionCommand();

Acutally FilterByTable here is simpler than "decoding", we access
syscache only when we find an entry in get_rel_sync_entry and the
replicate_valid is false, and the invalid case should rare. 

What I'm thinking now is we allow the get_rel_sync_sync_entry build its
own transaction state *only when it find a invalid entry*. if the caller
has built it already, like the existing cases in master, nothing will
happen except a simple transaction state check. Then in the
FilterByTable case we just leave it for get_rel_sync_sync_entry. See the
attachemnt for the idea.

-- 
Best Regards
Andy Fan

>From 90b8df330df049bd1d7e881dc6e9b108c17b0924 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 21 Feb 2024 18:40:03 +0800
Subject: [PATCH v1 1/1] Make get_rel_sync_entry less depending on transaction
 state.

get_rel_sync_entry needs transaction only a replicate_valid = false
entry is found, this should be some rare case. However the caller can't
know if a entry is valid, so they have to prepare the transaction state
before calling this function. Such preparation is expensive.

This patch makes the get_rel_sync_entry can manage a transaction stage
only if necessary. so the callers don't need to prepare it blindly.
---
 src/backend/replication/pgoutput/pgoutput.c | 60 -
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 998f92d671..25e55590a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #include "access/tupconvert.h"
+#include "access/relation.h"
 #include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "catalog/pg_publication_rel.h"
@@ -214,6 +215,11 @@ static void init_rel_sync_cache(MemoryContext cachectx);
 static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 			 Relation relation);
+static RelationSyncEntry *get_rel_sync_entry_by_relid(PGOutputData *data,
+	  Oid relid);
+static RelationSyncEntry *get_rel_sync_entry_internal(PGOutputData *data,
+	  Relation relation,
+	  Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 		  uint32 hashvalue);
@@ -1962,11 +1968,29 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
  */
 static RelationSyncEntry *
 get_rel_sync_entry(PGOutputData *data, Relation relation)
+{
+	return get_rel_sync_entry_internal(data, relation, InvalidOid);
+}
+
+static RelationSyncEntry *
+__attribute__ ((unused))
+get_rel_sync_entry_by_relid(PGOutputData *data, Oid relid)
+{
+	return get_rel_sync_entry_internal(data, NULL, relid);
+}
+
+static RelationSyncEntry *
+get_rel_sync_entry_internal(PGOutputData *data, Relation

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Matthias van de Meent

On Wed, 21 Feb 2024 at 00:33, Michail Nikolaev
 wrote:
>
> Hello!
> > I think the best way for this to work would be an index method that
> > exclusively stores TIDs, and of which we can quickly determine new
> > tuples, too. I was thinking about something like GIN's format, but
> > using (generation number, tid) instead of ([colno, colvalue], tid) as
> > key data for the internal trees, and would be unlogged (because the
> > data wouldn't have to survive a crash)
>
> Yeah, this seems to be a reasonable approach, but there are some
> doubts related to it - it needs new index type as well as unlogged
> indexes to be introduced - this may make the patch too invasive to be
> merged.

I suppose so, though persistence is usually just to keep things
correct in case of crashes, and this "index" is only there to support
processes that don't expect to survive crashes.

> Also, some way to remove the index from the catalog in case of
> a crash may be required.

That's less of an issue though, we already accept that a crash during
CIC/RIC leaves unusable indexes around, so "needs more cleanup" is not
exactly a blocker.

> A few more thoughts:
> * it is possible to go without generation number - we may provide a
> way to do some kind of fast index lookup (by TID) directly during the
> second table scan phase.

While possible, I don't think this would be more performant than the
combination approach, at the cost of potentially much more random IO
when the table is aggressively being updated.

> * one more option is to maintain a Tuplesorts (instead of an index)
> with TIDs as changelog and merge with index snapshot after taking a
> new visibility snapshot. But it is not clear how to share the same
> Tuplesort with multiple inserting backends.

Tuplesort requires the leader process to wait for concurrent backends
to finish their sort before it can start consuming their runs. This
would make it a very bad alternative to the "changelog index" as the
CIC process would require on-demand actions from concurrent backends
(flush of sort state). I'm not convinced that's somehow easier.

> * crazy idea - what is about to do the scan in the index we are
> building? We have tuple, so, we have all the data indexed in the
> index. We may try to do an index scan using that data to get all
> tuples and find the one with our TID :)

We can't rely on that, because we have no guarantee we can find the
tuple quickly enough. Equality-based indexing is very much optional,
and so are TID-based checks (outside the current vacuum-related APIs),
so finding one TID can (and probably will) take O(indexsize) when the
tuple is not in the index, which is one reason for ambulkdelete() to
exist.

Kind regards,

Matthias van de Meent

Re: BRIN integer overflow

2024-02-21 Thread Daniel Gustafsson

> On 21 Feb 2024, at 06:40, Oleg Tselebrovskiy  
> wrote:

> Function bringetbitmap that is used in BRIN's IndexAmRoutine should return an
> int64 value, but the actual return value is int, since totalpages is int and
> totalpages * 10 is also int. This could lead to integer overflow

(totalpages * 10) overflowing an int seems like a quite theoretical risk which
would be hard to hit in practice.

> I suggest to change totalpages to be int64 to avoid potential overflow.
> Also in all other "amgetbitmap functions" (such as hashgetbitmap, 
> gistgetbitmap,
> gingetbitmap, blgetbitmap) the return value is of correct int64 type

That being said, changing it like this seems reasonable since the API is
defined as int64, and it will keep static analyzers quiet.

--
Daniel Gustafsson

Re: POC: GROUP BY optimization

2024-02-21 Thread Alexander Korotkov

Hi, Richard!

> What do you think about the revisions for the test cases?

I've rebased your patch upthread.  Did some minor beautifications.

> * The table 'btg' is inserted with 1 tuples, which seems a bit
> expensive for a test.  I don't think we need such a big table to test
> what we want.

Your patch reduces the number of rows to 1000 tuples.  I found it
possible to further reduce it to 100 tuples.  That also allowed me to
save the plan in the test case introduced by e1b7fde418.

Please check if you're OK with the patch attached.

--
Regards,
Alexander Korotkov


0001-Multiple-revises-for-the-GROUP-BY-reordering-test-v2.patch
Description: Binary data

Re: 'Shutdown <= SmartShutdown' check while launching processes in postmaster.

2024-02-21 Thread shveta malik

On Wed, Feb 21, 2024 at 10:01 AM Tom Lane  wrote:
>
> shveta malik  writes:
> > I would like to know that why we have 'Shutdown <= SmartShutdown'
> > check before launching few processes (WalReceiver, WalSummarizer,
> > AutoVacuum worker) while rest of the processes (BGWriter, WalWriter,
> > Checkpointer, Archiver etc) do not have any such check. If I have to
> > launch a new process, what shall be the criteria to decide if I need
> > this check?
>
> Children that are stopped by the "if (pmState == PM_STOP_BACKENDS)"
> stanza in PostmasterStateMachine should not be allowed to start
> again later if we are trying to shut down.  (But "smart" shutdown
> doesn't enforce that, since it's a very weak state that only
> prohibits new client sessions.)  The processes that are allowed
> to continue beyond that point are ones that are needed to perform
> the shutdown checkpoint, or useful to make it finish faster.

Thank you for providing the details. It clarifies the situation. Do
you think it would be beneficial to include this as a code comment in
postmaster.c to simplify understanding for future readers?

thanks
Shveta

Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar

On Wed, Feb 21, 2024 at 2:52 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar  wrote:
> > So the problem is that we might consider the transaction change as
> > non-transaction and mark this flag as true.
>
> But it's not "might" right? It's absolutely 100% certain that we will
> consider that transaction's changes as non-transactional ... because
> when we're in fast-forward mode, the table of new relfilenodes is not
> built, and so whenever we check whether any transaction made a new
> relfilenode for this sequence, the answer will be no.
>
> > But what would have
> > happened if we would have identified it correctly as transactional?
> > In such cases, we wouldn't have set this flag here but then we would
> > have set this while processing the DecodeAbort/DecodeCommit, so the
> > net effect would be the same no?  You may question what if the
> > Abort/Commit WAL never appears in the WAL, but this flag is
> > specifically for the upgrade case, and in that case we have to do a
> > clean shutdown so may not be an issue.  But in the future, if we try
> > to use 'ctx->processing_required' for something else where the clean
> > shutdown is not guaranteed then this flag can be set incorrectly.
> >
> > I am not arguing that this is a perfect design but I am just making a
> > point about why it would work.
>
> Even if this argument is correct (and I don't know if it is), the code
> and comments need some updating. We should not be testing a flag that
> is guaranteed false with comments that make it sound like the value of
> the flag is trustworthy when it isn't.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: About a recently-added message

2024-02-21 Thread Amit Kapila

On Tue, Feb 20, 2024 at 3:21 PM shveta malik  wrote:
>
> okay, attached v2 patch with changed error msgs and double quotes
> around logical.
>

Horiguchi-San, does this address all your concerns related to
translation with these new messages?

-- 
With Regards,
Amit Kapila.

Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Robert Haas

On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar  wrote:
> So the problem is that we might consider the transaction change as
> non-transaction and mark this flag as true.

But it's not "might" right? It's absolutely 100% certain that we will
consider that transaction's changes as non-transactional ... because
when we're in fast-forward mode, the table of new relfilenodes is not
built, and so whenever we check whether any transaction made a new
relfilenode for this sequence, the answer will be no.

> But what would have
> happened if we would have identified it correctly as transactional?
> In such cases, we wouldn't have set this flag here but then we would
> have set this while processing the DecodeAbort/DecodeCommit, so the
> net effect would be the same no?  You may question what if the
> Abort/Commit WAL never appears in the WAL, but this flag is
> specifically for the upgrade case, and in that case we have to do a
> clean shutdown so may not be an issue.  But in the future, if we try
> to use 'ctx->processing_required' for something else where the clean
> shutdown is not guaranteed then this flag can be set incorrectly.
>
> I am not arguing that this is a perfect design but I am just making a
> point about why it would work.

Even if this argument is correct (and I don't know if it is), the code
and comments need some updating. We should not be testing a flag that
is guaranteed false with comments that make it sound like the value of
the flag is trustworthy when it isn't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar

On Wed, Feb 21, 2024 at 1:24 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar  wrote:
> > > But I am wondering why this flag is always set to true in
> > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > > aborted transactions are not supposed to be replayed?  So if my
> > > observation is correct that for the aborted transaction, this
> > > shouldn't be set to true then we have a problem with sequence where we
> > > are identifying the transactional changes as non-transaction changes
> > > because now for transactional changes this should depend upon commit
> > > status.
> >
> > I have checked this case with Amit Kapila.  So it seems in the cases
> > where we have sent the prepared transaction or streamed in-progress
> > transaction we would need to send the abort also, and for that reason,
> > we are setting 'ctx->processing_required' as true so that if these
> > WALs are not streamed we do not allow upgrade of such slots.
>
> I don't find this explanation clear enough for me to understand.


Explanation about why we set 'ctx->processing_required' to true from
DecodeCommit as well as DecodeAbort:
--
For upgrading logical replication slots, it's essential to ensure
these slots are completely synchronized with the subscriber.  To
identify that we process all the pending WAL in 'fast_forward' mode to
find whether there is any decodable WAL or not.  So in short any WAL
type that we stream to standby in normal mode (no fast_forward mode)
is considered decodable and so is the abort WAL.  That's the reason
why at the end of the transaction commit/abort we need to set this
'ctx->processing_required' to true i.e. there are some decodable WAL
exists so we can not upgrade this slot.

Why the below check is safe?
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

So the problem is that we might consider the transaction change as
non-transaction and mark this flag as true.  But what would have
happened if we would have identified it correctly as transactional?
In such cases, we wouldn't have set this flag here but then we would
have set this while processing the DecodeAbort/DecodeCommit, so the
net effect would be the same no?  You may question what if the
Abort/Commit WAL never appears in the WAL, but this flag is
specifically for the upgrade case, and in that case we have to do a
clean shutdown so may not be an issue.  But in the future, if we try
to use 'ctx->processing_required' for something else where the clean
shutdown is not guaranteed then this flag can be set incorrectly.

I am not arguing that this is a perfect design but I am just making a
point about why it would work.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Michail Nikolaev

One more idea - is just forbid HOT prune while the second phase is
running. It is not possible anyway currently because of snapshot held.

Possible enhancements:
* we may apply restriction only to particular tables
* we may apply restrictions only to part of the tables (not yet
scanned by R/CICs).

Yes, it is not an elegant solution, limited, not reliable in terms of
architecture, but a simple one.

Re: POC: GROUP BY optimization

On Fri, Feb 2, 2024 at 12:40 PM Andrei Lepikhov 
wrote:

> On 2/2/2024 11:06, Richard Guo wrote:
> >
> > On Fri, Feb 2, 2024 at 11:32 AM Richard Guo  > > wrote:
> >
> > On Fri, Feb 2, 2024 at 10:02 AM Tom Lane  > > wrote:
> >
> > One of the test cases added by this commit has not been very
> > stable in the buildfarm.  Latest example is here:
> >
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04
> <
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04
> >
> >
> > and I've seen similar failures intermittently on other machines.
> >
> > I'd suggest building this test atop a table that is more stable
> > than pg_class.  You're just waving a red flag in front of a bull
> > if you expect stable statistics from that during a regression
> run.
> > Nor do I see any particular reason for pg_class to be especially
> > suited to the test.
> >
> >
> > Yeah, it's not a good practice to use pg_class in this place.  While
> > looking through the test cases added by this commit, I noticed some
> > other minor issues that are not great.  Such as
> >
> > * The table 'btg' is inserted with 1 tuples, which seems a bit
> > expensive for a test.  I don't think we need such a big table to test
> > what we want.
> >
> > * I don't see why we need to manipulate GUC max_parallel_workers and
> > max_parallel_workers_per_gather.
> >
> > * I think we'd better write the tests with the keywords being all
> upper
> > or all lower.  A mixed use of upper and lower is not great. Such as
> in
> >
> >  explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w;
> >
> > * Some comments for the test queries are not easy to read.
> >
> > * For this statement
> >
> >  CREATE INDEX idx_y_x_z ON btg(y,x,w);
> >
> > I think the index name would cause confusion.  It creates an index on
> > columns y, x and w, but the name indicates an index on y, x and z.
> >
> > I'd like to write a draft patch for the fixes.
> >
> >
> > Here is the draft patch that fixes the issues I complained about in
> > upthread.
>


> I passed through the patch. Looks like it doesn't break anything. Why do
> you prefer to use count(*) in EXPLAIN instead of plain targetlist, like
> "SELECT x,y,..."?


Nothing special.  Just making the test cases consistent as much as
possible.


> Also, according to the test mentioned by Tom:
> 1. I see, PG uses IndexScan on (x,y). So, column x will be already
> sorted before the MergeJoin. Why not use Incremental Sort on (x,z,w)
> instead of full sort?


I think that's because the planner chooses to use (z, w, x) to perform
the mergejoin.  I did not delve into the details, but I guess the cost
estimation decides this is cheaper.

Hi Alexander,

What do you think about the revisions for the test cases?

Thanks
Richard

Re: Injection points: some tools to wait and wake

2024-02-21 Thread Bertrand Drouvot

Hi,

On Wed, Feb 21, 2024 at 07:08:03AM +0900, Michael Paquier wrote:
> On Tue, Feb 20, 2024 at 03:55:08PM +, Bertrand Drouvot wrote:
> > +PG_FUNCTION_INFO_V1(injection_points_wake);
> > +Datum
> > +injection_points_wake(PG_FUNCTION_ARGS)
> > +{
> > 
> > I think that This function will wake up all the "wait" injection points.
> > Would that make sense to implement some filtering based on the name? That 
> > could
> > be useful for tests that would need multiple wait injection points and that 
> > want
> > to wake them up "sequentially".
> > 
> > We could think about it if there is such a need in the future though.
> 
> Well, both you and Andrey are asking for it now, so let's do it.

Thanks!

> The implementation is simple:
> - Store in InjectionPointSharedState an array of wait_counts and an
> array of names.  There is only one condition variable.
> - When a point wants to wait, it takes the spinlock and looks within
> the array of names until it finds a free slot, adds its name into the
> array to reserve a wait counter at the same position, releases the
> spinlock.  Then it loops on the condition variable for an update of
> the counter it has reserved.  It is possible to make something more
> efficient, but at a small size it would not really matter.
> - The wakeup takes a point name in argument, acquires the spinlock,
> and checks if it can find the point into the array, pinpoints the
> location of the counter to update and updates it.  Then it broadcasts
> the change.
> - The wait loop checks its counter, leaves its loop, cancels the
> sleep, takes the spinlock to unregister from the array, and leaves.
>

I think that makes sense and now the "counter" makes more sense to me (thanks to
it we don't need multiple CV).
 
> I would just hardcode the number of points that can wait, say 5 of
> them tracked in shmem?  Does that look like what you are looking at?

I think so yes and more than 5 points would look like a complicated test IMHO.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Catalog domain not-null constraints

2024-02-21 Thread jian he

wandering around the function AlterDomainNotNull,
the following code can fix the previous undesired behavior.
seems pretty simple, am I missing something?
based on v3-0001-Add-tests-for-domain-related-information-schema-v.patch
and v3-0002-Catalog-domain-not-null-constraints.patch

diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 2f94e375..9069465a 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -2904,7 +2904,7 @@ AlterDomainAddConstraint(List *names, Node *newConstraint,
Form_pg_type typTup;
Constraint *constr;
char   *ccbin;
-   ObjectAddress address;
+   ObjectAddress address = InvalidObjectAddress;

/* Make a TypeName so we can use standard type lookup machinery */
typename = makeTypeNameFromNameList(names);
@@ -3003,6 +3003,12 @@ AlterDomainAddConstraint(List *names, Node
*newConstraint,
}
else if (constr->contype == CONSTR_NOTNULL)
{
+   /* Is the domain already set NOT NULL */
+   if (typTup->typnotnull)
+   {
+   table_close(typrel, RowExclusiveLock);
+   return address;
+   }
domainAddNotNullConstraint(domainoid, typTup->typnamespace,
typTup->typbasetype, typTup->typtypmod,
constr, NameStr(typTup->typname), constrAddr);

Re: Removing unneeded self joins