date:20231116

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-16 Thread Dilip Kumar

On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  wrote:
>
> I just noticed that 0003 does some changes to
> TransactionGroupUpdateXidStatus() that haven't been adequately
> explained AFAICS.  How do you know that these changes are safe?

IMHO this is safe as well as logical to do w.r.t. performance.  It's
safe because whenever we are updating any page in a group we are
acquiring the respective bank lock in exclusive mode and in extreme
cases if there are pages from different banks then we do switch the
lock as well before updating the pages from different groups.  And, we
do not wake any process in a group unless we have done the status
update for all the processes so there could not be any race condition
as well.  Also, It should not affect the performance adversely as well
and this will not remove the need for group updates.  The main use
case of group update is that it will optimize a situation when most of
the processes are contending for status updates on the same page and
processes that are waiting for status updates on different pages will
go to different groups w.r.t. that page, so in short in a group on
best effort basis we are trying to have the processes which are
waiting to update the same clog page that mean logically all the
processes in the group will be waiting on the same bank lock.  In an
extreme situation if there are processes in the group that are trying
to update different pages or even pages from different banks then we
are handling it well by changing the lock.  Although someone may raise
a concern that in cases where there are processes that are waiting for
different bank locks then after releasing one lock why not wake up
those processes, I think that is not required because that is the
situation we are trying to avoid where there are processes trying to
update different are in the same group so there is no point in adding
complexity to optimize that case.

> 0001 contains one typo in the docs, "cotents".
>
> I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
> leaving others in clog.c.  Maybe add commentary cross-linking both.
> Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
> the slru.h-defined limit of 131072 is not that bad, even if it's more
> than could possibly be needed for xact_buffers; nobody is going to use
> 64k buffers, since useful values are below a couple thousand anyhow.

I agree, that allowing xact_buffers to grow beyond 65536 up to the
slru.h-defined limit of 131072 is not that bad, so I will change that
in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: remaining sql/json patches

2023-11-16 Thread jian he

hi.
minor issues.

In transformJsonFuncExpr(ParseState *pstate, JsonFuncExpr *func)
func.behavior->on_empty->location and
func.behavior->on_error->location are correct.
but in ExecInitJsonExpr, jsestate->jsexpr->on_empty->location is -1,
jsestate->jsexpr->on_error->location is -1.
Maybe we can preserve these on_empty, on_error token locations in
transformJsonBehavior.

some enum declaration, ending element need an extra comma?

+ /*
+ * ExecEvalJsonExprPath() will set this to the address of the step to
+ * use to coerce the result of JsonPath* evaluation to the RETURNING
+ * type.  Also see the description of possible step addresses this
+ * could be set to in the definition of JsonExprState.ZZ
+ */

"ZZ", typo?

Re: SQL:2011 application time

2023-11-16 Thread jian he

based on v17.

begin;
drop table if exists s1;
CREATE TABLE s1 (id numrange, misc int, misc1 text);
create role  test101 login;
grant update, select  on s1 to test101;
insert into s1 VALUES ('[1,1000]',2);
set session authorization test101;
update s1 set id = '[1,1000]';
savepoint sp1;
update s1 FOR PORTION OF id from 10 to 100 set misc1 = 'test';
table s1;
savepoint sp2;
insert into s1 VALUES ('[2,1000]',12);
rollback;

In UPDATE FOR PORTION OF from x to y, if range [x,y) overlaps with the
"source" range
then the UPDATE action would be UPDATE and INSERT.
The above UPDATE FOR PORTION OF query should fail?
UPDATE FOR PORTION OF, may need insert privilege. We also need to document this.
Similarly, we also need to apply the above logic to DELETE FOR PORTION OF.
---
+  
+   If the table has a range column
+   or PERIOD, you may supply a

should be

+ 
+  If the table has a range column or  
+  PERIOD, you may supply a

similarly the doc/src/sgml/ref/delete.sgml the link reference also broken.

  
   If the table has a range column or  
  PERIOD, you may supply a
   FOR PORTION OF clause, and your update will only
affect rows
   that overlap the given interval. Furthermore, if a row's span extends outside
   the FOR PORTION OF bounds, then it will be
truncated to fit
   within the bounds, and new rows spanning the "cut off" duration will be
   inserted to preserve the old values.
  

 "given interval", "cut off" these words,  imho, feel not so clear.
We also need a document that:
 "UPDATE FOR PORTION OF" is UPDATE and INSERT (if overlaps).
If the "UPDATE FOR PORTION OF" range overlaps then
It will invoke triggers in the following order: before row update,
before row insert, after row insert. after row update.
---
src/test/regress/sql/for_portion_of.sql
You only need to create two triggers?
since for_portion_of_trigger only raises notice to output the triggers
meta info.

CREATE TRIGGER trg_for_portion_of_before
  BEFORE INSERT OR UPDATE OR DELETE ON for_portion_of_test
  FOR EACH ROW
  EXECUTE FUNCTION for_portion_of_trigger();
CREATE TRIGGER trg_for_portion_of_after
AFTER INSERT OR UPDATE OR DELETE ON for_portion_of_test
FOR EACH ROW
EXECUTE FUNCTION for_portion_of_trigger();

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-16 Thread Richard Guo

On Fri, Nov 17, 2023 at 11:38 AM Tom Lane  wrote:

> That line of argument also leads to the conclusion that it'd be
> okay to expose info about the ordering of the CTE result to the
> upper planner.  This patch doesn't do that, and I'm not sufficiently
> excited about the issue to go write some code.  But if someone else
> does, I think we shouldn't exclude doing it on the grounds of wanting
> to preserve an optimization fence.  The fence is sort of one-way
> in this line of thinking: information can propagate up to the outer
> planner level, but not down into the CTE plan.
>
> Thoughts?


Exactly!  Thanks for the detailed explanation.

Thanks
Richard

Re: Use of backup_label not noted in log

2023-11-16 Thread Laurenz Albe

On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:
> I've often had to analyze what caused corruption in PG instances, where the
> symptoms match not having had backup_label in place when bringing on the
> node. However that's surprisingly hard - the only log messages that indicate
> use of backup_label are at DEBUG1.
> 
> Given how crucial use of backup_label is and how frequently people do get it
> wrong, I think we should add a LOG message - it's not like use of backup_label
> is a frequent thing in the life of a postgres instance and is going to swamp
> the log.  And I think we should backpatch that addition.

+1

I am not sure about the backpatch: it is not a bug, and we should not wantonly
introduce new log messages in a minor release.  Some monitoring system may
get confused.

What about adding it to the "redo starts at" message, something like

  redo starts at 12/12345678, taken from control file

or

  redo starts at 12/12345678, taken from backup label

Yours,
Laurenz Albe

Re: Hide exposed impl detail of wchar.c

2023-11-16 Thread Nathan Bossart

On Thu, Nov 16, 2023 at 06:06:30PM -0500, Tom Lane wrote:
> I'm generally sympathetic to the idea that simd.h was a rather large
> dependency to add to something as widely used as pg_wchar.h.  So I'd
> favor getting it out of there just on compilation-time grounds,
> independently of whether it's causing active problems.  That argument
> wouldn't justify a back-patch, but "it's causing problems" might.

Given the lack of evidence of anyone else using is_valid_ascii(), I'm
leaning towards back-patching being the better option in this case.  I
don't know if it'll be feasible to keep simd.h out of all headers that
third-party code might want to use forever, but that's not an argument
against doing this right now for pgrx.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: MERGE ... RETURNING

2023-11-16 Thread jian he

>
> Attached is a separate patch with those doc updates, intended to be
> applied and back-patched independently of the main RETURNING patch.
>
> Regards,
> Dean

+   You will require the SELECT privilege and any column(s)
+   of the data_source and
+   target_table_name referred to
+   in any condition or expression.

I think it should be:
+   You will require the SELECT privilege on any column(s)
+   of the data_source and
+   target_table_name referred to
+   in any condition or expression.

Other than that, it looks fine.

Re: Synchronizing slots from primary to standby

2023-11-16 Thread Ajin Cherian

On Tue, Nov 14, 2023 at 12:57 AM Zhijie Hou (Fujitsu)
 wrote:
>
> 2) Raise error if user slot with same name already exists on standby.

"ERROR:  not synchronizing slot test; it is a user created slot"
I just tested this using v35 and to me the error message when this
happens is not very good. Neither does it sound like an error, nor is
there clarity on what the underlying problem is or how to correct it.

regards,
Ajin Cherian
Fujitsu Australia

Use of backup_label not noted in log

2023-11-16 Thread Andres Freund

Hi,

I've often had to analyze what caused corruption in PG instances, where the
symptoms match not having had backup_label in place when bringing on the
node. However that's surprisingly hard - the only log messages that indicate
use of backup_label are at DEBUG1.

Given how crucial use of backup_label is and how frequently people do get it
wrong, I think we should add a LOG message - it's not like use of backup_label
is a frequent thing in the life of a postgres instance and is going to swamp
the log.  And I think we should backpatch that addition.

Medium term I think we should go further, and leave evidence in pg_control
about the last use of ControlFile->backupStartPoint, instead of resetting it.

I realize that there's a discussion about removing backup_label - but I think
that's fairly orthogonal. Should we go with the pg_control approach, we should
still emit a useful message when starting in a state that's "equivalent" to
having used the backup_label.

Thoughts?

Greetings,

Andres Freund

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-16 Thread Laurenz Albe

On Thu, 2023-11-16 at 22:38 -0500, Tom Lane wrote:
> That line of argument also leads to the conclusion that it'd be
> okay to expose info about the ordering of the CTE result to the
> upper planner.  [...]  The fence is sort of one-way
> in this line of thinking: information can propagate up to the outer
> planner level, but not down into the CTE plan.
> 
> Thoughts?

That agrees with my intuition about MATERIALIZED CTEs.
I think of them as "first calculate the CTE, then calculate the
rest of the query" or an ad-hoc temporary table for the duration
of a query.  I would expect the upper planner to know estimates
and other data about the result of the CTE.

Yours,
Laurenz Albe

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-16 Thread David G. Johnston

On Thursday, November 16, 2023, Tom Lane  wrote:

>
> That line of argument also leads to the conclusion that it'd be
> okay to expose info about the ordering of the CTE result to the
> upper planner.  This patch doesn't do that, and I'm not sufficiently
> excited about the issue to go write some code.  But if someone else
> does, I think we shouldn't exclude doing it on the grounds of wanting
> to preserve an optimization fence.  The fence is sort of one-way
> in this line of thinking: information can propagate up to the outer
> planner level, but not down into the CTE plan.
>

This is indeed my understanding of what materialized means.  Basically, the
CTE is done first and in isolation; but any knowledge of its result shape
can be used when referencing it.

David J.

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-16 Thread Tom Lane

Richard Guo  writes:
> On Fri, Nov 17, 2023 at 2:16 AM Tom Lane  wrote:
>> So you could argue that there's more to do here, but I'm hesitant
>> to go further.  Part of the point of MATERIALIZED is to be an
>> optimization fence, so breaking down that fence is something to be
>> wary of.  Maybe we shouldn't even take this patch --- but on
>> balance I think it's an OK compromise.

> Agreed.  I think the patch is still valuable on its own, although it
> does not go down into MATERIALIZED case for further optimization.

Right.  My earlier response was rather rushed, so let me explain
my thinking a bit more.

When I realized that the discrepancy between MATERIALIZED-and-not
plans was due to the upper planner not seeing the pathkeys for the
CTE scan, my first thought was to try to export those pathkeys.
And my second thought was that the CTE should return multiple
potential paths, much as we do for sub-SELECT-in-FROM subqueries,
with the upper planner eventually choosing one of those paths.
But that second idea would break down the optimization fence
almost completely, because the opinions of the upper planner would
influence which plan we choose for the CTE query.  I think we
shouldn't go there, at least not for a CTE explicitly marked
MATERIALIZED.  (Maybe if it's not marked MATERIALIZED, but we
chose not to flatten it for some other reason, we could think
about that differently?  Not sure.)

I think that when we say that MATERIALIZED is meant as an optimization
fence, what we mostly mean is that the upper query shouldn't influence
the choice of plan for the sub-query.  However, we surely allow our
statistics or guesses for the sub-query to subsequently influence what
the upper planner does.  If that weren't true, we shouldn't even
expose any non-default rowcount guess to the upper planner --- but
that would lead to really horrid results, so we allow that information
to percolate up from the sub-query.  It seems like exposing column
statistics to the upper planner, as the proposed patch does, isn't
fundamentally different from exposing rowcount estimates.

That line of argument also leads to the conclusion that it'd be
okay to expose info about the ordering of the CTE result to the
upper planner.  This patch doesn't do that, and I'm not sufficiently
excited about the issue to go write some code.  But if someone else
does, I think we shouldn't exclude doing it on the grounds of wanting
to preserve an optimization fence.  The fence is sort of one-way
in this line of thinking: information can propagate up to the outer
planner level, but not down into the CTE plan.

Thoughts?

regards, tom lane

Re: Wrong results with grouping sets

2023-11-16 Thread Richard Guo

On Thu, Nov 16, 2023 at 11:25 PM Alena Rybakina 
wrote:

> I noticed that this query worked correctly in the main branch with the
> inequality operator:
>
> postgres=# select distinct on (a, b) a, b from (values (3, 1), (2, 2)) as
> t (a, b) where a > b group by grouping sets((a, b), (a)); a | b ---+--- 3 |
> 1 3 | (2 rows)
>
> So, I think you are right)
>

Thanks for taking an interest in this patch and verifying it.

> I looked at your patch and noticed a few things:
>
> 1. I think you should add a test with the cube operator, because I noticed
> that the order of the query in the result has also changed:
>

Hmm, I'm not sure if that's necessary.  The wrong result order you saw
here is caused by the same reason explained above: the planner fails to
realize that Var 'a' and 'b' are nullable by the grouping sets, making
them no longer always equal to each other.  This issue should have been
covered in the tests added by v1 patch.

Thanks
Richard

Re: [HACKERS] Should logtape.c blocks be of type long?

2023-11-16 Thread Michael Paquier

On Thu, Nov 16, 2023 at 03:03:46PM +0100, Heikki Linnakangas wrote:
> BufFileTellBlock should be adjusted too. Or removed altogether; it's been
> commented out since year 2000. Other than that, looks good to me.

Yes, I recall wondering about what to do on this one so I just let it
be when updating the last version of the patch as we have many
NOT_USED stuff in the core code.  After 23 years being around for no
purpose, I have just applied a patch to remove it for the sake of this
change, then applied the main change.

Thanks for double-checking, Heikki!
--
Michael

signature.asc
Description: PGP signature

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-16 Thread Richard Guo

On Fri, Nov 17, 2023 at 2:16 AM Tom Lane  wrote:

> So you could argue that there's more to do here, but I'm hesitant
> to go further.  Part of the point of MATERIALIZED is to be an
> optimization fence, so breaking down that fence is something to be
> wary of.  Maybe we shouldn't even take this patch --- but on
> balance I think it's an OK compromise.


Agreed.  I think the patch is still valuable on its own, although it
does not go down into MATERIALIZED case for further optimization.  Maybe
we can take another query as regression test to prove its value, in
which the CTE is not inlined without MATERIALIZED, such as

explain (costs off)
with x as (select unique1, unique2 from tenk1 b)
select count(*) from tenk1 a
where unique1 in (select unique1 from x x1) and
  unique1 in (select unique2 from x x2);
QUERY PLAN
--
 Aggregate
   CTE x
 ->  Seq Scan on tenk1 b
   ->  Hash Join
 Hash Cond: (a.unique1 = x2.unique2)
 ->  Nested Loop
   ->  HashAggregate
 Group Key: x1.unique1
 ->  CTE Scan on x x1
   ->  Index Only Scan using tenk1_unique1 on tenk1 a
 Index Cond: (unique1 = x1.unique1)
 ->  Hash
   ->  HashAggregate
 Group Key: x2.unique2
 ->  CTE Scan on x x2
(15 rows)

vs

explain (costs off)
with x as (select unique1, unique2 from tenk1 b)
select count(*) from tenk1 a
where unique1 in (select unique1 from x x1) and
  unique1 in (select unique2 from x x2);
QUERY PLAN
--
 Aggregate
   CTE x
 ->  Seq Scan on tenk1 b
   ->  Hash Semi Join
 Hash Cond: (a.unique1 = x2.unique2)
 ->  Hash Semi Join
   Hash Cond: (a.unique1 = x1.unique1)
   ->  Index Only Scan using tenk1_unique1 on tenk1 a
   ->  Hash
 ->  CTE Scan on x x1
 ->  Hash
   ->  CTE Scan on x x2
(12 rows)

I believe the second plan is faster in reality too.

Thanks
Richard

RE: Synchronizing slots from primary to standby

2023-11-16 Thread Zhijie Hou (Fujitsu)

On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand 
 wrote:
> On 11/13/23 2:57 PM, Zhijie Hou (Fujitsu) wrote:
> > On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand
>  wrote:
> >> Yeah good point, agree to just error out in all the case then (if we
> >> discard the sync_ reserved wording proposal, which seems to be the
> >> case as probably not worth the extra work).
> >
> > Thanks for the discussion!
> >
> > Here is the V33 patch set which includes the following changes:
> 
> Thanks for working on it!
> 
> >
> > 1) Drop slots with state 'i' in promotion flow after we shut down 
> > WalReceiver.
> 
> @@ -3557,10 +3558,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr
> RecPtr, bool randAccess,
>   * this only after failure, so when you promote, we still
>   * finish replaying as much as we can from archive and
>   * pg_wal before failover.
> +*
> +* Drop the slots for which sync is initiated but not yet
> +* completed i.e. they are still waiting for the primary
> +* server to catch up.
>   */
>  if (StandbyMode && CheckForStandbyTrigger())
>  {
>  XLogShutdownWalRcv();
> +   slotsync_drop_initiated_slots();
>  return XLREAD_FAIL;
>  }
> 
> I had a closer look and it seems this is not located at the right place.
> 
> Indeed, it's added here:
> 
> switch (currentSource)
> {
>   case XLOG_FROM_ARCHIVE:
>   case XLOG_FROM_PG_WAL:
> 
> While in our case we are in
> 
>   case XLOG_FROM_STREAM:
> 
> So I think we should move slotsync_drop_initiated_slots() in the
> XLOG_FROM_STREAM case. Maybe before shutting down the sync slot worker?
> (the TODO item number 2 you mentioned up-thread)

Thanks for the comment.

I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown
slotsync worker and drop slots. There could be other reasons(other than
promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code
there. I thought if the intention is to stop slotsync workers on promotion,
maybe FinishWalRecovery() is a better place to do it as it's indicating the end
of recovery and XLogShutdownWalRcv is also called in it.

And I feel we'd better drop the slots after shutting down the slotsync workers,
because otherwise the slotsync workers could create the dropped slot again in
rare cases.

Best Regards,
Hou zj

Re: Hide exposed impl detail of wchar.c

2023-11-16 Thread Jubilee Young

On Thu, Nov 16, 2023 at 2:54 PM Nathan Bossart  wrote:
> That being said, it's not unheard of for Postgres to make adjustments for
> third-party code (search for "pljava" in commits 62aba76 and f4aa3a1). I
> read the description of the pgrx problem [2], and I'm still trying to
> understand the scope of the issue. I don't think it's reasonable to expect
> someone building an extension to always use the exact same compiler that
> was used by the packager, but I also don't fully understand why different
> compilations of an inline function would cause problems.
>
> [0] 
> https://postgr.es/m/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com
> [1] 
> https://postgr.es/m/CAFBsxsH1Yutrmu%2B6LLHKK8iXY%2BvG--Do6zN%2B2900spHXQNNQKQ%40mail.gmail.com
> [2] https://github.com/pgcentralfoundation/pgrx/issues/1298
>

We don't directly `#include` C into Rust, but use libclang to preprocess and
compile a wrapping C header into a list of symbols Rust will look for at link
time. Our failure is in libclang and how we steer it:
- The Clang-C API (libclang.so) cannot determine where system headers are.
- A clang executable can determine where system headers are, but our bindgen
may be asked to use a libclang.so without a matching clang executable!
- This is partly because system packagers do not agree on what clang parts
must be installed together, nor even on the clang directory's layout.
- Thus, it is currently impossible to, given a libclang.so, determine with
100% accuracy where version-appropriate system headers are and include them,
nor does it do so implicitly.
- Users cannot be expected to always have reasonable directory layouts, nor
always have one clang + libclang.so + clang/"$MAJOR"/include on the system.
- We have tried various solutions and have had several users report, by various
channels, that their builds are breaking, even after they charitably try out
the patches I offer in their CI. Especially after system updates.

The clang-sys and rust-bindgen crates committed a series of unfortunate hacks
that surprisingly work. But the only real solution is actually exposing the
C++ API for header searches to Clang-C, and then move that up the deps chain.
Perhaps libclang-18.so will not have this problem?

- Jubilee

Re: Faster "SET search_path"

2023-11-16 Thread Jeff Davis

On Tue, 2023-11-14 at 20:13 -0800, Jeff Davis wrote:
> On Thu, 2023-10-19 at 19:01 -0700, Jeff Davis wrote:
> > 0003: Cache for recomputeNamespacePath.
> 
> Committed with some further simplification around the OOM handling.

While I considered OOM during hash key initialization, I missed some
other potential out-of-memory hazards. Attached a fixup patch 0003,
which re-introduces one list copy but it simplifies things
substantially in addition to being safer around OOM conditions.

> > 0004: Use the same cache to optimize check_search_path().
> > 0005: Optimize cache for repeated lookups of the same value.

Also attached new versions of these patches.

Regards,
Jeff Davis

From 6ed0785c29a43ccbd36aa3f9bd3be751131a6bf2 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 19 Oct 2023 15:48:16 -0700
Subject: [PATCH v9 5/5] Optimize SearchPathCache by saving the last entry.

Repeated lookups are common, so it's worth it to check the last entry
before doing another hash lookup.

Discussion: https://postgr.es/m/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com
---
 src/backend/catalog/namespace.c | 88 +
 1 file changed, 57 insertions(+), 31 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 0785f2b797..f540da7ebc 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -241,7 +241,8 @@ static bool MatchNamedCall(HeapTuple proctup, int nargs, List *argnames,
  *
  * The search path cache is based on a wrapper around a simplehash hash table
  * (nsphash, defined below). The spcache wrapper deals with OOM while trying
- * to initialize a key, and also offers a more convenient API.
+ * to initialize a key, optimizes repeated lookups of the same key, and also
+ * offers a more convenient API.
  */
 
 static inline uint32
@@ -281,6 +282,7 @@ spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b)
 #define SPCACHE_RESET_THRESHOLD		256
 
 static nsphash_hash * SearchPathCache = NULL;
+static SearchPathCacheEntry * LastSearchPathCacheEntry = NULL;
 
 /*
  * Create or reset search_path cache as necessary.
@@ -295,6 +297,7 @@ spcache_init(void)
 			return;
 
 	MemoryContextReset(SearchPathCacheContext);
+	LastSearchPathCacheEntry = NULL;
 	/* arbitrary initial starting size of 16 elements */
 	SearchPathCache = nsphash_create(SearchPathCacheContext, 16, NULL);
 	searchPathCacheValid = true;
@@ -307,12 +310,25 @@ spcache_init(void)
 static SearchPathCacheEntry *
 spcache_lookup(const char *searchPath, Oid roleid)
 {
-	SearchPathCacheKey cachekey = {
-		.searchPath = searchPath,
-		.roleid = roleid
-	};
+	if (LastSearchPathCacheEntry &&
+		LastSearchPathCacheEntry->key.roleid == roleid &&
+		strcmp(LastSearchPathCacheEntry->key.searchPath, searchPath) == 0)
+	{
+		return LastSearchPathCacheEntry;
+	}
+	else
+	{
+		SearchPathCacheEntry *entry;
+		SearchPathCacheKey cachekey = {
+			.searchPath = searchPath,
+			.roleid = roleid
+		};
+
+		entry = nsphash_lookup(SearchPathCache, cachekey);
 
-	return nsphash_lookup(SearchPathCache, cachekey);
+		LastSearchPathCacheEntry = entry;
+		return entry;
+	}
 }
 
 /*
@@ -324,35 +340,45 @@ spcache_lookup(const char *searchPath, Oid roleid)
 static SearchPathCacheEntry *
 spcache_insert(const char *searchPath, Oid roleid)
 {
-	SearchPathCacheEntry *entry;
-	SearchPathCacheKey cachekey = {
-		.searchPath = searchPath,
-		.roleid = roleid
-	};
-
-	/*
-	 * searchPath is not saved in SearchPathCacheContext. First perform a
-	 * lookup, and copy searchPath only if we need to create a new entry.
-	 */
-	entry = nsphash_lookup(SearchPathCache, cachekey);
-
-	if (!entry)
+	if (LastSearchPathCacheEntry &&
+		LastSearchPathCacheEntry->key.roleid == roleid &&
+		strcmp(LastSearchPathCacheEntry->key.searchPath, searchPath) == 0)
 	{
-		bool		found;
+		return LastSearchPathCacheEntry;
+	}
+	else
+	{
+		SearchPathCacheEntry *entry;
+		SearchPathCacheKey cachekey = {
+			.searchPath = searchPath,
+			.roleid = roleid
+		};
 
-		cachekey.searchPath = MemoryContextStrdup(SearchPathCacheContext, searchPath);
-		entry = nsphash_insert(SearchPathCache, cachekey, );
-		Assert(!found);
+		/*
+		 * searchPath is not saved in SearchPathCacheContext. First perform a
+		 * lookup, and copy searchPath only if we need to create a new entry.
+		 */
+		entry = nsphash_lookup(SearchPathCache, cachekey);
 
-		entry->oidlist = NIL;
-		entry->finalPath = NIL;
-		entry->firstNS = InvalidOid;
-		entry->temp_missing = false;
-		entry->forceRecompute = false;
-		/* do not touch entry->status, used by simplehash */
-	}
+		if (!entry)
+		{
+			bool		found;
+
+			cachekey.searchPath = MemoryContextStrdup(SearchPathCacheContext, searchPath);
+			entry = nsphash_insert(SearchPathCache, cachekey, );
+			Assert(!found);
+
+			entry->oidlist = NIL;
+			entry->finalPath = NIL;
+			entry->firstNS = InvalidOid;
+			entry->temp_missing = false;
+			entry->forceRecompute = false;
+			/* do not touch

Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

2023-11-16 Thread Melanie Plageman

On Thu, Nov 16, 2023 at 3:29 PM Robert Haas  wrote:
>
> On Wed, Nov 15, 2023 at 6:17 PM Andres Freund  wrote:
> > Thoughts on whether to backpatch? It'd probably be at least a bit painful,
> > there have been a lot of changes in the surrounding code in the last 5 
> > years.
>
> I guess the main point that I'd make here is that we shouldn't
> back-patch because it's wrong, but because of whatever consequences it
> being wrong has. If somebody demonstrates that a deadlock occurs, or
> that a painfully long stall can be constructed on a somewhat realistic
> test case, then I think we should back-patch. If it's just something
> that we look at and by visual inspection say "wow, that looks awful,"
> that is not a sufficient reason to back-patch in my view. Such a
> back-patch would still have known risk, but no known reward.

This reasoning makes sense, and I hadn't really thought of it that way.
I was going to offer to take a stab at producing some back patch sets,
but, given this rationale and Andres' downthread agreement and
analysis, it sounds like there is no reason to do so. Thanks for
thinking about my bug report!

- Melanie

Re: EXCLUDE COLLATE in CREATE/ALTER TABLE document

2023-11-16 Thread shihao zhong

Hi Jian,

Thanks for your comments, a new version is attached.

Thanks,
Shihao

On Fri, Nov 10, 2023 at 9:59 AM jian he  wrote:

> On Wed, Nov 1, 2023 at 10:30 AM shihao zhong 
> wrote:
> >
> > Thank you for your feedback on my previous patch. I have fixed the issue
> and attached a new patch for your review. Could you please take a look for
> it if you have a sec? Thanks
> >
>
> Your patch works fine. you can see it here:
> https://cirrus-ci.com/task/6481922939944960
> in an ideal world, since the doc is already built, we can probably
> view it as a plain html file just click the ci test result.
>
> in src/sgml/ref/create_table.sgml:
> "Each exclude_element can optionally specify an operator class and/or
> ordering options; these are described fully under CREATE INDEX."
>
> You may need to update this sentence to reflect that exclude_element
> can also optionally specify collation.
>


update_document_exclude_with_collate_v3.patch
Description: Binary data

Re: Hide exposed impl detail of wchar.c

2023-11-16 Thread Tom Lane

Nathan Bossart  writes:
> It looks like is_valid_ascii() was originally added to pg_wchar.h so that
> it could easily be used elsewhere [0] [1], but that doesn't seem to have
> happened yet.

It seems to be new as of v15, so there wouldn't have been a lot of time
for external code to adopt it.  As far as I can tell from Debian Code
Search, nobody has yet.

> Would moving this definition to a separate header file be a viable option?
> That'd still break any existing projects that are using it, but at least
> there'd be an easy fix.

That would provide a little bit of cover, at least, compared to just
hiding it in the .c file.

I'm generally sympathetic to the idea that simd.h was a rather large
dependency to add to something as widely used as pg_wchar.h.  So I'd
favor getting it out of there just on compilation-time grounds,
independently of whether it's causing active problems.  That argument
wouldn't justify a back-patch, but "it's causing problems" might.

regards, tom lane

Re: Hide exposed impl detail of wchar.c

2023-11-16 Thread Nathan Bossart

On Thu, Nov 16, 2023 at 12:10:59PM -0800, Jubilee Young wrote:
> On the off-chance that everyone agrees with me without reserve, the
> attached patch moves the relevant code around (and includes the gory
> details). This seems to be unlikely to be the only mildly-exotic build
> system failure caused by such an overexposed implementation detail, so
> while I'm not married to this particular code motion, it seems best to
> improve this some way.

It looks like is_valid_ascii() was originally added to pg_wchar.h so that
it could easily be used elsewhere [0] [1], but that doesn't seem to have
happened yet.

Would moving this definition to a separate header file be a viable option?
That'd still break any existing projects that are using it, but at least
there'd be an easy fix.  I'm not sure there _are_ any other projects using
it, anyway.  However, both of these proposals feel like they might be
slightly beyond what we'd ordinarily consider back-patching.

That being said, it's not unheard of for Postgres to make adjustments for
third-party code (search for "pljava" in commits 62aba76 and f4aa3a1).  I
read the description of the pgrx problem [2], and I'm still trying to
understand the scope of the issue.  I don't think it's reasonable to expect
someone building an extension to always use the exact same compiler that
was used by the packager, but I also don't fully understand why different
compilations of an inline function would cause problems.

[0] 
https://postgr.es/m/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com
[1] 
https://postgr.es/m/CAFBsxsH1Yutrmu%2B6LLHKK8iXY%2BvG--Do6zN%2B2900spHXQNNQKQ%40mail.gmail.com
[2] https://github.com/pgcentralfoundation/pgrx/issues/1298

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Hide exposed impl detail of wchar.c

2023-11-16 Thread Tom Lane

Jubilee Young  writes:
> I have root-caused the exact problem, but the bug is... social, rather
> than technical in nature: people with inadequate options at their
> disposal have implemented increasingly clever educated guesses that
> are increasingly prone to going wrong, rather than just asking anyone
> to help them increase their options. Rather than continuing this
> trend, I figured I would simply start doing things to hopefully draw
> the line here. I will be looking to follow up with the bindgen tools
> that fail to handle this correctly, but it would be nice if this
> stopped shipping in Postgres 16."${PG_NEXT_MINOR}", as pgrx does need
> the definitions in pg_wchar.h to have enough data to correctly
> determine database encoding and preserve certain Rust library
> invariants ("all Rust strings are correctly-formed UTF-8, anything
> else is just a sequence of bytes") without also obliterating
> performance.

It would be nice if you would state your problem straightforwardly,
rather than burying us in irrelevant-to-us details; but apparently
what you are unhappy about is that pg_wchar.h now #include's simd.h.
That evidently stems from commit 121d2d3d7 trying to make
is_valid_ascii() faster.

Currently the only caller of is_valid_ascii() is in wchar.c,
and so we could easily fix this by moving is_valid_ascii()
into wchar.c as your patch proposes.  However ... I suppose the
point of having it as a "static inline" in a header file is to
be able to optimize other call sites too.  So I wonder if there
used to be some, or this was just somebody's over-eagerness to
expose stuff they thought possibly might be useful.  And maybe
more to the point, are we worried about there being other
callers in future?  I'm really not sure.

regards, tom lane

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-11-16 Thread Andrei Zubkov

Hi hackers,

Patch rebased to the current master
-- 
regards, Andrei Zubkov
Postgres Professional
From ff2ff96352a843d32a1213a1e953503fd1525b7b Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Fri, 17 Nov 2023 00:27:24 +0300
Subject: [PATCH 1/2] pg_stat_statements tests: Add NOT NULL checking of
 pg_stat_statements_reset

This is preliminary patch. It adds NOT NULL checking for the result of
pg_stat_statements_reset() function. It is needed for upcoming patch
"Track statement entry timestamp" that will change the result type of
this function to the timestamp of a reset performed.

Author: Andrei Zubkov (zubkov)

Reviewed by: Julien Rouhaud (rjuju), Hayato Kuroda (ha-kun),
  Yuki Seino (seinoyu), Chengxi Sun (martin-sun),
  Anton Melnikov (antonmel), Darren Rush (darrenr),
  Michael Paquier (michael-kun), Sergei Kornilov (melkij),
  Alena Rybakina (a.rybakina), Andrei Lepikhov (lepikhov)

Discussion: https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 .../pg_stat_statements/expected/cursors.out   |  28 +--
 contrib/pg_stat_statements/expected/dml.out   |  28 +--
 .../expected/level_tracking.out   |  80 
 .../pg_stat_statements/expected/planning.out  |  18 +-
 .../pg_stat_statements/expected/select.out|  44 ++---
 .../expected/user_activity.out| 120 ++--
 .../pg_stat_statements/expected/utility.out   | 178 +-
 contrib/pg_stat_statements/expected/wal.out   |  10 +-
 contrib/pg_stat_statements/sql/cursors.sql|   6 +-
 contrib/pg_stat_statements/sql/dml.sql|   6 +-
 .../pg_stat_statements/sql/level_tracking.sql |  12 +-
 contrib/pg_stat_statements/sql/planning.sql   |   4 +-
 contrib/pg_stat_statements/sql/select.sql |  10 +-
 .../pg_stat_statements/sql/user_activity.sql  |  15 +-
 contrib/pg_stat_statements/sql/utility.sql|  32 ++--
 contrib/pg_stat_statements/sql/wal.sql|   2 +-
 16 files changed, 299 insertions(+), 294 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/cursors.out b/contrib/pg_stat_statements/expected/cursors.out
index 46375ea9051..0fc4b2c098d 100644
--- a/contrib/pg_stat_statements/expected/cursors.out
+++ b/contrib/pg_stat_statements/expected/cursors.out
@@ -3,10 +3,10 @@
 --
 -- These tests require track_utility to be enabled.
 SET pg_stat_statements.track_utility = TRUE;
-SELECT pg_stat_statements_reset();
- pg_stat_statements_reset 
---
- 
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
 (1 row)
 
 -- DECLARE
@@ -20,13 +20,13 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 ---+--+---
  2 |0 | CLOSE cursor_stats_1
  2 |0 | DECLARE cursor_stats_1 CURSOR WITH HOLD FOR SELECT $1
- 1 |1 | SELECT pg_stat_statements_reset()
+ 1 |1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
 (3 rows)
 
-SELECT pg_stat_statements_reset();
- pg_stat_statements_reset 
---
- 
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
 (1 row)
 
 -- FETCH
@@ -59,12 +59,12 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
  1 |0 | DECLARE cursor_stats_2 CURSOR WITH HOLD FOR SELECT $1
  1 |1 | FETCH 1 IN cursor_stats_1
  1 |1 | FETCH 1 IN cursor_stats_2
- 1 |1 | SELECT pg_stat_statements_reset()
+ 1 |1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
 (9 rows)
 
-SELECT pg_stat_statements_reset();
- pg_stat_statements_reset 
---
- 
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
 (1 row)
 
diff --git a/contrib/pg_stat_statements/expected/dml.out b/contrib/pg_stat_statements/expected/dml.out
index ede47a71acc..f6ac8da5ca2 100644
--- a/contrib/pg_stat_statements/expected/dml.out
+++ b/contrib/pg_stat_statements/expected/dml.out
@@ -81,16 +81,16 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
  1 |   12 | SELECT * FROM pgss_dml_tab ORDER BY a
  2 |4 | SELECT * FROM pgss_dml_tab WHERE a > $1 ORDER BY a
  1 |8 | SELECT * FROM pgss_dml_tab WHERE a IN ($1, $2, $3, $4, $5)
- 1 |1 | SELECT pg_stat_statements_reset()
+ 1 |1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
  1 |0 | SET pg_stat_statements.track_utility = FALSE
  6 |6 | UPDATE pgss_dml_tab SET b = $1 WHERE a = $2
  1 |3 | UPDATE pgss_dml_tab SET b = $1 WHERE a > $2
 (10 rows)
 
-SELECT pg_stat_statements_reset();
- pg_stat_statements_reset 
---
- 
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
 (1 row)
 
 -- MERGE
@@ -136,16 +136,16 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
|  |  WHEN NOT MATCHED THEN INSERT (a, b) VALUES ($1, $2)
  1 |0 | MERGE INTO pgss_dml_tab USING pgss_dml_tab st

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2023-11-16 Thread Daniel Gustafsson

> On 30 Jul 2020, at 14:11, Robert Haas  wrote:
> 
> On Mon, Jul 20, 2020 at 3:47 PM Robert Haas  wrote:
>> But I also agree that what pg_start_backup() was doing before v13 was
>> wrong; that's why I committed
>> 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't
>> back-patch it is because the consequences are so minor I didn't think
>> it was worth worrying about. We could, though. I'd be somewhat
>> inclined to both do that and also change LLVM to use on_proc_exit() in
>> master, but I don't feel super-strongly about it.
> 
> Unless somebody complains pretty soon, I'm going to go ahead and do
> what is described above.

When backpatching 9dce22033d5d I ran into this in v13 and below, since it needs
llvm_shutdown to happen via on_proc_exit in order for all llvm_release_context
calls to have finished.  Unless anyone objects I will backpatch bab150045bd97
to v12 and v13 as part of my backpatch.

--
Daniel Gustafsson

Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

2023-11-16 Thread Robert Haas

On Thu, Nov 16, 2023 at 3:49 PM Andres Freund  wrote:
> I think the main reason it's not all that bad, even when hitting this path, is
> that one stall every 8GB just isn't that much and that the stalls aren't that
> long - the leaf page fsm updates don't use the strategy, so they're still
> somewhat likely to be in s_b, and there's "just" ~2MB of FMS to read. I tried
> to reproduce it here, and it was a few ms, even though I dropped filesystem
> caches in a loop.

So just fix it in master then. If it turns out later there are worse
scenarios, you can back-patch then.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Adding deprecation notices to pgcrypto documentation

2023-11-16 Thread Daniel Gustafsson

In the "Allow tests to pass in OpenSSL FIPS mode" thread [0] it was discovered
that 3DES is joining the ranks of NIST disallowed algorithms.  The attached
patch adds a small note to the pgcrypto documentation about deprecated uses of
algorithms.  I've kept it to "official" notices such as RFC's and NIST SP's.
There might be more that deserve a notice, but this seemed like a good start.

Any thoughts on whether this would be helpful?

--
Daniel Gustafsson

[0] https://postgr.es/m/2825088.1696539...@sss.pgh.pa.us



pgcrypto_disallow_v2.diff
Description: Binary data

Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

2023-11-16 Thread Andres Freund

Hi,

On 2023-11-16 15:29:38 -0500, Robert Haas wrote:
> On Wed, Nov 15, 2023 at 6:17 PM Andres Freund  wrote:
> > Thoughts on whether to backpatch? It'd probably be at least a bit painful,
> > there have been a lot of changes in the surrounding code in the last 5 
> > years.
> 
> I guess the main point that I'd make here is that we shouldn't
> back-patch because it's wrong, but because of whatever consequences it
> being wrong has. If somebody demonstrates that a deadlock occurs, or
> that a painfully long stall can be constructed on a somewhat realistic
> test case, then I think we should back-patch.

Yea, that'd make it easy :)


> If it's just something that we look at and by visual inspection say "wow,
> that looks awful," that is not a sufficient reason to back-patch in my
> view. Such a back-patch would still have known risk, but no known reward.


> Until just now, I hadn't quite absorbed the fact that this only
> affected the indexes == 0 case; that case is probably extremely rare
> in real life. It's possible that accounts for why this hasn't caused
> more trouble. But there could also be reasons why, even if you do have
> that use case, this tends not to be too serious.

I think the main reason it's not all that bad, even when hitting this path, is
that one stall every 8GB just isn't that much and that the stalls aren't that
long - the leaf page fsm updates don't use the strategy, so they're still
somewhat likely to be in s_b, and there's "just" ~2MB of FMS to read. I tried
to reproduce it here, and it was a few ms, even though I dropped filesystem
caches in a loop.

Greetings,

Andres Freund

Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

2023-11-16 Thread Robert Haas

On Wed, Nov 15, 2023 at 6:17 PM Andres Freund  wrote:
> Thoughts on whether to backpatch? It'd probably be at least a bit painful,
> there have been a lot of changes in the surrounding code in the last 5 years.

I guess the main point that I'd make here is that we shouldn't
back-patch because it's wrong, but because of whatever consequences it
being wrong has. If somebody demonstrates that a deadlock occurs, or
that a painfully long stall can be constructed on a somewhat realistic
test case, then I think we should back-patch. If it's just something
that we look at and by visual inspection say "wow, that looks awful,"
that is not a sufficient reason to back-patch in my view. Such a
back-patch would still have known risk, but no known reward.

Until just now, I hadn't quite absorbed the fact that this only
affected the indexes == 0 case; that case is probably extremely rare
in real life. It's possible that accounts for why this hasn't caused
more trouble. But there could also be reasons why, even if you do have
that use case, this tends not to be too serious.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Hide exposed impl detail of wchar.c

2023-11-16 Thread Jubilee Young

Hello,

I work on pgrx, a Rust crate (really, a set of them) that allows
people to use Rust to write extensions against Postgres, exporting
what Postgres sees as ordinary "C" dynamic libraries. Unfortunately,
the build system for this is a touch complicated, as it cannot simply
run pgxs.mk, and as-of Postgres 16 it has been periodically failing on
platforms it used to do fine on, due to troubles involved with the
SIMD extension headers.

I have root-caused the exact problem, but the bug is... social, rather
than technical in nature: people with inadequate options at their
disposal have implemented increasingly clever educated guesses that
are increasingly prone to going wrong, rather than just asking anyone
to help them increase their options. Rather than continuing this
trend, I figured I would simply start doing things to hopefully draw
the line here. I will be looking to follow up with the bindgen tools
that fail to handle this correctly, but it would be nice if this
stopped shipping in Postgres 16."${PG_NEXT_MINOR}", as pgrx does need
the definitions in pg_wchar.h to have enough data to correctly
determine database encoding and preserve certain Rust library
invariants ("all Rust strings are correctly-formed UTF-8, anything
else is just a sequence of bytes") without also obliterating
performance.

On the off-chance that everyone agrees with me without reserve, the
attached patch moves the relevant code around (and includes the gory
details). This seems to be unlikely to be the only mildly-exotic build
system failure caused by such an overexposed implementation detail, so
while I'm not married to this particular code motion, it seems best to
improve this some way.
From f23b3675e38d32e7d52450573237f206286a9d61 Mon Sep 17 00:00:00 2001
From: Jubilee Young 
Date: Mon, 13 Nov 2023 11:59:27 -0800
Subject: [PATCH] Keep simd.h out of unrelated headers

The PGRX ("PostGres Rust eXtensions") build system has trouble inferring
where the appropriate compiler headers would be, due to the Clang-C APIs
it relies on not exposing an API to locate them. Because of this, it uses
an educated guess. When multiple clang installations are on the system,
this guess has a possibility of being wrong in disastrous ways.
Compiler builtin headers cannot be meaningfully "swapped" between
compiler versions, and we are lucky if they only type error on build.

Because of this, it is useful if pg_wchar.h does not expose simd.h details
to satisfy only one caller in wchar.c, as the header is still required by
PGRX (and many other users besides) in order to maintain Rust invariants.
Move is_valid_ascii into w_char.c in order to keep these implementation
details private to Postgres, and pray that other upstreams improve their
tooling so this problem doesn't trouble Postgres 17.
---
 src/common/wchar.c| 68 ++
 src/include/mb/pg_wchar.h | 69 ---
 2 files changed, 68 insertions(+), 69 deletions(-)

diff --git a/src/common/wchar.c b/src/common/wchar.c
index fb9d9f5c85..ea3c327fa5 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -14,6 +14,8 @@
 
 #include "mb/pg_wchar.h"
 
+#include "port/simd.h"
+
 
 /*
  * Operations on multi-byte encodings are driven by a table of helper
@@ -1911,6 +1913,72 @@ utf8_advance(const unsigned char *s, uint32 *state, int len)
 	*state &= 31;
 }
 
+/*
+ * Verify a chunk of bytes for valid ASCII.
+ *
+ * Returns false if the input contains any zero bytes or bytes with the
+ * high-bit set. Input len must be a multiple of the chunk size (8 or 16).
+ */
+static inline bool
+is_valid_ascii(const unsigned char *s, int len)
+{
+	const unsigned char *const s_end = s + len;
+	Vector8		chunk;
+	Vector8		highbit_acc = vector8_broadcast(0);
+#ifdef USE_NO_SIMD
+	Vector8		zero_acc = vector8_broadcast(0x80);
+#endif
+
+	Assert(len % sizeof(chunk) == 0);
+
+	while (s < s_end)
+	{
+		vector8_load(, s);
+
+		/* Capture any zero bytes in this chunk. */
+#ifdef USE_NO_SIMD
+
+		/*
+		 * First, add 0x7f to each byte. This sets the high bit in each byte,
+		 * unless it was a zero. If any resulting high bits are zero, the
+		 * corresponding high bits in the zero accumulator will be cleared.
+		 *
+		 * If none of the bytes in the chunk had the high bit set, the max
+		 * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,
+		 * and we don't need to worry about carrying over to the next byte. If
+		 * any input bytes did have the high bit set, it doesn't matter
+		 * because we check for those separately.
+		 */
+		zero_acc &= (chunk + vector8_broadcast(0x7F));
+#else
+
+		/*
+		 * Set all bits in each lane of the highbit accumulator where input
+		 * bytes are zero.
+		 */
+		highbit_acc = vector8_or(highbit_acc,
+ vector8_eq(chunk, vector8_broadcast(0)));
+#endif
+
+		/* Capture all set bits in this chunk. */
+		highbit_acc = vector8_or(highbit_acc, chunk);
+
+		s += sizeof(chunk);
+	}
+
+	/* Check if any high bits in

Re: On non-Windows, hard depend on uselocale(3)

2023-11-16 Thread Thomas Munro

On Thu, Nov 16, 2023 at 12:06 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Perhaps we could use snprintf_l() and strtod_l() where available.
> > They're not standard, but they are obvious extensions that NetBSD and
> > Windows have, and those are the two systems for which we are doing
> > grotty things in that code.
>
> Oooh, shiny.  I do not see any man page for strtod_l, but I do see
> that it's declared on mamba's host.  I wonder how long they've had it?
> The man page for snprintf_l appears to be quite ancient, so we could
> hope that strtod_l is available on all versions anyone cares about.

A decade[1].  And while I'm doing archeology, I noticed that POSIX has
agreed[2] in principle that *all* functions affected by the thread's
current locale should have a _l() variant, it's just that no one has
sent in the patch.

> > That would amount to extending
> > pg_locale.c's philosophy: either you must have uselocale(), or the
> > full set of _l() functions (that POSIX failed to standardise, dunno
> > what the history is behind that, seems weird).
>
> Yeah.  I'd say the _l functions should be preferred over uselocale()
> if available, but sadly they're not there on common systems.  (It
> looks like glibc has strtod_l but not snprintf_l, which is odd.)

Here is a first attempt.  In this version, new functions are exported
by pgtypeslib.  I realised that I had to do it in there because ECPG's
uselocale() jiggery-pokery is clearly intended to affect the
conversions happening in there too, and we probably don't want
circular dependencies between pgtypeslib and ecpglib.  I think this
means that pgtypeslib is actually subtly b0rked if you use it
independently without an ECPG connection (is that a thing people do?),
because all that code copied-and-pasted from the backend when run in
frontend code with eg a French locale will produce eg "0,42"; this
patch doesn't change that.

I also had a go[3] at doing it with static inlined functions, to avoid
creating a load of new exported functions and associated function call
overheads.  It worked fine, except on Windows: I needed a global
variable PGTYPESclocale that all the inlined functions can see when
called from ecpglib or pgtypeslib code, but if I put that in the
exports list then on that platform it seems to contain garbage; there
is probably some other magic needed to export non-function symbols
from the DLL or something like that, I didn't look into it.  See CI
failure + crash dumps.

[1] 
https://github.com/NetBSD/src/commit/c99aac45e540bc210cc660619a6b5323cbb5c17f
[2] https://www.austingroupbugs.net/view.php?id=1004
[3] https://github.com/macdice/postgres/tree/strtod_l_inline

0001-ecpg-Use-thread-safe-_l-functions-if-possible.patch
Description: Binary data

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-16 Thread Tom Lane

Richard Guo  writes:
> I think the second plan (patched) makes more sense.  In the first plan
> (unpatched), the HashAggregate node actually does not reduce the the
> number of rows because it groups by 'unique1', but planner does not know
> that because it lacks statistics for Vars referencing the CTE.

Yeah.  It's faster in reality too:

regression=# explain analyze with x as MATERIALIZED (select unique1 from tenk1 
b)
select count(*) from tenk1 a where unique1 in (select * from x);
  QUERY PLAN
  
--
 Aggregate  (cost=692.29..692.30 rows=1 width=8) (actual time=15.186..15.188 
rows=1 loops=1)
   CTE x
 ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..270.29 
rows=1 width=4) (actual time=0.028..0.754 rows=1 loops=1)
   Heap Fetches: 0
   ->  Nested Loop  (cost=225.28..409.50 rows=5000 width=0) (actual 
time=3.652..14.733 rows=1 loops=1)
 ->  HashAggregate  (cost=225.00..227.00 rows=200 width=4) (actual 
time=3.644..4.510 rows=1 loops=1)
   Group Key: x.unique1
   Batches: 1  Memory Usage: 929kB
   ->  CTE Scan on x  (cost=0.00..200.00 rows=1 width=4) 
(actual time=0.030..1.932 rows=1 loops=1)
 ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..0.90 
rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
   Index Cond: (unique1 = x.unique1)
   Heap Fetches: 0
 Planning Time: 0.519 ms
 Execution Time: 15.479 ms
(14 rows)

vs

regression=# explain analyze with x as MATERIALIZED (select unique1 from tenk1 
b)
select count(*) from tenk1 a where unique1 in (select * from x);
QUERY PLAN  
  
--
 Aggregate  (cost=1028.07..1028.08 rows=1 width=8) (actual time=4.578..4.579 
rows=1 loops=1)
   CTE x
 ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..270.29 
rows=1 width=4) (actual time=0.011..0.751 rows=1 loops=1)
   Heap Fetches: 0
   ->  Hash Semi Join  (cost=325.28..732.78 rows=1 width=0) (actual 
time=2.706..4.305 rows=1 loops=1)
 Hash Cond: (a.unique1 = x.unique1)
 ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..270.29 
rows=1 width=4) (actual time=0.011..0.676 rows=1 loops=1)
   Heap Fetches: 0
 ->  Hash  (cost=200.00..200.00 rows=1 width=4) (actual 
time=2.655..2.655 rows=1 loops=1)
   Buckets: 16384  Batches: 1  Memory Usage: 480kB
   ->  CTE Scan on x  (cost=0.00..200.00 rows=1 width=4) 
(actual time=0.012..1.963 rows=1 loops=1)
 Planning Time: 0.504 ms
 Execution Time: 4.821 ms
(13 rows)

Now, what you get if you remove MATERIALIZED is faster yet:

regression=# explain analyze with x as (select unique1 from tenk1 b)
select count(*) from tenk1 a where unique1 in (select * from x);
QUERY PLAN  
  
--
 Aggregate  (cost=715.57..715.58 rows=1 width=8) (actual time=2.681..2.682 
rows=1 loops=1)
   ->  Merge Semi Join  (cost=0.57..690.57 rows=1 width=0) (actual 
time=0.016..2.408 rows=1 loops=1)
 Merge Cond: (a.unique1 = b.unique1)
 ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..270.29 
rows=1 width=4) (actual time=0.007..0.696 rows=1 loops=1)
   Heap Fetches: 0
 ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..270.29 
rows=1 width=4) (actual time=0.007..0.655 rows=1 loops=1)
   Heap Fetches: 0
 Planning Time: 0.160 ms
 Execution Time: 2.718 ms
(9 rows)

I poked into that and found that the reason we don't get a mergejoin
with the materialized CTE is that the upper planner invocation doesn't
know that the CTE's output is sorted, so it thinks a separate sort
step would be needed.

So you could argue that there's more to do here, but I'm hesitant
to go further.  Part of the point of MATERIALIZED is to be an
optimization fence, so breaking down that fence is something to be
wary of.  Maybe we shouldn't even take this patch --- but on
balance I think it's an OK compromise.

regards, tom lane

Re: trying again to get incremental backup

2023-11-16 Thread Robert Haas

On Thu, Nov 16, 2023 at 12:34 PM Alvaro Herrera  wrote:
> Putting those two thoughts together, I think pg_basebackup with
> --progress could tell you "still waiting for the summary file up to LSN
> %X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
> or something like that.  This would probably require two concurrent
> connections, one to run BASE_BACKUP and another to inquire server state;
> but this should easy enough to integrate together with parallel
> basebackup later.

I had similar thoughts, except I was thinking it would be better to
have the warnings be generated on the server side. That would save the
need for a second libpq connection, which would be good, because I
think adding that would result in a pretty large increase in
complexity and some not-so-great user-visible consequences. In fact,
my latest thought is to just remove the timeout altogether, and emit
warnings like this:

WARNING:  still waiting for WAL summarization to reach %X/%X after %d
seconds, currently at %X/%X

We could emit that every 30 seconds or so until either the situation
resolves itself or the user hits ^C. I think that would be good enough
here. If we want, the interval between messages can be a GUC, but I
don't know how much real need there will be to tailor that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ALTER TABLE uses a bistate but not for toast tables

2023-11-16 Thread Justin Pryzby

@cfbot: rebased
>From a9bb61421c24bd3273a8519362febb4073c297a1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 21 Jun 2022 22:28:06 -0500
Subject: [PATCH v4] WIP: use BulkInsertState for toast tuples, too

DONE: ALTER, CLUSTER
TODO: copyto, copyfrom?

slot_getsomeattrs slot_deform_heap_tuple fetchatt
heap_getnextslot => heapgettup => heapgetpage => ReadBufferExtended
initscan
table_beginscan table_scan_getnextslot
RelationCopyStorageUsingBuffer ReadBufferWithoutRelcache

(gdb) bt
 #0  table_open (relationId=relationId@entry=16390, lockmode=lockmode@entry=1) at table.c:40
 #1  0x56444cb23d3c in toast_fetch_datum (attr=attr@entry=0x7f67933cc6cc) at detoast.c:372
 #2  0x56444cb24217 in detoast_attr (attr=attr@entry=0x7f67933cc6cc) at detoast.c:123
 #3  0x56444d07a4c8 in pg_detoast_datum_packed (datum=datum@entry=0x7f67933cc6cc) at fmgr.c:1743
 #4  0x56444d042c8d in text_to_cstring (t=0x7f67933cc6cc) at varlena.c:224
 #5  0x56444d0434f9 in textout (fcinfo=) at varlena.c:573
 #6  0x56444d078f10 in FunctionCall1Coll (flinfo=flinfo@entry=0x56444e4706b0, collation=collation@entry=0, arg1=arg1@entry=140082828592844) at fmgr.c:1124
 #7  0x56444d079d7f in OutputFunctionCall (flinfo=flinfo@entry=0x56444e4706b0, val=val@entry=140082828592844) at fmgr.c:1561
 #8  0x56444ccb1665 in CopyOneRowTo (cstate=cstate@entry=0x56444e470898, slot=slot@entry=0x56444e396d20) at copyto.c:975
 #9  0x56444ccb2c7d in DoCopyTo (cstate=cstate@entry=0x56444e470898) at copyto.c:891
 #10 0x56444ccab4c2 in DoCopy (pstate=pstate@entry=0x56444e396bb0, stmt=stmt@entry=0x56444e3759b0, stmt_location=0, stmt_len=48, processed=processed@entry=0x7ffc212a6310) at copy.c:308

cluster:
heapam_relation_copy_for_cluster
reform_and_rewrite_tuple
rewrite_heap_tuple
raw_heap_insert

//-os-only: linux
---
 src/backend/access/common/toast_internals.c |  6 --
 src/backend/access/heap/heapam.c| 24 +++--
 src/backend/access/heap/heaptoast.c | 11 ++
 src/backend/access/heap/rewriteheap.c   |  6 +-
 src/backend/access/table/toast_helper.c |  6 --
 src/include/access/heaptoast.h  |  4 +++-
 src/include/access/hio.h|  2 ++
 src/include/access/toast_helper.h   |  3 ++-
 src/include/access/toast_internals.h|  4 +++-
 9 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 371515395a5..3b52cba5eb1 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -118,7 +118,8 @@ toast_compress_datum(Datum value, char cmethod)
  */
 Datum
 toast_save_datum(Relation rel, Datum value,
- struct varlena *oldexternal, int options)
+ struct varlena *oldexternal, int options,
+ BulkInsertState bistate)
 {
 	Relation	toastrel;
 	Relation   *toastidxs;
@@ -321,7 +322,8 @@ toast_save_datum(Relation rel, Datum value,
 		memcpy(VARDATA(_data), data_p, chunk_size);
 		toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
 
-		heap_insert(toastrel, toasttup, mycid, options, NULL);
+		heap_insert(toastrel, toasttup, mycid, options, bistate ?
+	bistate->toast_state : NULL);
 
 		/*
 		 * Create the index entry.  We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 14de8158d49..470ffa7e708 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -74,7 +74,8 @@
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
-	 TransactionId xid, CommandId cid, int options);
+	 TransactionId xid, CommandId cid, int options,
+	 BulkInsertState bistate);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
   Buffer newbuf, HeapTuple oldtup,
   HeapTuple newtup, HeapTuple old_key_tuple,
@@ -1765,9 +1766,15 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+
 	bistate->next_free = InvalidBlockNumber;
 	bistate->last_free = InvalidBlockNumber;
 	bistate->already_extended_by = 0;
+
+	bistate->toast_state = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
+	bistate->toast_state->strategy = GetAccessStrategy(BAS_BULKWRITE);
+	bistate->toast_state->current_buf = InvalidBuffer;
+
 	return bistate;
 }
 
@@ -1779,6 +1786,10 @@ FreeBulkInsertState(BulkInsertState bistate)
 {
 	if (bistate->current_buf != InvalidBuffer)
 		ReleaseBuffer(bistate->current_buf);
+	if (bistate->toast_state->current_buf != InvalidBuffer)
+		ReleaseBuffer(bistate->toast_state->current_buf);
+
+	FreeAccessStrategy(bistate->toast_state->strategy);
 	FreeAccessStrategy(bistate->strategy);
 	pfree(bistate);
 }
@@ -1844,7 +1855,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,

Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera

On 2023-Nov-16, Alvaro Herrera wrote:

> On 2023-Oct-04, Robert Haas wrote:

> > - Right now, I have a hard-coded 60 second timeout for WAL
> > summarization. If you try to take an incremental backup and the WAL
> > summaries you need don't show up within 60 seconds, the backup times
> > out. I think that's a reasonable default, but should it be
> > configurable? If yes, should that be a GUC or, perhaps better, a
> > pg_basebackup option?
> 
> I'd rather have a way for the server to provide diagnostics on why the
> summaries aren't being produced.  Maybe a server running under valgrind
> is going to fail and need a longer one, but otherwise a hardcoded
> timeout seems sufficient.
> 
> You did say later that you thought summary files would just go from one
> checkpoint to the next.  So the only question is at what point the file
> for the last checkpoint (i.e. from the previous one up to the one
> requested by pg_basebackup) is written.  If walsummarizer keeps almost
> the complete state in memory and just waits for the checkpoint record to
> write it, then it's probably okay.

On 2023-Nov-16, Alvaro Herrera wrote:

> On 2023-Nov-16, Robert Haas wrote:
> 
> > On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera  
> > wrote:

> > > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> > > purpose or it's just a leftover from prior development.  I see it's only
> > > read in an assertion ... Maybe if we think this cross-check is
> > > important, it should be turned into an elog?  Otherwise, I'd remove it.
> > 
> > I've been thinking about that. One thing I'm not quite sure about
> > though is introspection. Maybe there should be a function that shows
> > summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
> > should expose pending_lsn too.
> 
> True.

Putting those two thoughts together, I think pg_basebackup with
--progress could tell you "still waiting for the summary file up to LSN
%X/%X to appear, and the walsummarizer is currently handling lsn %X/%X"
or something like that.  This would probably require two concurrent
connections, one to run BASE_BACKUP and another to inquire server state;
but this should easy enough to integrate together with parallel
basebackup later.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera

On 2023-Nov-16, Robert Haas wrote:

> On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera  
> wrote:
> > I meant code like this
> >
> > memcpy(, rlocator, sizeof(RelFileLocator));
> > key.forknum = forknum;
> > entry = blockreftable_lookup(brtab->hash, key);
> 
> Ah, I hadn't thought about that. Another way of handling that might be
> to add = {0} to the declaration of key. But I can do the initializer
> thing too if you think it's better. I'm not sure if there's an
> argument that the initializer might optimize better.

I think the {0} initializer is good enough, given a comment to indicate
why.

> > It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> > purpose or it's just a leftover from prior development.  I see it's only
> > read in an assertion ... Maybe if we think this cross-check is
> > important, it should be turned into an elog?  Otherwise, I'd remove it.
> 
> I've been thinking about that. One thing I'm not quite sure about
> though is introspection. Maybe there should be a function that shows
> summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
> should expose pending_lsn too.

True.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera

On 2023-Oct-04, Robert Haas wrote:

> - I would like some feedback on the generation of WAL summary files.
> Right now, I have it enabled by default, and summaries are kept for a
> week. That means that, with no additional setup, you can take an
> incremental backup as long as the reference backup was taken in the
> last week. File removal is governed by mtimes, so if you change the
> mtimes of your summary files or whack your system clock around, weird
> things might happen. But obviously this might be inconvenient. Some
> people might not want WAL summary files to be generated at all because
> they don't care about incremental backup, and other people might want
> them retained for longer, and still other people might want them to be
> not removed automatically or removed automatically based on some
> criteria other than mtime. I don't really know what's best here. I
> don't think the default policy that the patches implement is
> especially terrible, but it's just something that I made up and I
> don't have any real confidence that it's wonderful. One point to be
> consider here is that, if WAL summarization is enabled, checkpoints
> can't remove WAL that isn't summarized yet. Mostly that's not a
> problem, I think, because the WAL summarizer is pretty fast. But it
> could increase disk consumption for some people. I don't think that we
> need to worry about the summaries themselves being a problem in terms
> of space consumption; at least in all the cases I've tested, they're
> just not very big.

So, wal_summary is no longer turned on by default, I think following a
comment from Peter E.  I think this is a good decision, as we're only
going to need them on servers from which incremental backups are going
to be taken, which is a strict subset of all servers; and furthermore,
people that need them are going to realize that very easily, while if we
went the other around most people would not realize that they need to
turn them off to save some resource consumption.

Granted, the amount of resources additionally used is probably not very
big.  But since it can be changed with a reload not restart, it doesn't
seem problematic.

... oh, I just noticed that this patch now fails to compile because of
the MemoryContextResetAndDeleteChildren removal.

(Typo in the pg_walsummary manpage: "since WAL summary files primary
exist" -> "primarily")

> - On a related note, I haven't yet tested this on a standby, which is
> a thing that I definitely need to do. I don't know of a reason why it
> shouldn't be possible for all of this machinery to work on a standby
> just as it does on a primary, but then we need the WAL summarizer to
> run there too, which could end up being a waste if nobody ever tries
> to take an incremental backup. I wonder how that should be reflected
> in the configuration. We could do something like what we've done for
> archive_mode, where on means "only on if this is a primary" and you
> have to say always if you want it to run on standbys as well ... but
> I'm not sure if that's a design pattern that we really want to
> replicate into more places. I'd be somewhat inclined to just make
> whatever configuration parameters we need to configure this thing on
> the primary also work on standbys, and you can set each server up as
> you please. But I'm open to other suggestions.

I think it should default to off in primary and standby, and the user
has to enable it in whichever server they want to take backups from.

> - We need to settle the question of whether to send the whole backup
> manifest to the server or just the LSN. In a previous attempt at
> incremental backup, we decided the whole manifest was necessary,
> because flat-copying files could make new data show up with old LSNs.
> But that version of the patch set was trying to find modified blocks
> by checking their LSNs individually, not by summarizing WAL. And since
> the operations that flat-copy files are WAL-logged, the WAL summary
> approach seems to eliminate that problem - maybe an LSN (and the
> associated TLI) is good enough now. This also relates to Jakub's
> question about whether this machinery could be used to fast-forward a
> standby, which is not exactly a base backup but  ... perhaps close
> enough? I'm somewhat inclined to believe that we can simplify to an
> LSN and TLI; however, if we do that, then we'll have big problems if
> later we realize that we want the manifest for something after all. So
> if anybody thinks that there's a reason to keep doing what the patch
> does today -- namely, upload the whole manifest to the server --
> please speak up.

I don't understand this point.  Currently, the protocol is that
UPLOAD_MANIFEST is used to send the manifest prior to requesting the
backup.  You seem to be saying that you're thinking of removing support
for UPLOAD_MANIFEST and instead just give the LSN as an option to the
BASE_BACKUP command?

> - It's regrettable that we don't have incremental JSON parsing;

We now do

Re: trying again to get incremental backup

2023-11-16 Thread Robert Haas

On Thu, Nov 16, 2023 at 5:21 AM Alvaro Herrera  wrote:
> I meant code like this
>
> memcpy(, rlocator, sizeof(RelFileLocator));
> key.forknum = forknum;
> entry = blockreftable_lookup(brtab->hash, key);

Ah, I hadn't thought about that. Another way of handling that might be
to add = {0} to the declaration of key. But I can do the initializer
thing too if you think it's better. I'm not sure if there's an
argument that the initializer might optimize better.

> An output pointer, you mean :-)  (Should it be const?)

I'm bad at const, but that seems to work, so sure.

> When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
> to these output values; we modify some, but why?  Maybe they should be
> left alone?  In that case, the "if size == 0" test should move a couple
> of lines up, in the brtentry == NULL block.

OK.

> BTW, you could do the qsort() after deciding to backup the file fully if
> more than 90% needs to be replaced.

OK.

> BTW, in sendDir() why do
>  lookup_path = pstrdup(pathbuf + basepathlen + 1);
> when you could do
>  lookup_path = pstrdup(tarfilename);
> ?

No reason, changed.

> > If we want to inject more underscores here, my vote is to go all the
> > way and make it per_wal_range_cb.
>
> +1

Will look into this.

> Yeah, I just think that endless stream of hex chars are hard to read,
> and I've found myself following digits in the screen with my fingers in
> order to parse file names.  I guess you could say thousands separators
> for regular numbers aren't needed either, but we do have them for
> readability sake.

Sigh.

> I think a new section in chapter 30 "Reliability and the Write-Ahead
> Log" is warranted.  It would explain the summarization process, what the
> summary files are used for, and the deletion mechanism.  I can draft
> something if you want.

Sure, if you want to take a crack at it, that's great.

> It's not clear to me if WalSummarizerCtl->pending_lsn if fulfilling some
> purpose or it's just a leftover from prior development.  I see it's only
> read in an assertion ... Maybe if we think this cross-check is
> important, it should be turned into an elog?  Otherwise, I'd remove it.

I've been thinking about that. One thing I'm not quite sure about
though is introspection. Maybe there should be a function that shows
summarized_tli and summarized_lsn from WalSummarizerData, and maybe it
should expose pending_lsn too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: trying again to get incremental backup

2023-11-16 Thread Robert Haas

On Tue, Nov 14, 2023 at 8:12 AM Alvaro Herrera  wrote:
> 0001 looks OK to push, and since it stands on its own I would get it out
> of the way soon rather than waiting for the rest of the series to be
> further reviewed.

All right, done.

> 0003:
> AmWalSummarizerProcess() is unused.  Remove?

The intent seems to be to have one of these per enum value, whether it
gets used or not. Some of the others aren't used, either.

> MaybeWalSummarizer() is called on each ServerLoop() in postmaster.c?
> This causes a function call to be emitted every time through.  That
> looks odd.  All other process starts have some triggering condition.

I'm not sure how much this matters, really. I would expect that the
function call overhead here wouldn't be very noticeable. Generally I
think that when ServerLoop returns from WaitEventSetWait it's going to
be because we need to fork a process. That's pretty expensive compared
to a function call. If we can iterate through this loop lots of times
without doing any real work then it might matter, but I feel like
that's probably not the case, and probably something we would want to
fix if it were the case.

Now, I could nevertheless move some of the triggering conditions in
MaybeStartWalSummarizer(), but moving, say, just the summarize_wal
condition wouldn't be enough to avoid having MaybeStartWalSummarizer()
called repeatedly when there was no work to do, because summarize_wal
could be true and the summarizer could all be running. Similarly, if I
move just the WalSummarizerPID == 0 condition, the function gets
called repeatedly without doing anything when summarize_wal = off. So
at a minimum you have to move both of those if you care about avoiding
the function call overhead, and then you have to wonder if you care
about the corner cases where the function would be called repeatedly
for no gain even then.

Another approach would be to make the function static inline rather
than just static. Or we could delete the whole function and just
duplicate the logic it contains at both call sites. Personally I'm
inclined to just leave it how it is in the absence of some evidence
that there's a real problem here. It's nice to have all the triggering
conditions in one place with nothing duplicated.

> GetOldestUnsummarizedLSN uses while(true), but WaitForWalSummarization
> and SummarizeWAL use while(1). Maybe settle on one style?

OK.

> 0004:
> in PrepareForIncrementalBackup(), the logic that determines
> earliest_wal_range_tli and latest_wal_range_tli looks pretty weird.  I
> think it works fine if there's a single timeline, but not otherwise.
> Or maybe the trick is that it relies on timelines returned by
> readTimeLineHistory being sorted backwards?  If so, maybe add a comment
> about that somewhere; I don't think other callers of readTimeLineHistory
> make that assumption.

It does indeed rely on that assumption, and the comment at the top of
the for (i = 0; i < num_wal_ranges; ++i) loop explains that. Note also
the comment just below that begins "If we found this TLI in the
server's history". I agree with you that this logic looks strange, and
it's possible that there's some better way to do encode the idea than
what I've done here, but I think it might be just that the particular
calculation we're trying to do here is strange. It's almost easier to
understand the logic if you start by reading the sanity checks
("manifest requires WAL from initial timeline %u starting at %X/%X,
but that timeline begins at %X/%X" et. al.), look at the triggering
conditions for those, and then work upward to see how
earliest/latest_wal_range_tli get set, and then look up from there to
see how saw_earliest/latest_wal_range_tli are used in computing those
values.

We do rely on the ordering assumption elsewhere. For example, in
XLogFileReadAnyTLI, see if (tli < curFileTLI) break. We also use it to
set expectedTLEs, which is documented to have this property. And
AddWALInfoToBackupManifest relies on it too; see the comment "Because
the timeline history file lists newer timelines before older ones" in
AddWALInfoToBackupManifest. We're not entirely consistent about this,
e.g., unlike XLogFileReadAnyTLI, tliInHistory() and
tliOfPointInHistory() don't have an early exit provision, but we do
use it some places.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: doc: fix wording describing the checkpoint_flush_after GUC

2023-11-16 Thread Robert Haas

On Tue, Nov 14, 2023 at 3:01 PM Andres Freund  wrote:
> Hm, I really can't get excited about this. To me the "you" sounds worse, but
> whatever...

To me, it seems flat-out incorrect without the "you".

It might be better to rephrase the whole thing entirely so that it
doesn't need to address the reader, like allows you to force ->
forces.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Do away with a few backwards compatibility macros

2023-11-16 Thread Nathan Bossart

On Thu, Nov 16, 2023 at 07:11:41PM +0530, Bharath Rupireddy wrote:
> After a recent commit 6a72c42f (a related discussion [1]) which
> removed MemoryContextResetAndDeleteChildren(), I think there are a
> couple of other backward compatibility macros out there that can be
> removed. These macros are tuplestore_donestoring() which was
> introduced by commit dd04e95 21 years ago and SPI_push() and friends
> which were made no-ops macros by commit 1833f1a 7 years ago. Debian
> code search shows very minimal usages of these macros. Here's a patch
> attached to remove them.

I'm fine with this because all of these macros are no-ops for all supported
versions of Postgres.  Even if an extension is using them today, you'll get
the same behavior as before if you remove the uses and rebuild against
v12-v16.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Wrong results with grouping sets

2023-11-16 Thread Alena Rybakina


Hi! Thank you for your work on the subject.

On 25.09.2023 10:11, Richard Guo wrote:

I think I've come across a wrong result issue with grouping sets, as
shown by the query below.

-- result is correct with only grouping sets
select a, b
from (values (1, 1), (2, 2)) as t (a, b) where a = b
group by grouping sets((a, b), (a));
 a | b
---+---
 1 | 1
 1 |
 2 | 2
 2 |
(4 rows)

-- result is NOT correct with grouping sets and distinct on
select distinct on (a, b) a, b
from (values (1, 1), (2, 2)) as t (a, b) where a = b
group by grouping sets((a, b), (a));
 a | b
---+---
 1 | 1
 2 | 2
(2 rows)

The distinct on expressions include both 'a' and 'b', so rows (1, 1) and
(1, NULL) should not be considered equal.  (The same for rows (2, 2) and
(2, NULL).)


I noticed that this query worked correctly in the main branch with the 
inequality operator:


postgres=# select distinct on (a, b) a, b from (values (3, 1), (2, 2)) 
as t (a, b) where a > b group by grouping sets((a, b), (a)); a | b 
---+--- 3 | 1 3 | (2 rows)


So, I think you are right)



I think the root cause is that when we generate distinct_pathkeys, we
failed to realize that Var 'b' might be nullable by the grouping sets,
so it's no longer always equal to Var 'a'.  It's not correct to deem
that the PathKey for 'b' is redundant and thus remove it from the
pathkeys list.

We have the same issue when generating sort_pathkeys.  As a result, we
may have the final output in the wrong order.  There were several
reports about this issue before, such as [1][2].

To fix this issue, I'm thinking that we mark the grouping expressions
nullable by grouping sets with a dummy RTE for grouping sets, something
like attached.  In practice we do not need to create a real RTE for
that, what we need is just a RT index.  In the patch I use 0, because
it's not a valid outer join relid, so it would not conflict with the
outer-join-aware-Var infrastructure.

If the grouping expression is a Var or PHV, we can just set its
nullingrels, very straightforward.   For an expression that is neither a
Var nor a PHV, I'm not quite sure how to set the nullingrels.  I tried
the idea of wrapping it in a new PHV to carry the nullingrels, but that
may cause some unnecessary plan diffs.  In the patch for such an
expression I just set the nullingrels of Vars or PHVs that are contained
in it.  This is not really 'correct' in theory, because it is the whole
expression that can be nullable by grouping sets, not its individual
vars.  But it works in practice, because what we need is that the
expression can be somehow distinguished from the same expression in ECs,
and marking its vars is sufficient for this purpose.  But what if the
expression is variable-free?  This is the point that needs more work.
Fow now the patch just handles variable-free expressions of type Const,
by wrapping it in a new PHV.

There are some places where we need to artificially remove the RT index
of grouping sets from the nullingrels, such as when we generate
PathTarget for initial input to grouping nodes, or when we generate
PathKeys for the grouping clauses, because the expressions there are
logically below the grouping sets.  We also need to do that in
set_upper_references when we update the targetlist of an Agg node, so
that we can perform exact match for nullingrels, rather than superset
match.

Since the fix depends on the nullingrels stuff, it seems not easy for
back-patching.  I'm not sure what we should do in back branches.

Any thoughts?

[1] 
https://www.postgresql.org/message-id/cambws48atqtqgk37msydk_eagdo3y0ia_lzvuvgq2uo_wh2...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/17071-24dc13fbfa296...@postgresql.org



I looked at your patch and noticed a few things:

1. I think you should add a test with the cube operator, because I 
noticed that the order of the query in the result has also changed:


master:
postgres=# select a,b from (values(1,1),(2,2)) as t (a,b) where a=b 
group by cube (a, (a,b)) order by b,a; a | b ---+--- 1 | 1 1 | 1 1 | 2 | 
2 2 | 2 2 | | (7 rows)


with patch:

postgres=# select a, b from (values (1, 1), (2, 2)) as t (a, b) where a 
= b group by cube(a, (a, b)) order by b,a; a | b ---+--- 1 | 1 1 | 1 2 | 
2 2 | 2 1 | 2 | | (7 rows)


--
Regards,
Alena Rybakina

Re: BUG #18097: Immutable expression not allowed in generated at

2023-11-16 Thread Tom Lane

Aleksander Alekseev  writes:
>> True, but from the perspective of the affected code, the question is
>> basically "did you call expression_planner() yet".  So I like this
>> naming for that connection, whereas something based on "transformation"
>> doesn't really connect to anything in existing function names.

> Fair enough.

Pushed like that, then.  Thanks for reviewing!

regards, tom lane

Re: psql not responding to SIGINT upon db reconnection

2023-11-16 Thread Heikki Linnakangas


On 06/11/2023 19:16, Tristan Partin wrote:

That sounds like a much better solution. Attached you will find a v4
that implements your suggestion. Please let me know if there is
something that I missed. I can confirm that the patch works.


This patch is missing a select(). It will busy loop until the connection 
is established or cancelled.


Shouldn't we also clear CancelRequested after we have cancelled the 
attempt? Otherwise, any subsequent attempts will immediately fail too.


Should we use 'cancel_pressed' here rather than CancelRequested? To be 
honest, I don't understand the difference, so that's a genuine question. 
There was an attempt at unifying them in the past but it was reverted in 
commit 5d43c3c54d.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [HACKERS] Should logtape.c blocks be of type long?

2023-11-16 Thread Heikki Linnakangas


On 26/09/2023 07:15, Michael Paquier wrote:

On Sun, Sep 24, 2023 at 10:42:49AM +0900, Michael Paquier wrote:

Indeed, or Windows decides that making long 8-byte is wiser, but I
doubt that's ever going to happen on backward-compatibility ground.


While looking more at that, I've noticed that I missed BufFileAppend()
and BufFileSeekBlock(), that themselves rely on long.  The other code
paths calling these two routines rely on BlockNumber (aka uint32), so
that seems to be the bottom of it.


BufFileTellBlock should be adjusted too. Or removed altogether; it's 
been commented out since year 2000. Other than that, looks good to me.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: retire MemoryContextResetAndDeleteChildren backwards compatibility macro

2023-11-16 Thread Bharath Rupireddy

On Tue, Nov 14, 2023 at 11:29 PM Nathan Bossart
 wrote:
>
> On Tue, Nov 14, 2023 at 10:46:25PM +0530, Bharath Rupireddy wrote:
> > FWIW, there are other backward compatibility macros out there like
> > tuplestore_donestoring which was introduced by commit dd04e95 21 years
> > ago and SPI_push() and its friends which were made no-ops macros by
> > commit 1833f1a 7 years ago. Debian code search shows very minimal
> > usages of the above macros. Can we do away with
> > tuplestore_donestoring?
>
> Can we take these other things to a separate thread?

Sure. Here it is -
https://www.postgresql.org/message-id/CALj2ACVeO58JM5tK2Qa8QC-%3DkC8sdkJOTd4BFU%3DK8zs4gGYpjQ%40mail.gmail.com.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Do away with a few backwards compatibility macros

2023-11-16 Thread Bharath Rupireddy

Hi,

After a recent commit 6a72c42f (a related discussion [1]) which
removed MemoryContextResetAndDeleteChildren(), I think there are a
couple of other backward compatibility macros out there that can be
removed. These macros are tuplestore_donestoring() which was
introduced by commit dd04e95 21 years ago and SPI_push() and friends
which were made no-ops macros by commit 1833f1a 7 years ago. Debian
code search shows very minimal usages of these macros. Here's a patch
attached to remove them.

Thoughts?

[1] https://www.postgresql.org/message-id/20231114175953.GD2062604%40nathanxps13

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Do-away-with-a-few-backwards-compatibility-macros.patch
Description: Binary data

Re: Trigger violates foreign key constraint

2023-11-16 Thread Aleksander Alekseev

Hi,

> Attached is a slightly modified version of the patch.

The patch was marked as "Needs Review" so I decided to take a look.

I believe it's a good patch. The text is well written, has the
necessary references, it warns the user but doesn't scare him/her too
much.

> I couldn't make it clearer.  I'd have to write an example to make it clearer,
> and that would certainly be out of scope.

Personally I don't think we should document particular steps to break
FK constraints. In a similar situation when we documented the
limitations about XID wraparound we didn't document particular steps
to trigger XID wraparound. I don't recall encountering such examples
in the documentation, so I don't think we should add them here, at
least for consistency.

All in all the patch seems to be as good as it will get. I suggest merging it.

-- 
Best regards,
Aleksander Alekseev

Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-11-16 Thread Amul Sul

On Thu, Nov 16, 2023 at 2:50 AM Peter Eisentraut 
wrote:

> On 15.11.23 13:26, Amul Sul wrote:
> > Question: Why are you using AT_PASS_ADD_OTHERCONSTR?  I don't know if
> > it's right or wrong, but if you have a specific reason, it would be
> > good
> > to know.
> >
> > I referred to ALTER COLUMN DEFAULT and used that.
>
> Hmm, I'm not sure if that is a good comparison.  For ALTER TABLE, SET
> DEFAULT is just a catalog manipulation, it doesn't change any data, so
> it's pretty easy.  SET EXPRESSION changes data, which other phases might
> want to inspect?  For example, if you do SET EXPRESSION and add a
> constraint in the same ALTER TABLE statement, do those run in the
> correct order?
>

I think, you are correct, but currently AT_PASS_ADD_OTHERCONSTR is for
AT_CookedColumnDefault, AT_ColumnDefault, and AT_AddIdentity.
AT_CookedColumnDefault is only supported for CREATE TABLE.  AT_ColumnDefault
and AT_AddIdentity will be having errors while operating on the generated
column. So
that anomaly does not exist, but could be in future addition. I think it is
better to
use AT_PASS_MISC to keep this operation at last.

While testing this, I found a serious problem with the patch that CHECK and
FOREIGN KEY constraint check does not happens at rewrite, see this:

create table a (y int primary key);
insert into a values(1),(2);
create table b (x int, y int generated always as(x) stored, foreign key(y)
references a(y));
insert into b values(1),(2);
insert into b values(3);<-- an error, expected one

alter table b alter column y set expression as (x*100);  <-- no error,
NOT expected

select * from b;
 x |  y
---+-
 1 | 100
 2 | 200
(2 rows)

Also,

delete from a;   <-- no error, NOT expected.
select * from a;
 y
---
(0 rows)

Shouldn't that have been handled by the ATRewriteTables() facility
implicitly
like NOT NULL constraints?  Or should we prepare a list of CHECK and FK
constraints and pass it through tab->constraints?

> > Tiny comment: The error message in ATExecSetExpression() does not
> need
> > to mention "stored", since it would be also applicable to virtual
> > generated columns in the future.
> >
> > I had to have the same thought, but later decided when we do that
> > virtual column thing, we could simply change that. I am fine to do that
> > change
> > now as well, let me know your thought.
>
> Not a big deal, but I would change it now.
>
> Another small thing I found:  In ATExecColumnDefault(), there is an
> errhint() that suggests DROP EXPRESSION instead of DROP DEFAULT.  You
> could now add another hint that suggests SET EXPRESSION instead of SET
> DEFAULT.
>

Ok.

Regards,
Amul Sul

RE: pg_upgrade and logical replication

2023-11-16 Thread Hayato Kuroda (Fujitsu)

Dear Vignesh,

Thanks for updating the patch! Here are some comments.
They are mainly cosmetic because I have not read yours these days.

01. binary_upgrade_add_sub_rel_state()

```
+/* We must check these things before dereferencing the arguments */
+if (PG_ARGISNULL(0) || PG_ARGISNULL(1) || PG_ARGISNULL(2))
+elog(ERROR, "null argument to binary_upgrade_add_sub_rel_state is not 
allowed")
```

But fourth argument can be NULL, right? I know you copied from other functions,
but they do not accept for all arguments. One approach is that pg_dump 
explicitly
writes InvalidXLogRecPtr as the fourth argument.

02. binary_upgrade_add_sub_rel_state()

```
+if (!OidIsValid(relid))
+ereport(ERROR,
+errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid relation identifier used: %u", relid));
+
+tup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+if (!HeapTupleIsValid(tup))
+ereport(ERROR,
+errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relation %u does not exist", relid))
```

I'm not sure they should be ereport(). Isn't it that they will be never 
occurred?
Other upgrade funcs do not have ereport(), and I think it does not have to be
translated.

03. binary_upgrade_replorigin_advance()

IIUC this function is very similar to pg_replication_origin_advance(). Can we
extract a common part of them? I think pg_replication_origin_advance() will be
just a wrapper, and binary_upgrade_replorigin_advance() will get the name of
origin and pass to it.

04. binary_upgrade_replorigin_advance()

Even if you do not accept 03, some variable name can be follow the function.

05. getSubscriptions()

```
+appendPQExpBufferStr(query, "o.remote_lsn AS suboriginremotelsn\n")
```

Hmm, this value is taken anyway, but will be dumed only when the cluster is 
PG17+.
Should we avoid getting the value like subrunasowner and subpasswordrequired?
Not sure...

06. dumpSubscriptionTable()

Can we assert that remote version is PG17+?

07. check_for_subscription_state()

IIUC, this function is used only for old cluster. Should we follow
check_old_cluster_for_valid_slots()?

08. check_for_subscription_state()

```
+fprintf(script, "database:%s subscription:%s schema:%s relation:%s 
state:%s not in required state\n",
+active_db->db_name,
+PQgetvalue(res, i, 0),
+PQgetvalue(res, i, 1),
+PQgetvalue(res, i, 2),
+PQgetvalue(res, i, 3));
```

IIRC, format strings should be double-quoted.

09. check_new_cluster_logical_replication_slots()

Checks for replication origin were added in 
check_new_cluster_logical_replication_slots(),
but I felt it became a super function. Can we devide?

10. check_new_cluster_logical_replication_slots()

Even if you reject above, it should be renamed.

11. pg_upgrade.h

```
+int subscription_count; /* number of subscriptions */
```

Based on other struct, it should be "nsubscriptions".

12. 004_subscription.pl

```
+use File::Path qw(rmtree);
```

I think this is not used.

13. 004_subscription.pl

```
+my $bindir = $new_sub->config_data('--bindir');
```
For extensibility, it might be better to separate for old/new bindir.

14. 004_subscription.pl

```
+my $synced_query =
+  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'r'";
+$old_sub->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
```

Actually, I'm not sure it is really needed. wait_for_subscription_sync() in 
line 163
ensures that sync are done? Are there any holes around here?

15. 004_subscription.pl

```
+# Check the number of rows for each table on each server
+my $result =
+  $publisher->safe_psql('postgres', "SELECT count(*) FROM tab_upgraded");
+is($result, qq(50), "check initial tab_upgraded table data on publisher");
+$result =
+  $publisher->safe_psql('postgres', "SELECT count(*) FROM tab_not_upgraded");
+is($result, qq(1), "check initial tab_upgraded table data on publisher");
+$result =
+  $old_sub->safe_psql('postgres', "SELECT count(*) FROM tab_upgraded");
+is($result, qq(50),
+"check initial tab_upgraded table data on the old subscriber");
+$result =
+  $old_sub->safe_psql('postgres', "SELECT count(*) FROM tab_not_upgraded");
+is($result, qq(0),
+"check initial tab_not_upgraded table data on the old subscriber");
```

I'm not sure they are really needed. At that time pg_upgrade --check is called,
this won't change the state of clusters.

16. pg_proc.dat

```
+{ oid => '8404', descr => 'for use by pg_upgrade (relation for 
pg_subscription_rel)',
+  proname => 'binary_upgrade_add_sub_rel_state', proisstrict => 'f',
+  provolatile => 'v', proparallel => 'u', prorettype => 'void',
+  proargtypes => 'text oid char pg_lsn',
+  prosrc => 'binary_upgrade_add_sub_rel_state' },
+{ oid => '8405', descr => 'for use by pg_upgrade

Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)

2023-11-16 Thread Bharath Rupireddy

On Thu, Nov 16, 2023 at 4:01 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-16, Peter Smith wrote:
>
> > I searched HEAD code and did not find any "translator:" comments for
> > just ordinary slot name substitutions like these; AFAICT that comment
> > is not necessary anymore.
>
> True.  Lose that.

Removed.

> The rationale I have is to consider whether a translator looking at the
> original message message in isolation is going to understand what the %s
> means.  If it's possible to tell what it is without having to go read
> the source code that leads to the message, then you don't need a
> "translator:" comment.  Otherwise you do.

Agree. I think it's easy for one to guess the slot name follows "
replication slot %s", so I removed the translator message.

> > SUGGESTION (#1a and #1b)
> >
> > ereport(log_replication_commands ? LOG : DEBUG1,
> > errmsg(SlotIsLogical(s)
> >? "acquired logical replication slot \"%s\""
> >: "acquired physical replication slot \"%s\"",
> >NameStr(s->data.name)));
>
> The bad thing about this is that gettext() is not going to pick up these
> strings into the translation catalog.  You could fix that by adding
> gettext_noop() calls around them:
>
>  ereport(log_replication_commands ? LOG : DEBUG1,
>  errmsg(SlotIsLogical(s)
> ? gettext_noop("acquired logical replication slot \"%s\"")
> : gettext_noop("acquired physical replication slot \"%s\""),
> NameStr(s->data.name)));
>
> but at that point it's not clear that it's really better than putting
> the ternary in the outer scope.

I retained wrapping messages in errmsg("...").

> You can verify this by doing "make update-po" and then searching for the
> messages in postgres.pot.

Translation gives me [1] with v18 patch

PSA v18 patch.

[1]
#: replication/slot.c:545
#, c-format
msgid "acquired logical replication slot \"%s\""
msgstr ""

#: replication/slot.c:547
#, c-format
msgid "acquired physical replication slot \"%s\""
msgstr ""

#: replication/slot.c:622
#, c-format
msgid "released logical replication slot \"%s\""
msgstr ""

#: replication/slot.c:624
#, c-format
msgid "released physical replication slot \"%s\""
msgstr ""

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v18-0001-Log-messages-for-replication-slot-acquisition-an.patch
Description: Binary data

RE: Popcount optimization using AVX512

2023-11-16 Thread Shankaran, Akash

Sorry for the late response here. We spent some time researching and measuring 
the frequency impact of AVX512 instructions used here.

>How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

AVX512 has light and heavy instructions. While the heavy AVX512 instructions 
have clock frequency implications, the light instructions not so much. See [0] 
for more details. We captured EMON data for the benchmark used in this work, 
and see that the instructions are using the licensing level not meant for heavy 
AVX512 operations. This means the instructions for popcount : 
_mm512_popcnt_epi64(), _mm512_reduce_add_epi64() are not going to have any 
significant impact on CPU clock frequency. 
Clock frequency impact aside, we measured the same benchmark for gains on older 
Intel hardware and observe up to 18% better performance on Intel Icelake. On 
older intel hardware, the popcntdq 512 instruction is not present so it won’t 
work. If clock frequency is not affected, rest of workload should not be 
impacted in the case of mixed workloads. 

>Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Testing this on smaller block sizes < 8KiB shows that AVX512 compared to the 
current 64bit behavior shows slightly lower performance, but with a large 
variance. We cannot conclude much from it. The testing with ANALYZE benchmark 
by Nathan also points to no visible impact as a result of using AVX512. The 
gains on larger dataset is easily evident, with less variance. 
What are your thoughts if we introduce AVX512 popcount for smaller sizes as an 
optional feature initially, and then test it more thoroughly over time on this 
particular use case? 

Regarding enablement, following the other responses related to function 
inlining, using ifunc and enabling future intrinsic support, it seems a 
concrete solution would require further discussion. We’re attaching a patch to 
enable AVX512, which can use AVX512 flags during build. For example:
  >make -E CFLAGS_AVX512="-mavx -mavx512dq -mavx512vpopcntdq -mavx512vl 
-march=icelake-server -DAVX512_POPCNT=1"

Thoughts or feedback on the approach in the patch? This solution should not 
impact anyone who doesn’t use the feature i.e. AVX512. Open to additional ideas 
if this doesn’t seem like the right approach here.  

[0] 
https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

-Original Message-
From: Nathan Bossart  
Sent: Tuesday, November 7, 2023 12:15 PM
To: Noah Misch 
Cc: Tom Lane ; Matthias van de Meent 
; Amonson, Paul D ; 
pgsql-hackers@lists.postgresql.org; Shankaran, Akash 
Subject: Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:
> On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
>> On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
>> > The glibc/gcc "ifunc" mechanism was designed to solve this problem 
>> > of choosing a function implementation based on the runtime CPU, 
>> > without incurring function pointer overhead.  I would not attempt 
>> > to use AVX512 on non-glibc systems, and I would use ifunc to select the 
>> > desired popcount implementation on glibc:
>> > https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.ht
>> > ml
>> 
>> Thanks, that seems promising for the function pointer cases.  I'll 
>> plan on trying to convert one of the existing ones to use it.  BTW it 
>> looks like LLVM has something similar [0].
>> 
>> IIUC this unfortunately wouldn't help for cases where we wanted to 
>> keep stuff inlined, such as is_valid_ascii() and the functions in 
>> pg_lfind.h, unless we applied it to the calling functions, but that 
>> doesn't ѕound particularly maintainable.
> 
> Agreed, it doesn't solve inline cases.  If the gains are big enough, 
> we should move toward packages containing N CPU-specialized copies of 
> the postgres binary, with bin/postgres just exec'ing the right one.

I performed a quick test with ifunc on my x86 machine that ordinarily uses the 
runtime checks for the CRC32C code, and I actually see a consistent 3.5% 
regression for pg_waldump -z on 100M 65-byte records.  I've attached the patch 
used for testing.

The multiple-copies-of-the-postgres-binary idea seems interesting.  That's 
probably not something that could be enabled by default, but perhaps we could 
add support for a build option.

--
Nathan Bossart
Amazon Web

Re: serial and partitioned table

2023-11-16 Thread Ashutosh Bapat

On Mon, Nov 13, 2023 at 3:39 PM Peter Eisentraut  wrote:
>
> On 17.10.23 09:25, Ashutosh Bapat wrote:
> > #create table tpart (a serial primary key, src varchar) partition by 
> > range(a);
> > CREATE TABLE
> > #create table t_p4 (a int primary key, src varchar);
> > CREATE TABLE
> > To appease the gods of surprises I need to add a NOT NULL constraint. See 
> > [1].
> > #alter table t_p4 alter column a set not null;
> > ALTER TABLE
> > #alter table tpart attach partition t_p4 for values from (7) to (9);
> > ALTER TABLE
> > #\d t_p4
> >   Table "public.t_p4"
> >   Column |   Type| Collation | Nullable | Default
> > +---+---+--+-
> >   a  | integer   |   | not null |
> >   src| character varying |   |  |
> > Partition of: tpart FOR VALUES FROM (7) TO (9)
> > Indexes:
> >  "t_p4_pkey" PRIMARY KEY, btree (a)
> >
> > The partition was attached but the gods of surprises forgot to set the
> > default value for a, which gets set when we create a partition
> > directly.
> > #create table t_p3 partition of tpart for values from (5) to (7);
> > CREATE TABLE
> > #\d t_p3
> >   Table "public.t_p3"
> >   Column |   Type| Collation | Nullable |
> > Default
> > +---+---+--+--
> >   a  | integer   |   | not null |
> > nextval('tpart_a_seq'::regclass)
> >   src| character varying |   |  |
> > Partition of: tpart FOR VALUES FROM (5) TO (7)
> > Indexes:
> >  "t_p3_pkey" PRIMARY KEY, btree (a)
>
> Partitions can have default values different from the parent table.  So
> it would not be correct for the attach operation to just overwrite the
> defaults on the table being attached.  One might think, it should only
> adjust the default if no default was explicitly specified.  But we don't
> have a way to tell apart "no default" from "null default was actually
> intended".
>
> So, while I agree that there is some potential for confusion here, I
> think this might be intentional behavior.  Or at least there is no
> better possible behavior.

Ok.

>
> >
> > Gods of surprises have another similar gift.
> > #create table t_p2(a serial primary key, src varchar);
> > CREATE TABLE
> > #alter table tpart attach partition t_p2 for values from (3) to (5);
> > ALTER TABLE
> > #\d t_p2
> >   Table "public.t_p2"
> >   Column |   Type| Collation | Nullable |
> > Default
> > +---+---+--+-
> >   a  | integer   |   | not null |
> > nextval('t_p2_a_seq'::regclass)
> >   src| character varying |   |  |
> > Partition of: tpart FOR VALUES FROM (3) TO (5)
> > Indexes:
> >  "t_p2_pkey" PRIMARY KEY, btree (a)
> > Observe that t_p2 uses a different sequence, not the sequence used by
> > the parttiioned table tpart.
>
> I think this is also correct if you consider the definition of serial as
> a macro that creates a sequence.  Of course, the behavior is silly,
> which is why we are plotting to get rid of the current definition of serial.

Ok.

If we implement the identity behaviour as per the discussion in
partitioning and identity thread [1], behaviour of serial column will
be different from the identity column. The behaviour of the identity
column would be saner, of course. When and if we redirect the serial
to identity there will be some surprises because of these differences.
I think, this is moving things in a better direction. We just need to
acknowledge and agree on this.

[1] 
https://www.postgresql.org/message-id/flat/8801cade-20d2-4c9c-a583-b3754beb9be3%40eisentraut.org#9ce279be53b86dd9ab5fce027c94687d

-- 
Best Wishes,
Ashutosh Bapat

Re: partitioning and identity column

2023-11-16 Thread Ashutosh Bapat

On Mon, Nov 13, 2023 at 3:51 PM Peter Eisentraut  wrote:
>
> On 27.10.23 13:32, Ashutosh Bapat wrote:
> > I think we should fix these anomalies as follows
> > 1. Allow identity columns to be added to the partitioned table
> > irrespective of whether they have partitions of not.
> > 2. Propagate identity property to partitions.
> > 3. Use the same underlying sequence for getting default value of an
> > identity column when INSERTing directly in a partition.
> > 4. Disallow attaching a partition with identity column.
> >
> > 1 will fix inconsistencies in Behaviour 3 and 4. 2 and 3 will fix
> > anomalies in Behaviour 1. 4 will fix Behaviour 2.
>
> This makes sense to me.
>
> Note, here is a writeup about the behavior of generated columns with
> partitioning:
> https://www.postgresql.org/docs/devel/ddl-generated-columns.html.  It
> would be useful if we documented the behavior of identity columns
> similarly.  (I'm not saying the behavior has to match.)

Yes. Will add the documentation while working on the code.

>
> One thing that's not clear to me is what should happen if you have a
> partitioned table with an identity column and you try to attach a
> partition that has its own identity definition for that column.  I
> suppose we shouldn't allow that.

That's point 4 above. We shouldn't allow that case.

> (The equivalent case for generated
> columns is allowed.)
>

There might be some weird behaviour because of that like virtual
columns from the same partition reporting different values based on
the table used in the SELECT clause OR stored generated columns having
different values for two rows with the same underlying columns just
because they were INSERTed into different tables (partitioned vs
partition). That may have some impact on the logical replication. I
haven't tested this myself. Maybe a topic for a separate thread.

-- 
Best Wishes,
Ashutosh Bapat

Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)

2023-11-16 Thread Alvaro Herrera

On 2023-Nov-16, Peter Smith wrote:

> I searched HEAD code and did not find any "translator:" comments for
> just ordinary slot name substitutions like these; AFAICT that comment
> is not necessary anymore.

True.  Lose that.

The rationale I have is to consider whether a translator looking at the
original message message in isolation is going to understand what the %s
means.  If it's possible to tell what it is without having to go read
the source code that leads to the message, then you don't need a
"translator:" comment.  Otherwise you do.

You also need to assume the translator is not stupid, but that seems an
OK assumption.

> SUGGESTION (#1a and #1b)
> 
> ereport(log_replication_commands ? LOG : DEBUG1,
> errmsg(SlotIsLogical(s)
>? "acquired logical replication slot \"%s\""
>: "acquired physical replication slot \"%s\"",
>NameStr(s->data.name)));

The bad thing about this is that gettext() is not going to pick up these
strings into the translation catalog.  You could fix that by adding
gettext_noop() calls around them:

 ereport(log_replication_commands ? LOG : DEBUG1,
 errmsg(SlotIsLogical(s)
? gettext_noop("acquired logical replication slot \"%s\"")
: gettext_noop("acquired physical replication slot \"%s\""),
NameStr(s->data.name)));

but at that point it's not clear that it's really better than putting
the ternary in the outer scope.

You can verify this by doing "make update-po" and then searching for the
messages in postgres.pot.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)

Re: WaitEventSet resource leakage

2023-11-16 Thread Heikki Linnakangas


On 16/11/2023 01:08, Tom Lane wrote:

Heikki Linnakangas  writes:

On 09/03/2023 20:51, Tom Lane wrote:

After further thought that seems like a pretty ad-hoc solution.
We probably can do no better in the back branches, but shouldn't
we start treating WaitEventSets as ResourceOwner-managed resources?
Otherwise, transient WaitEventSets are going to be a permanent
source of headaches.



Let's change it so that it's always allocated in TopMemoryContext, but
pass a ResourceOwner instead:
WaitEventSet *
CreateWaitEventSet(ResourceOwner owner, int nevents)
And use owner == NULL to mean session lifetime.


WFM.  (I didn't study your back-branch patch.)


And here is a patch to implement that on master.

--
Heikki Linnakangas
Neon (https://neon.tech)
From cc88af75011208fc7e9a2bba6b27e437edfab952 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 16 Nov 2023 11:16:58 +0100
Subject: [PATCH v1 1/1] Use ResourceOwner to track WaitEventSets.

A WaitEventSet holds file descriptors or event handles (on Windows).
If FreeWaitEventSet is not called, those fds or handles are leaked.
Use ResourceOwners to track WaitEventSets, to clean those up
automatically on error.

This was a live bug in async Append nodes, if a FDW's
ForeignAsyncRequest function failed. (In back branches, I will apply a
more localized fix for that based on PG_TRY-PG_FINALLY.) The added
test doesn't check for leaking resources, so it passed even before
this commit. But at least it covers the code path.

In the passing, fix misleading comment on what the 'nevents' argument
to WaitEventSetWait means.

Report by Alexander Lakhin, analysis and suggestion for the fix by
Tom Lane. Fixes bug #17828.

Discussion: https://www.postgresql.org/message-id/472235.1678387...@sss.pgh.pa.us
---
 .../postgres_fdw/expected/postgres_fdw.out|  7 +++
 contrib/postgres_fdw/sql/postgres_fdw.sql |  6 ++
 src/backend/executor/nodeAppend.c |  5 +-
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/postmaster/postmaster.c   |  2 +-
 src/backend/postmaster/syslogger.c|  2 +-
 src/backend/storage/ipc/latch.c   | 63 +--
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner.h  |  1 +
 9 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 64bcc66b8d..22cae37a1e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10809,6 +10809,13 @@ SELECT * FROM result_tbl ORDER BY a;
 (2 rows)
 
 DELETE FROM result_tbl;
+-- Test error handling, if accessing one of the foreign partitions errors out
+CREATE FOREIGN TABLE async_p_broken PARTITION OF async_pt FOR VALUES FROM (1) TO (10001)
+  SERVER loopback OPTIONS (table_name 'non_existent_table');
+SELECT * FROM async_pt;
+ERROR:  relation "public.non_existent_table" does not exist
+CONTEXT:  remote SQL command: SELECT a, b, c FROM public.non_existent_table
+DROP FOREIGN TABLE async_p_broken;
 -- Check case where multiple partitions use the same connection
 CREATE TABLE base_tbl3 (a int, b int, c text);
 CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2d14eeadb5..075da4ff86 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3607,6 +3607,12 @@ INSERT INTO result_tbl SELECT a, b, 'AAA' || c FROM async_pt WHERE b === 505;
 SELECT * FROM result_tbl ORDER BY a;
 DELETE FROM result_tbl;
 
+-- Test error handling, if accessing one of the foreign partitions errors out
+CREATE FOREIGN TABLE async_p_broken PARTITION OF async_pt FOR VALUES FROM (1) TO (10001)
+  SERVER loopback OPTIONS (table_name 'non_existent_table');
+SELECT * FROM async_pt;
+DROP FOREIGN TABLE async_p_broken;
+
 -- Check case where multiple partitions use the same connection
 CREATE TABLE base_tbl3 (a int, b int, c text);
 CREATE FOREIGN TABLE async_p3 PARTITION OF async_pt FOR VALUES FROM (3000) TO (4000)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 609df6b9e6..af8e37205f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -1025,7 +1025,8 @@ ExecAppendAsyncEventWait(AppendState *node)
 	/* We should never be called when there are no valid async subplans. */
 	Assert(node->as_nasyncremain > 0);
 
-	node->as_eventset = CreateWaitEventSet(CurrentMemoryContext, nevents);
+	Assert(node->as_eventset == NULL);
+	node->as_eventset = CreateWaitEventSet(CurrentResourceOwner, nevents);
 	AddWaitEventToSet(node->as_eventset, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
 	  NULL, NULL);
 
@@ -1050,7 +1051,7 @@ ExecAppendAsyncEventWait(AppendState *node)
 		return;
 	}
 
-	/* We wait on at most

Re: trying again to get incremental backup

2023-11-16 Thread Alvaro Herrera

On 2023-Nov-13, Robert Haas wrote:

> On Mon, Nov 13, 2023 at 11:25 AM Alvaro Herrera  
> wrote:

> > Also, it would be good to provide and use a
> > function to initialize a BlockRefTableKey from the RelFileNode and
> > forknum components, and ensure that any padding bytes are zeroed.
> > Otherwise it's not going to be a great hash key.  On my platform there
> > aren't any (padding bytes), but I think it's unwise to rely on that.
> 
> I'm having trouble understanding the second part of this suggestion.
> Note that in a frontend context, SH_RAW_ALLOCATOR is pg_malloc0, and
> in a backend context, we get the default, which is
> MemoryContextAllocZero. Maybe there's some case this doesn't cover,
> though?

I meant code like this

memcpy(, rlocator, sizeof(RelFileLocator));
key.forknum = forknum;
entry = blockreftable_lookup(brtab->hash, key);

where any padding bytes in "key" could have arbitrary values, because
they're not initialized.  So I'd have a (maybe static inline) function
  BlockRefTableKeyInit(, rlocator, forknum)
that fills it in for you.

Note:
  #define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(BlockRefTableKey)) == 0)
AFAICT the new simplehash uses in this patch series are the only ones
that use memcmp() as SH_EQUAL, so we don't necessarily have precedent on
lack of padding bytes initialization in existing uses of simplehash.

> > These forward struct declarations are not buying you anything, I'd
> > remove them:
> 
> I've had problems from time to time when I don't do this. I'll remove
> it here, but I'm not convinced that it's always useless.

Well, certainly there are places where they are necessary.

> > I don't much like the way header files in src/bin/pg_combinebackup files
> > are structured.  Particularly, causing a simplehash to be "instantiated"
> > just because load_manifest.h is included seems poised to cause pain.  I
> > think there should be a file with the basic struct declarations (no
> > simplehash); and then maybe since both pg_basebackup and
> > pg_combinebackup seem to need the same simplehash, create a separate
> > header file containing just that..  But, did you notice that anything
> > that includes reconstruct.h will instantiate the simplehash stuff,
> > because it includes load_manifest.h?  It may be unwise to have the
> > simplehash in a header file.  Maybe just declare it in each .c file that
> > needs it.  The duplicity is not that large.
> 
> I think that I did this correctly.

Oh, I hadn't grokked that we had this SH_SCOPE thing and a separate
SH_DECLARE for it being extern.  OK, please ignore that.

> > Why leave unnamed arguments in function declarations?
> 
> I mean, I've changed it now, but I don't think it's worth getting too
> excited about.

Well, we did get into consistency arguments on this point previously.  I
agree it's not *terribly* important, but on thread
https://www.postgresql.org/message-id/flat/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg%40mail.gmail.com
people got really worked up about this stuff.

> > In GetFileBackupMethod(), which arguments are in and which are out?
> > The comment doesn't say, and it's not obvious why we pass both the file
> > path as well as the individual constituent pieces for it.
> 
> The header comment does document which values are potentially set on
> return. I guess I thought it was clear enough that the stuff not
> documented to be output parameters was input parameters. Most of them
> aren't even pointers, so they have to be input parameters. The only
> exception is 'path', which I have some difficulty thinking that anyone
> is going to imagine to be an input pointer.

An output pointer, you mean :-)  (Should it be const?)

When the return value is BACK_UP_FILE_FULLY, it's not clear what happens
to these output values; we modify some, but why?  Maybe they should be
left alone?  In that case, the "if size == 0" test should move a couple
of lines up, in the brtentry == NULL block.

BTW, you could do the qsort() after deciding to backup the file fully if
more than 90% needs to be replaced.

BTW, in sendDir() why do
 lookup_path = pstrdup(pathbuf + basepathlen + 1);
when you could do
 lookup_path = pstrdup(tarfilename);
?

> > There are two functions named record_manifest_details_for_file() in
> > different programs.
> 
> I had trouble figuring out how to name this stuff. I did notice the
> awkwardness, but surely nobody can think that two functions with the
> same name in different binaries can be actually the same function.

Of course not, but when cscope-jumping around, it is weird.

> If we want to inject more underscores here, my vote is to go all the
> way and make it per_wal_range_cb.

+1

> > In walsummarizer.c, HandleWalSummarizerInterrupts is called in
> > summarizer_read_local_xlog_page but SummarizeWAL() doesn't do that.
> > Maybe it should?
> 
> I replaced all the CHECK_FOR_INTERRUPTS() in that file with
> HandleWalSummarizerInterrupts(). Does that seem right?

Looks to be

Re: Synchronizing slots from primary to standby

2023-11-16 Thread Amit Kapila

On Wed, Nov 15, 2023 at 5:21 PM shveta malik  wrote:
>
> PFA v34.
>

Few comments on v34-0001*
===
1.
+ char buf[100];
+
+ buf[0] = '\0';
+
+ if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+ strcat(buf, "twophase");
+ if (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING)
+ {
+ if (buf[0] != '\0')
+ strcat(buf, " and ");
+ strcat(buf, "failover");
+ }
+
  ereport(LOG,
- (errmsg("logical replication apply worker for subscription \"%s\"
will restart so that two_phase can be enabled",
- MySubscription->name)));
+ (errmsg("logical replication apply worker for subscription \"%s\"
will restart so that %s can be enabled",
+ MySubscription->name, buf)));

I feel it is better to separate elogs rather than construct the
string. It would be easier for the translation.

2.
-
 /* Initialize walsender process before entering the main command loop */

Spurious line removal

3.
@@ -440,17 +448,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)

  if (startlsn < moveto)
  {
- SpinLockAcquire(>mutex);
- MyReplicationSlot->data.restart_lsn = moveto;
- SpinLockRelease(>mutex);
+ PhysicalConfirmReceivedLocation(moveto);
  retlsn = moveto;
-
- /*
- * Dirty the slot so as it is written out at the next checkpoint. Note
- * that the LSN position advanced may still be lost in the event of a
- * crash, but this makes the data consistent after a clean shutdown.
- */
- ReplicationSlotMarkDirty();
  }

I think this change has been made so that we can wakeup logical
walsenders from a central location. In general, this is a good idea
but it seems calling PhysicalConfirmReceivedLocation() would make an
additional call to ReplicationSlotsComputeRequiredLSN() which is
already called in the caller of
pg_physical_replication_slot_advance(), so not sure such unification
is a good idea here.

4.
+ * Here logical walsender associated with failover logical slot waits
+ * for physical standbys corresponding to physical slots specified in
+ * standby_slot_names GUC.
+ */
+void
+WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn)

In the above comments, we don't seem to follow the 80-col limit.
Please check all other comments in the patch for similar problem.

5.
+static void
+WalSndRereadConfigAndSlots(List **standby_slots)
+{
+ char*pre_standby_slot_names = pstrdup(standby_slot_names);
+
+ ProcessConfigFile(PGC_SIGHUP);
+
+ if (strcmp(pre_standby_slot_names, standby_slot_names) != 0)
+ {
+ list_free(*standby_slots);
+ *standby_slots = GetStandbySlotList(true);
+ }
+
+ pfree(pre_standby_slot_names);
+}

The function name is misleading w.r.t the functionality. Can we name
it on the lines of WalSndRereadConfigAndReInitSlotList()? I know it is
a bit longer but couldn't come up with anything better.

6.
+ /*
+ * Fast path to entering the loop in case we already know we have
+ * enough WAL available and all the standby servers has confirmed
+ * receipt of WAL upto RecentFlushPtr.

I think this comment is a bit misleading because it is a fast path to
avoid entering the loop. I think we can keep the existing comment
here: "Fast path to avoid acquiring the spinlock in case we already
know ..."

7.
@@ -3381,7 +3673,9 @@ WalSndWait(uint32 socket_events, long timeout,
uint32 wait_event)
  * And, we use separate shared memory CVs for physical and logical
  * walsenders for selective wake ups, see WalSndWakeup() for more details.
  */
- if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL)
+ if (wait_for_standby)
+ ConditionVariablePrepareToSleep(>wal_confirm_rcv_cv);
+ else if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL)

The comment above this change needs to be updated for the usage of this new CV.

8.
+WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION "Waiting for physical
standby confirmation in WAL sender process."

I feel the above description is not clear. How about being more
specific with something along the lines of: "Waiting for the WAL to be
received by physical standby in WAL sender process."

9.
+ {"standby_slot_names", PGC_SIGHUP, REPLICATION_PRIMARY,
+ gettext_noop("List of streaming replication standby server slot "
+ "names that logical walsenders waits for."),

I think we slightly simplify it by saying: "Lists streaming
replication standby server slot names that logical WAL sender
processes wait for.". It would be more consistent with a few other
similar variables.

10.
+ gettext_noop("List of streaming replication standby server slot "
+ "names that logical walsenders waits for."),
+ gettext_noop("Decoded changes are sent out to plugins by logical "
+ "walsenders only after specified replication slots "
+ "confirm receiving WAL."),

Instead of walsenders, let's use WAL sender processes.

11.
@@ -6622,10 +6623,12 @@ describeSubscriptions(const char *pattern, bool verbose)
  appendPQExpBuffer(,
", suborigin AS \"%s\"\n"
", subpasswordrequired AS \"%s\"\n"
-   ", subrunasowner AS \"%s\"\n",
+   ", subrunasowner AS \"%s\"\n"
+   ", subfailoverstate AS \"%s\"\n",

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-16 Thread Alvaro Herrera

I just noticed that 0003 does some changes to
TransactionGroupUpdateXidStatus() that haven't been adequately
explained AFAICS.  How do you know that these changes are safe?

0001 contains one typo in the docs, "cotents".

I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
leaving others in clog.c.  Maybe add commentary cross-linking both.
Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
the slru.h-defined limit of 131072 is not that bad, even if it's more
than could possibly be needed for xact_buffers; nobody is going to use
64k buffers, since useful values are below a couple thousand anyhow.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
   http://postgr.es/m/482d1632.8010...@sigaev.ru

Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

2023-11-16 Thread Richard Guo

On Thu, Nov 9, 2023 at 6:45 AM Tom Lane  wrote:

> The existing RTE_SUBQUERY stanza has most of what we need for this,
> so I experimented with extending that to also handle RTE_CTE.  It
> seems to work, though I soon found out that it needed tweaking for
> the case where the CTE is INSERT/UPDATE/DELETE RETURNING.


The change looks good to me.  To nitpick, should we modify the comment
of examine_simple_variable to also mention 'CTEs'?

  * This is split out as a subroutine so that we can recurse to deal with
- * Vars referencing subqueries.
+ * Vars referencing subqueries or CTEs.


> Interestingly, this does not change any existing regression test
> results.  I'd supposed there might be at least one place with a
> visible plan change, but nope.  Running a coverage test does show
> that the new code paths are exercised, but I wonder if we ought
> to try to devise a regression test that proves it more directly.


I think we ought to.  Here is one regression test that proves that this
change improves query plans in some cases.

Unpatched:

explain (costs off)
with x as MATERIALIZED (select unique1 from tenk1 b)
select count(*) from tenk1 a where unique1 in (select * from x);
 QUERY PLAN

 Aggregate
   CTE x
 ->  Index Only Scan using tenk1_unique1 on tenk1 b
   ->  Nested Loop
 ->  HashAggregate
   Group Key: x.unique1
   ->  CTE Scan on x
 ->  Index Only Scan using tenk1_unique1 on tenk1 a
   Index Cond: (unique1 = x.unique1)
(9 rows)

Patched:

explain (costs off)
with x as MATERIALIZED (select unique1 from tenk1 b)
select count(*) from tenk1 a where unique1 in (select * from x);
 QUERY PLAN

 Aggregate
   CTE x
 ->  Index Only Scan using tenk1_unique1 on tenk1 b
   ->  Hash Semi Join
 Hash Cond: (a.unique1 = x.unique1)
 ->  Index Only Scan using tenk1_unique1 on tenk1 a
 ->  Hash
   ->  CTE Scan on x
(8 rows)

I think the second plan (patched) makes more sense.  In the first plan
(unpatched), the HashAggregate node actually does not reduce the the
number of rows because it groups by 'unique1', but planner does not know
that because it lacks statistics for Vars referencing the CTE.

Thanks
Richard

Re: remaining sql/json patches

2023-11-16 Thread Amit Langote

On Thu, Nov 16, 2023 at 2:11 AM Andres Freund  wrote:
> On 2023-11-15 22:00:41 +0900, Amit Langote wrote:
> > > This causes a nontrivial increase in the size of the parser (~5% in an
> > > optimized build here), I wonder if we can do better.
> >
> > Hmm, sorry if I sound ignorant but what do you mean by the parser here?
>
> gram.o, in an optimized build.
>
> > I can see that the byte-size of gram.o increases by 1.66% after the
> > above additions  (1.72% with previous versions).
>
> I'm not sure anymore how I measured it, but if you just looked at the total
> file size, that might not show the full gain, because of debug symbols
> etc. You can use the size command to look at just the code and data size.

$ size /tmp/gram.*
   textdata bss dec hex filename
 661808   0   0  661808   a1930 /tmp/gram.o.unpatched
 672800   0   0  672800   a4420 /tmp/gram.o.patched

That's still a 1.66% increase in the code size:

(672800 - 661808) / 661808 % = 1.66

As for gram.c, the increase is a bit larger:

$ ll /tmp/gram.*
-rw-rw-r--. 1 amit amit 2605925 Nov 16 16:18 /tmp/gram.c.unpatched
-rw-rw-r--. 1 amit amit 2709422 Nov 16 16:22 /tmp/gram.c.patched

(2709422 - 2605925) / 2605925 % = 3.97

> > I've also checked
> > using log_parser_stats that there isn't much slowdown in the
> > raw-parsing speed.
>
> What does "isn't much slowdown" mean in numbers?

Sure, the benchmark I used measured the elapsed time (using
log_parser_stats) of parsing a simple select statement (*) averaged
over 1 repetitions of the same query performed with `psql -c`:

Unpatched: 0.61 seconds
Patched: 0.61 seconds

Here's a look at the perf:

Unpatched:
   0.59%  [.] AllocSetAlloc
   0.51%  [.] hash_search_with_hash_value
   0.47%  [.] base_yyparse

Patched:
   0.63%  [.] AllocSetAlloc
   0.52%  [.] hash_search_with_hash_value
   0.44%  [.] base_yyparse

(*) select 1, 2, 3 from foo where a = 1

Is there a more relevant benchmark I could use?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-16 Thread torikoshia


On 2023-11-16 16:48, Michael Paquier wrote:

On Wed, Nov 15, 2023 at 04:25:14PM +0900, Michael Paquier wrote:

Other than that, it looks OK.


Tweaked the queries of this one slightly, and applied.  So I think
that we are now good for this thread.  Thanks, all!


Thanks for the modification and apply!


--
Michael


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

62 matches

Mail list logo