date:20240403

Re: type cache cleanup improvements

2024-04-03 Thread Andrei Lepikhov


On 3/15/24 17:57, Teodor Sigaev wrote:

Okay, I've applied this piece for now.  Not sure I'll have much room
to look at the rest.


Thank you very much!
I have spent some time reviewing this feature. I think we can discuss 
and apply it step-by-step. So, the 0001-* patch is at this moment.
The feature addresses the issue of TypCache being bloated by intensive 
usage of non-standard types and domains. It adds significant overhead 
during relcache invalidation by thoroughly scanning this hash table.
IMO, this feature will be handy soon, as we already see some patches 
where TypCache is intensively used for storing composite types—for 
example, look into solutions proposed in [1].
One of my main concerns with this feature is the possibility of lost 
entries, which could be mistakenly used by relations with the same oid 
in the future. This seems particularly possible in cases with multiple 
temporary tables. The author has attempted to address this by replacing 
the typrelid and type_id fields in the mapRelType on each call of 
lookup_type_cache. However, I believe we could further improve this by 
removing the entry from mapRelType on invalidation, thus avoiding this 
potential issue.
While reviewing the patch, I made some minor changes (see attachment) 
that you're free to adopt or reject. However, it's crucial that the 
patch includes a detailed explanation, not just a single sentence, to 
ensure everyone understands the changes.
Upon closer inspection, I noticed that the current implementation only 
invalidates the cache entry. While this is acceptable for standard 
types, it may not be sufficient to maintain numerous custom types (as in 
the example in the initial letter) or in cases where whole-row vars are 
heavily used. In such scenarios, removing the entry and reducing the 
hash table's size might be more efficient.
In toto, the 0001-* patch looks good, and I would be glad to see it in 
the core.


[1] 
https://www.postgresql.org/message-id/flat/CAKcux6ktu-8tefLWtQuuZBYFaZA83vUzuRd7c1YHC-yEWyYFpg%40mail.gmail.com


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index e3c32c7848..ed321603d5 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -74,16 +74,17 @@
 #include "utils/typcache.h"
 
 
-/* The main type cache hashtable searched by lookup_type_cache */
-static HTAB *TypeCacheHash = NULL;
-
 /* The map from relation's oid to its type oid */
-typedef struct mapRelTypeEntry
+typedef struct RelTypeMapEntry
 {
 	Oid	typrelid;
 	Oid type_id;
-} mapRelTypeEntry;
+} RelTypeMapEntry;
+
+/* The main type cache hashtable searched by lookup_type_cache */
+static HTAB *TypeCacheHash = NULL;
 
+/* Utility hash table to speed up processing of invalidation relcache events. */
 static HTAB *mapRelType = NULL;
 
 /* List of type cache entries for domain types */
@@ -368,7 +369,7 @@ lookup_type_cache(Oid type_id, int flags)
 	, HASH_ELEM | HASH_BLOBS);
 
 		ctl.keysize = sizeof(Oid);
-		ctl.entrysize = sizeof(mapRelTypeEntry);
+		ctl.entrysize = sizeof(RelTypeMapEntry);
 		mapRelType = hash_create("Map reloid to typeoid", 64,
  , HASH_ELEM | HASH_BLOBS);
 
@@ -492,11 +493,11 @@ lookup_type_cache(Oid type_id, int flags)
 	 */
 	if (OidIsValid(typentry->typrelid) && typentry->typtype == TYPTYPE_COMPOSITE)
 	{
-		mapRelTypeEntry *relentry;
+		RelTypeMapEntry *relentry;
 
-		relentry = (mapRelTypeEntry*) hash_search(mapRelType,
-  >typrelid,
-  HASH_ENTER, NULL);
+		relentry = (RelTypeMapEntry *) hash_search(mapRelType,
+   >typrelid,
+   HASH_ENTER, NULL);
 
 		relentry->typrelid = typentry->typrelid;
 		relentry->type_id = typentry->type_id;
@@ -2297,7 +2298,7 @@ SharedRecordTypmodRegistryAttach(SharedRecordTypmodRegistry *registry)
 }
 
 static void
-invalidateCompositeTypeCacheEntry(TypeCacheEntry *typentry)
+invalidateTypeCacheEntry(TypeCacheEntry *typentry)
 {
 	/* Delete tupdesc if we have it */
 	if (typentry->tupDesc != NULL)
@@ -2348,11 +2349,11 @@ TypeCacheRelCallback(Datum arg, Oid relid)
 
 	if (OidIsValid(relid))
 	{
-		mapRelTypeEntry *relentry;
+		RelTypeMapEntry *relentry;
 
-		relentry = (mapRelTypeEntry *) hash_search(mapRelType,
-  ,
-  HASH_FIND, NULL);
+		relentry = (RelTypeMapEntry *) hash_search(mapRelType,
+   ,
+   HASH_FIND, NULL);
 
 		if (relentry != NULL)
 		{
@@ -2365,7 +2366,7 @@ TypeCacheRelCallback(Datum arg, Oid relid)
 Assert(typentry->typtype == TYPTYPE_COMPOSITE);
 Assert(relid == typentry->typrelid);
 
-invalidateCompositeTypeCacheEntry(typentry);
+invalidateTypeCacheEntry(typentry);
 			}
 		}
 
@@ -2397,7 +2398,7 @@ TypeCacheRelCallback(Datum arg, Oid relid)
 		{
 			if (typentry->typtype == TYPTYPE_COMPOSITE)
 			{
-invalidateCompositeTypeCacheEntry(typentry);
+invalidateTypeCacheEntry(typentry);
 			}
 			else if

Re: remaining sql/json patches

2024-04-03 Thread jian he

On Wed, Apr 3, 2024 at 11:30 AM jian he  wrote:
>
> On Tue, Apr 2, 2024 at 9:57 PM Amit Langote  wrote:
> >
> > Please let me know if you have further comments on 0001.  I'd like to
> > get that in before spending more energy on 0002.
> >

-- a/src/backend/parser/parse_target.c
+++ b/src/backend/parser/parse_target.c
@@ -2019,6 +2019,9 @@ FigureColnameInternal(Node *node, char **name)
  case JSON_VALUE_OP:
  *name = "json_value";
  return 2;
+ case JSON_TABLE_OP:
+ *name = "json_table";
+ return 2;
  default:
  elog(ERROR, "unrecognized JsonExpr op: %d",
  (int) ((JsonFuncExpr *) node)->op);

"case JSON_TABLE_OP part", no need?
json_table output must provide column name and type?

I did some minor refactor transformJsonTableColumns, make the comments
align with the function intention.
in v48-0001, in transformJsonTableColumns we can `Assert(rawc->name);`.
since non-nested JsonTableColumn must specify column name.
in v48-0002, we can change to `if (rawc->coltype != JTC_NESTED)
Assert(rawc->name);`

SELECT * FROM JSON_TABLE(jsonb '1', '$' COLUMNS (a int PATH '$.a' )
ERROR ON ERROR) jt;
ERROR:  no SQL/JSON item

I thought it should just return NULL.
In this case, I thought that
(not column-level) ERROR ON ERROR should not interfere with "COLUMNS
(a int PATH '$.a' )".

+-- Other miscellanous checks
"miscellanous" should be "miscellaneous".

overall the coverage is pretty high.
the current test output looks fine.

v48-0001-minor-refactor-transformJsonTableColumns.only_for_v48_0001
Description: Binary data

Re: RFC: Additional Directory for Extensions

2024-04-03 Thread Daniel Gustafsson

> On 3 Apr 2024, at 09:13, Alvaro Herrera  wrote:
> 
> On 2024-Apr-02, David E. Wheeler wrote:
> 
>> That quotation comes from this Debian patch[2] maintained by Christoph
>> Berg. I’d like to formally propose integrating this patch into the
>> core. And not only because it’s overhead for package maintainers like
>> Christoph, but because a number of use cases have emerged since we
>> originally discussed something like this back in 2013[3]:
> 
> I support the idea of there being a second location from where to load
> shared libraries

Agreed, the case made upthread that installing an extension breaks the app
signing seems like a compelling reason to do this.

The implementation of this need to make sure the directory is properly set up
however to avoid similar problems that CVE 2019-10211 showed.

--
Daniel Gustafsson

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-04-03 Thread Daniel Gustafsson

> On 6 Mar 2024, at 09:59, Daniel Gustafsson  wrote:

> Nothing sticks out from reading through these patches, they seem quite ready 
> to
> me.

Took another look at this today and committed it. Thanks!

--
Daniel Gustafsson

Re: Synchronizing slots from primary to standby

2024-04-03 Thread Amit Kapila

On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila  wrote:
>
> On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
>  wrote:
>
> > I quickly looked at v8, and have a nit, rest all looks good.
> >
> > +if (DecodingContextReady(ctx) && found_consistent_snapshot)
> > +*found_consistent_snapshot = true;
> >
> > Can the found_consistent_snapshot be checked first to help avoid the
> > function call DecodingContextReady() for pg_replication_slot_advance
> > callers?
> >
>
> Okay, changed. Additionally, I have updated the comments and commit
> message. I'll push this patch after some more testing.
>

Pushed!

-- 
With Regards,
Amit Kapila.

Re: Support prepared statement invalidation when result types change

2024-04-03 Thread Jelte Fennema-Nio

On Sun, 7 Jan 2024 at 07:55, vignesh C  wrote:
> One of the test has aborted in CFBot at [1] with:

Rebased the patchset and removed patch 0003 since it was causing the
CI issue reported by vignesh and it seems much less useful and more
controversial to me anyway (due to the extra required roundtrip).

On Sun, 7 Jan 2024 at 09:17, Shay Rojansky  wrote:
> Just to point out, FWIW, that the .NET Npgsql driver does indeed cache 
> RowDescriptions... The whole point of preparation is to optimize things as 
> much as possible for repeated execution of the query; I get that the value 
> there is much lower than e.g. doing another network roundtrip, but that's 
> still extra work that's better off being cut if it can be.

Hmm, interesting. I totally agree that it's always good to do less
when possible. The problem is that in the face of server side prepared
statement invalidations due to DDL changes to the table or search path
changes, the row types might change. Or the server needs to constantly
throw an error, like it does now, but that seems worse imho.

I'm wondering though if we can create a middleground, where a client
can still cache the RowDescription client side when no DDL or
search_patch changes are happening. But still tell the client about a
new RowDescription when they do happen.

The only way of doing that I can think of is changing the Postgres
protocol in a way similar to this: Allow Execute to return a
RowDescription too, but have the server only do so when the previously
received RowDescription for this prepared statement is now invalid.

This would definitely require some additional tracking at PgBouncer to
make it work, i.e. per client and per server it should now keep track
of the last RowDescription for each prepared statement. But that's
definitely something we could do.

This would make this patch much more involved though, as now it would
suddenly involve an actual protocol change, and that basically depends
on this patch moving forward[1].

[1]: 
https://www.postgresql.org/message-id/flat/cageczqtg2hcmb5gau53uuwcdc7gcnjfll6mnw0wnhwhgq9u...@mail.gmail.com

v6-0002-Completely-remove-fixed_result-from-CachedPlanSou.patch
Description: Binary data

v6-0001-Support-prepared-statement-invalidation-when-resu.patch
Description: Binary data

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

On Wed, 3 Apr 2024 at 12:11, Daniel Gustafsson  wrote:
>
> > On 6 Mar 2024, at 09:59, Daniel Gustafsson  wrote:
>
> > Nothing sticks out from reading through these patches, they seem quite 
> > ready to
> > me.
>
> Took another look at this today and committed it. Thanks!

Thanks for the commit!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

[PATCH] Modify pg_ctl to detect presence of geek user

2024-04-03 Thread Panda Developpeur

Dear PostgreSQL Hackers,

I am submitting a patch to modify pg_ctl to detect the presence of a geek
user on the system and adapt its behavior accordingly. This patch
introduces the following changes:

   1.

   *Detection of geek user*: The modified pg_ctl now checks user created on
   the computer.
   2.

   *No documentation or tests*: Please note that I have not included new
   documentation or tests in this patch submission. However, I am open to
   adding them based on the community's feedback.
   3.

   *Performance impact*: The performance impact of these changes is
   minimal, with an expected delay of 500ms in specific scenarios only.


Please review the patch and provide your feedback. I am open to making any
necessary improvements based on the community's suggestions.

Thank you for considering my contribution.

Best regards,


0001-Geek-detection.patch
Description: Binary data

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Jelte Fennema-Nio

On Tue, 2 Apr 2024 at 16:33, Robert Haas  wrote:
> Committed it, I did. My thanks for working on this issue, I extend.

Looking at the committed version of this patch, the pg_unreachable
calls seemed weird to me. 1 is actually incorrect, thus possibly
resulting in undefined behaviour. And for the other call an imho
better fix would be to remove the now 21 year unused enum variant,
instead of introducing its only reference in the whole codebase.

Attached are two trivial patches, feel free to remove both of the
pg_unreachable calls.
From 7f4279bb939e2d2707fdd0f471893d734959ab91 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 3 Apr 2024 15:21:52 +0200
Subject: [PATCH v11 2/2] Remove PGRES_POLLING_ACTIVE

This enum variant has been unused for at least 21 years
44aba280207740d0956160c0288e61f28f024a71. It was left for backwards
compatibility for "awhile". I think that time has come.
---
 src/bin/psql/command.c  | 2 --
 src/interfaces/libpq/libpq-fe.h | 2 --
 2 files changed, 4 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index dc3cc7c10b7..9a163cf22cd 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3811,8 +3811,6 @@ wait_until_connected(PGconn *conn)
 			case PGRES_POLLING_WRITING:
 forRead = false;
 continue;
-			case PGRES_POLLING_ACTIVE:
-pg_unreachable();
 		}
 	}
 }
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index 8d3c5c6f662..2459b0cf5e1 100644
--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -91,8 +91,6 @@ typedef enum
 	PGRES_POLLING_READING,		/* These two indicate that one may	  */
 	PGRES_POLLING_WRITING,		/* use select before polling again.   */
 	PGRES_POLLING_OK,
-	PGRES_POLLING_ACTIVE		/* unused; keep for awhile for backwards
- * compatibility */
 } PostgresPollingStatusType;
 
 typedef enum
-- 
2.34.1

From 29100578b5f0b4cec68ee1b8c572c4f280335da3 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 3 Apr 2024 15:19:04 +0200
Subject: [PATCH v11 1/2] Fix actually reachable pg_unreachable call

In cafe1056558fe07cdc52b95205588fcd80870362 a call to pg_unreachable was
introduced that was actually reachable, because there were break
statements in the loop. This addresses that by both changing the break
statements to return and removing the call to pg_unreachable, which
seemed of dubious necessity anyway.
---
 src/bin/psql/command.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c005624e9c3..dc3cc7c10b7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3771,7 +3771,7 @@ wait_until_connected(PGconn *conn)
 		 * user has requested a cancellation.
 		 */
 		if (cancel_pressed)
-			break;
+			return;
 
 		/*
 		 * Do not assume that the socket remains the same across
@@ -3779,7 +3779,7 @@ wait_until_connected(PGconn *conn)
 		 */
 		sock = PQsocket(conn);
 		if (sock == -1)
-			break;
+			return;
 
 		/*
 		 * If the user sends SIGINT between the cancel_pressed check, and
@@ -3815,8 +3815,6 @@ wait_until_connected(PGconn *conn)
 pg_unreachable();
 		}
 	}
-
-	pg_unreachable();
 }
 
 void

base-commit: 936e3fa3787a51397280c1081587586e83c20399
-- 
2.34.1

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Melanie Plageman

On Wed, Apr 3, 2024 at 8:39 AM Heikki Linnakangas  wrote:
>
> On 02/04/2024 16:11, Heikki Linnakangas wrote:
> > On 01/04/2024 20:22, Melanie Plageman wrote:
> >> Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> >> you didn't modify them much/at all, but I noticed some things in my code
> >> that could be better.
> >
> > Ok, here's what I have now. I made a lot of small comment changes here
> > and there, and some minor local refactorings, but nothing major. I lost
> > track of all the individual changes I'm afraid, so I'm afraid you'll
> > have to just diff against the previous version if you want to see what's
> > changed. I hope I didn't break anything.
> >
> > I'm pretty happy with this now. I will skim through it one more time
> > later today or tomorrow, and commit. Please review once more if you have
> > a chance.
> >
> >> This probably doesn't belong here. I noticed spgdoinsert.c had a static
> >> function for sorting OffsetNumbers, but I didn't see anything general
> >> purpose anywhere else.
> >
> > I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
> > be nice to have just one copy of this in some common place, but I also
> > wasn't sure where to put it.
>
> One more version, with two small fixes:
>
> 1. I fumbled the offsetnumber-cmp function at the last minute so that it
> didn't compile. Fixed. that

I noticed you didn't make the comment updates I suggested in my
version 13 here [1]. A few of them are outdated references to
heap_page_prune() and one to a now deleted variable name
(all_visible_except_removable).

I applied them to your v13 and attached the diff.

> Off-list, Melanie reported that there is a small regression with the
> benchmark script she posted yesterday, after all, but I'm not able to
> reproduce that.

Actually, I think it was noise.

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_aPqZkThyfr0USaHp-3cN_ruEdAHBKtNQJqXDTjWUz0rw%40mail.gmail.com
From 882e937c122f5e83bc9ba643443c1a27c807d82e Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 3 Apr 2024 10:12:44 -0400
Subject: [PATCH 3/3] comment updates

---
 src/backend/access/heap/pruneheap.c | 53 +
 src/backend/storage/ipc/procarray.c |  6 ++--
 src/include/access/heapam.h |  2 +-
 3 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8ed44ba93d..4a6a4cee4d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -328,7 +328,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
- * heap_page_prune_and_freeze() is responsible for initializing it.
+ * heap_page_prune_and_freeze() is responsible for initializing it. Required by
+ * all callers.
  *
  * reason indicates why the pruning is performed.  It is included in the WAL
  * record for debugging and analysis purposes, but otherwise has no effect.
@@ -393,6 +394,7 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.pagefrz.freeze_required = false;
 	if (prstate.freeze)
 	{
+		Assert(new_relfrozen_xid && new_relmin_mxid);
 		prstate.pagefrz.FreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.NoFreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.FreezePageRelminMxid = *new_relmin_mxid;
@@ -415,19 +417,19 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.deadoffsets = presult->deadoffsets;
 
 	/*
-	 * Caller may update the VM after we're done.  We keep track of whether
-	 * the page will be all_visible and all_frozen, once we're done with the
-	 * pruning and freezing, to help the caller to do that.
+	 * Caller may update the VM after we're done.  We can keep track of
+	 * whether the page will be all-visible and all-frozen after pruning and
+	 * freezing to help the caller to do that.
 	 *
 	 * Currently, only VACUUM sets the VM bits.  To save the effort, only do
-	 * only the bookkeeping if the caller needs it.  Currently, that's tied to
-	 * HEAP_PAGE_PRUNE_FREEZE, but it could be a separate flag, if you wanted
-	 * to update the VM bits without also freezing, or freezing without
+	 * the bookkeeping if the caller needs it.  Currently, that's tied to
+	 * HEAP_PAGE_PRUNE_FREEZE, but it could be a separate flag if you wanted
+	 * to update the VM bits without also freezing or freeze without also
 	 * setting the VM bits.
 	 *
 	 * In addition to telling the caller whether it can set the VM bit, we
 	 * also use 'all_visible' and 'all_frozen' for our own decision-making. If
-	 * the whole page will become frozen, we consider opportunistically
+	 * the whole page would become frozen, we consider opportunistically
 	 * freezing tuples.  We will not be able to freeze the whole page if there
 	 * are tuples present that are

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Jelte Fennema-Nio

On Wed, 3 Apr 2024 at 16:55, Tristan Partin  wrote:
> Removing from the switch statement causes a warning:
>
> > [1/2] Compiling C object src/bin/psql/psql.p/command.c.o
> > ../src/bin/psql/command.c: In function ‘wait_until_connected’:
> > ../src/bin/psql/command.c:3803:17: warning: enumeration value 
> > ‘PGRES_POLLING_ACTIVE’ not handled in switch [-Wswitch]
> >  3803 | switch (PQconnectPoll(conn))
> >   | ^~


Ofcourse... fixed now
From 29100578b5f0b4cec68ee1b8c572c4f280335da3 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 3 Apr 2024 15:19:04 +0200
Subject: [PATCH v13 1/2] Fix actually reachable pg_unreachable call

In cafe1056558fe07cdc52b95205588fcd80870362 a call to pg_unreachable was
introduced that was actually reachable, because there were break
statements in the loop. This addresses that by both changing the break
statements to return and removing the call to pg_unreachable, which
seemed of dubious necessity anyway.
---
 src/bin/psql/command.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c005624e9c3..dc3cc7c10b7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3771,7 +3771,7 @@ wait_until_connected(PGconn *conn)
 		 * user has requested a cancellation.
 		 */
 		if (cancel_pressed)
-			break;
+			return;
 
 		/*
 		 * Do not assume that the socket remains the same across
@@ -3779,7 +3779,7 @@ wait_until_connected(PGconn *conn)
 		 */
 		sock = PQsocket(conn);
 		if (sock == -1)
-			break;
+			return;
 
 		/*
 		 * If the user sends SIGINT between the cancel_pressed check, and
@@ -3815,8 +3815,6 @@ wait_until_connected(PGconn *conn)
 pg_unreachable();
 		}
 	}
-
-	pg_unreachable();
 }
 
 void

base-commit: 936e3fa3787a51397280c1081587586e83c20399
-- 
2.34.1

From 76cef25162b44adc20172afee47836ca765d3b5c Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 3 Apr 2024 15:21:52 +0200
Subject: [PATCH v13 2/2] Remove PGRES_POLLING_ACTIVE case

This enum variant has been unused for at least 21 years
44aba280207740d0956160c0288e61f28f024a71. It's been there only for
backwards compatibility of the ABI, so no need to check for it.

Also update the comment on PGRES_POLLING_ACTIVE, since we're stuck with
it forever due to ABI compatibility guarantees.
---
 src/bin/psql/command.c  | 7 ++-
 src/interfaces/libpq/libpq-fe.h | 3 +--
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index dc3cc7c10b7..5b31d08bf70 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3802,17 +3802,14 @@ wait_until_connected(PGconn *conn)
 
 		switch (PQconnectPoll(conn))
 		{
-			case PGRES_POLLING_OK:
-			case PGRES_POLLING_FAILED:
-return;
 			case PGRES_POLLING_READING:
 forRead = true;
 continue;
 			case PGRES_POLLING_WRITING:
 forRead = false;
 continue;
-			case PGRES_POLLING_ACTIVE:
-pg_unreachable();
+			default:
+return;
 		}
 	}
 }
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index 8d3c5c6f662..c184e853889 100644
--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -91,8 +91,7 @@ typedef enum
 	PGRES_POLLING_READING,		/* These two indicate that one may	  */
 	PGRES_POLLING_WRITING,		/* use select before polling again.   */
 	PGRES_POLLING_OK,
-	PGRES_POLLING_ACTIVE		/* unused; keep for awhile for backwards
- * compatibility */
+	PGRES_POLLING_ACTIVE		/* unused; keep for backwards compatibility */
 } PostgresPollingStatusType;
 
 typedef enum
-- 
2.34.1

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-04-03 Thread Melih Mutlu

Hi,

Michael Paquier , 14 Şub 2024 Çar, 10:23 tarihinde
şunu yazdı:

> On Fri, Jan 19, 2024 at 05:41:45PM +0900, torikoshia wrote:
> > We already have additional description below the table which explains
> each
> > column of the system view. For example pg_locks:
> > https://www.postgresql.org/docs/devel/view-pg-locks.html
>
> I was reading the patch, and using int[] as a representation of the
> path of context IDs up to the top-most parent looks a bit strange to
> me, with the relationship between each parent -> child being
> preserved, visibly, based on the order of the elements in this array
> made of temporary IDs compiled on-the-fly during the function
> execution.  Am I the only one finding that a bit strange?  Could it be
> better to use a different data type for this path and perhaps switch
> to the names of the contexts involved?
>

Do you find having the path column strange all together? Or only using
temporary IDs to generate that column? The reason why I avoid using context
names is because there can be multiple contexts with the same name. This
makes it difficult to figure out which context, among those with that
particular name, is actually included in the path. I couldn't find any
other information that is unique to each context.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Jacob Champion

On Tue, Apr 2, 2024 at 11:55 AM Daniel Gustafsson  wrote:
> The attached removes 1.0.2 support (meson build parts untested yet) with a few
> small touch ups of related documentation.  I haven't yet done the research on
> where that leaves LibreSSL since we don't really define anywhere what we
> support (so for we've gotten by assuming it's kind of sort 1.0.2 for the parts
> we care about which is skating on fairly thin ice).

As far as I can tell, no versions of LibreSSL so far provide
X509_get_signature_info(), so this patch is probably a bit too
aggressive.

--Jacob

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Tristan Partin


On Wed Apr 3, 2024 at 8:32 AM CDT, Jelte Fennema-Nio wrote:

On Tue, 2 Apr 2024 at 16:33, Robert Haas  wrote:
> Committed it, I did. My thanks for working on this issue, I extend.

Looking at the committed version of this patch, the pg_unreachable
calls seemed weird to me. 1 is actually incorrect, thus possibly
resulting in undefined behaviour. And for the other call an imho
better fix would be to remove the now 21 year unused enum variant,
instead of introducing its only reference in the whole codebase.

Attached are two trivial patches, feel free to remove both of the
pg_unreachable calls.


Patches look good. Sorry about causing you to do some work.

--
Tristan Partin
Neon (https://neon.tech)

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Tom Lane

Jelte Fennema-Nio  writes:
> Looking at the committed version of this patch, the pg_unreachable
> calls seemed weird to me. 1 is actually incorrect, thus possibly
> resulting in undefined behaviour. And for the other call an imho
> better fix would be to remove the now 21 year unused enum variant,
> instead of introducing its only reference in the whole codebase.

If we do the latter, we will almost certainly get pushback from
distros who check for library ABI breaks.  I fear the comment
suggesting that we could remove it someday is too optimistic.

regards, tom lane

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Tristan Partin

On Wed Apr 3, 2024 at 10:05 AM CDT, Jelte Fennema-Nio wrote:

On Wed, 3 Apr 2024 at 16:55, Tristan Partin  wrote:
> Removing from the switch statement causes a warning:
>
> > [1/2] Compiling C object src/bin/psql/psql.p/command.c.o
> > ../src/bin/psql/command.c: In function ‘wait_until_connected’:
> > ../src/bin/psql/command.c:3803:17: warning: enumeration value 
‘PGRES_POLLING_ACTIVE’ not handled in switch [-Wswitch]
> >  3803 | switch (PQconnectPoll(conn))
> >   | ^~

Ofcourse... fixed now

I think patch 2 makes it worse. The value in -Wswitch is that when new 
enum variants are added, the developer knows the locations to update. 
Adding a default case makes -Wswitch pointless.

Patch 1 is still good. The comment change in patch 2 is good too!

--
Tristan Partin
Neon (https://neon.tech)

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bharath Rupireddy

On Wed, Apr 3, 2024 at 6:46 PM Bertrand Drouvot
 wrote:
>
> Just one comment on v32-0001:
>
> +# Synced slot on the standby must get its own inactive_since.
> +is( $standby1->safe_psql(
> +   'postgres',
> +   "SELECT '$inactive_since_on_primary'::timestamptz <= 
> '$inactive_since_on_standby'::timestamptz AND
> +   '$inactive_since_on_standby'::timestamptz <= 
> '$slot_sync_time'::timestamptz;"
> +   ),
> +   "t",
> +   'synchronized slot has got its own inactive_since');
> +
>
> By using <= we are not testing that it must get its own inactive_since (as we
> allow them to be equal in the test). I think we should just add some usleep()
> where appropriate and deny equality during the tests on inactive_since.

Thanks. It looks like we can ignore the equality in all of the
inactive_since comparisons. IIUC, all the TAP tests do run with
primary and standbys on the single BF animals. And, it looks like
assigning the inactive_since timestamps to perl variables is giving
the microseconds precision level
(./tmp_check/log/regress_log_040_standby_failover_slots_sync:inactive_since
2024-04-03 14:30:09.691648+00). FWIW, we already have some TAP and SQL
tests relying on stats_reset timestamps without equality. So, I've
left the equality for the inactive_since tests.

> Except for the above, v32-0001 LGTM.

Thanks. Please see the attached v33-0001 patch after removing equality
on inactive_since TAP tests.

On Wed, Apr 3, 2024 at 1:47 PM Bertrand Drouvot
 wrote:
>
> Some comments regarding v31-0002:
>
> === testing the behavior
>
> T1 ===
>
> > - synced slots don't get invalidated due to inactive timeout because
> > such slots not considered active at all as they don't perform logical
> > decoding (of course, they will perform in fast_forward mode to fix the
> > other data loss issue, but they don't generate changes for them to be
> > called as *active* slots)
>
> It behaves as described. OTOH non synced logical slots on the standby and
> physical slots on the standby are invalidated which is what is expected.

Right.

> T2 ===
>
> In case the slot is invalidated on the primary,
>
> primary:
>
> postgres=# select slot_name, inactive_since, invalidation_reason from 
> pg_replication_slots where slot_name = 's1';
>  slot_name |inactive_since | invalidation_reason
> ---+---+-
>  s1| 2024-04-03 06:56:28.075637+00 | inactive_timeout
>
> then on the standby we get:
>
> standby:
>
> postgres=# select slot_name, inactive_since, invalidation_reason from 
> pg_replication_slots where slot_name = 's1';
>  slot_name |inactive_since| invalidation_reason
> ---+--+-
>  s1| 2024-04-03 07:06:43.37486+00 | inactive_timeout
>
> shouldn't the slot be dropped/recreated instead of updating inactive_since?

The sync slots that are invalidated on the primary aren't dropped and
recreated on the standby. There's no point in doing so because
invalidated slots on the primary can't be made useful. However, I
found that the synced slot is acquired and released unnecessarily
after the invalidation_reason is synced from the primary. I added a
skip check in synchronize_one_slot to skip acquiring and releasing the
slot if it's locally found inactive. With this, inactive_since won't
get updated for invalidated sync slots on the standby as we don't
acquire and release the slot.

> === code
>
> CR1 ===
>
> +Invalidates replication slots that are inactive for longer the
> +specified amount of time
>
> s/for longer the/for longer that/?

Fixed.

> CR2 ===
>
> +true) as such synced slots don't actually perform
> +logical decoding.
>
> We're switching in fast forward logical due to [1], so I'm not sure that's 
> 100%
> accurate here. I'm not sure we need to specify a reason.

Fixed.

> CR3 ===
>
> + errdetail("This slot has been invalidated because it was inactive for more 
> than the time specified by replication_slot_inactive_timeout parameter.")));
>
> I think we can remove "parameter" (see for example the error message in
> validate_remote_info()) and reduce it a bit, something like?
>
> "This slot has been invalidated because it was inactive for more than 
> replication_slot_inactive_timeout"?

Done.

> CR4 ===
>
> + appendStringInfoString(_detail, _("The slot has been inactive for more 
> than the time specified by replication_slot_inactive_timeout parameter."));
>
> Same.

Done. Changed it to "The slot has been inactive for more than
replication_slot_inactive_timeout."

> CR5 ===
>
> +   /*
> +* This function isn't expected to be called for inactive timeout 
> based
> +* invalidation. A separate function 
> InvalidateInactiveReplicationSlot is
> +* to be used for that.
>
> Do you think it's worth to explain why?

Hm, I just wanted to point out the actual function here. I modified

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Alexander Lakhin


Hello Alvaro,

27.02.2024 20:33, Alvaro Herrera wrote:

Here's the complete set, with these two names using the singular.



I've managed to trigger an assert added by 53c2a97a9.
Please try the following script against a server compiled with
-DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
define, it just simplifies reproducing...):
# initdb & start ...

createdb test
echo "
SELECT pg_current_xact_id() AS tx
\gset

SELECT format('CREATE TABLE t%s(i int)', g)
  FROM generate_series(1, 1022 - :tx) g
\gexec

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT pg_current_xact_id();
SELECT pg_sleep(5);
" | psql test &

echo "
SELECT pg_sleep(1);
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 1 INTO a;
COMMIT;

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 2 INTO b;
" | psql test

It fails for me with the following stack trace:
TRAP: failed Assert("LWLockHeldByMeInMode(SimpleLruGetBankLock(ctl, pageno), LW_EXCLUSIVE)"), File: "slru.c", Line: 366, 
PID: 21711

ExceptionalCondition at assert.c:52:13
SimpleLruZeroPage at slru.c:369:11
SerialAdd at predicate.c:921:20
SummarizeOldestCommittedSxact at predicate.c:1521:2
GetSerializableTransactionSnapshotInt at predicate.c:1787:16
GetSerializableTransactionSnapshot at predicate.c:1691:1
GetTransactionSnapshot at snapmgr.c:264:21
exec_simple_query at postgres.c:1162:4
...

Best regards,
Alexander

Re: LogwrtResult contended spinlock

2024-04-03 Thread Alvaro Herrera

On 2024-Apr-03, Alvaro Herrera wrote:

> So what I do in the attached 0001 is stop using the XLogwrtResult struct
> in XLogCtl and replace it with separate Write and Flush values, and add
> the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
> and Flush from the shared XLogCtl to the local variable given as macro
> argument.  (I also added our idiomatic do {} while(0) to the macro
> definition, for safety).  The new members are XLogCtl->logWriteResult
> and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
> essentially identical semantics as the previous code.  No atomic access
> yet!

BTW I forgot.  I didn't like the name XLogUpdateLocalLogwrtResult() name
much.  What do you think about RefreshXLogWriteResult() instead?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it." (ncm, http://lwn.net/Articles/174769/)

Re: Streaming I/O, vectored I/O (WIP)

2024-04-03 Thread Melanie Plageman

On Tue, Apr 2, 2024 at 8:32 PM Thomas Munro  wrote:
>
> Here are the remaining patches discussed in this thread.  They give
> tablespace-specific io_combine_limit, effective_io_readahead_window
> (is this useful?), and up-to-1MB io_combine_limit (is this useful?).
> I think the first two would probably require teaching reloption.c how
> to use guc.c's parse_int() and unit flags, but I won't have time to
> look at that for this release so I'll just leave these here.
>
> On the subject of guc.c, this is a terrible error message... did I do
> something wrong?
>
> postgres=# set io_combine_limit = '42MB';
> ERROR:  5376 8kB is outside the valid range for parameter
> "io_combine_limit" (1 .. 32)

Well, GUC_UNIT_BLOCKS interpolates the block limit into the error
message string (get_config_unit_name()). But, I can't imagine this
error message is clear for any of the GUCs using GUC_UNIT_BLOCKS. I
would think some combination of the two would be helpful, like "43008
kB (5376 blocks) is outside of the valid range for parameter". The
user can check what their block size is. I don't think we need to
interpolate and print the block size in the error message.

On another note, since io_combine_limit, when specified in size,
rounds up to the nearest multiple of blocksize, it might be worth
mentioning this in the io_combine_limit docs at some point. I checked
docs for another GUC_UNIT_BLOCKS guc, backend_flush_after, and it
alludes to this.

- Melanie

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Heikki Linnakangas


On 02/04/2024 16:11, Heikki Linnakangas wrote:

On 01/04/2024 20:22, Melanie Plageman wrote:

Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
you didn't modify them much/at all, but I noticed some things in my code
that could be better.


Ok, here's what I have now. I made a lot of small comment changes here
and there, and some minor local refactorings, but nothing major. I lost
track of all the individual changes I'm afraid, so I'm afraid you'll
have to just diff against the previous version if you want to see what's
changed. I hope I didn't break anything.

I'm pretty happy with this now. I will skim through it one more time
later today or tomorrow, and commit. Please review once more if you have
a chance.


This probably doesn't belong here. I noticed spgdoinsert.c had a static
function for sorting OffsetNumbers, but I didn't see anything general
purpose anywhere else.


I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
be nice to have just one copy of this in some common place, but I also
wasn't sure where to put it.


One more version, with two small fixes:

1. I fumbled the offsetnumber-cmp function at the last minute so that it 
didn't compile. Fixed. that


2. On VACUUM on an unlogged or temp table, the logic always thought that 
we would be generating an FPI, causing it to always freeze when it 
could. But of course, you never generate FPIs on an unlogged table. 
Fixed that. (Perhaps we should indeed freeze more aggressively on an 
unlogged table, but changing the heuristic is out of scope for this patch.)


Off-list, Melanie reported that there is a small regression with the 
benchmark script she posted yesterday, after all, but I'm not able to 
reproduce that.


--
Heikki Linnakangas
Neon (https://neon.tech)
From e04bda4666d7eaff0c520a9c9e1468a9c4cc9f51 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 2 Apr 2024 15:47:06 +0300
Subject: [PATCH v13 1/3] Refactor how heap_prune_chain() updates prunable_xid

In preparation of freezing and counting tuples which are not
candidates for pruning, split heap_prune_record_unchanged() into
multiple functions, depending the kind of line pointer. That's not too
interesting right now, but makes the next commit smaller.

Recording the lowest soon-to-be prunable xid is one of the actions we
take for unchanged LP_NORMAL item pointers but not for others, so move
that to the new heap_prune_record_unchanged_lp_normal() function. The
next commit will add more actions to these functions.

Author: Melanie Plageman 
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
---
 src/backend/access/heap/pruneheap.c | 125 
 1 file changed, 92 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index ef563e19aa5..1b5bf990d21 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -78,7 +78,11 @@ static void heap_prune_record_redirect(PruneState *prstate,
 static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_dead_or_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
-static void heap_prune_record_unchanged(PruneState *prstate, OffsetNumber offnum);
+
+static void heap_prune_record_unchanged_lp_unused(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_normal(Page page, int8 *htsv, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_dead(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_redirect(PruneState *prstate, OffsetNumber offnum);
 
 static void page_verify_redirects(Page page);
 
@@ -311,7 +315,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/* Nothing to do if slot doesn't contain a tuple */
 		if (!ItemIdIsUsed(itemid))
 		{
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_unused(page, , offnum);
 			continue;
 		}
 
@@ -324,7 +328,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			if (unlikely(prstate.mark_unused_now))
 heap_prune_record_unused(, offnum, false);
 			else
-heap_prune_record_unchanged(, offnum);
+heap_prune_record_unchanged_lp_dead(page, , offnum);
 			continue;
 		}
 
@@ -434,7 +438,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			}
 		}
 		else
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_normal(page, presult->htsv, , offnum);
 	}
 
 	/* We should now have processed every tuple exactly once  */
@@ -652,9 +656,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
 		 */
 		chainitems[nchain++] = offnum;
 
-		/*
-		 * Check tuple's visibility status.
-		 */
 		switch

Re: RFC: Additional Directory for Extensions

2024-04-03 Thread David E. Wheeler

On Apr 3, 2024, at 3:57 AM, walt...@technowledgy.de wrote:

> I can also imagine that it would be very helpful in a container setup to be 
> able to set an environment variable with this path instead of having to 
> recompile all of postgres to change it.

Yes, I like the suggestion to make it require a restart, which lets the 
sysadmin control it and not limited to whatever the person who compiled it 
thought would make sense.

Best,

David

Re: RFC: Additional Directory for Extensions

2024-04-03 Thread David E. Wheeler

On Apr 3, 2024, at 8:54 AM, David E. Wheeler  wrote:

> Yes, I like the suggestion to make it require a restart, which lets the 
> sysadmin control it and not limited to whatever the person who compiled it 
> thought would make sense.

Or SIGHUP?

D

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot

Hi,

On Wed, Apr 03, 2024 at 05:12:12PM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 4:19 PM Amit Kapila  wrote:
> >
> > + 'postgres',
> > + "SELECT '$inactive_since_on_primary'::timestamptz <
> > '$inactive_since_on_standby'::timestamptz AND
> > + '$inactive_since_on_standby'::timestamptz < 
> > '$slot_sync_time'::timestamptz;"
> >
> > Shall we do <= check as we are doing in the main function
> > get_slot_inactive_since_value as the time duration is less so it can
> > be the same as well? Similarly, please check other tests.
> 
> I get you. If the tests are so fast that losing a bit of precision
> might cause tests to fail. So, I'v added equality check for all the
> tests.

> Please find the attached v32-0001 patch with the above review comments
> addressed.

Thanks!

Just one comment on v32-0001:

+# Synced slot on the standby must get its own inactive_since.
+is( $standby1->safe_psql(
+   'postgres',
+   "SELECT '$inactive_since_on_primary'::timestamptz <= 
'$inactive_since_on_standby'::timestamptz AND
+   '$inactive_since_on_standby'::timestamptz <= 
'$slot_sync_time'::timestamptz;"
+   ),
+   "t",
+   'synchronized slot has got its own inactive_since');
+

By using <= we are not testing that it must get its own inactive_since (as we
allow them to be equal in the test). I think we should just add some usleep()
where appropriate and deny equality during the tests on inactive_since.

Except for the above, v32-0001 LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Modify pg_ctl to detect presence of geek user

2024-04-03 Thread Bruce Momjian

On Wed, Apr  3, 2024 at 04:17:21PM +0300, Panda Developpeur wrote:
> Dear PostgreSQL Hackers,
> 
> I am submitting a patch to modify pg_ctl to detect the presence of a geek user
> on the system and adapt its behavior accordingly. This patch introduces the
> following changes:
> 
>  1. Detection of geek user: The modified pg_ctl now checks user created on the
> computer.
> 
>  2. No documentation or tests: Please note that I have not included new
> documentation or tests in this patch submission. However, I am open to
> adding them based on the community's feedback.
> 
>  3. Performance impact: The performance impact of these changes is minimal,
> with an expected delay of 500ms in specific scenarios only.
> 
> 
> Please review the patch and provide your feedback. I am open to making any
> necessary improvements based on the community's suggestions.
> 
> Thank you for considering my contribution.

Aside from an extra newline in the patch, I think this is ready to go!

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: remaining sql/json patches

2024-04-03 Thread jian he

hi.
+  
+   json_table is an SQL/JSON function which
+   queries JSON data
+   and presents the results as a relational view, which can be accessed as a
+   regular SQL table. You can only use
json_table inside the
+   FROM clause of a SELECT,
+   UPDATE, DELETE, or
MERGE
+   statement.
+  

the only issue is that MERGE Synopsis don't have
FROM clause.
other than that, it's quite correct.
see following tests demo:

drop table ss;
create table ss(a int);
insert into ss select 1;
delete from ss using JSON_TABLE(jsonb '1', '$' COLUMNS (a int PATH '$'
) ERROR ON ERROR) jt where jt.a = 1;
insert into ss select 2;
update ss set a = 1 from JSON_TABLE(jsonb '2', '$' COLUMNS (a int PATH
'$')) jt where jt.a = 2;
DROP TABLE IF EXISTS target;
CREATE TABLE target (tid integer, balance integer) WITH
(autovacuum_enabled=off);
INSERT INTO target VALUES (1, 10),(2, 20),(3, 30);
MERGE INTO target USING JSON_TABLE(jsonb '2', '$' COLUMNS (a int PATH
'$' ) ERROR ON ERROR) source(sid)
ON target.tid = source.sid
WHEN MATCHED THEN UPDATE SET balance = 0
returning *;
--

+  
+   To split the row pattern into columns, json_table
+   provides the COLUMNS clause that defines the
+   schema of the created view. For each column, a separate path expression
+   can be specified to be evaluated against the row pattern to get a
+   SQL/JSON value that will become the value for the specified column in
+   a given output row.
+  
should be "an SQL/JSON".

+
+ Inserts a SQL/JSON value obtained by applying
+ path_expression against the row pattern into
+ the view's output row after coercing it to specified
+ type.
+
should be "an SQL/JSON".

"coercing it to specified type"
should be
"coercing it to the specified type"?
---
+
+ The value corresponds to whether evaluating the PATH
+ expression yields any SQL/JSON values.
+
maybe we can change to
+
+ The value corresponds to whether applying the
path_expression
+ expression yields any SQL/JSON values.
+
so it looks more consistent with the preceding paragraph.

+
+ Optionally, ON ERROR can be used to specify whether
+ to throw an error or return the specified value to handle structural
+ errors, respectively.  The default is to return a boolean value
+ FALSE.
+
we don't need "respectively" here?

+ if (jt->on_error &&
+ jt->on_error->btype != JSON_BEHAVIOR_ERROR &&
+ jt->on_error->btype != JSON_BEHAVIOR_EMPTY &&
+ jt->on_error->btype != JSON_BEHAVIOR_EMPTY_ARRAY)
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid ON ERROR behavior"),
+ errdetail("Only EMPTY or ERROR is allowed for ON ERROR in JSON_TABLE()."),
+ parser_errposition(pstate, jt->on_error->location));

errdetail("Only EMPTY or ERROR is allowed for ON ERROR in JSON_TABLE()."),
maybe change to something like:
`
errdetail("Only EMPTY or ERROR is allowed for ON ERROR in the
top-level JSON_TABLE() ").
`
i guess mentioning "top-level" is fine.
since "top-level", we have 19  appearances in functions-json.html.

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Tom Lane

Jacob Champion  writes:
> As far as I can tell, no versions of LibreSSL so far provide
> X509_get_signature_info(), so this patch is probably a bit too
> aggressive.

Another problem with cutting support is how many buildfarm members
will we lose.  I scraped recent configure logs and got the attached
results.  I count 3 machines running 1.0.1, 18 running some flavor
of 1.0.2, and 7 running various LibreSSL versions.  We could
probably retire or update the 1.0.1 installations, but the rest
would represent a heavier lift.  Notably, it seems that what macOS
is shipping is LibreSSL.

regards, tom lane

sysname|  l 
  
---+--
 alabio| configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 alimoche  | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022
 arowana   | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 avocet| configure: using openssl: OpenSSL 1.1.1l-fips  24 Aug 2021 
SUSE release 150400.7.60.2
 ayu   | configure: using openssl: OpenSSL 1.1.0l  10 Sep 2019
 babbler   | configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 basilisk  | configure: using openssl: OpenSSL 3.1.4 24 Oct 2023 (Library: 
OpenSSL 3.1.4 24 Oct 2023)
 batfish   | configure: using openssl: OpenSSL 1.0.2g  1 Mar 2016
 batta | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 blackneck | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022
 boa   | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 boomslang | configure: using openssl: OpenSSL 1.1.1k  25 Mar 2021
 broadbill | configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 bulbul| configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 buri  | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 bushmaster| configure: using openssl: OpenSSL 3.1.5 30 Jan 2024 (Library: 
OpenSSL 3.1.5 30 Jan 2024)
 butterflyfish | configure: using openssl: OpenSSL 1.0.2p-fips  14 Aug 2018
 caiman| configure: using openssl: OpenSSL 3.2.1 30 Jan 2024 (Library: 
OpenSSL 3.2.1 30 Jan 2024)
 canebrake | configure: using openssl: OpenSSL 3.1.5 30 Jan 2024 (Library: 
OpenSSL 3.1.5 30 Jan 2024)
 cascabel  | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022 
(Library: OpenSSL 1.1.1w  11 Sep 2023)
 cavefish  | configure: using openssl: OpenSSL 1.1.1  11 Sep 2018
 chevrotain| configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022 
(Library: OpenSSL 1.1.1w  11 Sep 2023)
 chimaera  | configure: using openssl: OpenSSL 1.1.0l  10 Sep 2019
 chipmunk  | configure: using openssl: OpenSSL 1.0.1t  3 May 2016
 cisticola | configure: using openssl: OpenSSL 1.1.1g FIPS  21 Apr 2020
 clam  | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 conchuela | configure: using openssl: LibreSSL 3.2.5
 copperhead| configure: using openssl: OpenSSL 1.1.1k  25 Mar 2021
 culpeo| configure: using openssl: OpenSSL 3.0.11 19 Sep 2023 (Library: 
OpenSSL 3.0.11 19 Sep 2023)
 cuon  | configure: using openssl: OpenSSL 1.0.2g  1 Mar 2016
 demoiselle| configure: using openssl: OpenSSL 1.1.0h-fips  27 Mar 2018
 desman| configure: using openssl: OpenSSL 3.0.9 30 May 2023 (Library: 
OpenSSL 3.0.9 30 May 2023)
 dhole | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 dikkop| configure: using openssl: OpenSSL 3.0.10 1 Aug 2023 (Library: 
OpenSSL 3.0.10 1 Aug 2023)
 elasmobranch  | configure: using openssl: OpenSSL 1.1.0h-fips  27 Mar 2018
 gokiburi  | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 grison| configure: using openssl: OpenSSL 1.1.0l  10 Sep 2019
 grison| configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022
 guaibasaurus  | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 gull  | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 habu  | configure: using openssl: OpenSSL 3.0.9 30 May 2023 (Library: 
OpenSSL 3.0.9 30 May 2023)
 hachi | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 hake  | configure: using openssl: OpenSSL 1.0.2u  20 Dec 2019
 hippopotamus  | configure: using openssl: OpenSSL 1.1.1l-fips  24 Aug 2021 
SUSE release 150400.7.60.2
 indri | configure: using openssl: OpenSSL 3.2.0 23 Nov 2023 (Library: 
OpenSSL 3.2.0 23 Nov 2023)
 jackdaw   | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022 
(Library: OpenSSL 1.1.1w  11 Sep 2023)
 jay   | configure: using openssl: OpenSSL 1.1.1l-fips  24 Aug 2021 
SUSE release 150400.7.60.2
 kingsnake | configure: using openssl: OpenSSL 3.0.9 30 May 2023 (Library: 
OpenSSL 3.0.9 30 May 2023)
 krait | configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 lancehead | configure: using

Re: Separate memory contexts for relcache and catcache

2024-04-03 Thread Melih Mutlu

vignesh C , 27 Oca 2024 Cmt, 06:01 tarihinde şunu
yazdı:

> On Wed, 3 Jan 2024 at 16:56, Melih Mutlu  wrote:
> CFBot shows that the patch does not apply anymore as in [1]:
> === Applying patches on top of PostgreSQL commit ID
> 729439607ad210dbb446e31754e8627d7e3f7dda ===
> === applying patch
> ./v2-0001-Separate-memory-contexts-for-relcache-and-catcach.patch
> patching file src/backend/utils/cache/catcache.c
> ...
> Hunk #8 FAILED at 1933.
> Hunk #9 succeeded at 2253 (offset 84 lines).
> 1 out of 9 hunks FAILED -- saving rejects to file
> src/backend/utils/cache/catcache.c.rej
>
> Please post an updated version for the same.
>
> [1] - http://cfbot.cputube.org/patch_46_4554.log
>
> Regards,
> Vignesh
>

Rebased. PSA.


-- 
Melih Mutlu
Microsoft


v3-0001-Separate-memory-contexts-for-relcache-and-catcach.patch
Description: Binary data

Re: pg_combinebackup --copy-file-range

2024-04-03 Thread Jakub Wartak

On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
 wrote:
>
> Hi,
>
> I've been running some benchmarks and experimenting with various stuff,
> trying to improve the poor performance on ZFS, and the regression on XFS
> when using copy_file_range. And oh boy, did I find interesting stuff ...

[..]

Congratulations on great results!

> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
> to finish recovery, run pg_checksums --check (to check the patches don't
> produce something broken)

I've performed some follow-up small testing on all patches mentioned
here  (1..7), with the earlier developed nano-incremental-backup-tests
that helped detect some issues for Robert earlier during original
development. They all went fine in both cases:
- no special options when using pg_combinebackup
- using pg_combinebackup --copy-file-range --manifest-checksums=NONE

Those were:
test_across_wallevelminimal.sh
test_full_pri__incr_stby__restore_on_pri.sh
test_full_pri__incr_stby__restore_on_stby.sh
test_full_stby__incr_stby__restore_on_pri.sh
test_full_stby__incr_stby__restore_on_stby.sh
test_incr_after_timelineincrease.sh
test_incr_on_standby_after_promote.sh
test_many_incrementals_dbcreate_duplicateOID.sh
test_many_incrementals_dbcreate_filecopy_NOINCR.sh
test_many_incrementals_dbcreate_filecopy.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_repro_assert_RP.sh
test_repro_assert.sh
test_standby_incr_just_backup.sh
test_stuck_walsum.sh
test_truncaterollback.sh
test_unlogged_table.sh

> Now to the findings 
>
>
> 1) block alignment

[..]

> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

If anything is on the table, then I think in the far future
pg_refresh_standby_using_incremental_backup_from_primary would be the
only other tool using the format ?

> 2) prefetch
> ---
[..]
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Right, we could optionally cover in the docs later-on various options
to get the performance (on XFS use $this, but without $that and so
on). It's kind of madness dealing with all those performance
variations.

Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a
getopt variant like: --prefetch[=HOWMANY] with 128 being as default ?

-J.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Alvaro Herrera

Hello,

On 2024-Apr-03, Alexander Lakhin wrote:

> I've managed to trigger an assert added by 53c2a97a9.
> Please try the following script against a server compiled with
> -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> define, it just simplifies reproducing...):

Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
there.  This rewrite of that small piece should fix it.  Thanks for
reporting this.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
>From 44c39cf4bf258fb0b65aa1acc5f84e5d7f729eb1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Apr 2024 16:00:24 +0200
Subject: [PATCH] Fix zeroing of pg_serial page without SLRU bank lock

---
 src/backend/storage/lmgr/predicate.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 3f378c0099..d5bbfbd4c6 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -137,7 +137,7 @@
  *	SerialControlLock
  *		- Protects SerialControlData members
  *
- *	SerialSLRULock
+ *	SLRU per-bank locks
  *		- Protects SerialSlruCtl
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
@@ -908,20 +908,25 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
 	if (isNewPage)
 		serialControl->headPage = targetPage;
 
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
 	if (isNewPage)
 	{
-		/* Initialize intervening pages. */
-		while (firstZeroPage != targetPage)
+		/* Initialize intervening pages; might involve trading locks */
+		for (;;)
 		{
-			(void) SimpleLruZeroPage(SerialSlruCtl, firstZeroPage);
+			lock = SimpleLruGetBankLock(SerialSlruCtl, firstZeroPage);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			slotno = SimpleLruZeroPage(SerialSlruCtl, firstZeroPage);
+			if (firstZeroPage == targetPage)
+break;
 			firstZeroPage = SerialNextPage(firstZeroPage);
+			LWLockRelease(lock);
 		}
-		slotno = SimpleLruZeroPage(SerialSlruCtl, targetPage);
 	}
 	else
+	{
+		LWLockAcquire(lock, LW_EXCLUSIVE);
 		slotno = SimpleLruReadPage(SerialSlruCtl, targetPage, true, xid);
+	}
 
 	SerialValue(slotno, xid) = minConflictCommitSeqNo;
 	SerialSlruCtl->shared->page_dirty[slotno] = true;
-- 
2.39.2

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Tristan Partin

On Wed Apr 3, 2024 at 9:50 AM CDT, Jelte Fennema-Nio wrote:

On Wed, 3 Apr 2024 at 16:31, Tom Lane  wrote:
> If we do the latter, we will almost certainly get pushback from
> distros who check for library ABI breaks.  I fear the comment
> suggesting that we could remove it someday is too optimistic.

Alright, changed patch 0002 to keep the variant. But remove it from
the recently added switch/case. And also updated the comment to remove
the "for awhile".

Removing from the switch statement causes a warning:

[1/2] Compiling C object src/bin/psql/psql.p/command.c.o
../src/bin/psql/command.c: In function ‘wait_until_connected’:
../src/bin/psql/command.c:3803:17: warning: enumeration value 
‘PGRES_POLLING_ACTIVE’ not handled in switch [-Wswitch]
 3803 | switch (PQconnectPoll(conn))
  | ^~

--
Tristan Partin
Neon (https://neon.tech)

Re: New Table Access Methods for Multi and Single Inserts

2024-04-03 Thread Bharath Rupireddy

On Wed, Apr 3, 2024 at 2:32 PM Bharath Rupireddy
 wrote:
>
> I too prefer the latter so that the caller doesn't have to have two
> paths. The new API can just transparently fallback to single inserts.
> I've implemented that in the attached v17 patch. I also tested the
> default APIs manually, but I'll see if I can add some tests to it the
> default API.

Fixed a compiler warning found via CF bot. Please find the attached
v18 patches. I'm sorry for the noise.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From ff1278b77e0d6ac6a49f0826602bd948e78c7a91 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 3 Apr 2024 11:56:14 +
Subject: [PATCH v18 1/2] Introduce new table modify access methods

---
 src/backend/access/heap/heapam.c | 189 ++-
 src/backend/access/heap/heapam_handler.c |   5 +
 src/backend/access/table/tableam.c   |  86 +++
 src/backend/access/table/tableamapi.c|   8 +
 src/include/access/heapam.h  |  41 +
 src/include/access/tableam.h | 108 +
 src/tools/pgindent/typedefs.list |   3 +
 7 files changed, 439 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b661d9811e..69f8c597d8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -64,6 +64,7 @@
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 #include "utils/spccache.h"
@@ -107,7 +108,8 @@ static int	bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
 static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
 static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
 		bool *copy);
-
+static void heap_modify_buffer_flush(TableModifyState *state);
+static void heap_modify_insert_end(TableModifyState *state);
 
 /*
  * Each tuple lock mode has a corresponding heavyweight lock, and one or two
@@ -2441,6 +2443,191 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	*insert_indexes = true;
 }
 
+/*
+ * Initialize heap modify state.
+ */
+TableModifyState *
+heap_modify_begin(Relation rel, int modify_flags, CommandId cid,
+  int options)
+{
+	TableModifyState *state;
+	MemoryContext context;
+	MemoryContext oldcontext;
+
+	context = AllocSetContextCreate(CurrentMemoryContext,
+	"heap_modify memory context",
+	ALLOCSET_DEFAULT_SIZES);
+
+	oldcontext = MemoryContextSwitchTo(context);
+	state = palloc0(sizeof(TableModifyState));
+	state->rel = rel;
+	state->modify_flags = modify_flags;
+	state->mctx = context;
+	state->cid = cid;
+	state->options = options;
+	state->insert_indexes = false;
+	state->modify_end_cb = NULL;	/* To be installed lazily */
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
+/*
+ * Store passed-in tuple into in-memory buffered slots. When full, insert
+ * multiple tuples from the buffers into heap.
+ */
+void
+heap_modify_buffer_insert(TableModifyState *state,
+		  TupleTableSlot *slot)
+{
+	TupleTableSlot *dstslot;
+	HeapInsertState *istate;
+	HeapMultiInsertState *mistate;
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(state->mctx);
+
+	/* First time through, initialize heap insert state */
+	if (state->data == NULL)
+	{
+		istate = (HeapInsertState *) palloc0(sizeof(HeapInsertState));
+		istate->bistate = NULL;
+		istate->mistate = NULL;
+		state->data = istate;
+
+		if ((state->modify_flags & TM_FLAG_MULTI_INSERTS) != 0)
+		{
+			mistate = (HeapMultiInsertState *) palloc0(sizeof(HeapMultiInsertState));
+			mistate->slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
+			istate->mistate = mistate;
+		}
+
+		if ((state->modify_flags & TM_FLAG_BAS_BULKWRITE) != 0)
+			istate->bistate = GetBulkInsertState();
+
+		state->modify_end_cb = heap_modify_insert_end;
+	}
+
+	istate = (HeapInsertState *) state->data;
+	Assert(istate->mistate != NULL);
+	mistate = istate->mistate;
+	Assert(istate->bistate != NULL);
+
+	dstslot = mistate->slots[mistate->cur_slots];
+	if (dstslot == NULL)
+	{
+		/*
+		 * We use virtual tuple slots buffered slots for leveraging the
+		 * optimization it provides to minimize physical data copying. The
+		 * virtual slot gets materialized when we copy (via below
+		 * ExecCopySlot) the tuples from the source slot which can be of any
+		 * type. This way, it is ensured that the tuple storage doesn't depend
+		 * on external memory, because all the datums that aren't passed by
+		 * value are copied into the slot's memory context.
+		 */
+		dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel),
+	 );
+		mistate->slots[mistate->cur_slots] = dstslot;
+	}
+
+	ExecClearTuple(dstslot);
+	ExecCopySlot(dstslot, slot);
+
+	mistate->cur_slots++;
+

Re: [PATCH] Modify pg_ctl to detect presence of geek user

2024-04-03 Thread Bruce Momjian

On Wed, Apr  3, 2024 at 09:25:02AM -0400, Bruce Momjian wrote:
> On Wed, Apr  3, 2024 at 04:17:21PM +0300, Panda Developpeur wrote:
> > Dear PostgreSQL Hackers,
> > 
> > I am submitting a patch to modify pg_ctl to detect the presence of a geek 
> > user
> > on the system and adapt its behavior accordingly. This patch introduces the
> > following changes:
> > 
> >  1. Detection of geek user: The modified pg_ctl now checks user created on 
> > the
> > computer.
> > 
> >  2. No documentation or tests: Please note that I have not included new
> > documentation or tests in this patch submission. However, I am open to
> > adding them based on the community's feedback.
> > 
> >  3. Performance impact: The performance impact of these changes is minimal,
> > with an expected delay of 500ms in specific scenarios only.
> > 
> > 
> > Please review the patch and provide your feedback. I am open to making any
> > necessary improvements based on the community's suggestions.
> > 
> > Thank you for considering my contribution.
> 
> Aside from an extra newline in the patch, I think this is ready to go!

Also, it feels like the deadline for this patch was two days ago.  ;-)

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: RFC: Additional Directory for Extensions

2024-04-03 Thread David E. Wheeler

On Apr 3, 2024, at 8:54 AM, David E. Wheeler  wrote:

> Yes, I like the suggestion to make it require a restart, which lets the 
> sysadmin control it and not limited to whatever the person who compiled it 
> thought would make sense.

Here’s a revision of the Debian patch that requires a server start.

However, in studying the patch, it appears that the `extension_directory` is 
searched for *all* shared libraries, not just those being loaded for an 
extension. Am I reading the `expand_dynamic_library_name()` function right?

If so, this seems like a good way for a bad actor to muck with things, by 
putting an exploited libpgtypes library into the extension directory, where it 
would be loaded in preference to the core libpgtypes library, if they couldn’t 
exploit the original.

I’m thinking it would be better to have the dynamic library lookup for 
extension libraries (and LOAD libraries?) separate, so that the 
`extension_directory` would not be used for core libraries.

This would also allow the lookup of extension libraries prefixed by the 
directory field from the control file, which would enable much tidier extension 
installation: The control file, SQL scripts, and DSOs could all be in a single 
directory for an extension.

Thoughts?

Best,

David

v1-0001-Add-extension_directory-GUC.patch
Description: Binary data

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Dilip Kumar

On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera  wrote:
>
> Hello,
>
> On 2024-Apr-03, Alexander Lakhin wrote:
>
> > I've managed to trigger an assert added by 53c2a97a9.
> > Please try the following script against a server compiled with
> > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> > define, it just simplifies reproducing...):
>
> Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
> there.  This rewrite of that small piece should fix it.  Thanks for
> reporting this.
>

Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
thanks for reporting.  Your fix looks correct to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Is it safe to cache data by GiST consistent function

2024-04-03 Thread Tom Lane

=?utf-8?Q?Micha=C5=82_K=C5=82eczek?=  writes:
> When implementing a GiST consistent function I found the need to cache 
> pre-processed query across invocations.
> I am not sure if it is safe to do (or I need to perform some steps to make 
> sure cached info is not leaked between rescans).

AFAIK it works.  I don't see any of the in-core ones doing so,
but at least range_gist_consistent and multirange_gist_consistent
are missing a bet by repeating their cache search every time.

> The comment in gistrescan says:

>   /*
>* If this isn't the first time through, preserve the fn_extra
>* pointers, so that if the consistentFns are using them to 
> cache
>* data, that data is not leaked across a rescan.
>*/

> which seems to me self-contradictory as fn_extra is preserved between rescans 
> (so leaks are indeed possible).

I think you're reading it wrong.  If we cleared fn_extra during
rescan, access to the old extra value would be lost so a new one
would have to be created, leaking the old value for the rest of
the query.

regards, tom lane

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Jelte Fennema-Nio

On Wed, 3 Apr 2024 at 16:31, Tom Lane  wrote:
> If we do the latter, we will almost certainly get pushback from
> distros who check for library ABI breaks.  I fear the comment
> suggesting that we could remove it someday is too optimistic.

Alright, changed patch 0002 to keep the variant. But remove it from
the recently added switch/case. And also updated the comment to remove
the "for awhile".
From 29100578b5f0b4cec68ee1b8c572c4f280335da3 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 3 Apr 2024 15:19:04 +0200
Subject: [PATCH v12 1/2] Fix actually reachable pg_unreachable call

In cafe1056558fe07cdc52b95205588fcd80870362 a call to pg_unreachable was
introduced that was actually reachable, because there were break
statements in the loop. This addresses that by both changing the break
statements to return and removing the call to pg_unreachable, which
seemed of dubious necessity anyway.
---
 src/bin/psql/command.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c005624e9c3..dc3cc7c10b7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3771,7 +3771,7 @@ wait_until_connected(PGconn *conn)
 		 * user has requested a cancellation.
 		 */
 		if (cancel_pressed)
-			break;
+			return;
 
 		/*
 		 * Do not assume that the socket remains the same across
@@ -3779,7 +3779,7 @@ wait_until_connected(PGconn *conn)
 		 */
 		sock = PQsocket(conn);
 		if (sock == -1)
-			break;
+			return;
 
 		/*
 		 * If the user sends SIGINT between the cancel_pressed check, and
@@ -3815,8 +3815,6 @@ wait_until_connected(PGconn *conn)
 pg_unreachable();
 		}
 	}
-
-	pg_unreachable();
 }
 
 void

base-commit: 936e3fa3787a51397280c1081587586e83c20399
-- 
2.34.1

From f71634361a0c9ba3416c75fc3ff6d55536f134c4 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 3 Apr 2024 15:21:52 +0200
Subject: [PATCH v12 2/2] Remove PGRES_POLLING_ACTIVE case

This enum variant has been unused for at least 21 years
44aba280207740d0956160c0288e61f28f024a71. It's been there only for
backwards compatibility of the ABI, so no need to check for it.

Also update the comment on PGRES_POLLING_ACTIVE, since we're stuck with
it forever due to ABI compatibility guarantees.
---
 src/bin/psql/command.c  | 2 --
 src/interfaces/libpq/libpq-fe.h | 3 +--
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index dc3cc7c10b7..9a163cf22cd 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3811,8 +3811,6 @@ wait_until_connected(PGconn *conn)
 			case PGRES_POLLING_WRITING:
 forRead = false;
 continue;
-			case PGRES_POLLING_ACTIVE:
-pg_unreachable();
 		}
 	}
 }
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index 8d3c5c6f662..c184e853889 100644
--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -91,8 +91,7 @@ typedef enum
 	PGRES_POLLING_READING,		/* These two indicate that one may	  */
 	PGRES_POLLING_WRITING,		/* use select before polling again.   */
 	PGRES_POLLING_OK,
-	PGRES_POLLING_ACTIVE		/* unused; keep for awhile for backwards
- * compatibility */
+	PGRES_POLLING_ACTIVE		/* unused; keep for backwards compatibility */
 } PostgresPollingStatusType;
 
 typedef enum
-- 
2.34.1

Re: Missing LWLock protection in pgstat_reset_replslot()

2024-04-03 Thread Michael Paquier

On Tue, Apr 02, 2024 at 03:18:24PM -0400, Tom Lane wrote:
> I've closed the CF entry for this [1] as committed.  Please re-open
> it if there was something left to do here.
> 
> [1] https://commitfest.postgresql.org/47/4878/

Thanks, I was not aware of this one.
--
Michael


signature.asc
Description: PGP signature

Re: RFC: Additional Directory for Extensions

2024-04-03 Thread Alvaro Herrera

On 2024-Apr-02, David E. Wheeler wrote:

> That quotation comes from this Debian patch[2] maintained by Christoph
> Berg. I’d like to formally propose integrating this patch into the
> core. And not only because it’s overhead for package maintainers like
> Christoph, but because a number of use cases have emerged since we
> originally discussed something like this back in 2013[3]:

I support the idea of there being a second location from where to load
shared libraries ... but I don't like the idea of making it
runtime-configurable.  If we want to continue to tighten up what
superuser can do, then one of the things that has to go away is the
ability to load shared libraries from arbitrary locations
(dynamic_library_path).  I think we should instead look at making those
locations hardcoded at compile time.  The packager can then decide where
those things go, and the superuser no longer has the ability to load
arbitrary code from arbitrary locations.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".

RE: Psql meta-command conninfo+

2024-04-03 Thread Maiquel Grassi

Hello Sami,

(v26)

>  Here is an example [1] where the session information functions are 
> referenced.

>  The public doc for this example is in [2].


>  Something similar can be done here. For example:


>  The session user's name; see the session_user() 
> function in

>   for more details.

I updated the patch.

Thank you for this help.

Regards,
Maiquel Grassi.


v26-0001-psql-meta-command-conninfo-plus.patch
Description: v26-0001-psql-meta-command-conninfo-plus.patch

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Amit Kapila

On Wed, Apr 3, 2024 at 2:58 PM shveta malik  wrote:
>
> On Wed, Apr 3, 2024 at 11:17 AM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> > >
> > > > > Or a simple solution is that the slotsync worker updates
> > > > > inactive_since as it does for non-synced slots, and disables
> > > > > timeout-based slot invalidation for synced slots.
> > >
> > > I like this idea better, it takes care of such a case too when the
> > > user is relying on sync-function rather than worker and does not want
> > > to get the slots invalidated in between 2 sync function calls.
> >
> > Please find the attached v31 patches implementing the above idea:
> >
>
> Thanks for the patches, please find few comments:
>
> v31-001:
>
> 1)
> system-views.sgml:
> value will get updated  after every synchronization from the
> corresponding remote slot on the primary.
>
> --This is confusing. It will be good to rephrase it.
>
> 2)
> update_synced_slots_inactive_since()
>
> --May be, we should mention in the header that this function is called
> only during promotion.
>
> 3) 040_standby_failover_slots_sync.pl:
> We capture inactive_since_on_primary when we do this for the first time at 
> #175
> ALTER SUBSCRIPTION regress_mysub1 DISABLE"
>
> But we again recreate the sub and disable it at line #280.
> Do you think we shall get inactive_since_on_primary again here, to be
> compared with inactive_since_on_new_primary later?
>

I think so.

Few additional comments on tests:
1.
+is( $standby1->safe_psql(
+ 'postgres',
+ "SELECT '$inactive_since_on_primary'::timestamptz <
'$inactive_since_on_standby'::timestamptz AND
+ '$inactive_since_on_standby'::timestamptz < '$slot_sync_time'::timestamptz;"

Shall we do <= check as we are doing in the main function
get_slot_inactive_since_value as the time duration is less so it can
be the same as well? Similarly, please check other tests.

2.
+=item $node->get_slot_inactive_since_value(self, slot_name, reference_time)
+
+Get inactive_since column value for a given replication slot validating it
+against optional reference time.
+
+=cut
+
+sub get_slot_inactive_since_value

I see that all callers validate against reference time. It is better
to name it validate_slot_inactive_since rather than using get_* as the
main purpose is to validate the passed value.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Andy Fan



> I also can confirm that a lot of users would be very happy to have
> "read your writes consistency" and ready to do something to achieve
> this at an application level.  However, they typically don't know what
> exactly they need.
>
> So, blogging about pg_wal_replay_wait() and spreading words about it
> at conferences would be highly appreciated.

Sure, once it is committed, I promise I can doing a knowledge sharing in
our organization and write a article in chinese language.  

-- 
Best Regards
Andy Fan

Re: New Table Access Methods for Multi and Single Inserts

2024-04-03 Thread Bharath Rupireddy

On Wed, Apr 3, 2024 at 1:10 AM Jeff Davis  wrote:
>
> On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote:
> > if (table_modify_buffer_insert() is defined)
> >table_modify_buffer_insert(...);
> > else
> > {
> >   myState->bistate = GetBulkInsertState();
> >   table_tuple_insert(...);
> > }
>
> We can't alloc/free the bulk insert state for every insert call. I see
> two options:
>
> * Each caller needs to support two code paths: if the buffered insert
> APIs are defined, then use those; otherwise the caller needs to manage
> the bulk insert state itself and call the plain insert API.
>
> * Have default implementation for the new API methods, so that the
> default for the begin method would allocate the bulk insert state, and
> the default for the buffered insert method would be to call plain
> insert using the bulk insert state.
>
> I'd prefer the latter, at least in the long term. But I haven't really
> thought through the details, so perhaps we'd need to use the former.

I too prefer the latter so that the caller doesn't have to have two
paths. The new API can just transparently fallback to single inserts.
I've implemented that in the attached v17 patch. I also tested the
default APIs manually, but I'll see if I can add some tests to it the
default API.

> > > After we have these new APIs fully in place and used by COPY, what
> > > will
> > > happen to those other APIs? Will they be deprecated or will there
> > > be a
> > > reason to keep them?
> >
> > Deprecated perhaps?
>
> Including Alexander on this thread, because he's making changes to the
> multi-insert API. We need some consensus on where we are going with
> these APIs before we make more changes, and what incremental steps make
> sense in v17.
>
> Here's where I think this API should go:
>
> 1. Have table_modify_begin/end and table_modify_buffer_insert, like
> those that are implemented in your patch.
>
> 2. Add some kind of flush callback that will be called either while the
> tuples are being flushed or after the tuples are flushed (but before
> they are freed by the AM). (Aside: do we need to call it while the
> tuples are being flushed to get the right visibility semantics for
> after-row triggers?)
>
> 3. Add table_modify_buffer_{update|delete} APIs.
>
> 4. Some kind of API tweaks to help manage memory when modifying
> pertitioned tables, so that the buffering doesn't get out of control.
> Perhaps just reporting memory usage and allowing the caller to force
> flushes would be enough.
>
> 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
> fairly straightforward, I believe, and handled by your patch. Indexes
> are (re)built afterward, and no triggers are possible.
>
> 6. Use these new methods for CREATE TABLE ... AS. This is fairly
> straightforward, I believe, and handled by your patch. No indexes or
> triggers are possible.
>
> 7. Use these new methods for COPY. We have to be careful to avoid
> regressions for the heap method, because it's already managing its own
> buffers. If the AM manages the buffering, then it may require
> additional copying of slots, which could be a disadvantage. To solve
> this, we may need some minor API tweaks to avoid copying when the
> caller guarantees that the memory will not be freed to early, or
> perhaps expose the AM's memory context to copyfrom.c. Another thing to
> consider is that the buffering in copyfrom.c is also used for FDWs, so
> that buffering code path needs to be preserved in copyfrom.c even if
> not used for AMs.
>
> 8. Use these new methods for INSERT INTO ... SELECT. One potential
> challenge here is that execution nodes are not always run to
> completion, so we need to be sure that the flush isn't forgotten in
> that case.
>
> 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
> the buffer_insert/update/delete APIs; we don't need a separate merge
> method. This probably requires that the AM maintain 3 separate buffers
> to distinguish different kinds of changes at flush time (obviously
> these can be initialized lazily to avoid overhead when not being used).
>
> 10. Use these new methods for logical apply.
>
> 11. Deprecate the multi_insert API.
>
> Thoughts on this plan? Does your patch make sense in v17 as a stepping
> stone, or should we try to make all of these API changes together in
> v18?

I'd like to see the new multi insert API (as proposed in the v17
patches) for PG17 if possible. The basic idea with these new APIs is
to let the AM implementers choose the right buffered insert strategy
(one can choose the AM specific slot type to buffer the tuples, choose
the AM specific memory and flushing decisions etc.). Another advantage
with these new multi insert API is that the CREATE MATERIALIZED VIEW,
REFRESH MATERIALIZED VIEW, CREATE TABLE AS commands for heap AM got
faster by 62.54%, 68.87%, 74.31% or 2.67, 3.21, 3.89 times
respectively. The performance improvement in REFRESH MATERIALIZED VIEW
can benefit customers running analytical

Re: LogwrtResult contended spinlock

2024-04-03 Thread Alvaro Herrera

Thanks for keeping this moving forward.  I gave your proposed patches a
look.   One thing I didn't like much is that we're adding a new member
(Copy) to XLogwrtAtomic -- but this struct is supposed to be a mirror of
XLogwrtResult for use with atomic access.  Since this new member is not
added to XLogwrtResult (because it's not needed there), the whole idea
of there being symmetry between those two structs crumbles down.
Because we later stop using struct-assign anyway, meaning we no longer
need the structs to match, we can instead spell out the members in
XLogCtl and call it a day.

So what I do in the attached 0001 is stop using the XLogwrtResult struct
in XLogCtl and replace it with separate Write and Flush values, and add
the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
and Flush from the shared XLogCtl to the local variable given as macro
argument.  (I also added our idiomatic do {} while(0) to the macro
definition, for safety).  The new members are XLogCtl->logWriteResult
and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
essentially identical semantics as the previous code.  No atomic access
yet!

0002 then adds pg_atomic_monotonic_advance_u64.  (I don't add the _u32
variant, because I don't think it's a great idea to add dead code.  If
later we see a need for it we can put it in.)  It also changes the two
new members to be atomics, changes the macro to use atomic read, and
XLogWrite now uses monotonic increment.  A couple of other places can
move the macro calls to occur outside the spinlock.  Also, XLogWrite
gains the invariant checking that involves Write and Flush.

Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and
Flush, and updates WALReadFromBuffers to test that instead of the Write
pointer, and adds in XLogWrite the invariant checks that involve the
Copy pointer.

I haven't rerun Bharath test loop yet; will do so shortly.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."  (Andrew Sullivan)
https://postgr.es/m/20050809113420.gd2...@phlogiston.dyndns.org
>From e6ddbe87d598c6a1090e39845a4a308060f0af1e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Apr 2024 11:23:12 +0200
Subject: [PATCH v15 1/3] split XLogCtl->LogwrtResult into separate struct
 members

---
 src/backend/access/transam/xlog.c | 59 ++-
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87ea03954b..6213d99561 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -291,16 +291,15 @@ static bool doPageWrites;
  *
  * LogwrtRqst indicates a byte position that we need to write and/or fsync
  * the log up to (all records before that point must be written or fsynced).
- * LogwrtResult indicates the byte positions we have already written/fsynced.
- * These structs are identical but are declared separately to indicate their
- * slightly different functions.
+ * The positions already written/fsynced are maintained in logWriteResult
+ * and logFlushResult.
  *
- * To read XLogCtl->LogwrtResult, you must hold either info_lck or
- * WALWriteLock.  To update it, you need to hold both locks.  The point of
- * this arrangement is that the value can be examined by code that already
- * holds WALWriteLock without needing to grab info_lck as well.  In addition
- * to the shared variable, each backend has a private copy of LogwrtResult,
- * which is updated when convenient.
+ * To read XLogCtl->logWriteResult or ->logFlushResult, you must hold either
+ * info_lck or WALWriteLock.  To update them, you need to hold both locks.
+ * The point of this arrangement is that the value can be examined by code
+ * that already holds WALWriteLock without needing to grab info_lck as well.
+ * In addition to the shared variable, each backend has a private copy of
+ * both in LogwrtResult, which is updated when convenient.
  *
  * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
  * (protected by info_lck), but we don't need to cache any copies of it.
@@ -478,7 +477,8 @@ typedef struct XLogCtlData
 	 * Protected by info_lck and WALWriteLock (you must hold either lock to
 	 * read it, but both to update)
 	 */
-	XLogwrtResult LogwrtResult;
+	XLogRecPtr	logWriteResult; /* last byte + 1 written out */
+	XLogRecPtr	logFlushResult; /* last byte + 1 flushed */
 
 	/*
 	 * Latest initialized page in the cache (last byte position + 1).
@@ -614,6 +614,15 @@ static int	UsableBytesInSegment;
  */
 static XLogwrtResult LogwrtResult = {0, 0};
 
+/*
+ * Update local copy of shared XLogCtl->log{Write,Flush}Result
+ */
+#define XLogUpdateLocalLogwrtResult(_target) \
+	do { \
+		_target.Write =

Re: Fix for timestamp lag issue from emit_log_hook when GUC log_line_prefix has '%m'

2024-04-03 Thread Michael Paquier

On Mon, Mar 18, 2024 at 02:12:57PM +0900, Michael Paquier wrote:
> I may be able to get this one committed just before the feature freeze
> of v17, as timestamp consistency for hooks that call
> log_status_format() is narrow.  For now, I have added an entry in the
> CF app to keep track of it:
> https://commitfest.postgresql.org/48/4901/

While looking again at that. there were two more comments that missed
a refresh about JSON in get_formatted_log_time() and
get_formatted_start_time().  It's been a few weeks since the last
update, but I'll be able to wrap that tomorrow, updating these
comments on the way.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alexander Korotkov

Hi, Alvaro!

Thank you for your feedback.

On Wed, Apr 3, 2024 at 9:58 AM Alvaro Herrera  wrote:
> Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a
> spinlock) to protect the contents of waitLSN -- and it's used to walk an
> arbitrary long list of processes waiting ... and also, an arbitrary
> number of processes could be calling this code.  I think using a
> spinlock for this is unwise, as it'll cause busy-waiting whenever
> there's contention.  Wouldn't it be better to use an LWLock for this?
> Then the processes would sleep until the lock is freed.
>
> While nosing about the code, other things struck me:
>
> I think there should be more comments about WaitLSNProcInfo and
> WaitLSNState in waitlsn.h.
>
> In addLSNWaiter it'd be better to assign 'cur' before acquiring the
> lock.
>
> Is a plan array really the most efficient data structure for this,
> considering that you have to reorder each time you add an element?
> Maybe it is, but wouldn't it make sense to use memmove() when adding one
> element rather iterating all the remaining elements to the end of the
> queue?
>
> I think the include list in waitlsn.c could be tightened a bit:

I've just pushed commit, which shortens the include list and fixes the
order of 'cur' assigning and taking spinlock in addLSNWaiter().

Regarding the shmem data structure for LSN waiters.  I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed.  In my experience
waking up a process is way slower than scanning a short flat array.

However, I agree that when the number of waiters is very high and flat
array may become a problem.  It seems that the pairing heap is not
hard to use for shmem structures.  The only memory allocation call in
paritingheap.c is in pairingheap_allocate().  So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

I'll come back with switching to the pairing heap shortly.

--
Regards,
Alexander Korotkov

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Jakub Wartak

On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz  wrote:
[..]
> v4 is rebased on top of v14 streaming read API changes.

Hi Nazir, so with streaming API committed, I gave a try to this patch.
With autovacuum=off and 30GB table on NVMe (with standard readahead of
256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
default) created using: create table t as select repeat('a', 100) || i
|| repeat('b', 500) as filler from generate_series(1, 4500) as i;

on master, effect of mainteance_io_concurency [default 10] is like
that (when resetting the fs cache after each ANALYZE):

m_io_c = 0:
Time: 3137.914 ms (00:03.138)
Time: 3094.540 ms (00:03.095)
Time: 3452.513 ms (00:03.453)

m_io_c = 1:
Time: 2972.751 ms (00:02.973)
Time: 2939.551 ms (00:02.940)
Time: 2904.428 ms (00:02.904)

m_io_c = 2:
Time: 1580.260 ms (00:01.580)
Time: 1572.132 ms (00:01.572)
Time: 1558.334 ms (00:01.558)

m_io_c = 4:
Time: 938.304 ms
Time: 931.772 ms
Time: 920.044 ms

m_io_c = 8:
Time: 666.025 ms
Time: 660.241 ms
Time: 648.848 ms

m_io_c = 16:
Time: 542.450 ms
Time: 561.155 ms
Time: 539.683 ms

m_io_c = 32:
Time: 538.487 ms
Time: 541.705 ms
Time: 538.101 ms

with patch applied:

m_io_c = 0:
Time: 3106.469 ms (00:03.106)
Time: 3140.343 ms (00:03.140)
Time: 3044.133 ms (00:03.044)

m_io_c = 1:
Time: 2959.817 ms (00:02.960)
Time: 2920.265 ms (00:02.920)
Time: 2911.745 ms (00:02.912)

m_io_c = 2:
Time: 1581.912 ms (00:01.582)
Time: 1561.444 ms (00:01.561)
Time: 1558.251 ms (00:01.558)

m_io_c = 4:
Time: 908.116 ms
Time: 901.245 ms
Time: 901.071 ms

m_io_c = 8:
Time: 619.870 ms
Time: 620.327 ms
Time: 614.266 ms

m_io_c = 16:
Time: 529.885 ms
Time: 526.958 ms
Time: 528.474 ms

m_io_c = 32:
Time: 521.185 ms
Time: 520.713 ms
Time: 517.729 ms

No difference to me, which seems to be good. I've double checked and
patch used the new way

acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
-> mdreadv -> FileReadV -> pg_preadv (inlined)
acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
-> ...

I gave also io_combine_limit to 32 (max, 256kB) a try and got those
slightly better results:

[..]
m_io_c = 16:
Time: 494.599 ms
Time: 496.345 ms
Time: 973.500 ms

m_io_c = 32:
Time: 461.031 ms
Time: 449.037 ms
Time: 443.375 ms

and that (last one) apparently was able to push it to ~50-60k still
random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
still reading random , so I assume no merging was done:

Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz
w/s wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s
drqm/s  %drqm d_await dareq-sz f/s f_await  aqu-sz  %util
nvme0n1   61212.00591.82 0.00   0.000.10 9.90
2.00  0.02 0.00   0.000.0012.000.00  0.00
0.00   0.000.00 0.000.000.006.28  85.20

So in short it looks good to me.

-Jakub Wartak.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bharath Rupireddy

On Wed, Apr 3, 2024 at 12:20 PM Bertrand Drouvot
 wrote:
>
> > Please find the attached v31 patches implementing the above idea:
>
> Some comments related to v31-0001:
>
> === testing the behavior
>
> T1 ===
>
> > - synced slots get their on inactive_since just like any other slot
>
> It behaves as described.
>
> T2 ===
>
> > - synced slots inactive_since is set to current timestamp after the
> > standby gets promoted to help inactive_since interpret correctly just
> > like any other slot.
>
> It behaves as described.

Thanks for testing.

> CR1 ===
>
> +inactive_since value will get updated
> +after every synchronization
>
> indicates the last synchronization time? (I think that after every 
> synchronization
> could lead to confusion).

Done.

> CR2 ===
>
> +   /*
> +* Set the time since the slot has become inactive 
> after shutting
> +* down slot sync machinery. This helps correctly 
> interpret the
> +* time if the standby gets promoted without a 
> restart.
> +*/
>
> It looks to me that this comment is not at the right place because there is
> nothing after the comment that indicates that we shutdown the "slot sync 
> machinery".
>
> Maybe a better place is before the function definition and mention that this 
> is
> currently called when we shutdown the "slot sync machinery"?

Done.

> CR3 ===
>
> +* We get the current time beforehand and only once 
> to avoid
> +* system calls overhead while holding the lock.
>
> s/avoid system calls overhead while holding the lock/avoid system calls while 
> holding the spinlock/?

Done.

> CR4 ===
>
> +* Set the time since the slot has become inactive. We get the current
> +* time beforehand to avoid system call overhead while holding the 
> lock
>
> Same.

Done.

> CR5 ===
>
> +   # Check that the captured time is sane
> +   if (defined $reference_time)
> +   {
>
> s/Check that the captured time is sane/Check that the inactive_since is sane/?
>
> Sorry if some of those comments could have been done while I did review 
> v29-0001.

Done.

On Wed, Apr 3, 2024 at 2:58 PM shveta malik  wrote:
>
> Thanks for the patches, please find few comments:
>
> v31-001:
>
> 1)
> system-views.sgml:
> value will get updated  after every synchronization from the
> corresponding remote slot on the primary.
>
> --This is confusing. It will be good to rephrase it.

Done as per Bertrand's suggestion.

> 2)
> update_synced_slots_inactive_since()
>
> --May be, we should mention in the header that this function is called
> only during promotion.

Done as per Bertrand's suggestion.

> 3) 040_standby_failover_slots_sync.pl:
> We capture inactive_since_on_primary when we do this for the first time at 
> #175
> ALTER SUBSCRIPTION regress_mysub1 DISABLE"
>
> But we again recreate the sub and disable it at line #280.
> Do you think we shall get inactive_since_on_primary again here, to be
> compared with inactive_since_on_new_primary later?

Hm. Done that. Recapturing both slot_creation_time_on_primary and
inactive_since_on_primary before and after CREATE SUBSCRIPTION creates
the slot again on the primary/publisher.

On Wed, Apr 3, 2024 at 3:32 PM Amit Kapila  wrote:
>
> > CR2 ===
> >
> > +   /*
> > +* Set the time since the slot has become inactive 
> > after shutting
> > +* down slot sync machinery. This helps correctly 
> > interpret the
> > +* time if the standby gets promoted without a 
> > restart.
> > +*/
> >
> > It looks to me that this comment is not at the right place because there is
> > nothing after the comment that indicates that we shutdown the "slot sync 
> > machinery".
> >
> > Maybe a better place is before the function definition and mention that 
> > this is
> > currently called when we shutdown the "slot sync machinery"?
> >
> Won't it be better to have an assert for SlotSyncCtx->pid? IIRC, we
> have some existing issues where we don't ensure that no one is running
> sync API before shutdown is complete but I think we can deal with that
> separately and here we can still have an Assert.

That can work to ensure the slot sync worker isn't running as
SlotSyncCtx->pid gets updated only for the slot sync worker. I added
this assertion for now.

We need to ensure (in a separate patch and thread) there is no backend
acquiring it and performing sync while the slot sync worker is
shutting down. Otherwise, some of the slots can get resynced and some
are not while we are shutting down the slot sync worker as part of the
standby promotion which might leave the slots in an inconsistent
state.

> > CR3 ===
> >
> > +* We get the current time beforehand and only once 
> > to avoid
> > +* system calls overhead while holding

Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Jeff Davis

On Mon, 2024-04-01 at 12:42 +0900, Masahiko Sawada wrote:
> While reviewing the patches, I realized the comment of
> binearyheap_allocate() should also be updated. So I've attached the
> new patches.

In sift_{up|down}, each loop iteration calls set_node(), and each call
to set_node does a hash lookup. I didn't measure it, but that feels
wasteful.

I don't even think you really need the hash table. The key to the hash
table is a pointer, so it's not really doing anything that couldn't be
done more efficiently by just following the pointer.

I suggest that you add a "heap_index" field to ReorderBufferTXN that
would point to the index into the heap's array (the same as
bh_nodeidx_entry.index in your patch). Each time an element moves
within the heap array, just follow the pointer to the ReorderBufferTXN
object and update the heap_index -- no hash lookup required.

That's not easy to do with the current binaryheap API. But a binary
heap is not a terribly complex structure, so you can just do an inline
implementation of it where sift_{up|down} know to update the heap_index
field of the ReorderBufferTXN.

Regards,
Jeff Davis

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Amit Kapila

On Wed, Apr 3, 2024 at 12:20 PM Bertrand Drouvot
 wrote:
>
> On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> > On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> > >
> > > > > Or a simple solution is that the slotsync worker updates
> > > > > inactive_since as it does for non-synced slots, and disables
> > > > > timeout-based slot invalidation for synced slots.
> > >
> > > I like this idea better, it takes care of such a case too when the
> > > user is relying on sync-function rather than worker and does not want
> > > to get the slots invalidated in between 2 sync function calls.
> >
> > Please find the attached v31 patches implementing the above idea:
>
> Thanks!
>
> Some comments related to v31-0001:
>
> === testing the behavior
>
> T1 ===
>
> > - synced slots get their on inactive_since just like any other slot
>
> It behaves as described.
>
> T2 ===
>
> > - synced slots inactive_since is set to current timestamp after the
> > standby gets promoted to help inactive_since interpret correctly just
> > like any other slot.
>
> It behaves as described.
>
> CR1 ===
>
> +inactive_since value will get updated
> +after every synchronization
>
> indicates the last synchronization time? (I think that after every 
> synchronization
> could lead to confusion).
>

+1.

> CR2 ===
>
> +   /*
> +* Set the time since the slot has become inactive 
> after shutting
> +* down slot sync machinery. This helps correctly 
> interpret the
> +* time if the standby gets promoted without a 
> restart.
> +*/
>
> It looks to me that this comment is not at the right place because there is
> nothing after the comment that indicates that we shutdown the "slot sync 
> machinery".
>
> Maybe a better place is before the function definition and mention that this 
> is
> currently called when we shutdown the "slot sync machinery"?
>

Won't it be better to have an assert for SlotSyncCtx->pid? IIRC, we
have some existing issues where we don't ensure that no one is running
sync API before shutdown is complete but I think we can deal with that
separately and here we can still have an Assert.

> CR3 ===
>
> +* We get the current time beforehand and only once 
> to avoid
> +* system calls overhead while holding the lock.
>
> s/avoid system calls overhead while holding the lock/avoid system calls while 
> holding the spinlock/?
>

Is it valid to say that there is overhead of this call while holding
spinlock? Because I don't think at the time of promotion we expect any
other concurrent slot activity. The first reason seems good enough.

One other observation:
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -42,6 +42,7 @@
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogutils.h"

Is there a reason for this inclusion? I don't see any change which
should need this one.

-- 
With Regards,
Amit Kapila.

Re: Fix out-of-bounds in the function PQescapeinternal (src/interfaces/libpq/fe-exec.c)

2024-04-03 Thread Ranier Vilela

Em ter., 2 de abr. de 2024 às 18:13, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > While I working in [1], Coverity reported some errors:
> > src/bin/pg_basebackup/pg_createsubscriber.c
> > CID 1542690: (#1 of 2): Out-of-bounds access (OVERRUN)
> > alloc_strlen: Allocating insufficient memory for the terminating null of
> > the string. [Note: The source code implementation of the function has
> been
> > overridden by a builtin model.]
> > CID 1542690: (#2 of 2): Out-of-bounds access (OVERRUN)
> > alloc_strlen: Allocating insufficient memory for the terminating null of
> > the string. [Note: The source code implementation of the function has
> been
> > overridden by a builtin model.]
>
> Yeah, we saw that in the community run too.  I'm tempted to call it
> an AI hallucination.  The "Note" seems to mean that they're not
> actually analyzing our code but some made-up substitute.
>
Yeah, the message is a little confusing.
It seems to me that it is a replacement of the malloc function with its
own, with some type of security cookie,
like /GS (Buffer Security Check)



> > The source of errors is the function PQescapeInternal.
> > The slow path has bugs when num_quotes or num_backslashes are greater
> than
> > zero.
> > For each num_quotes or num_backslahes we need to allocate two more.
>
> Nonsense.  The quote or backslash is already counted in input_len,
> so we need to add just one more.
>
Right, the quote or backslash is counted in input_len.
In the test I did, the string had 10 quotes, so
input_len had at least 10 bytes for quotes.
But we write 20 quotes, in the slow path.

if (*s == quote_char || (!as_ident && *s == '\\'))
*rp++ = *s;
*rp++ = *s;

|0| = quote_char
|1| = quote_char
|2| = quote_char
|3| = quote_char
|4| = quote_char
|5| = quote_char
|6| = quote_char
|7| = quote_char
|8| = quote_char
|9| = quote_char
|10| = quote_char <--memory overwrite
|11| = quote_char
|12| = quote_char
|13| = quote_char
|14| = quote_char
|15| = quote_char
|16| = quote_char
|17| = quote_char
|18| = quote_char
|19| = quote_char



> If there were anything wrong here, I'm quite sure our testing under
> e.g. valgrind would have caught it years ago.  However, just to be
> sure, I tried adding an Assert that the allocated space is filled
> exactly, as attached.  It gets through check-world just fine.
>
It seems to me that only the fast path is being validated by the assert.

if (num_quotes == 0 && (num_backslashes == 0 || as_ident))
{
memcpy(rp, str, input_len);
rp += input_len;
}

best regards,
Ranier Vilela

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alvaro Herrera

Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a
spinlock) to protect the contents of waitLSN -- and it's used to walk an
arbitrary long list of processes waiting ... and also, an arbitrary
number of processes could be calling this code.  I think using a
spinlock for this is unwise, as it'll cause busy-waiting whenever
there's contention.  Wouldn't it be better to use an LWLock for this?
Then the processes would sleep until the lock is freed.

While nosing about the code, other things struck me:

I think there should be more comments about WaitLSNProcInfo and
WaitLSNState in waitlsn.h.

In addLSNWaiter it'd be better to assign 'cur' before acquiring the
lock.

Is a plan array really the most efficient data structure for this,
considering that you have to reorder each time you add an element?
Maybe it is, but wouldn't it make sense to use memmove() when adding one
element rather iterating all the remaining elements to the end of the
queue?

I think the include list in waitlsn.c could be tightened a bit:

@@ -18,28 +18,18 @@
 #include 
 
 #include "pgstat.h"
-#include "fmgr.h"
-#include "access/transam.h"
-#include "access/xact.h"
 #include "access/xlog.h"
-#include "access/xlogdefs.h"
 #include "access/xlogrecovery.h"
-#include "catalog/pg_type.h"
 #include "commands/waitlsn.h"
-#include "executor/spi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "storage/ipc.h"
 #include "storage/latch.h"
-#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
-#include "storage/sinvaladt.h"
-#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 #include "utils/fmgrprotos.h"
+#include "utils/wait_event_types.h"

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Daniel Gustafsson

Buildfarm animal mamba (NetBSD 10.0 on macppc) started failing on this with
what seems like a bogus compiler warning:

waitlsn.c: In function 'WaitForLSN':
waitlsn.c:275:24: error: 'endtime' may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
  275 |delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
  |   ~^~~~
cc1: all warnings being treated as errors

endtime is indeed initialized further up, but initializing endtime at
declaration seems innocent enough and should make this compiler happy and the
buildfarm greener.

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 6679378156..17ad0057ad 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -226,7 +226,7 @@ void
 WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
 {
XLogRecPtr  currentLSN;
-   TimestampTz endtime;
+   TimestampTz endtime = 0;

/* Shouldn't be called when shmem isn't initialized */
Assert(waitLSN);

--
Daniel Gustafsson

Is it safe to cache data by GiST consistent function

2024-04-03 Thread Michał Kłeczek

Hello,

When implementing a GiST consistent function I found the need to cache 
pre-processed query across invocations.
I am not sure if it is safe to do (or I need to perform some steps to make sure 
cached info is not leaked between rescans).

The comment in gistrescan says:

/*
 * If this isn't the first time through, preserve the fn_extra
 * pointers, so that if the consistentFns are using them to 
cache
 * data, that data is not leaked across a rescan.
 */

which seems to me self-contradictory as fn_extra is preserved between rescans 
(so leaks are indeed possible).

Am I missing something?

Thanks,
Michal

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread shveta malik

On Wed, Apr 3, 2024 at 11:17 AM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
>
> Please find the attached v31 patches implementing the above idea:
>

Thanks for the patches, please find few comments:

v31-001:

1)
system-views.sgml:
value will get updated  after every synchronization from the
corresponding remote slot on the primary.

--This is confusing. It will be good to rephrase it.

2)
update_synced_slots_inactive_since()

--May be, we should mention in the header that this function is called
only during promotion.

3) 040_standby_failover_slots_sync.pl:
We capture inactive_since_on_primary when we do this for the first time at #175
ALTER SUBSCRIPTION regress_mysub1 DISABLE"

But we again recreate the sub and disable it at line #280.
Do you think we shall get inactive_since_on_primary again here, to be
compared with inactive_since_on_new_primary later?


v31-002:
(I had reviewed v29-002 but missed to post comments,  I think these
are still applicable)

1) I think replication_slot_inactivity_timeout was recommended here
(instead of replication_slot_inactive_timeout, so please give it a
thought):
https://www.postgresql.org/message-id/202403260739.udlp7lxixktx%40alvherre.pgsql

2) Commit msg:
a)
"It is often easy for developers to set a timeout of say 1
or 2 or 3 days at slot level, after which the inactive slots get
dropped."

Shall we say invalidated rather than dropped?

b)
"To achieve the above, postgres introduces a GUC allowing users
set inactive timeout and then a slot stays inactive for this much
amount of time it invalidates the slot."

Broken sentence.



thanks
Shveta

Re: Confusing #if nesting in hmac_openssl.c

2024-04-03 Thread Michael Paquier

On Tue, Apr 02, 2024 at 03:56:13PM +0200, Daniel Gustafsson wrote:
> > On 2 Apr 2024, at 15:50, Tom Lane  wrote:
> 
> > I'll go ahead with what I have.
> 
> +1

+#ifdef USE_RESOWNER_FOR_HMAC 

Why not, that's cleaner.  Thanks for the commit.  The interactions
between this code and b8bff07da are interesting.
--
Michael


signature.asc
Description: PGP signature

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot

Hi,

On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
> 
> Please find the attached v31 patches implementing the above idea:

Thanks!

Some comments related to v31-0001:

=== testing the behavior

T1 ===

> - synced slots get their on inactive_since just like any other slot

It behaves as described.

T2 ===

> - synced slots inactive_since is set to current timestamp after the
> standby gets promoted to help inactive_since interpret correctly just
> like any other slot.

It behaves as described.

CR1 ===

+inactive_since value will get updated
+after every synchronization

indicates the last synchronization time? (I think that after every 
synchronization
could lead to confusion).

CR2 ===

+   /*
+* Set the time since the slot has become inactive 
after shutting
+* down slot sync machinery. This helps correctly 
interpret the
+* time if the standby gets promoted without a restart.
+*/

It looks to me that this comment is not at the right place because there is
nothing after the comment that indicates that we shutdown the "slot sync 
machinery".

Maybe a better place is before the function definition and mention that this is
currently called when we shutdown the "slot sync machinery"?

CR3 ===

+* We get the current time beforehand and only once to 
avoid
+* system calls overhead while holding the lock.

s/avoid system calls overhead while holding the lock/avoid system calls while 
holding the spinlock/?

CR4 ===

+* Set the time since the slot has become inactive. We get the current
+* time beforehand to avoid system call overhead while holding the lock

Same.

CR5 ===

+   # Check that the captured time is sane
+   if (defined $reference_time)
+   {

s/Check that the captured time is sane/Check that the inactive_since is sane/?

Sorry if some of those comments could have been done while I did review 
v29-0001.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson

> On 3 Apr 2024, at 19:38, Tom Lane  wrote:
> 
> Jacob Champion  writes:
>> The RHEL7-alikes are the biggest set, but that's already discussed
>> above. Looks like SUSE 12 goes EOL later this year (October 2024), and
>> it ships OpenSSL 1.1.1 as an option. Already-dead distros are Ubuntu
>> 16.04 (April 2021), Photon 2 (January 2023), and Photon 3 (March
>> 2024). That leaves AL2, OpenIndiana Hipster, and Solaris 11.4, all of
>> which appear to have newer versions of OpenSSL shipped and selectable.
> 
> The discussion we had last year concluded that we were OK with
> dropping 1.0.1 support when RHEL6 goes out of extended support
> (June 2024 per this thread, I didn't check it).  Seems like we
> should have the same policy for RHEL7.  Also, calling Photon 3
> dead because it went EOL three days ago seems over-hasty.
> 
> Bottom line for me is that pulling 1.0.1 support now is OK,
> but I think pulling 1.0.2 is premature.

Is Red Hat building and and shipping v17 packages for RHEL7 ELS customers?  If
not then it seems mostly academical to tie our dependencies to RHEL ELS unless
I'm missing something.

--
Daniel Gustafsson

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

Thank you for looking into this!

On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
>
> On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > Streaming API has been committed but the committed version has a minor
> > change, the read_stream_begin_relation function takes Relation instead
> > of BufferManagerRelation now. So, here is a v5 which addresses this
> > change.
>
> I'm getting a repeatable segfault / assertion failure with this:
>
> postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> CREATE TABLE
> postgres=# insert into tengiga select g, repeat('x', 900) from
> generate_series(1, 140) g;
> INSERT 0 140
> postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> SET
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> "heapam_handler.c", Line: 1079, PID: 262232
> postgres: heikki postgres [local]
> ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> postgres: heikki postgres [local]
> ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> postgres: heikki postgres [local]
> ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> postgres: heikki postgres [local]
> ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> postgres: heikki postgres [local]
> ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> postgres: heikki postgres [local]
> ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> postgres: heikki postgres [local]
> ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> was terminated by signal 6: Aborted

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

There are a couple of solutions I thought of:

1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
the acquire_sample_rows():

Streaming API uses this function to prefetch block numbers.
BlockSampler_HasMore() will reach to the end first as it is used while
prefetching, so it will start to return false while there are still
buffers to return from the streaming API. That will cause some buffers
at the end to not be processed.

2- Expose something (function, variable etc.) from the streaming API
to understand if the read is finished and there is no buffer to
return:

I think this works but I am not sure if the streaming API allows
something like that.

3- Check every buffer returned from the streaming API, if it is
invalid stop the main loop in the acquire_sample_rows():

This solves the problem but there will be two if checks for each
buffer returned,
- in heapam_scan_analyze_next_block() to check if the returned buffer is invalid
- to break main loop in acquire_sample_rows() if
heapam_scan_analyze_next_block() returns false
One of the if cases can be bypassed by moving
heapam_scan_analyze_next_block()'s code to the main loop in the
acquire_sample_rows().

I implemented the third solution, here is v6.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v6] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using

Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Robert Haas

On Wed, Apr 3, 2024 at 11:17 AM Tristan Partin  wrote:
> I think patch 2 makes it worse. The value in -Wswitch is that when new
> enum variants are added, the developer knows the locations to update.
> Adding a default case makes -Wswitch pointless.
>
> Patch 1 is still good. The comment change in patch 2 is good too!

It seems to me that 0001 should either remove the pg_unreachable()
call or change the break to a return, but not both. The commit message
tries to justify doing both by saying that the pg_unreachable() call
doesn't have much value, but there's not much value in omitting
pg_unreachable() from unreachable places, either, so I'm not
convinced.

I agree with Tristan's analysis of 0002.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

2024-04-03 Thread Tom Lane

Corey Huinker  writes:
>> As far as that goes, it shouldn't be that hard to deal with, at least
>> not for "soft" errors which hopefully cover most input-function
>> failures these days.  You should be invoking array_in via
>> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
>> (Look at pg_input_error_info() for useful precedent.)

> Ah, my understanding may be out of date. I was under the impression that
> that mechanism relied on the the cooperation of the per-element input
> function, so even if we got all the builtin datatypes to play nice with
> *Safe(), we were always going to be at risk with a user-defined input
> function.

That's correct, but it's silly not to do what we can.  Also, I imagine
that there is going to be high evolutionary pressure on UDTs to
support soft error mode for COPY, so over time the problem will
decrease --- as long as we invoke the soft error mode.

> 1. NULL input =>  Return NULL. (because strict).
> 2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
> (thus ruining binary upgrade)
> 3. Call values are so bad (examples: attname not found, required stat
> missing) that nothing can recover => WARN, return FALSE.
> 4. At least one stakind-stat is wonky (impossible for datatype, missing
> stat pair, wrong type on input parameter), but that's the worst of it => 1
> to N WARNs, write stats that do make sense, return TRUE.
> 5. Hunky-dory. => No warns. Write all stats. return TRUE.

> Which of those seem like good ereturn candidates to you?

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

regards, tom lane

Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Andres Freund

Hi,

On 2024-04-03 17:58:55 -0400, Tom Lane wrote:
> Magnus Hagander  writes:
> > On Wed, Apr 3, 2024 at 7:57 PM Andres Freund  wrote:
> >> Openssh has now integrated [1] a patch to remove the dependency on
> >> libsystemd
> >> for triggering service manager readyness notifications, by inlining the
> >> necessary function. That's not hard, the protocol is pretty simple.
> >> I suspect we should do the same. We're not even close to being a target as
> >> attractive as openssh, but still, it seems unnecessary.
> 
> > +1.
> 
> I didn't read the patch, but if it's short and stable enough then this
> seems like a good idea.

It's basically just checking for an env var, opening the unix socket indicated
by that, writing a string to it and closing the socket again.


> (If openssh and we are using such a patch, that will probably be a big
> enough stake in the ground to prevent somebody deciding to change the
> protocol ...)

One version of the openssh patch to remove liblzma was submitted by one of the
core systemd devs, so I think they agree that it's a stable API.  The current
protocol supports adding more information by adding attributes, so it should
be extensible enough anyway.


> >> An argument could be made to instead just remove support, but I think it's
> >> quite valuable to have intra service dependencies that can rely on the
> >> server actually having started up.
> 
> > If we remove support we're basically just asking most of our linux
> > packagers to add it back in, and they will add it back in the same way we
> > did it. I think we do everybody a disservice if we do that. It's useful
> > functionality.
> 
> Yeah, that idea seems particularly silly in view of the desire
> expressed earlier in this thread to reduce the number of patches
> carried by packagers.  People packaging for systemd-using distros
> will not consider that this functionality is optional.

Yep.

Greetings,

Andres Freund

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Thomas Munro

On Thu, Apr 4, 2024 at 11:51 AM Peter Eisentraut  wrote:
> On 30.03.24 22:27, Thomas Munro wrote:
> > Hmm, OK so it doesn't have 3 available in parallel from base repos.
> > But it's also about to reach end of "full support" in 2 months[1], so
> > if we applied the policies we discussed in the LLVM-vacuuming thread
> > (to wit: build farm - EOL'd OSes), then...  One question I'm unclear
> > on is whether v17 will be packaged for RHEL8.
>
> The rest of the thread talks about the end of support of RHEL 7, but you
> are here talking about RHEL 8.   It is true that "full support" for RHEL
> 8 ended in May 2024, but that is the not the one we are tracking.  We
> are tracking the 10-year one, which I suppose is now called "maintenance
> support".

I might have confused myself with the two EOLs and some wishful
thinking.  I am a lot less worked up about this general topic now that
RHEL has moved to "rolling" LLVM updates in minor releases, removing a
physical-pain-inducing 10-year vacuuming horizon (that's 20 LLVM major
releases and they only fix bugs in one...).  I will leave openssl
discussions to those more knowledgeable about that.

> So if the above package list is correct, then we ought to keep
> supporting openssl 1.1.* until 2029.

That's a shame.  But it sounds like the developer burden isn't so
different from 1.1.1 to 3.x, so maybe it's not such a big deal from
our point of view.  (I have no opinion on the security ramifications
of upstream's EOL, but as a layman it sounds completely bonkers to use
it.  I wonder why the packaging community wouldn't just arrange to
have a supported-by-upstream 3.x package in their RPM repo when they
supply the newest PostgreSQL versions for the oldest RHEL, but again
not my area so I'll shut up).

Re: RFC: Additional Directory for Extensions

2024-04-03 Thread walther


Alvaro Herrera:

I support the idea of there being a second location from where to load
shared libraries ... but I don't like the idea of making it
runtime-configurable.  If we want to continue to tighten up what
superuser can do, then one of the things that has to go away is the
ability to load shared libraries from arbitrary locations
(dynamic_library_path).  I think we should instead look at making those
locations hardcoded at compile time.  The packager can then decide where
those things go, and the superuser no longer has the ability to load
arbitrary code from arbitrary locations.


The use-case for runtime configuration of this seems to be build-time 
testing of extensions against an already installed server. For this 
purpose it should be enough to be able to set this directory at startup 
- it doesn't need to be changed while the server is actually running. 
Then you could spin up a temporary postgres instance with the extension 
directory pointing a the build directory and test.


Would startup-configurable be better than runtime-configurable regarding 
your concerns?


I can also imagine that it would be very helpful in a container setup to 
be able to set an environment variable with this path instead of having 
to recompile all of postgres to change it.


Best,

Wolfgang

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot

Hi,

On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
> 
> Please find the attached v31 patches implementing the above idea:

Thanks!

Some comments regarding v31-0002:

=== testing the behavior

T1 ===

> - synced slots don't get invalidated due to inactive timeout because
> such slots not considered active at all as they don't perform logical
> decoding (of course, they will perform in fast_forward mode to fix the
> other data loss issue, but they don't generate changes for them to be
> called as *active* slots)

It behaves as described. OTOH non synced logical slots on the standby and
physical slots on the standby are invalidated which is what is expected.

T2 ===

In case the slot is invalidated on the primary,

primary:

postgres=# select slot_name, inactive_since, invalidation_reason from 
pg_replication_slots where slot_name = 's1';
 slot_name |inactive_since | invalidation_reason
---+---+-
 s1| 2024-04-03 06:56:28.075637+00 | inactive_timeout

then on the standby we get:

standby:

postgres=# select slot_name, inactive_since, invalidation_reason from 
pg_replication_slots where slot_name = 's1';
 slot_name |inactive_since| invalidation_reason
---+--+-
 s1| 2024-04-03 07:06:43.37486+00 | inactive_timeout

shouldn't the slot be dropped/recreated instead of updating inactive_since?

=== code

CR1 ===

+Invalidates replication slots that are inactive for longer the
+specified amount of time

s/for longer the/for longer that/?

CR2 ===

+true) as such synced slots don't actually perform
+logical decoding.

We're switching in fast forward logical due to [1], so I'm not sure that's 100%
accurate here. I'm not sure we need to specify a reason.

CR3 ===

+ errdetail("This slot has been invalidated because it was inactive for more 
than the time specified by replication_slot_inactive_timeout parameter.")));

I think we can remove "parameter" (see for example the error message in
validate_remote_info()) and reduce it a bit, something like?

"This slot has been invalidated because it was inactive for more than 
replication_slot_inactive_timeout"?

CR4 ===

+ appendStringInfoString(_detail, _("The slot has been inactive for more 
than the time specified by replication_slot_inactive_timeout parameter."));

Same.

CR5 ===

+   /*
+* This function isn't expected to be called for inactive timeout based
+* invalidation. A separate function InvalidateInactiveReplicationSlot 
is
+* to be used for that.

Do you think it's worth to explain why?

CR6 ===

+   if (replication_slot_inactive_timeout == 0)
+   return false;
+   else if (slot->inactive_since > 0)

"else" is not needed here.

CR7 ===

+   SpinLockAcquire(>mutex);
+
+   /*
+* Check if the slot needs to be invalidated due to
+* replication_slot_inactive_timeout GUC. We do this with the 
spinlock
+* held to avoid race conditions -- for example the 
inactive_since
+* could change, or the slot could be dropped.
+*/
+   now = GetCurrentTimestamp();

We should not call GetCurrentTimestamp() while holding a spinlock.

CR8 ===

+# Testcase start: Invalidate streaming standby's slot as well as logical
+# failover slot on primary due to inactive timeout GUC. Also, check the logical

s/inactive timeout GUC/replication_slot_inactive_timeout/?

CR9 ===

+# Start: Helper functions used for this test file
+# End: Helper functions used for this test file

I think that's the first TAP test with this comment. Not saying we should not 
but
why did you feel the need to add those?

[1]: 
https://www.postgresql.org/message-id/os0pr01mb5716b3942ae49f3f725aca9294...@os0pr01mb5716.jpnprd01.prod.outlook.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2024-04-03 Thread Jakub Wartak

Hi Andrey,

On Thu, Mar 28, 2024 at 1:09 PM Andrey M. Borodin  wrote:
>
>
>
> > On 8 Aug 2023, at 12:31, John Naylor  wrote:
> >
> > > > Also the shared counter is the cause of the slowdown, but not the 
> > > > reason for the numeric limit.
> > >
> > > Isn't it both? typedef Oid is unsigned int = 2^32, and according to 
> > > GetNewOidWithIndex() logic if we exhaust the whole OID space it will hang 
> > > indefinitely which has the same semantics as "being impossible"/permanent 
> > > hang (?)
> >
> > Looking again, I'm thinking the OID type size is more relevant for the 
> > first paragraph, and the shared/global aspect is more relevant for the 
> > second.
> >
> > The last issue is how to separate the notes at the bottom, since there are 
> > now two topics.
>
> Jakub, do you have plans to address this feedback? Is the CF entry still 
> relevant?

Yes; I've forgotten about this one and clearly I had problems
formulating it in proper shape to be accepted. I've moved it to the
next CF now as this is not critical and I would prefer to help current
timesenistive CF. Anyone is welcome to help amend the patch...

-J.

Re: remaining sql/json patches

2024-04-03 Thread jian he

On Wed, Apr 3, 2024 at 3:15 PM jian he  wrote:
>
> On Wed, Apr 3, 2024 at 11:30 AM jian he  wrote:
> >
> > On Tue, Apr 2, 2024 at 9:57 PM Amit Langote  wrote:
> > >
> > > Please let me know if you have further comments on 0001.  I'd like to
> > > get that in before spending more energy on 0002.
> > >

more doc issue with v48. 0001, 0002.

 The optional json_path_name serves as an
 identifier of the provided path_expression.
 The path name must be unique and distinct from the column names.

"path name" should be
json_path_name


git diff --check
doc/src/sgml/func.sgml:19192: trailing whitespace.
+ id |   kind   |  title  | director


+  
+   JSON data stored at a nested level of the row pattern can be extracted using
+   the NESTED PATH clause.  Each
+   NESTED PATH clause can be used to generate one or more
+   columns using the data from a nested level of the row pattern, which can be
+   specified using a COLUMNS clause.  Rows constructed from
+   such columns are called child rows and are joined
+   agaist the row constructed from the columns specified in the parent
+   COLUMNS clause to get the row in the final view.  Child
+   columns may themselves contain a NESTED PATH
+   specifification thus allowing to extract data located at arbitrary nesting
+   levels.  Columns produced by NESTED PATHs at the same
+   level are considered to be siblings and are joined
+   with each other before joining to the parent row.
+  

"agaist" should be "against".
"specifification" should be "specification".
+Rows constructed from
+   such columns are called child rows and are joined
+   agaist the row constructed from the columns specified in the parent
+   COLUMNS clause to get the row in the final view.
this sentence is long, not easy to comprehend, maybe we can rephrase it
or split it into two.



+  | NESTED PATH json_path_specification
 AS path_name 
+COLUMNS ( json_table_column
, ... )
v48, 0002 patch.
in the json_table synopsis section, put these two lines into one line,
I think would make it more readable.
also the following sgml code will render the html into one line.

  NESTED PATH
json_path_specification 
AS json_path_name

  COLUMNS (
json_table_column ,
... )


also path_name should be
json_path_name.



+
+ The NESTED PATH syntax is recursive,
+ so you can go down multiple nested levels by specifying several
+ NESTED PATH subclauses within each other.
+ It allows to unnest the hierarchy of JSON objects and arrays
+ in a single function invocation rather than chaining several
+ JSON_TABLE expressions in an SQL statement.
+
"The NESTED PATH syntax is recursive"
should be
`
The NESTED PATH syntax can be recursive,
you can go down multiple nested levels by specifying several
NESTED PATH subclauses within each other.
`

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

On Tue, 2 Apr 2024 at 10:23, Nazir Bilal Yavuz  wrote:
>
> v4 is rebased on top of v14 streaming read API changes.

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 43e2f2b32e2fdb7e1fd787b1d8595768741f4792 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Apr 2024 01:22:59 +0300
Subject: [PATCH v5] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h  |  4 +-
 src/backend/access/heap/heapam_handler.c | 16 ++---
 src/backend/commands/analyze.c   | 85 
 3 files changed, 36 insertions(+), 69 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b632fe953c4..4e35caeb42e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -374,8 +375,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 
 /* in heap/heapam_handler.c*/
 extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-		   BlockNumber blockno,
-		   BufferAccessStrategy bstrategy);
+		   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 		   TransactionId OldestXmin,
 		   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c86000d245b..0533d9660c4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  The scan has been
+ * started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
 void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1075,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..105285c3ea2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+   void *user_data,
+   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(_global_prng_state);
 	nblocks = BlockSampler_Init(, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(_bs, totalblocks, targrows,

Re: Opportunistically pruning page before update

2024-04-03 Thread Melanie Plageman

On Fri, Jan 26, 2024 at 8:33 PM James Coleman  wrote:
>
> On Tue, Jan 23, 2024 at 2:46 AM Dilip Kumar  wrote:
> >
> > On Tue, Jan 23, 2024 at 7:18 AM James Coleman  wrote:
> > >
> > > On Mon, Jan 22, 2024 at 8:21 PM James Coleman  wrote:
> > > >
> > > > See rebased patch attached.
> > >
> > > I just realized I left a change in during the rebase that wasn't 
> > > necessary.
> > >
> > > v4 attached.
> >
> > I have noticed that you are performing the opportunistic pruning after
> > we decided that the updated tuple can not fit in the current page and
> > then we are performing the pruning on the new target page.  Why don't
> > we first perform the pruning on the existing page of the tuple itself?
> >  Or this is already being done before this patch? I could not find
> > such existing pruning so got this question because such pruning can
> > convert many non-hot updates to the HOT update right?
>
> First off I noticed that I accidentally sent a different version of
> the patch I'd originally worked on. Here's the one from the proper
> branch. It's still similar, but I want to make sure the right one is
> being reviewed.

I finally got back around to looking at this. Sorry for the delay.

I don't feel confident enough to say at a high level whether or not it
is a good idea in the general case to try pruning every block
RelationGetBufferForTuple() considers as a target block.

But, I did have a few thoughts on the implementation:

heap_page_prune_opt() checks PageGetHeapFreeSpace() twice. You will
have already done that in RelationGetBufferForTuple(). And you don't
need even need to do it both of those times because you have a lock
(which is why heap_page_prune_opt() does it twice). This all seems a
bit wasteful. And then, you do it again after pruning.

This made me think, vacuum cares how much space heap_page_prune() (now
heap_page_prune_and_freeze()) freed up. Now if we add another caller
who cares how much space pruning freed up, perhaps it is worth
calculating this while pruning and returning it. I know
PageGetHeapFreeSpace() isn't the most expensive function in the world,
but it also seems like a useful, relevant piece of information to
inform the caller of.

You don't have to implement the above, it was just something I was
thinking about.

Looking at the code also made me wonder if it is worth changing
RelationGetBufferForTuple() to call PageGetHeapFreeSpace() before
taking a lock (which won't give a totally accurate result, but that's
probably okay) and then call heap_page_prune_opt() without a lock when
PageGetHeapFreeSpace() says there isn't enough space.

Also do we want to do GetVisibilityMapPins() before calling
heap_page_prune_opt()? I don't quite get why we do that before knowing
if we are going to actually use the target block in the current code.

Anyway, I'm not sure I like just adding a parameter to
heap_page_prune_opt() to indicate it already has an exclusive lock. It
does a bunch of stuff that was always done without a lock and now you
are doing it with an exclusive lock.

- Melanie

Re: On disable_cost

2024-04-03 Thread Greg Sabino Mullane

On Wed, Apr 3, 2024 at 3:21 PM Robert Haas  wrote:

> It's also pretty clear to me that the fact that enable_indexscan
> and enable_indexonlyscan work completely differently from each other
> is surprising at best, wrong at worst, but here again, what this patch
> does about that is not above reproach.

Yes, that is wrong, surely there is a reason we have two vars. Thanks for
digging into this: if nothing else, the code will be better for this
discussion, even if we do nothing for now with disable_cost.

Cheers,
Greg

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart

On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote:
> I committed v23-0001.  Here is a rebased version of the remaining patches.
> I intend to test the masking idea from Ants next.

0002 was missing a cast that is needed for the 32-bit builds.  I've fixed
that in v25.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From fe001e38b3f209c2fe615a2c4c64109d5e4d3da1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v25 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 673 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result:

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alexander Korotkov

On Wed, Apr 3, 2024 at 10:04 PM Pavel Borisov  wrote:
> On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov  wrote:
>>
>> On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera  
>> wrote:
>> >
>> > On 2024-Apr-03, Alexander Korotkov wrote:
>> >
>> > > Regarding the shmem data structure for LSN waiters.  I didn't pick
>> > > LWLock or ConditionVariable, because I needed the ability to wake up
>> > > only those waiters whose LSN is already replayed.  In my experience
>> > > waking up a process is way slower than scanning a short flat array.
>> >
>> > I agree, but I think that's unrelated to what I was saying, which is
>> > just the patch I attach here.
>>
>> Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
>> meant this very clearly!
>>
>> I'm good with the patch.  Attached revision contains a bit of a commit 
>> message.
>>
>> > > However, I agree that when the number of waiters is very high and flat
>> > > array may become a problem.  It seems that the pairing heap is not
>> > > hard to use for shmem structures.  The only memory allocation call in
>> > > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
>> > > be able to initialize the pairing heap in-place, and it will be fine
>> > > for shmem.
>> >
>> > Ok.
>> >
>> > With the code as it stands today, everything in WaitLSNState apart from
>> > the pairing heap is accessed without any locking.  I think this is at
>> > least partly OK because each backend only accesses its own entry; but it
>> > deserves a comment.  Or maybe something more, because WaitLSNSetLatches
>> > does modify the entry for other backends.  (Admittedly, this could only
>> > happens for backends that are already sleeping, and it only happens
>> > with the lock acquired, so it's probably okay.  But clearly it deserves
>> > a comment.)
>>
>> Please, check 0002 patch attached.  I found it easier to move two
>> assignments we previously moved out of lock, into the lock; then claim
>> WaitLSNState.procInfos is also protected by WaitLSNLock.
>
> Could you re-attach 0002. Seems it failed to attach to the previous message.

I actually forgot both!

--
Regards,
Alexander Korotkov


v2-0001-Use-an-LWLock-instead-of-a-spinlock-in-waitlsn.c.patch
Description: Binary data


v2-0002-Clarify-what-is-protected-by-WaitLSNLock.patch
Description: Binary data

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Melanie Plageman

On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro  wrote:
>
> On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
>  wrote:
> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
> if (strategy)
> {
> int strategy_buffers = GetAccessStrategyBufferCount(strategy);
> max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
> }
>
> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

Yea, I don't actually understand why not just use strategy_buffers - 1
or something. 1/2 seems like a big limiting factor for those
strategies with small rings.

I don't really think it will come up for SELECT queries since they
rely on readahead and not prefetching.
It does seem like it could easily come up for analyze.

But I am on board with the idea of clamping for sequential scan/TID
range scan. For vacuum, we might have to think harder. If the user
specifies buffer_usage_limit and io_combine_limit and they are never
reaching IOs of io_combine_limit because of their buffer_usage_limit
value, then we should probably inform them.

- Melanie

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Andres Freund

Hi,

On 2024-04-04 09:27:35 +1300, Thomas Munro wrote:
> Anecdotally by all reports I've seen so far, all-in-cache seems to be
> consistently a touch faster than master if anything, for streaming seq
> scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
> due to intentional changes.  No efficiency is coming from batching
> anything.  Perhaps it has to do with CPU pipelining effects: though
> it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
> itself is cut up into stages in a kind of pipeline:
> read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
> returns page n each time you call it, so maybe we get more CPU
> parallelism due to spreading the data dependencies out?

Another theory is that it's due to the plain ReadBuffer() path needing to do
RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't
need to.



> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
> if (strategy)
> {
> int strategy_buffers = GetAccessStrategyBufferCount(strategy);
> max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
> }

> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
should we apply that only if the ring the strategy doesn't use the
StrategyRejectBuffer() logic?

I think it's fine to add a handwavy justification for the 2, saying that we
want to balance readahead distance and reducing WAL write frequency, and that
at some point more sophisticated logic will be needed.


> Then we should increase the default ring sizes for BAS_BULKREAD and
> BAS_VACUUM to make the previous statement true.

I'm not sure it's right tying them together. The concerns for both are fairly
different.


> The size of main memory and L2 cache have increased dramatically since those
> strategies were invented.  I think we should at least double them, and more
> likely quadruple them.  I realise you already made them configurable per
> command in commit 1cbbee033, but I mean the hard coded default 256 in
> freelist.c.  It's not only to get more space for our prefetching plans, it's
> also to give the system more chance of flushing WAL asynchronously/in some
> other backend before you crash into dirty data; as you discovered,
> prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy,
> by taking away space that is effectively deferring WAL flushes, so I'd at
> least like to double the size for if we use "/ 2" above.

I think for VACUUM we should probably go a bit further. There's no comparable
L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Greetings,

Andres Freund

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson

> On 4 Apr 2024, at 00:06, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> On 3 Apr 2024, at 19:38, Tom Lane  wrote:
>>> Bottom line for me is that pulling 1.0.1 support now is OK,
>>> but I think pulling 1.0.2 is premature.
> 
>> Is Red Hat building and and shipping v17 packages for RHEL7 ELS customers?  
>> If
>> not then it seems mostly academical to tie our dependencies to RHEL ELS 
>> unless
>> I'm missing something.
> 
> True, they won't be doing that, and neither will Devrim.  So maybe
> we can leave RHEL7 out of the discussion, in which case there's
> not a lot of reason to keep 1.0.2 support.  We'll need to notify
> buildfarm owners to adjust their configurations.

The patch will also need to be adjusted to work with LibreSSL, but I know Jacob
was looking into that so ideally we should have something to review before
the weekend.

--
Daniel Gustafsson

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> After sleeping on the problem, I think we can avoid this question
> altogether while making the code faster by using aligned accesses.
> Loads that straddle cache line boundaries run internally as 2 load
> operations. Gut feel says that there are enough out-of-order resources
> available to make it not matter in most cases. But even so, not doing
> the extra work is surely better. Attached is another approach that
> does aligned accesses, and thereby avoids going outside bounds.
> 
> Would be interesting to see how well that fares in the small use case.
> Anything that fits into one aligned cache line should be constant
> speed, and there is only one branch, but the mask setup and folding
> the separate popcounts together should add up to about 20-ish cycles
> of overhead.

I tested your patch in comparison to v25 and saw the following:

  bytes  v25   v25+ants
21108.205  1033.132
41311.227  1289.373
81927.954  2360.113
   162281.091  2365.408
   323856.992  2390.688
   643648.72   3242.498
  1284108.549  3607.148
  2564910.076  4496.852

For 2 bytes and 4 bytes, the inlining should take effect, so any difference
there is likely just noise.  At 8 bytes, we are calling the function
pointer, and there is a small regression with the masking approach.
However, by 16 bytes, the masking approach is on par with v25, and it wins
for all larger buffers, although the gains seem to taper off a bit.

If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_combinebackup --copy-file-range

2024-04-03 Thread Tomas Vondra

Hi,

Here's a much more polished and cleaned up version of the patches,
fixing all the issues I've been aware of, and with various parts merged
into much more cohesive parts (instead of keeping them separate to make
the changes/evolution more obvious).

I decided to reorder the changes like this:

1) block alignment - As I said earlier, I think we should definitely do
this, even if only to make future improvements possible. After chatting
about this with Robert off-list a bit, he pointed out I actually forgot
to not align the headers for files without any blocks, so this version
fixes that.

2) add clone/copy_file_range for the case that copies whole files. This
is pretty simple, but it also adds the --clone/copy-file-range options,
and similar infrastructure. The one slightly annoying bit is that we now
have the ifdef stuff in two places - when parsing the options, and then
in the copy_file_XXX methods, and the pg_fatal() calls should not be
reachable in practice. But that doesn't seem harmful, and might be a
useful protection against someone calling function that does nothing.

This also merges the original two parts, where the first one only did
this cloning/CoW stuff when checksum did not need to be calculated, and
the second extended it to those cases too (by also reading the data, but
still doing the copy the old way).

I think this is the right way - if that's not desisable, it's easy to
either add --no-manifest or not use the CoW options. Or just not change
the checksum type. There's no other way.

3) add copy_file_range to write_reconstructed_file, by using roughly the
same logic/reasoning as (2). If --copy-file-range is specified and a
checksum should be calculated, the data is read for the checksum, but
the copy is done using copy_file_range.

I did rework the flow write_reconstructed_file() flow a bit, because
tracking what exactly needs to be read/written in each of the cases
(which copy method, zeroed block, checksum calculated) made the flow
really difficult to follow. Instead I introduced a function to
read/write a block, and call them from different places.

I think this is much better, and it also makes the following part
dealing with batches of blocks much easier / smaller change.

4) prefetching - This is mostly unrelated to the CoW stuff, but it has
tremendous benefits, especially for ZFS. I haven't been able to tune ZFS
to get decent performance, and ISTM it'd be rather unusable for backup
purposes without this.

5) batching in write_reconstructed_file - This benefits especially the
copy_file_range case, where I've seen it to yield massive speedups (see
the message from Monday for better data).

6) batching for prefetch - Similar logic to (5), but for fadvise. This
is the only part where I'm not entirely sure whether it actually helps
or not. Needs some more analysis, but I'm including it for completeness.

I do think the parts (1)-(4) are in pretty good shape, good enough to
make them committable in a day or two. I see it mostly a matter of
testing and comment improvements rather than code changes.

(5) is in pretty good shape too, but I'd like to maybe simplify and
refine the write_reconstructed_file changes a bit more. I don't think
it's broken, but it feels a bit too cumbersome.

Not sure about (6) yet.

I changed how I think about this a bit - I don't really see the CoW copy
methods as necessary faster than the regular copy (even though it can be
with (5)). I think the main benefits are the space savings, enabled by
patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
that I don't think the performance is an issue - everything has a cost.

On 4/3/24 15:39, Jakub Wartak wrote:
> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>  wrote:
>>
>> Hi,
>>
>> I've been running some benchmarks and experimenting with various stuff,
>> trying to improve the poor performance on ZFS, and the regression on XFS
>> when using copy_file_range. And oh boy, did I find interesting stuff ...
> 
> [..]
> 
> Congratulations on great results!
> 
>> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
>> to finish recovery, run pg_checksums --check (to check the patches don't
>> produce something broken)
> 
> I've performed some follow-up small testing on all patches mentioned
> here  (1..7), with the earlier developed nano-incremental-backup-tests
> that helped detect some issues for Robert earlier during original
> development. They all went fine in both cases:
> - no special options when using pg_combinebackup
> - using pg_combinebackup --copy-file-range --manifest-checksums=NONE
> 
> Those were:
> test_across_wallevelminimal.sh
> test_full_pri__incr_stby__restore_on_pri.sh
> test_full_pri__incr_stby__restore_on_stby.sh
> test_full_stby__incr_stby__restore_on_pri.sh
> test_full_stby__incr_stby__restore_on_stby.sh
> test_incr_after_timelineincrease.sh
> test_incr_on_standby_after_promote.sh
> test_many_incrementals_dbcreate_duplicateOID.sh
>

Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Peter Eisentraut


On 03.04.24 23:19, Magnus Hagander wrote:
When the code is this simple, we should definitely consider carrying it 
ourselves. At least if we don't expect to need *other* functionality 
from the same library in the future, which I doubt we will from libsystemd.


Well, I've long had it on my list to do some integration to log directly 
to the journal, so you can preserve metadata better.  I'm not sure right 
now whether this would use libsystemd, but it's not like there is 
absolutely no other systemd-related functionality that could be added.


Personally, I think this proposed change is trying to close a barndoor 
after a horse has bolted.  There are many more interesting and scary 
libraries in the dependency tree of "postgres", so just picking off one 
right now doesn't really accomplish anything.  The next release of 
libsystemd will drop all the compression libraries as hard dependencies, 
so the issue in that sense is gone anyway.  Also, fun fact: liblzma is 
also a dependency via libxml2.

Re: On disable_cost

2024-04-03 Thread Robert Haas

On Tue, Apr 2, 2024 at 12:58 PM Tom Lane  wrote:
> Not really.  But if you had, say, a join of a promoted path to a
> disabled path, that would be treated as on-par with a join of two
> regular paths, which seems like it'd lead to odd choices.  Maybe
> it'd be fine, but my gut says it'd likely not act nicely.  As you
> say, it's a lot easier to believe that only-promoted or only-disabled
> situations would behave sanely.

Makes sense.

I wanted to further explore the idea of just not generating plans of
types that are currently disabled. I looked into doing this for
enable_indexscan and enable_indexonlyscan. As a first step, I
investigated how those settings work now, and was horrified. I don't
know whether I just wasn't paying attention back when the original
index-only scan work was done -- I remember discussing
enable_indexonlyscan with you at the time -- or whether it got changed
subsequently. Anyway, the current behavior is:

[A] enable_indexscan=false adds disable_cost to the cost of every
Index Scan path *and also* every Index-Only Scan path. So disabling
index-scans also in effect discourages the use of index-only scans,
which would make sense if we didn't have a separate setting called
enable_indexonlyscan, but we do. Given that, I think this is
completely and utterly wrong.

[b] enable_indexonlyscan=false causes index-only scan paths not to be
generated at all, but instead, we generate index-scan paths to do the
same thing that we would not have generated otherwise. This is weird
because it means that disabling one plan type causes us to consider
additional plans of another type, which seems like a thing that a user
might not expect. It's more defensible than [A], though, because you
could argue that we only omit the index scan path as an optimization,
on the theory that it will always lose to the index-only scan path,
and thus if the index-only scan path is not generated, there's a point
to generating the index scan path after all, so we should. However, it
seems unlikely to me that someone reading the one line of
documentation that we have about this parameter would be able to guess
that it works this way.

Here's an example of how the current system behaves:

robert.haas=# explain select count(*) from pgbench_accounts;
   QUERY PLAN
-
 Aggregate  (cost=2854.29..2854.30 rows=1 width=8)
   ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
 (cost=0.29..2604.29 rows=10 width=0)
(2 rows)

robert.haas=# set enable_indexscan=false;
SET
robert.haas=# explain select count(*) from pgbench_accounts;
  QUERY PLAN
--
 Aggregate  (cost=2890.00..2890.01 rows=1 width=8)
   ->  Seq Scan on pgbench_accounts  (cost=0.00..2640.00 rows=10 width=0)
(2 rows)

robert.haas=# set enable_seqscan=false;
SET
robert.haas=# set enable_bitmapscan=false;
SET
robert.haas=# explain select count(*) from pgbench_accounts;
QUERY PLAN
--
 Aggregate  (cost=1002854.29..1002854.30 rows=1 width=8)
   ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
 (cost=100.29..1002604.29 rows=10 width=0)
(2 rows)

robert.haas=# set enable_indexonlyscan=false;
SET
robert.haas=# explain select count(*) from pgbench_accounts;
  QUERY PLAN
---
 Aggregate  (cost=1002890.00..1002890.01 rows=1 width=8)
   ->  Seq Scan on pgbench_accounts
(cost=100.00..1002640.00 rows=10 width=0)
(2 rows)

The first time we run the query, it picks an index-only scan because
it's the cheapest. When index scans are disabled, the query now picks
a sequential scan, even though it wasn't use an index-only scan,
because the index scan that it was using is perceived to have become
very expensive. When we then shut off sequential scans and bitmap-only
scans, it switches back to an index-only scan, because setting
enable_indexscan=false didn't completely disable index-only scans, but
just made them expensive. But now everything looks expensive, so we go
back to the same plan we had initially, except with the cost increased
by a bazillion. Finally, when we disable index-only scans, that
removes that plan from the pool, so now we pick the second-cheapest
plan overall, which in this case is a sequential scan.

So just to see what would happen, I wrote a patch to make
enable_indexscan and enable_indexonlyscan do exactly what they say on
the tin: when you set one of them to false, paths of that type are not
generated,

Re: Security lessons from liblzma

2024-04-03 Thread Daniel Gustafsson

> On 3 Apr 2024, at 20:09, Peter Eisentraut  wrote:
> 
> On 30.03.24 00:14, Daniel Gustafsson wrote:
>> One take-away for me is how important it is to ship recipes for regenerating
>> any testdata which is included in generated/compiled/binary format.  Kind of
>> how we in our tree ship the config for test TLS certificates and keys which 
>> can
>> be manually inspected, and used to rebuild the testdata (although the risk 
>> for
>> injections in this particular case seems low).  Bad things can still be
>> injected, but formats which allow manual review at least goes some way 
>> towards
>> lowering risk.
> 
> I actually find the situation with the test TLS files quite unsatisfactory.  
> While we have build rules, the output files are not reproducible, both 
> because of some inherent randomness in the generation, and because different 
> OpenSSL versions produce different details.

This testdata is by nature not reproducible, and twisting arms to make it so
will only result in testing against synthetic data which risk hiding bugs IMO.

> So you just have to "trust" that what's there now makes sense.

Not entirely, you can review the input files which are used to generate the
test data and verify that those make sense..

> Of course, you can use openssl tools to unpack these files, but that is 
> difficult and unreliable unless you know exactly what you are looking for.

..and check like you mention above, but it's as you say not entirely trivial.  
It
would be nice to improve this but I'm not sure how.  Document how to inspect
the data in src/test/ssl/README perhaps?

> It would be better if we created the required test files as part of the test 
> run.  (Why not?  Too slow?)

The make sslfiles step requires OpenSSL 1.1.1, which is higher than what we
require to be installed to build postgres.  The higher-than-base requirement is
due to it building test data only used when running 1.1.1 or higher, so
technically it could be made to work if anyone is interesting in investing time
in 1.0.2.

Time is one aspect, on my crusty old laptop it takes ~2 seconds to regenerate
the files.  That in itself isn't that much, but we've rejected test-time
additions far less than that.  We could however make CI and the Buildfarm run
the regeneration and leave it up to each developer to decide locally?  Or
remove them and regenerate on the first SSL test run and then use the cached
ones after that?

On top of time I have a feeling the regeneration won't run on Windows.  When
it's converted to use Meson then that might be fixed though.

--
Daniel Gustafsson

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Andrew Dunstan



On 2024-04-03 We 15:12, Daniel Gustafsson wrote:

The
fact that very few animals run the ssl tests is a pet peeve of mine, it would
be nice if we could get broader coverage there.



Well, the only reason for that is that the SSL tests need to be listed 
in PG_TEST_EXTRA, and the only reason for that is that there's a 
possible hazard on multi-user servers. But I bet precious few buildfarm 
animals run in such an environment. Mine don't - I'm the only user.


Maybe we could send out an email to the buildfarm-owners list asking 
people to consider allowing the ssl tests.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: LogwrtResult contended spinlock

2024-04-03 Thread Jeff Davis

On Wed, 2024-04-03 at 13:19 +0200, Alvaro Herrera wrote:
> So what I do in the attached 0001 is stop using the XLogwrtResult
> struct
> in XLogCtl and replace it with separate Write and Flush values, and
> add
> the macro XLogUpdateLocalLogwrtResult() that copies the values of
> Write
> and Flush from the shared XLogCtl to the local variable given as
> macro
> argument.

+1

> 0002 then adds pg_atomic_monotonic_advance_u64.  (I don't add the
> _u32
> variant, because I don't think it's a great idea to add dead code. 
> If
> later we see a need for it we can put it in.)  It also changes the
> two
> new members to be atomics, changes the macro to use atomic read, and
> XLogWrite now uses monotonic increment.  A couple of other places can
> move the macro calls to occur outside the spinlock.  Also, XLogWrite
> gains the invariant checking that involves Write and Flush.

To maintain the invariant that Write >= Flush, I believe you need to
always store to Write first, then Flush; and load from Flush first,
then from Write. So RefreshXLogWriteResult should load Flush then load
Write, and the same for the Assert code. And in 0003, loading from Copy
should happen last.

Also, should pg_atomic_monotonic_advance_u64() return currval? I don't
think it's important for the current patches, but I could imagine a
caller wanting the most up-to-date value possible, even if it's beyond
what the caller requested. Returning void seems slightly wasteful.

Other than that, it looks good to me.

> Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and
> Flush, and updates WALReadFromBuffers to test that instead of the
> Write
> pointer, and adds in XLogWrite the invariant checks that involve the
> Copy pointer.

Looks good to me.

Regards,
Jeff Davis

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Thomas Munro

On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
 wrote:
> On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas  wrote:
> > On 01/04/2024 22:58, Melanie Plageman wrote:
> > > Attached v7 has version 14 of the streaming read API as well as a few
> > > small tweaks to comments and code.
> >
> > I saw benchmarks in this thread to show that there's no regression when
> > the data is in cache, but I didn't see any benchmarks demonstrating the
> > benefit of this. So I ran this quick test:
>
> Good point! It would be good to show why we would actually want this
> patch. Attached v8 is rebased over current master (which now has the
> streaming read API).

Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
due to intentional changes.  No efficiency is coming from batching
anything.  Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out?  (Makes me
wonder what happens if you insert a memory prefetch for the page
header into that production line, and if there are more opportunities
to spread dependencies out eg hashing the buffer tag ahead of time.)

> On the topic of BAS_BULKREAD buffer access strategy, I think the least
> we could do is add an assert like this to read_stream_begin_relation()
> after calculating max_pinned_buffers.
>
> Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
>
> Perhaps we should do more? I think with a ring size of 16 MB, large
> SELECTs are safe for now. But I know future developers will make
> changes and it would be good not to assume they will understand that
> pinning more buffers than the size of the ring effectively invalidates
> the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}

I just don't know where to get that '2'.  The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true.  The size of main
memory and L2 cache have increased dramatically since those strategies
were invented.  I think we should at least double them, and more
likely quadruple them.  I realise you already made them configurable
per command in commit 1cbbee033, but I mean the hard coded default 256
in freelist.c.  It's not only to get more space for our prefetching
plans, it's also to give the system more chance of flushing WAL
asynchronously/in some other backend before you crash into dirty data;
as you discovered, prefetching accidentally makes that effect worse
in.a BAS_VACUUM strategy, by taking away space that is effectively
deferring WAL flushes, so I'd at least like to double the size for if
we use "/ 2" above.

Re: On disable_cost

2024-04-03 Thread David Rowley

On Thu, 4 Apr 2024 at 08:21, Robert Haas  wrote:
> I wanted to further explore the idea of just not generating plans of
> types that are currently disabled. I looked into doing this for
> enable_indexscan and enable_indexonlyscan. As a first step, I
> investigated how those settings work now, and was horrified. I don't
> know whether I just wasn't paying attention back when the original
> index-only scan work was done -- I remember discussing
> enable_indexonlyscan with you at the time -- or whether it got changed
> subsequently. Anyway, the current behavior is:
>
> [A] enable_indexscan=false adds disable_cost to the cost of every
> Index Scan path *and also* every Index-Only Scan path. So disabling
> index-scans also in effect discourages the use of index-only scans,
> which would make sense if we didn't have a separate setting called
> enable_indexonlyscan, but we do. Given that, I think this is
> completely and utterly wrong.
>
> [b] enable_indexonlyscan=false causes index-only scan paths not to be
> generated at all, but instead, we generate index-scan paths to do the
> same thing that we would not have generated otherwise.

FWIW, I think changing this is a bad idea and I don't think the
behaviour that's in your patch is useful.  With your patch, if I SET
enable_indexonlyscan=false, any index that *can* support an IOS for my
query will now not be considered at all!

With your patch applied, I see:

-- default enable_* GUC values.
postgres=# explain select oid from pg_class order by oid;
QUERY PLAN
---
 Index Only Scan using pg_class_oid_index on pg_class
(cost=0.27..22.50 rows=415 width=4)
(1 row)

postgres=# set enable_indexonlyscan=0; -- no index scan?
SET
postgres=# explain select oid from pg_class order by oid;
   QUERY PLAN
-
 Sort  (cost=36.20..37.23 rows=415 width=4)
   Sort Key: oid
   ->  Seq Scan on pg_class  (cost=0.00..18.15 rows=415 width=4)
(3 rows)

postgres=# set enable_seqscan=0; -- still no index scan!
SET
postgres=# explain select oid from pg_class order by oid;
 QUERY PLAN

 Sort  (cost=136.20..137.23 rows=415 width=4)
   Sort Key: oid
   ->  Seq Scan on pg_class  (cost=100.00..118.15
rows=415 width=4)
(3 rows)

postgres=# explain select oid from pg_class order by oid,relname; --
now an index scan?!
 QUERY PLAN
-
 Incremental Sort  (cost=0.43..79.50 rows=415 width=68)
   Sort Key: oid, relname
   Presorted Key: oid
   ->  Index Scan using pg_class_oid_index on pg_class
(cost=0.27..60.82 rows=415 width=68)
(4 rows)

I don't think this is good as pg_class_oid_index effectively won't be
used as long as the particular query could use that index with an
index *only* scan. You can see above that as soon as I adjust the
query slightly so that IOS isn't possible, the index can be used
again. I think an Index Scan would have been a much better option for
the 2nd query than the seq scan and sort.

I think if I do SET enable_indexonlyscan=0; the index should still be
used with an Index Scan and it definitely shouldn't result in Index
Scan also being disabled if that index happens to contain all the
columns required to support an IOS.

FWIW, I'm fine with the current behaviour.  It looks like we've
assumed that, when possible, IOS are always superior to Index Scan, so
there's no point in generating an Index Scan path when we can generate
an IOS path. I think this makes sense. For that not to be true,
checking the all visible flag would have to be more costly than
visiting the heap.  Perhaps that could be true if the visibility map
page had to come from disk and the heap page was cached and the disk
was slow, but I don't think that scenario is worthy of considering
both Index scan and IOS path types when IOS is possible. We've no way
to accurately cost that anyway.

This all seems similar to enable_sort vs enable_incremental_sort. For
a while, we did consider both an incremental sort and a sort when an
incremental sort was possible, but it seemed to me that an incremental
sort would always be better when it was possible, so I changed that in
4a29eabd1. I've not seen anyone complain.  I made it so that when an
incremental sort is possible but is disabled, we do a sort instead.
That seems fairly similar to how master handles
enable_indexonlyscan=false.

In short, I don't find it strange that disabling one node type results
in considering another type that we'd otherwise not consider in cases
where we assume that the disabled node type is always superior and
should always be used when it is possible.

I do

Re: Built-in CTYPE provider

2024-04-03 Thread Jeff Davis

On Tue, 2024-03-26 at 08:14 +0100, Peter Eisentraut wrote:
> 
> Full vs. simple case mapping is more of a legacy compatibility
> question, 
> in my mind.  There is some expectation/precedent that C.UTF-8 uses 
> simple case mapping, but beyond that, I don't see a reason why
> someone 
> would want to explicitly opt for simple case mapping, other than if
> they 
> need length preservation or something, but if they need that, then
> they 
> are going to be in a world of pain in Unicode anyway.

I mostly agree, though there are some other purposes for the simple
mapping:

* a substitute for case folding: lower() with simple case mapping will
work better for that purpose than lower() with full case mapping (after
we have casefold(), this won't be a problem)

* simple case mapping is conceptually simpler, and that's a benefit by
itself in some situations -- maybe the 1:1 assumption exists other
places in their application

> > There's also another reason to consider it an argument rather than
> > a
> > collation property, which is that it might be dependent on some
> > other
> > field in a row. I could imagine someone wanting to do:
> > 
> >     SELECT
> >   UPPER(some_field,
> >     full => true,
> >     dotless_i => CASE other_field WHEN ...)
> >     FROM ...
> 
> Can you index this usefully?  It would only work if the user query 
> matches exactly this pattern?

In that example, UPPER is used in the target list -- the WHERE clause
might be indexable. The UPPER is just used for display purposes, and
may depend on some locale settings stored in another table associated
with a particular user.

Regards,
Jeff Davis

RE: Can't compile PG 17 (master) from git under Msys2 autoconf

2024-04-03 Thread Regina Obe

> Hi Regina,
> 
> On 2024-Mar-27, Regina Obe wrote:
> 
> > The error is
> >
> > rm -f '../../src/include/storage/lwlocknames.h'
> > cp -pR ../../backend/storage/lmgr/lwlocknames.h
> > '../../src/include/storage/lwlocknames.h'
> > cp: cannot stat '../../backend/storage/lmgr/lwlocknames.h': No such
> > file or directory
> > make[1]: *** [Makefile:143: ../../src/include/storage/lwlocknames.h]
> > Error 1
> > make[1]: Leaving directory '/projects/postgresql/postgresql-
> git/src/backend'
> > make: *** [../../src/Makefile.global:382: submake-generated-headers]
> > Error 2
> 
> Hmm, I changed these rules again in commit da952b415f44, maybe your
> problem is with that one?  I wonder if changing the references to
> ../include/storage/lwlocklist.h to $(topdir)/src/include/storage/lwlocklist.h
> (and similar things in
> src/backend/storage/lmgr/Makefile) would fix it.
> 
> Do you run configure in the source directory or a separate build one?
> 
> --
> Álvaro HerreraBreisgau, Deutschland  —
> https://www.EnterpriseDB.com/
> "If it is not right, do not do it.
> If it is not true, do not say it." (Marcus Aurelius, Meditations)

I run in the source directory, but tried doing in a separate build directory 
and ran into the same issue.

Let me look at that commit and get back to you if that makes a difference.

Thanks,
Regina

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Melanie Plageman

On Wed, Apr 03, 2024 at 10:25:01PM +0300, Nazir Bilal Yavuz wrote:
>
> I realized the same error while working on Jakub's benchmarking results.
> 
> Cause: I was using the nblocks variable to check how many blocks will
> be returned from the streaming API. But I realized that sometimes the
> number returned from BlockSampler_Init() is not equal to the number of
> blocks that BlockSampler_Next() will return as BlockSampling algorithm
> decides how many blocks to return on the fly by using some random
> seeds.
> 
> There are a couple of solutions I thought of:
> 
> 1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
> the acquire_sample_rows():
> 
> Streaming API uses this function to prefetch block numbers.
> BlockSampler_HasMore() will reach to the end first as it is used while
> prefetching, so it will start to return false while there are still
> buffers to return from the streaming API. That will cause some buffers
> at the end to not be processed.
> 
> 2- Expose something (function, variable etc.) from the streaming API
> to understand if the read is finished and there is no buffer to
> return:
> 
> I think this works but I am not sure if the streaming API allows
> something like that.
> 
> 3- Check every buffer returned from the streaming API, if it is
> invalid stop the main loop in the acquire_sample_rows():
> 
> This solves the problem but there will be two if checks for each
> buffer returned,
> - in heapam_scan_analyze_next_block() to check if the returned buffer is 
> invalid
> - to break main loop in acquire_sample_rows() if
> heapam_scan_analyze_next_block() returns false
> One of the if cases can be bypassed by moving
> heapam_scan_analyze_next_block()'s code to the main loop in the
> acquire_sample_rows().
> 
> I implemented the third solution, here is v6.

I've reviewed the patches inline below and attached a patch that has
some of my ideas on top of your patch.

> From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
> From: Nazir Bilal Yavuz 
> Date: Wed, 3 Apr 2024 15:14:15 +0300
> Subject: [PATCH v6] Use streaming read API in ANALYZE
> 
> ANALYZE command gets random tuples using BlockSampler algorithm. Use
> streaming reads to get these tuples by using BlockSampler algorithm in
> streaming read API prefetch logic.
> ---
>  src/include/access/heapam.h  |  6 +-
>  src/backend/access/heap/heapam_handler.c | 22 +++---
>  src/backend/commands/analyze.c   | 85 
>  3 files changed, 42 insertions(+), 71 deletions(-)
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index a307fb5f245..633caee9d95 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -25,6 +25,7 @@
>  #include "storage/bufpage.h"
>  #include "storage/dsm.h"
>  #include "storage/lockdefs.h"
> +#include "storage/read_stream.h"
>  #include "storage/shm_toc.h"
>  #include "utils/relcache.h"
>  #include "utils/snapshot.h"
> @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
> struct 
> GlobalVisState *vistest);
>  
>  /* in heap/heapam_handler.c*/
> -extern void heapam_scan_analyze_next_block(TableScanDesc scan,
> - 
>BlockNumber blockno,
> - 
>BufferAccessStrategy bstrategy);
> +extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
> + 
>ReadStream *stream);
>  extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
>   
>TransactionId OldestXmin,
>   
>double *liverows, double *deadrows,
> diff --git a/src/backend/access/heap/heapam_handler.c 
> b/src/backend/access/heap/heapam_handler.c
> index 0952d4a98eb..d83fbbe6af3 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> Relation NewHeap,
>  }
>  
>  /*
> - * Prepare to analyze block `blockno` of `scan`.  The scan has been started
> - * with SO_TYPE_ANALYZE option.
> + * Prepare to analyze block returned from streaming object.  If the block 
> returned
> + * from streaming object is valid, true is returned; otherwise false is 
> returned.
> + * The scan has been started with SO_TYPE_ANALYZE option.
>   *
>   * This routine holds a buffer pin and lock on the heap page.  They are held
>   * until heapam_scan_analyze_next_tuple() returns false.  That is until all 
> the
>   * items of the heap page are analyzed.
>   */
> -void
> -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> -

Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Magnus Hagander

On Wed, Apr 3, 2024 at 7:57 PM Andres Freund  wrote:

> Hi,
>
> As most will know by now, the way xz debacle was able to make sshd
> vulnerable
> was through a dependency from sshd to libsystemd and then from libsystemd
> to
> liblzma. One lesson from this is that unnecessary dependencies can still
> increase risk.
>

Yeah, I think that's something to consider for every dependency added. I
think we're fairly often protected against "adding too many libraries"
because many libraries simply don't exist for all the platforms we want to
build on. But it's nevertheless something to think about each time.

It's worth noting that we have an optional dependency on libsystemd as well.
>
> Openssh has now integrated [1] a patch to remove the dependency on
> libsystemd
> for triggering service manager readyness notifications, by inlining the
> necessary function. That's not hard, the protocol is pretty simple.
>
> I suspect we should do the same. We're not even close to being a target as
> attractive as openssh, but still, it seems unnecessary.
>

+1.

When the code is this simple, we should definitely consider carrying it
ourselves. At least if we don't expect to need *other* functionality from
the same library in the future, which I doubt we will from libsystemd.

An argument could be made to instead just remove support, but I think it's
> quite valuable to have intra service dependencies that can rely on the
> server
> actually having started up.
>
>
If we remove support we're basically just asking most of our linux
packagers to add it back in, and they will add it back in the same way we
did it. I think we do everybody a disservice if we do that. It's useful
functionality.

//Magnus

Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Tom Lane

Magnus Hagander  writes:
> On Wed, Apr 3, 2024 at 7:57 PM Andres Freund  wrote:
>> Openssh has now integrated [1] a patch to remove the dependency on
>> libsystemd
>> for triggering service manager readyness notifications, by inlining the
>> necessary function. That's not hard, the protocol is pretty simple.
>> I suspect we should do the same. We're not even close to being a target as
>> attractive as openssh, but still, it seems unnecessary.

> +1.

I didn't read the patch, but if it's short and stable enough then this
seems like a good idea.  (If openssh and we are using such a patch,
that will probably be a big enough stake in the ground to prevent
somebody deciding to change the protocol ...)

>> An argument could be made to instead just remove support, but I think it's
>> quite valuable to have intra service dependencies that can rely on the
>> server actually having started up.

> If we remove support we're basically just asking most of our linux
> packagers to add it back in, and they will add it back in the same way we
> did it. I think we do everybody a disservice if we do that. It's useful
> functionality.

Yeah, that idea seems particularly silly in view of the desire
expressed earlier in this thread to reduce the number of patches
carried by packagers.  People packaging for systemd-using distros
will not consider that this functionality is optional.

regards, tom lane

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-04-03 Thread Michael Paquier

On Wed, Apr 03, 2024 at 04:20:39PM +0300, Melih Mutlu wrote:
> Michael Paquier , 14 Şub 2024 Çar, 10:23 tarihinde
> şunu yazdı:
>> I was reading the patch, and using int[] as a representation of the
>> path of context IDs up to the top-most parent looks a bit strange to
>> me, with the relationship between each parent -> child being
>> preserved, visibly, based on the order of the elements in this array
>> made of temporary IDs compiled on-the-fly during the function
>> execution.  Am I the only one finding that a bit strange?  Could it be
>> better to use a different data type for this path and perhaps switch
>> to the names of the contexts involved?
> 
> Do you find having the path column strange all together? Or only using
> temporary IDs to generate that column? The reason why I avoid using context
> names is because there can be multiple contexts with the same name. This
> makes it difficult to figure out which context, among those with that
> particular name, is actually included in the path. I couldn't find any
> other information that is unique to each context.

I've been re-reading the patch again to remember what this is about,
and I'm OK with having this "path" column in the catalog.  However,
I'm somewhat confused by the choice of having a temporary number that
shows up in the catalog representation, because this may not be
constant across multiple calls so this still requires a follow-up
temporary ID <-> name mapping in any SQL querying this catalog.  A
second thing is that array does not show the hierarchy of the path;
the patch relies on the order of the elements in the output array
instead.
--
Michael


signature.asc
Description: PGP signature

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson

> On 3 Apr 2024, at 17:29, Tom Lane  wrote:
> 
> Jacob Champion  writes:
>> As far as I can tell, no versions of LibreSSL so far provide
>> X509_get_signature_info(), so this patch is probably a bit too
>> aggressive.
> 
> Another problem with cutting support is how many buildfarm members
> will we lose.  I scraped recent configure logs and got the attached
> results.  I count 3 machines running 1.0.1,

Support for 1.0.1 was removed with 8e278b657664 in July 2023 so those are not
building with OpenSSL enabled already.

> 18 running some flavor of 1.0.2,

massasauga and snakefly run the ssl_passphrase_callback-check test but none of
these run the ssl-check tests AFAICT, so we have very low coverage as is.  The
fact that very few animals run the ssl tests is a pet peeve of mine, it would
be nice if we could get broader coverage there.

Worth noting is that the OpenSSL check in configure.ac only reports what the
version of the OpenSSL binary in $PATH is, not which version of the library
that we build against (using --with-libs/--with-includes etc).

--
Daniel Gustafsson

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-04-03 Thread Tom Lane

Alvaro Herrera  writes:
> Great, thanks for looking.  Pushed now, I'll be closing the commitfest
> entry shortly.

On my machine, headerscheck does not like this:

$ src/tools/pginclude/headerscheck --cplusplus
In file included from /tmp/headerscheck.4gTaW5/test.cpp:3:
./src/include/libpq/libpq-be-fe-helpers.h: In function 'char* 
libpqsrv_cancel(PGconn*, TimestampTz)':
./src/include/libpq/libpq-be-fe-helpers.h:393:10: warning: ISO C++ forbids 
converting a string constant to 'char*' [-Wwrite-strings]
   return "out of memory";
  ^~~
./src/include/libpq/libpq-be-fe-helpers.h:421:13: warning: ISO C++ forbids 
converting a string constant to 'char*' [-Wwrite-strings]
 error = "cancel request timed out";
 ^~

The second part of that could easily be fixed by declaring "error" as
"const char *".  As for the first part, can we redefine the whole
function as returning "const char *"?  (If not, this coding is very
questionable anyway.)

regards, tom lane

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-03 Thread Tom Lane

Matthias van de Meent  writes:
> On Tue, 2 Apr 2024 at 17:47, Tom Lane  wrote:
>> IIUC, it's not possible to use the SERIALIZE option when explaining
>> CREATE TABLE AS, because you can't install the instrumentation tuple
>> receiver when the IntoRel one is needed.  I think that's fine because
>> no serialization would happen in that case anyway, but should we
>> throw an error if that combination is requested?  Blindly reporting
>> that zero bytes were serialized seems like it'd confuse people.

> I think it's easily explained as no rows were transfered to the
> client. If there is actual confusion, we can document it, but
> confusing disk with network is not a case I'd protect against. See
> also: EXPLAIN (ANALYZE, SERIALIZE) INSERT without the RETURNING
> clause.

Fair enough.  There were a couple of spots in the code where I thought
this was important to comment about, though.

>> However, isn't the bottom half of serializeAnalyzeStartup doing
>> exactly what the comment above it says we don't do, that is accounting
>> for the RowDescription message?  Frankly I agree with the comment that
>> it's not worth troubling over, so I'd just drop that code rather than
>> finding a solution for the magic-number problem.

> I'm not sure I agree with not including the size of RowDescription
> itself though, as wide results can have a very large RowDescription
> overhead; up to several times the returned data in cases where few
> rows are returned.

Meh --- if we're rounding off to kilobytes, you're unlikely to see it.
In any case, if we start counting overhead messages, where shall we
stop?  Should we count the eventual CommandComplete or ReadyForQuery,
for instance?  I'm content to say that this measures data only; that
seems to jibe with other aspects of EXPLAIN's behavior.

> Removed. I've instead added buffer usage, as I realised that wasn't
> covered yet, and quite important to detect excessive detoasting (it's
> not covered at the top-level scan).

Duh, good catch.

I've pushed this after a good deal of cosmetic polishing -- for
example, I spent some effort on making serializeAnalyzeReceive
look as much like printtup as possible, in hopes of making it
easier to keep the two functions in sync in future.

regards, tom lane

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Tom Lane

Daniel Gustafsson  writes:
> On 3 Apr 2024, at 19:38, Tom Lane  wrote:
>> Bottom line for me is that pulling 1.0.1 support now is OK,
>> but I think pulling 1.0.2 is premature.

> Is Red Hat building and and shipping v17 packages for RHEL7 ELS customers?  If
> not then it seems mostly academical to tie our dependencies to RHEL ELS unless
> I'm missing something.

True, they won't be doing that, and neither will Devrim.  So maybe
we can leave RHEL7 out of the discussion, in which case there's
not a lot of reason to keep 1.0.2 support.  We'll need to notify
buildfarm owners to adjust their configurations.

regards, tom lane

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson

> On 3 Apr 2024, at 21:48, Andrew Dunstan  wrote:
> On 2024-04-03 We 15:12, Daniel Gustafsson wrote:
>>   The
>> fact that very few animals run the ssl tests is a pet peeve of mine, it would
>> be nice if we could get broader coverage there.
> 
> Well, the only reason for that is that the SSL tests need to be listed in 
> PG_TEST_EXTRA, and the only reason for that is that   there's a possible 
> hazard on multi-user servers. But I bet precious few buildfarm animals run in 
> such an environment. Mine don't - I'm the only user. 
> 
> Maybe we could send out an email to the buildfarm-owners list asking people 
> to consider allowing the ssl tests.

I think that sounds like a good idea.

--
Daniel Gustafsson

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Peter Eisentraut


On 30.03.24 22:27, Thomas Munro wrote:

On Sun, Mar 31, 2024 at 9:59 AM Tom Lane  wrote:

Thomas Munro  writes:

I was reminded of this thread by ambient security paranoia.  As it
stands, we require 1.0.2 (but we very much hope that package
maintainers and others in control of builds don't decide to use it).
Should we skip 1.1.1 and move to requiring 3 for v17?


I'd be kind of sad if I couldn't test SSL stuff anymore on my
primary workstation, which has

$ rpm -q openssl
openssl-1.1.1k-12.el8_9.x86_64

I think it's probably true that <=1.0.2 is not in any distro that
we still need to pay attention to, but I reject the contention
that RHEL8 is not in that set.


Hmm, OK so it doesn't have 3 available in parallel from base repos.
But it's also about to reach end of "full support" in 2 months[1], so
if we applied the policies we discussed in the LLVM-vacuuming thread
(to wit: build farm - EOL'd OSes), then...  One question I'm unclear
on is whether v17 will be packaged for RHEL8.


The rest of the thread talks about the end of support of RHEL 7, but you 
are here talking about RHEL 8.   It is true that "full support" for RHEL 
8 ended in May 2024, but that is the not the one we are tracking.  We 
are tracking the 10-year one, which I suppose is now called "maintenance 
support".


So if the above package list is correct, then we ought to keep 
supporting openssl 1.1.* until 2029.

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Michael Paquier

On Wed, Apr 03, 2024 at 01:38:50PM -0400, Tom Lane wrote:
> The discussion we had last year concluded that we were OK with
> dropping 1.0.1 support when RHEL6 goes out of extended support
> (June 2024 per this thread, I didn't check it).  Seems like we
> should have the same policy for RHEL7.  Also, calling Photon 3
> dead because it went EOL three days ago seems over-hasty.

Yeah.  A bunch of users of Photon are VMware (or you could say
Broadcom) product appliances, and I'd suspect that quite a lot of them
rely on Photon 3 for their base OS image.  Upgrading that stuff is not
easy work in my experience because they need to cope with a bunch of
embedded services.

> Bottom line for me is that pulling 1.0.1 support now is OK,
> but I think pulling 1.0.2 is premature.

Yeah, I guess so.  At least that seems like the safest conclusion
currently here.  The build-time check on X509_get_signature_info()
would still be required.

I'd love being able to rip out the internal locking logic currently in
libpq as LibreSSL has traces of CRYPTO_lock(), as far as I've checked,
and we rely on its existence.
--
Michael

signature.asc
Description: PGP signature

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Alvaro Herrera

On 2024-Apr-03, Dilip Kumar wrote:

> Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
> thanks for reporting.  Your fix looks correct to me.

Thanks for the quick review!  And thanks to Alexander for the report.
Pushed the fix.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alvaro Herrera

Hello Alexander,

On 2024-Apr-03, Alexander Korotkov wrote:

> Regarding the shmem data structure for LSN waiters.  I didn't pick
> LWLock or ConditionVariable, because I needed the ability to wake up
> only those waiters whose LSN is already replayed.  In my experience
> waking up a process is way slower than scanning a short flat array.

I agree, but I think that's unrelated to what I was saying, which is
just the patch I attach here.

> However, I agree that when the number of waiters is very high and flat
> array may become a problem.  It seems that the pairing heap is not
> hard to use for shmem structures.  The only memory allocation call in
> paritingheap.c is in pairingheap_allocate().  So, it's only needed to
> be able to initialize the pairing heap in-place, and it will be fine
> for shmem.

Ok.

With the code as it stands today, everything in WaitLSNState apart from
the pairing heap is accessed without any locking.  I think this is at
least partly OK because each backend only accesses its own entry; but it
deserves a comment.  Or maybe something more, because WaitLSNSetLatches
does modify the entry for other backends.  (Admittedly, this could only
happens for backends that are already sleeping, and it only happens
with the lock acquired, so it's probably okay.  But clearly it deserves
a comment.)

Don't we need to WaitLSNCleanup() during error recovery or something?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"(Andrew Morton)
>From 4079dc6a6a6893055b32dee75f178b324bbaef77 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Apr 2024 18:35:15 +0200
Subject: [PATCH] Use an LWLock instead of a spinlock in waitlsn.c

---
 src/backend/commands/waitlsn.c  | 15 +++
 src/backend/utils/activity/wait_event_names.txt |  1 +
 src/include/commands/waitlsn.h  |  5 +
 src/include/storage/lwlocklist.h|  1 +
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 51a34d422e..a57b818a2d 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -58,7 +58,6 @@ WaitLSNShmemInit(void)
 			   );
 	if (!found)
 	{
-		SpinLockInit(>waitersHeapMutex);
 		pg_atomic_init_u64(>minWaitedLSN, PG_UINT64_MAX);
 		pairingheap_initialize(>waitersHeap, lsn_cmp, NULL);
 		memset(>procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
@@ -115,13 +114,13 @@ addLSNWaiter(XLogRecPtr lsn)
 	procInfo->procnum = MyProcNumber;
 	procInfo->waitLSN = lsn;
 
-	SpinLockAcquire(>waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	pairingheap_add(>waitersHeap, >phNode);
 	procInfo->inHeap = true;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(>waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -132,11 +131,11 @@ deleteLSNWaiter(void)
 {
 	WaitLSNProcInfo *procInfo = >procInfos[MyProcNumber];
 
-	SpinLockAcquire(>waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	if (!procInfo->inHeap)
 	{
-		SpinLockRelease(>waitersHeapMutex);
+		LWLockRelease(WaitLSNLock);
 		return;
 	}
 
@@ -144,7 +143,7 @@ deleteLSNWaiter(void)
 	procInfo->inHeap = false;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(>waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -160,7 +159,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
 
-	SpinLockAcquire(>waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	/*
 	 * Iterate the pairing heap of waiting processes till we find LSN not yet
@@ -182,7 +181,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	updateMinWaitedLSN();
 
-	SpinLockRelease(>waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 
 	/*
 	 * Set latches for processes, whose waited LSNs are already replayed. This
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0d288d6b3d..ec43a78d60 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -330,6 +330,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared pg_serial state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
index 0d80248682..b3d9eed64d 100644
--- a/src/include/commands/waitlsn.h
+++ b/src/include/commands/waitlsn.h
@@ -55,13 +55,10 @@ typedef struct WaitLSNState
 
 	/*
 	 * A pairing heap of waiting processes order by LSN values (least LSN is
-	 * on top).
+	 * on top).  Protected by WaitLSNLock.
 	 */
 	pairingheap

Re: LogwrtResult contended spinlock

2024-04-03 Thread Alvaro Herrera

On 2024-Apr-03, Bharath Rupireddy wrote:

> On Wed, Apr 3, 2024 at 4:49 PM Alvaro Herrera  wrote:

> > So what I do in the attached 0001 is stop using the XLogwrtResult struct
> > in XLogCtl and replace it with separate Write and Flush values, and add
> > the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
> > and Flush from the shared XLogCtl to the local variable given as macro
> > argument.  (I also added our idiomatic do {} while(0) to the macro
> > definition, for safety).  The new members are XLogCtl->logWriteResult
> > and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
> > essentially identical semantics as the previous code.  No atomic access
> > yet!
> 
> +1.

> Also, I'm fine with renaming XLogUpdateLocalLogwrtResult() to
> RefreshXLogWriteResult().

Okay, I have pushed 0001 with the name change, will see about getting
the others in tomorrow.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

1 2 >

1 - 100 of 161 matches

Mail list logo