Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bharath Rupireddy
On Thu, Apr 4, 2024 at 10:48 AM Amit Kapila  wrote:
>
> The v33-0001 looks good to me. I have made minor changes in the
> comments/commit message and removed one part of the test which was a
> bit confusing and didn't seem to add much value. Let me know what you
> think of the attached?

Thanks for the changes. v34-0001 LGTM.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-03 Thread Amit Kapila
On Thu, Apr 4, 2024 at 10:53 AM Ajin Cherian  wrote:
>
> On Wed, Mar 6, 2024 at 1:29 AM Давыдов Виталий  
> wrote:
>>
>> In usual work, the subscription has two_phase = on. I have to change this 
>> option at catchup stage only, but this parameter can not be altered. There 
>> was a patch proposal in past to implement altering of two_phase option, but 
>> it was rejected. I think, the recreation of the subscription with two_phase 
>> = off will not work.
>>
>>
>
> The altering of two_phase was restricted because if there was a previously 
> prepared transaction on the subscriber when the two_phase was on, and then it 
> was turned off, the apply worker on the subscriber would re-apply the 
> transaction a second time and this might result in an inconsistent replica.
> Here's a patch that allows toggling two_phase option provided that there are 
> no pending uncommitted prepared transactions on the subscriber for that 
> subscription.
>

I think this would probably be better than the current situation but
can we think of a solution to allow toggling the value of two_phase
even when prepared transactions are present? Can you please summarize
the reason for the problems in doing that and the solutions, if any?

-- 
With Regards,
Amit Kapila.




Re: Fix for timestamp lag issue from emit_log_hook when GUC log_line_prefix has '%m'

2024-04-03 Thread Michael Paquier
On Wed, Apr 03, 2024 at 03:13:13PM +0900, Michael Paquier wrote:
> While looking again at that. there were two more comments that missed
> a refresh about JSON in get_formatted_log_time() and
> get_formatted_start_time().  It's been a few weeks since the last
> update, but I'll be able to wrap that tomorrow, updating these
> comments on the way.

And done with 2a217c371799, before the feature freeze.
--
Michael


signature.asc
Description: PGP signature


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-03 Thread Ajin Cherian
On Wed, Mar 6, 2024 at 1:29 AM Давыдов Виталий 
wrote:

> In usual work, the subscription has two_phase = on. I have to change this
> option at catchup stage only, but this parameter can not be altered. There
> was a patch proposal in past to implement altering of two_phase option, but
> it was rejected. I think, the recreation of the subscription with two_phase
> = off will not work.
>
>
>
The altering of two_phase was restricted because if there was a previously
prepared transaction on the subscriber when the two_phase was on, and then
it was turned off, the apply worker on the subscriber would re-apply the
transaction a second time and this might result in an inconsistent replica.
Here's a patch that allows toggling two_phase option provided that there
are no pending uncommitted prepared transactions on the subscriber for that
subscription.

Thanks to Kuroda-san for working on the patch.

regards,
Ajin Cherian
Fujitsu Australia


v1-0001-Allow-altering-of-two_phase-option-in-subscribers.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Amit Kapila
On Wed, Apr 3, 2024 at 8:28 PM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 3, 2024 at 6:46 PM Bertrand Drouvot
>  wrote:
> >
> > Just one comment on v32-0001:
> >
> > +# Synced slot on the standby must get its own inactive_since.
> > +is( $standby1->safe_psql(
> > +   'postgres',
> > +   "SELECT '$inactive_since_on_primary'::timestamptz <= 
> > '$inactive_since_on_standby'::timestamptz AND
> > +   '$inactive_since_on_standby'::timestamptz <= 
> > '$slot_sync_time'::timestamptz;"
> > +   ),
> > +   "t",
> > +   'synchronized slot has got its own inactive_since');
> > +
> >
> > By using <= we are not testing that it must get its own inactive_since (as 
> > we
> > allow them to be equal in the test). I think we should just add some 
> > usleep()
> > where appropriate and deny equality during the tests on inactive_since.
>
> Thanks. It looks like we can ignore the equality in all of the
> inactive_since comparisons. IIUC, all the TAP tests do run with
> primary and standbys on the single BF animals. And, it looks like
> assigning the inactive_since timestamps to perl variables is giving
> the microseconds precision level
> (./tmp_check/log/regress_log_040_standby_failover_slots_sync:inactive_since
> 2024-04-03 14:30:09.691648+00). FWIW, we already have some TAP and SQL
> tests relying on stats_reset timestamps without equality. So, I've
> left the equality for the inactive_since tests.
>
> > Except for the above, v32-0001 LGTM.
>
> Thanks. Please see the attached v33-0001 patch after removing equality
> on inactive_since TAP tests.
>

The v33-0001 looks good to me. I have made minor changes in the
comments/commit message and removed one part of the test which was a
bit confusing and didn't seem to add much value. Let me know what you
think of the attached?

-- 
With Regards,
Amit Kapila.


v34-0001-Allow-synced-slots-to-have-their-inactive_since.patch
Description: Binary data


Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Jeff Davis
On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote:
> IIUC, with your suggestion, sift_{up|down} needs to update the
> heap_index field as well. Does it mean that the caller needs to pass
> the address of heap_index down to sift_{up|down}?

I'm not sure quite how binaryheap should be changed. Bringing the heap
implementation into reorderbuffer.c would obviously work, but that
would be more code. Another option might be to make the API of
binaryheap look a little more like simplehash, where some #defines
control optional behavior and can tell the implementation where to find
fields in the structure.

Perhaps it's not worth the effort though, if performance is already
good enough?

Regards,
Jeff Davis





Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

2024-04-03 Thread Erik Wienhold
On 2024-04-04 03:29 +0200, David G. Johnston wrote:
> On Thu, Mar 28, 2024 at 8:02 PM Erik Wienhold  wrote:
> 
> > Thanks, that sounds better.  I incorporated that with some minor edits
> > in the attached v3.
> >
> 
> You added my missing ( but dropped the comma after "i.e."

Thanks, fixed in v4.  Looks like American English prefers that comma and
it's also more common in our docs.

-- 
Erik
>From 8b29c5852762bacb637fab021a06b12ab5cd7f93 Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Fri, 8 Mar 2024 04:21:56 +0100
Subject: [PATCH v4] Document that typed tables rely on CREATE TYPE

CREATE TABLE OF requires a stand-alone composite type.  Clarify that in
the error message.  Also reword the docs to better explain the
connection between created table and stand-alone composite type.

Reworded docs provided by David G. Johnston.
---
 doc/src/sgml/ref/create_table.sgml| 18 +-
 src/backend/commands/tablecmds.c  |  2 +-
 src/test/regress/expected/typed_table.out |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index dfb7822985..586ccb190b 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -249,19 +249,19 @@ WITH ( MODULUS numeric_literal, REM
 OF type_name
 
  
-  Creates a typed table, which takes its
-  structure from the specified composite type (name optionally
-  schema-qualified).  A typed table is tied to its type; for
-  example the table will be dropped if the type is dropped
-  (with DROP TYPE ... CASCADE).
+  Creates a typed table, which takes its structure
+  from an existing (name optionally schema-qualified) stand-alone composite
+  type (i.e., created using ) though it
+  still produces a new composite type as well.  The table will have
+  a dependency on the referenced type such that cascaded alter and drop
+  actions on the type will propagate to the table.
  
 
  
-  When a typed table is created, then the data types of the
-  columns are determined by the underlying composite type and are
-  not specified by the CREATE TABLE command.
+  A typed table always has the same column names and data types as the
+  type it is derived from, and you cannot specify additional columns.
   But the CREATE TABLE command can add defaults
-  and constraints to the table and can specify storage parameters.
+  and constraints to the table, as well as specify storage parameters.
  
 

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 317b89f67c..d756d2b200 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6994,7 +6994,7 @@ check_of_type(HeapTuple typetuple)
if (!typeOk)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-errmsg("type %s is not a composite type",
+errmsg("type %s is not a stand-alone composite 
type",
format_type_be(typ->oid;
 }
 
diff --git a/src/test/regress/expected/typed_table.out 
b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..745fbde811 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -89,7 +89,7 @@ drop cascades to function get_all_persons()
 drop cascades to table persons2
 drop cascades to table persons3
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
-ERROR:  type stuff is not a composite type
+ERROR:  type stuff is not a stand-alone composite type
 DROP TABLE stuff;
 -- implicit casting
 CREATE TYPE person_type AS (id int, name text);
-- 
2.44.0



Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bharath Rupireddy
On Thu, Apr 4, 2024 at 9:42 AM Masahiko Sawada  wrote:
>
> @@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
> if (SlotSyncCtx->pid == InvalidPid)
> {
> SpinLockRelease(>mutex);
> +   update_synced_slots_inactive_since();
> return;
> }
> SpinLockRelease(>mutex);
> @@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
> }
>
> SpinLockRelease(>mutex);
> +
> +   update_synced_slots_inactive_since();
>  }
>
> Why do we want to update all synced slots' inactive_since values at
> shutdown in spite of updating the value every time when releasing the
> slot? It seems to contradict the fact that inactive_since is updated
> when releasing or restoring the slot.

It is to get the inactive_since right for the cases where the standby
is promoted without a restart similar to when a standby is promoted
with restart in which case the inactive_since is set to current time
in RestoreSlotFromDisk.

Imagine the slot is synced last time at time t1 and then a few hours
passed, the standby is promoted without a restart. If we don't set
inactive_since to current time in this case in ShutDownSlotSync, the
inactive timeout invalidation mechanism can kick in immediately.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Masahiko Sawada
On Wed, Apr 3, 2024 at 11:58 PM Bharath Rupireddy
 wrote:
>
>
> Please find the attached v33 patches.

@@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
if (SlotSyncCtx->pid == InvalidPid)
{
SpinLockRelease(>mutex);
+   update_synced_slots_inactive_since();
return;
}
SpinLockRelease(>mutex);
@@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
}

SpinLockRelease(>mutex);
+
+   update_synced_slots_inactive_since();
 }

Why do we want to update all synced slots' inactive_since values at
shutdown in spite of updating the value every time when releasing the
slot? It seems to contradict the fact that inactive_since is updated
when releasing or restoring the slot.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Streaming read-ready sequential scan code

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 14:38, Melanie Plageman  wrote:
> Looking at it in the code, I am wondering if we should call
> heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
> that it is prepping a page to be scanned. You choose whatever you
> think is best.

I ended up calling it heap_prepare_pagescan() as I started to think
prep/prepare should come first. I don't think it's perfect as the
intended meaning is heap_prepare_page_for_scanning_in_pagemode(), but
that's obviously too long.

I've pushed the v9-0001 with that rename done.

David




Re: Popcount optimization using AVX512

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 11:50, Nathan Bossart  wrote:
> If we can verify this approach won't cause segfaults and can stomach the
> regression between 8 and 16 bytes, I'd happily pivot to this approach so
> that we can avoid the function call dance that I have in v25.
>
> Thoughts?

If we're worried about regressions with some narrow range of byte
values, wouldn't it make more sense to compare that to cc4826dd5~1 at
the latest rather than to some version that's already probably faster
than PG16?

David




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-03 Thread Thomas Munro
On Fri, Mar 8, 2024 at 6:20 AM Maxim Orlov  wrote:
> Quite an interesting patch, in my opinion.  I've decided to work on it a bit, 
> did some refactoring (sorry) and add
> basic tests. Also, I try to take into account as much as possible notes on 
> the patch, mentioned by Cédric Villemain.

Thanks!  Unfortunately I don't think it's possible to include a
regression test that looks at the output, because it'd be
non-deterministic.  Any other backend could pin or dirty the buffer
you try to evict, changing the behaviour.

> > and maybe better to go with FlushOneBuffer() ?
> It's a good idea, but I'm not sure at the moment.  I'll try to dig some 
> deeper into it.  At least, FlushOneBuffer does
> not work for a local buffers.  So, we have to decide whatever 
> pg_buffercache_invalidate should or should not
> work for local buffers.  For now, I don't see why it should not.  Maybe I 
> miss something?

I think it's OK to ignore local buffers for now.  pg_buffercache
generally doesn't support/show them so I don't feel inclined to
support them for this.  I removed a few traces of local support.

It didn't seem appropriate to use the pg_monitor role for this, so I
made it superuser-only.  I don't think it makes much sense to use this
on any kind of production system so I don't think we need a new role
for it, and existing roles don't seem too appropriate.  pageinspect et
al use the same approach.

I added a VOLATILE qualifier to the function.

I added some documentation.

I changed the name to pg_buffercache_evict().

I got rid of the 'force' flag which was used to say 'I only want to
evict this buffer it is clean'.  I don't really see the point in that,
we might as well keep it simple.  You could filter buffers on
"isdirty" if you want.

I added comments to scare anyone off using EvictBuffer() for anything
much, and marking it as something for developer convenience.  (I am
aware of an experimental patch that uses this same function as part of
a buffer pool resizing operation, but that has other infrastructure to
make that safe and would adjust those remarks accordingly.)

I wondered whether it should really be testing for  BM_TAG_VALID
rather than BM_VALID.  Arguably, but it doesn't seem important for
now.  The distinction would arise if someone had tried to read in a
buffer, got an I/O error and abandoned ship, leaving a buffer with a
valid tag but not valid contents.  Anyone who tries to ReadBuffer() it
will then try to read it again, but in the meantime this function
won't be able to evict it (it'll just return false).  Doesn't seem
that obvious to me that this obscure case needs to be handled.  That
doesn't happen *during* a non-error case, because then it's pinned and
we already return false in this code for pins.

I contemplated whether InvalidateBuffer() or InvalidateVictimBuffer()
would be better here and realised that Andres's intuition was probably
right when he suggested the latter up-thread.  It is designed with the
right sort of arbitrary concurrent activity in mind, where the former
assumes things about locking and dropping, which could get us into
trouble if not now maybe in the future.

I ran the following diabolical buffer blaster loop while repeatedly
running installcheck:

do
$$
begin
  loop
perform pg_buffercache_evict(bufferid)
   from pg_buffercache
  where random() <= 0.25;
  end loop;
End;
$$;

The only ill-effect was a hot laptop.

Thoughts, objections, etc?

Very simple example of use:

create or replace function uncache_relation(name text)
returns boolean
begin atomic;
  select bool_and(pg_buffercache_evict(bufferid))
from pg_buffercache
   where reldatabase = (select oid
  from pg_database
 where datname = current_database())
 and relfilenode = pg_relation_filenode(name);
end;

More interesting for those of us hacking on streaming I/O stuff was
the ability to evict just parts of things and see how the I/O merging
and I/O depth react.


v5-0001-Add-pg_buffercache_evict-function-for-testing.patch
Description: Binary data


Re: Is it safe to cache data by GiST consistent function

2024-04-03 Thread Michał Kłeczek


> On 3 Apr 2024, at 19:02, Tom Lane  wrote:
> 
> =?utf-8?Q?Micha=C5=82_K=C5=82eczek?=  writes:
> 
>> pg_trgm consistent caches tigrams but it has some logic to make sure cached 
>> values are recalculated:
> 
>>  cache = (gtrgm_consistent_cache *) fcinfo->flinfo->fn_extra;
>>  if (cache == NULL ||
>>  cache->strategy != strategy ||
>>  VARSIZE(cache->query) != querysize ||
>>  memcmp((char *) cache->query, (char *) query, querysize) != 0)
> 
>> What I don’t understand is if it is necessary or it is enough to check 
>> fn_extra==NULL.
> 
> Ah, I didn't think to search contrib.  Yes, you need to validate the
> cache entry.  In this example, a rescan could insert a new query
> value.  In general, an opclass support function could get called using
> a pretty long-lived FunctionCallInfo (e.g. one in the index's relcache
> entry), so it's unwise to assume that cached data is relevant to the
> current call without checking.

This actually sounds scary - looks like there is no way to perform cache 
clean-up after rescan then?

Do you think it might be useful to introduce a way for per-rescan caching (ie. 
setting up a dedicated memory context in gistrescan and passing it to support 
functions)?

—
Michal



Re: On disable_cost

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 10:15, David Rowley  wrote:
> In short, I don't find it strange that disabling one node type results
> in considering another type that we'd otherwise not consider in cases
> where we assume that the disabled node type is always superior and
> should always be used when it is possible.

In addition to what I said earlier, I think the current
enable_indexonlyscan is implemented in a way that has the planner do
what it did before IOS was added.  I think that goal makes sense with
any patch that make the planner try something new. We want to have
some method to get the previous behaviour for the cases where the
planner makes a dumb choice or to avoid some bug in the new feature.

I think using that logic, the current scenario with enable_indexscan
and enable_indexonlyscan makes complete sense. I mean, including
enable_indexscan=0 adding disable_cost to IOS Paths.

David




Re: Using the %m printf format more

2024-04-03 Thread Michael Paquier
On Fri, Mar 22, 2024 at 01:58:24PM +, Dagfinn Ilmari Mannsåker wrote:
> Here's an updated patch that adds such a comment.  I'll add it to the
> commitfest later today unless someone commits it before then.

I see no problem to do that now rather than later.  So, done to make
pg_regress able to use %m.
--
Michael


signature.asc
Description: PGP signature


Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-03 Thread David Rowley
On Mon, 18 Mar 2024 at 15:50, Michael Paquier  wrote:
> Additionally, the RMT has set the feature freeze to be **April 8, 2024
> at 0:00 AoE** (see [1]).  This is the last time to commit features for
> PostgreSQL 17.  In other words, no new PostgreSQL 17 feature can be
> committed after April 8, 2024 at 0:00 AoE.  As mentioned last year in
> [2], this uses the "standard" feature freeze date/time.

Someone asked me about this, so thought it might be useful to post here.

To express this as UTC, It's:

postgres=# select '2024-04-08 00:00-12:00' at time zone 'UTC';
  timezone
-
 2024-04-08 12:00:00

Or, time remaining, relative to now:

select '2024-04-08 00:00-12:00' - now();

David

> [1]: https://en.wikipedia.org/wiki/Anywhere_on_Earth
> [2]: 
> https://www.postgresql.org/message-id/9fbe60ec-fd1b-6ee0-240d-af7fc4442...@postgresql.org




Re: Refactoring of pg_resetwal/t/001_basic.pl

2024-04-03 Thread Michael Paquier
On Tue, Mar 26, 2024 at 02:53:35PM +0300, Svetlana Derevyanko wrote:
> What do you think?
>
> +use constant SLRU_PAGES_PER_SEGMENT => 32;

Well, I disagree with what you are doing here, adding a hardcoded
dependency between the test code and the backend code.  I would
suggest to use a more dynamic approach and retrieve such values
directly from the headers.  See scan_server_header() in
039_end_of_wal.pl as one example.  7b5275eec3a5 is newer than
bae868caf222, so the original commit could have used that, as well.
--
Michael


signature.asc
Description: PGP signature


Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 12:34, Michael Paquier  wrote:
> I've been re-reading the patch again to remember what this is about,
> and I'm OK with having this "path" column in the catalog.  However,
> I'm somewhat confused by the choice of having a temporary number that
> shows up in the catalog representation, because this may not be
> constant across multiple calls so this still requires a follow-up
> temporary ID <-> name mapping in any SQL querying this catalog.  A
> second thing is that array does not show the hierarchy of the path;
> the patch relies on the order of the elements in the output array
> instead.

My view on this is that there are a couple of things with the patch
which could be considered separately:

1. Should we have a context_id in the view?
2. Should we also have an array of all parents?

My view is that we really need #1 as there's currently no reliable way
to determine a context's parent as the names are not unique.   I do
see that Melih has mentioned this is temporary in:

+  
+   Current context id. Note that the context id is a temporary id and may
+   change in each invocation
+  

For #2, I'm a bit less sure about this. I know Andres would like to
see this array added, but equally WITH RECURSIVE would work.  Does the
array of parents completely eliminate the need for recursive queries?
I think the array works for anything that requires all parents or some
fixed (would be) recursive level, but there might be some other
condition to stop recursion other than the recursion level that
someone needs to do.What I'm trying to get at is; do we need to
document the WITH RECURSIVE stuff anyway? and if we do, is it still
worth having the parents array?

David




Re: Add new error_action COPY ON_ERROR "log"

2024-04-03 Thread Michael Paquier
On Tue, Apr 02, 2024 at 09:53:57AM +0900, Masahiko Sawada wrote:
> Pushed.

Thanks for following up with this thread.
--
Michael


signature.asc
Description: PGP signature


Re: Streaming read-ready sequential scan code

2024-04-03 Thread Melanie Plageman
On Wed, Apr 3, 2024 at 9:08 PM David Rowley  wrote:
>
> On Thu, 4 Apr 2024 at 06:03, Melanie Plageman  
> wrote:
> > Attached v8 is rebased over current master (which now has the
> > streaming read API).
>
> I've looked at the v8-0001 patch.

Thanks for taking a look!

> I wasn't too keen on heapbuildvis() as a function name for an external
> function.  Since it also does pruning work, it seemed weird to make it
> sound like it only did visibility work.  Per our offline discussion
> about names, I've changed it to what you suggested which is
> heap_page_prep().

Looking at it in the code, I am wondering if we should call
heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
that it is prepping a page to be scanned. You choose whatever you
think is best.

> Aside from that, there was an outdated comment: "In pageatatime mode,
> heapgetpage() already did visibility checks," which was no longer true
> as that's done in heapbuildvis() (now heap_page_prep()).
>
> I also did a round of comment adjustments as there were a few things I
> didn't like, e.g:
>
> + * heapfetchbuf - subroutine for heapgettup()
>
> This is also used in heapgettup_pagemode(), so I thought it was a bad
> idea for a function to list places it thinks it's being called.   I
> also thought it redundant to write "This routine" in the function head
> comment. I think "this routine" is implied by the context. I ended up
> with:
>
> /*
>  * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
>  *
>  * Read the specified block of the scan relation into a buffer and pin that
>  * buffer before saving it in the scan descriptor.
>  */
>
> I'm happy with your changes to heapam_scan_sample_next_block(). I did
> adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
> the same as the seqscan version, just with "seqscan" swapped for
> "sample scan".

That all is fine with me.

> The only other thing I adjusted there was to use "blockno" in some
> places where you were using hscan->rs_cblock.  These all come after
> the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
> is more likely to avoid reading the value from the struct again if the
> compiler isn't happy that the two values are still equivalent for some
> reason.  Even if the compiler is happy today, it would only take a
> code change to pass hscan to some external function for the compiler
> to perhaps be uncertain if that function has made an adjustment to
> rs_cblock and go with the safe option of pulling the value out the
> struct again which is a little more expensive as it requires some
> maths to figure out the offset.
>
> I've attached v9-0001 and a delta of just my changes from v8.

All sounds good and LGTM

- Melanie




Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

2024-04-03 Thread David G. Johnston
On Thu, Mar 28, 2024 at 8:02 PM Erik Wienhold  wrote:

> Thanks, that sounds better.  I incorporated that with some minor edits
> in the attached v3.
>

Looks good.

You added my missing ( but dropped the comma after "i.e."

diff --git a/doc/src/sgml/ref/create_table.sgml
b/doc/src/sgml/ref/create_table.sgml
index dc69a3f5dc..b2e9e97b93 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -251,7 +251,7 @@ WITH ( MODULUS numeric_literal, REM
  
   Creates a typed table, which takes its
structure
   from an existing (name optionally schema-qualified) stand-alone
composite
-  type (i.e. created using ) though it
+  type (i.e., created using ) though it
   still produces a new composite type as well.  The table will have
   a dependency on the referenced type such that cascaded alter and drop
   actions on the type will propagate to the table.

David J.


Re: Statistics Import and Export

2024-04-03 Thread Michael Paquier
On Mon, Apr 01, 2024 at 01:21:53PM -0400, Tom Lane wrote:
> I'm not sure.  I think if we put our heads down we could finish
> the changes I'm suggesting and resolve the other issues this week.
> However, it is starting to feel like the sort of large, barely-ready
> patch that we often regret cramming in at the last minute.  Maybe
> we should agree that the first v18 CF would be a better time to
> commit it.

There are still 4 days remaining, so there's still time, but my
overall experience on the matter with my RMT hat on is telling me that
we should not rush this patch set.  Redesigning portions close to the
end of a dev cycle is not a good sign, I am afraid, especially if the
sub-parts of the design don't fit well in the global picture as that
could mean more maintenance work on stable branches in the long term.
Still, it is very good to be aware of the problems because you'd know
what to tackle to reach the goals of this patch set.
--
Michael


signature.asc
Description: PGP signature


Re: Streaming read-ready sequential scan code

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 06:03, Melanie Plageman  wrote:
> Attached v8 is rebased over current master (which now has the
> streaming read API).

I've looked at the v8-0001 patch.

I wasn't too keen on heapbuildvis() as a function name for an external
function.  Since it also does pruning work, it seemed weird to make it
sound like it only did visibility work.  Per our offline discussion
about names, I've changed it to what you suggested which is
heap_page_prep().

Aside from that, there was an outdated comment: "In pageatatime mode,
heapgetpage() already did visibility checks," which was no longer true
as that's done in heapbuildvis() (now heap_page_prep()).

I also did a round of comment adjustments as there were a few things I
didn't like, e.g:

+ * heapfetchbuf - subroutine for heapgettup()

This is also used in heapgettup_pagemode(), so I thought it was a bad
idea for a function to list places it thinks it's being called.   I
also thought it redundant to write "This routine" in the function head
comment. I think "this routine" is implied by the context. I ended up
with:

/*
 * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
 *
 * Read the specified block of the scan relation into a buffer and pin that
 * buffer before saving it in the scan descriptor.
 */

I'm happy with your changes to heapam_scan_sample_next_block(). I did
adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
the same as the seqscan version, just with "seqscan" swapped for
"sample scan".

The only other thing I adjusted there was to use "blockno" in some
places where you were using hscan->rs_cblock.  These all come after
the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
is more likely to avoid reading the value from the struct again if the
compiler isn't happy that the two values are still equivalent for some
reason.  Even if the compiler is happy today, it would only take a
code change to pass hscan to some external function for the compiler
to perhaps be uncertain if that function has made an adjustment to
rs_cblock and go with the safe option of pulling the value out the
struct again which is a little more expensive as it requires some
maths to figure out the offset.

I've attached v9-0001 and a delta of just my changes from v8.

David
From 90bfc63097c556d0921d8f9165731fb07ec04167 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v9] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for read streams. The streaming read API
will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c | 78 ++--
 src/backend/access/heap/heapam_handler.c | 38 
 src/include/access/heapam.h  |  2 +-
 3 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a9d5b109a5..6524fc44a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -360,17 +360,17 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber 
startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heap_page_prep - Prepare the current scan page to be scanned in pagemode
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Preparation currently consists of 1. prune the scan's rs_cbuf page, and 2.
+ * fill the rs_vistuples array with the OffsetNumbers of visible tuples.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heap_page_prep(TableScanDesc sscan)
 {
HeapScanDesc scan = (HeapScanDesc) sscan;
-   Buffer  buffer;
+   Buffer  buffer = scan->rs_cbuf;
+   BlockNumber block = 

Re: Allow non-superuser to cancel superuser tasks.

2024-04-03 Thread Michael Paquier
On Tue, Apr 02, 2024 at 04:35:28PM +0500, Andrey M. Borodin wrote:
> We can add tests just like [0] with injection points.
> I mean replace that "sleep 1" with something like
> "$node->wait_for_event('autovacuum worker', 'autocauum-runing');".
> Currently we have no infrastructure to wait for autovacuum of
> particular table, but I think it's doable.
> Also I do not like that test is changing system-wide autovac
> settings, AFAIR these settings can be set for particular table.

Yeah, hardcoded sleeps are not really acceptable.  On fast machines
they eat in global runtime making the whole slower, impacting the CI.
On slow machines, that's not going to be stable and we have a lot of
buildfarm animals starved on CPU, like the ones running valgrind or
just because their environment is slow (one of my animals runs on a
RPI, for example).  Note that slow machines have a lot of value
because they're usually better at catching race conditions.  Injection
points would indeed make the tests more deterministic by controlling
the waits and wakeups you'd like to have in the patch's tests.

eeefd4280f6e would be a good example of how to implement a test.
--
Michael


signature.asc
Description: PGP signature


Re: Reports on obsolete Postgres versions

2024-04-03 Thread David G. Johnston
On Tue, Apr 2, 2024 at 1:47 PM Bruce Momjian  wrote:

> On Tue, Apr  2, 2024 at 11:34:46AM +0200, Magnus Hagander wrote:
>
> Okay, I changed "superseded" to "old", and changed "latest" to
> "current", patch attached.
>
>
I took a pass at this and found a few items of note.  Changes on top of
Bruce's patch.

diff --git a/templates/support/versioning.html
b/templates/support/versioning.html
index 0ed79f4f..d4762967 100644
--- a/templates/support/versioning.html
+++ b/templates/support/versioning.html
@@ -21,9 +21,9 @@ a release available outside of the minor release roadmap.

 
 The PostgreSQL Global Development Group supports a major version for 5
years
-after its initial release. After its five year anniversary, a major
version will
-have one last minor release containing any fixes and will be considered
-end-of-life (EOL) and no longer supported.
+after its initial release. After its fifth anniversary, a major version
will
+have one last minor release and will be considered
+end-of-life (EOL), meaning no new bug fixes will be written for it.
 

# "fifth anniversary "seems more correct than "five year anniversary".
# The fact it gets a minor release implies it receives fixes.
# I've always had an issue with our use of the phrasing "no longer
supported".  It seems vague when practically it just means we stop writing
patches for it.  Whether individual community members refrain from
answering questions or helping people on these older releases is not a
project decision but a personal one.  Also, since we already say it is
supported for 5 years it seems a bit redundant to then also say "after 5
years it is unsupported".


 Version Numbering
@@ -45,11 +45,12 @@ number, e.g. 9.5.3 to 9.5.4.
 Upgrading

 
-Major versions usually change the internal format of the system tables.
-These changes are often complex, so we do not maintain backward
-compatibility of all stored data.  A dump/reload of the database or use of
the
-pg_upgrade module is required
-for major upgrades. We also recommend reading the
+Major versions need the data directory to be initialized so that the
system tables
+specific to that version can be created.  There are two options to then
+migrate the user data from the old directory to the new one: a dump/reload
+of the database or using the
+pg_upgrade module.
+We also recommend reading the
 upgrading section of the major
 version you are planning to upgrade to. You can upgrade from one major
version
 to another without upgrading to intervening versions, but we recommend
reading
@@ -58,14 +59,15 @@ versions prior to doing so.
 

# This still talked about "stored data" when really that isn't the main
concern and if it was pg_upgrade wouldn't work as an option.
# The choice to say "data directory" here seems reasonable if arguable.
But it implies the location of user data and we state it also contains
version-specific system tables.  It can go unsaid that they are
version-specific because the collection as a whole and the layout of
individual tables can and almost certainly will change between versions.

 
-Minor release upgrades do not require a dump and restore;  you simply stop
+Minor releases upgrades do not impact the data directory, only the
binaries.
+You simply stop
 the database server, install the updated binaries, and restart the server.
-Such upgrades might require manual changes to complete so always read
+However, some upgrades might require manual changes to complete so always
read
 the release notes first.
 

# The fact minor releases don't require dump/reload doesn't directly
preclude them from needing pg_upgrade and writing "Such upgrades" seems
like it could lead someone to think that.
# Data Directory seems generic enough to be understood here and since I
mention it in the Major Version as to why data migration is needed,
mentioning here
# as something not directly altered and thus precluding the data migration
has a nice symmetry.  The potential need for manual changes becomes clearer
as well.


 
-Minor releases only fix frequently-encountered bugs, security issues, and data corruption
 problems, so such upgrades are very low risk.  The community
 considers performing minor upgrades to be less risky than continuing to

# Reality mostly boils down to trusting our judgement when it comes to bugs
as each one is evaluated individually.  We do not promise to leave simply
buggy behavior alone in minor releases which is the only policy that would
nearly eliminate upgrade risk.  We back-patch 5 year old bugs quite often
which by definition are not frequently encountered.  I cannot think of a
good adjective to put there nor does one seem necessary even if I agree it
reads a bit odd otherwise.  I think that has more to do with this being
just the wrong structure to get our point across.  Can we pick out some
statistics from our long history of publishing minor releases to present an
objective reality to the reader to convince them to trust our track-record
in this matter?

David J.


Re: Silence Meson warning on HEAD

2024-04-03 Thread Michael Paquier
On Mon, Apr 01, 2024 at 06:46:01PM -0500, Tristan Partin wrote:
> 1. version-check.diff
> 
> Wrap the offending line in a Meson version check.
> 
> 2. perl-string.diff
> 
> Pass the perl program as a string via its .path() method.
> 
> 3. meson-bump.diff
> 
> Bump the minimum meson version from 0.54 to 0.55, at least.

Among these three options, 2) seems like the most appealing to me at
this stage of the development cycle.  Potential breakages and more
backpatch fuzz is never really appealing even if meson is a rather new
thing, particularly considering that there is a simple solution that
does not require a minimal version bump.

> Seems like as good an opportunity to bump Meson to 0.56 for Postgres 17,
> which I have found to exist in the EPEL for RHEL 7. I don't know what
> version exists in the base repo. Perhaps it is 0.55, which would still get
> rid of the aforementioned warning.

Perhaps we could shave even more versions for PG18 if it helps with
future development?
--
Michael


signature.asc
Description: PGP signature


Re: Allow non-superuser to cancel superuser tasks.

2024-04-03 Thread Leung, Anthony
Update - the condition should be && 

if (pgstat_get_backend_type(proc->backendId) == B_AUTOVAC_WORKER)
{
if (!has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_AUTOVACUUM) 
&& !superuser())
return SIGNAL_BACKEND_NOAUTOVACUUM;
}

Thanks
--
Anthony Leung
Amazon Web Services: https://aws.amazon.com



Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Masahiko Sawada
Hi,

On Thu, Apr 4, 2024 at 2:32 AM Jeff Davis  wrote:
>
> On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:
> > I suggest that you add a "heap_index" field to ReorderBufferTXN that
> > would point to the index into the heap's array (the same as
> > bh_nodeidx_entry.index in your patch). Each time an element moves
> > within the heap array, just follow the pointer to the
> > ReorderBufferTXN
> > object and update the heap_index -- no hash lookup required.
>
> It looks like my email was slightly too late, as the work was already
> committed.

Thank you for the suggestions! I should have informed it earlier.

>
> My suggestion is not required for 17, and so it's fine if this waits
> until the next CF. If it turns out to be a win we can consider
> backporting to 17 just to keep the code consistent, otherwise it can go
> in 18.

IIUC, with your suggestion, sift_{up|down} needs to update the
heap_index field as well. Does it mean that the caller needs to pass
the address of heap_index down to sift_{up|down}?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Allow non-superuser to cancel superuser tasks.

2024-04-03 Thread Leung, Anthony
> I don't think we should rely on !OidIsValid(proc->roleId) for signaling
> autovacuum workers.  That might not always be true, and I don't see any
> need to rely on that otherwise.  IMHO we should just add a few lines before
> the existing code, which doesn't need to be changed at all:
> 
>   if (pg_stat_is_backend_autovac_worker(proc->backendId) &&
>   !has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_AUTOVACUUM))
>   return SIGNAL_BACKEND_NOAUTOVACUUM;

I tried to add them above the existing code. When I test it locally, a user 
without pg_signal_autovacuum will actually fail at this block because the user 
is not superuser and !OidIsValid(proc->roleId) is also true in the following:

/*
 * Only allow superusers to signal superuser-owned backends.  Any 
process
 * not advertising a role might have the importance of a superuser-owned
 * backend, so treat it that way.
 */
if ((!OidIsValid(proc->roleId) || superuser_arg(proc->roleId)) &&
!superuser())
return SIGNAL_BACKEND_NOSUPERUSER;

This is what Im planning to do - If the backend is autovacuum worker and the 
user is not superuser or has pg_signal_autovacuum role, we return the new value 
and provide the relevant error message 

  /*
 * If the backend is autovacuum worker, allow user with privileges of 
the 
   * pg_signal_autovacuum role to signal the backend.
 */
if (pgstat_get_backend_type(proc->backendId) == B_AUTOVAC_WORKER)
{
if (!has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_AUTOVACUUM) 
|| !superuser())
return SIGNAL_BACKEND_NOAUTOVACUUM;
}
/*
 * Only allow superusers to signal superuser-owned backends.  Any 
process
 * not advertising a role might have the importance of a superuser-owned
 * backend, so treat it that way.
*/
else if ((!OidIsValid(proc->roleId) || superuser_arg(proc->roleId)) &&
 !superuser())
{
return SIGNAL_BACKEND_NOSUPERUSER;
}
/* Users can signal backends they have role membership in. */
else if (!has_privs_of_role(GetUserId(), proc->roleId) &&
 !has_privs_of_role(GetUserId(), 
ROLE_PG_SIGNAL_BACKEND))
{
return SIGNAL_BACKEND_NOPERMISSION;
}


> We can add tests just like [0] with injection points.
> I mean replace that "sleep 1" with something like 
> "$node->wait_for_event('autovacuum worker', 'autocauum-runing');".
> Currently we have no infrastructure to wait for autovacuum of particular 
> table, but I think it's doable.
> Also I do not like that test is changing system-wide autovac settings, AFAIR 
> these settings can be set for particular table.

Thanks for the suggestion. I will take a look at this. Let me also separate the 
test into a different patch file.

--
Anthony Leung
Amazon Web Services: https://aws.amazon.com






Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Thomas Munro
On Thu, Apr 4, 2024 at 11:51 AM Peter Eisentraut  wrote:
> On 30.03.24 22:27, Thomas Munro wrote:
> > Hmm, OK so it doesn't have 3 available in parallel from base repos.
> > But it's also about to reach end of "full support" in 2 months[1], so
> > if we applied the policies we discussed in the LLVM-vacuuming thread
> > (to wit: build farm - EOL'd OSes), then...  One question I'm unclear
> > on is whether v17 will be packaged for RHEL8.
>
> The rest of the thread talks about the end of support of RHEL 7, but you
> are here talking about RHEL 8.   It is true that "full support" for RHEL
> 8 ended in May 2024, but that is the not the one we are tracking.  We
> are tracking the 10-year one, which I suppose is now called "maintenance
> support".

I might have confused myself with the two EOLs and some wishful
thinking.  I am a lot less worked up about this general topic now that
RHEL has moved to "rolling" LLVM updates in minor releases, removing a
physical-pain-inducing 10-year vacuuming horizon (that's 20 LLVM major
releases and they only fix bugs in one...).  I will leave openssl
discussions to those more knowledgeable about that.

> So if the above package list is correct, then we ought to keep
> supporting openssl 1.1.* until 2029.

That's a shame.  But it sounds like the developer burden isn't so
different from 1.1.1 to 3.x, so maybe it's not such a big deal from
our point of view.  (I have no opinion on the security ramifications
of upstream's EOL, but as a layman it sounds completely bonkers to use
it.  I wonder why the packaging community wouldn't just arrange to
have a supported-by-upstream 3.x package in their RPM repo when they
supply the newest PostgreSQL versions for the oldest RHEL, but again
not my area so I'll shut up).




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-04-03 Thread Michael Paquier
On Wed, Apr 03, 2024 at 04:20:39PM +0300, Melih Mutlu wrote:
> Michael Paquier , 14 Şub 2024 Çar, 10:23 tarihinde
> şunu yazdı:
>> I was reading the patch, and using int[] as a representation of the
>> path of context IDs up to the top-most parent looks a bit strange to
>> me, with the relationship between each parent -> child being
>> preserved, visibly, based on the order of the elements in this array
>> made of temporary IDs compiled on-the-fly during the function
>> execution.  Am I the only one finding that a bit strange?  Could it be
>> better to use a different data type for this path and perhaps switch
>> to the names of the contexts involved?
> 
> Do you find having the path column strange all together? Or only using
> temporary IDs to generate that column? The reason why I avoid using context
> names is because there can be multiple contexts with the same name. This
> makes it difficult to figure out which context, among those with that
> particular name, is actually included in the path. I couldn't find any
> other information that is unique to each context.

I've been re-reading the patch again to remember what this is about,
and I'm OK with having this "path" column in the catalog.  However,
I'm somewhat confused by the choice of having a temporary number that
shows up in the catalog representation, because this may not be
constant across multiple calls so this still requires a follow-up
temporary ID <-> name mapping in any SQL querying this catalog.  A
second thing is that array does not show the hierarchy of the path;
the patch relies on the order of the elements in the output array
instead.
--
Michael


signature.asc
Description: PGP signature


Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Michael Paquier
On Wed, Apr 03, 2024 at 01:38:50PM -0400, Tom Lane wrote:
> The discussion we had last year concluded that we were OK with
> dropping 1.0.1 support when RHEL6 goes out of extended support
> (June 2024 per this thread, I didn't check it).  Seems like we
> should have the same policy for RHEL7.  Also, calling Photon 3
> dead because it went EOL three days ago seems over-hasty.

Yeah.  A bunch of users of Photon are VMware (or you could say
Broadcom) product appliances, and I'd suspect that quite a lot of them
rely on Photon 3 for their base OS image.  Upgrading that stuff is not
easy work in my experience because they need to cope with a bunch of
embedded services.

> Bottom line for me is that pulling 1.0.1 support now is OK,
> but I think pulling 1.0.2 is premature.

Yeah, I guess so.  At least that seems like the safest conclusion
currently here.  The build-time check on X509_get_signature_info()
would still be required.

I'd love being able to rip out the internal locking logic currently in
libpq as LibreSSL has traces of CRYPTO_lock(), as far as I've checked,
and we rely on its existence.
--
Michael


signature.asc
Description: PGP signature


Re: Built-in CTYPE provider

2024-04-03 Thread Jeff Davis
On Tue, 2024-03-26 at 08:14 +0100, Peter Eisentraut wrote:
> 
> Full vs. simple case mapping is more of a legacy compatibility
> question, 
> in my mind.  There is some expectation/precedent that C.UTF-8 uses 
> simple case mapping, but beyond that, I don't see a reason why
> someone 
> would want to explicitly opt for simple case mapping, other than if
> they 
> need length preservation or something, but if they need that, then
> they 
> are going to be in a world of pain in Unicode anyway.

I mostly agree, though there are some other purposes for the simple
mapping:

* a substitute for case folding: lower() with simple case mapping will
work better for that purpose than lower() with full case mapping (after
we have casefold(), this won't be a problem)

* simple case mapping is conceptually simpler, and that's a benefit by
itself in some situations -- maybe the 1:1 assumption exists other
places in their application

> > There's also another reason to consider it an argument rather than
> > a
> > collation property, which is that it might be dependent on some
> > other
> > field in a row. I could imagine someone wanting to do:
> > 
> >     SELECT
> >   UPPER(some_field,
> >     full => true,
> >     dotless_i => CASE other_field WHEN ...)
> >     FROM ...
> 
> Can you index this usefully?  It would only work if the user query 
> matches exactly this pattern?

In that example, UPPER is used in the target list -- the WHERE clause
might be indexable. The UPPER is just used for display purposes, and
may depend on some locale settings stored in another table associated
with a particular user.

Regards,
Jeff Davis





Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Peter Eisentraut

On 03.04.24 23:19, Magnus Hagander wrote:
When the code is this simple, we should definitely consider carrying it 
ourselves. At least if we don't expect to need *other* functionality 
from the same library in the future, which I doubt we will from libsystemd.


Well, I've long had it on my list to do some integration to log directly 
to the journal, so you can preserve metadata better.  I'm not sure right 
now whether this would use libsystemd, but it's not like there is 
absolutely no other systemd-related functionality that could be added.


Personally, I think this proposed change is trying to close a barndoor 
after a horse has bolted.  There are many more interesting and scary 
libraries in the dependency tree of "postgres", so just picking off one 
right now doesn't really accomplish anything.  The next release of 
libsystemd will drop all the compression libraries as hard dependencies, 
so the issue in that sense is gone anyway.  Also, fun fact: liblzma is 
also a dependency via libxml2.






Re: pg_combinebackup --copy-file-range

2024-04-03 Thread Tomas Vondra
Hi,

Here's a much more polished and cleaned up version of the patches,
fixing all the issues I've been aware of, and with various parts merged
into much more cohesive parts (instead of keeping them separate to make
the changes/evolution more obvious).

I decided to reorder the changes like this:

1) block alignment - As I said earlier, I think we should definitely do
this, even if only to make future improvements possible. After chatting
about this with Robert off-list a bit, he pointed out I actually forgot
to not align the headers for files without any blocks, so this version
fixes that.

2) add clone/copy_file_range for the case that copies whole files. This
is pretty simple, but it also adds the --clone/copy-file-range options,
and similar infrastructure. The one slightly annoying bit is that we now
have the ifdef stuff in two places - when parsing the options, and then
in the copy_file_XXX methods, and the pg_fatal() calls should not be
reachable in practice. But that doesn't seem harmful, and might be a
useful protection against someone calling function that does nothing.

This also merges the original two parts, where the first one only did
this cloning/CoW stuff when checksum did not need to be calculated, and
the second extended it to those cases too (by also reading the data, but
still doing the copy the old way).

I think this is the right way - if that's not desisable, it's easy to
either add --no-manifest or not use the CoW options. Or just not change
the checksum type. There's no other way.

3) add copy_file_range to write_reconstructed_file, by using roughly the
same logic/reasoning as (2). If --copy-file-range is specified and a
checksum should be calculated, the data is read for the checksum, but
the copy is done using copy_file_range.

I did rework the flow write_reconstructed_file() flow a bit, because
tracking what exactly needs to be read/written in each of the cases
(which copy method, zeroed block, checksum calculated) made the flow
really difficult to follow. Instead I introduced a function to
read/write a block, and call them from different places.

I think this is much better, and it also makes the following part
dealing with batches of blocks much easier / smaller change.

4) prefetching - This is mostly unrelated to the CoW stuff, but it has
tremendous benefits, especially for ZFS. I haven't been able to tune ZFS
to get decent performance, and ISTM it'd be rather unusable for backup
purposes without this.

5) batching in write_reconstructed_file - This benefits especially the
copy_file_range case, where I've seen it to yield massive speedups (see
the message from Monday for better data).

6) batching for prefetch - Similar logic to (5), but for fadvise. This
is the only part where I'm not entirely sure whether it actually helps
or not. Needs some more analysis, but I'm including it for completeness.


I do think the parts (1)-(4) are in pretty good shape, good enough to
make them committable in a day or two. I see it mostly a matter of
testing and comment improvements rather than code changes.

(5) is in pretty good shape too, but I'd like to maybe simplify and
refine the write_reconstructed_file changes a bit more. I don't think
it's broken, but it feels a bit too cumbersome.

Not sure about (6) yet.

I changed how I think about this a bit - I don't really see the CoW copy
methods as necessary faster than the regular copy (even though it can be
with (5)). I think the main benefits are the space savings, enabled by
patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
that I don't think the performance is an issue - everything has a cost.


On 4/3/24 15:39, Jakub Wartak wrote:
> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
>  wrote:
>>
>> Hi,
>>
>> I've been running some benchmarks and experimenting with various stuff,
>> trying to improve the poor performance on ZFS, and the regression on XFS
>> when using copy_file_range. And oh boy, did I find interesting stuff ...
> 
> [..]
> 
> Congratulations on great results!
> 
>> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
>> to finish recovery, run pg_checksums --check (to check the patches don't
>> produce something broken)
> 
> I've performed some follow-up small testing on all patches mentioned
> here  (1..7), with the earlier developed nano-incremental-backup-tests
> that helped detect some issues for Robert earlier during original
> development. They all went fine in both cases:
> - no special options when using pg_combinebackup
> - using pg_combinebackup --copy-file-range --manifest-checksums=NONE
> 
> Those were:
> test_across_wallevelminimal.sh
> test_full_pri__incr_stby__restore_on_pri.sh
> test_full_pri__incr_stby__restore_on_stby.sh
> test_full_stby__incr_stby__restore_on_pri.sh
> test_full_stby__incr_stby__restore_on_stby.sh
> test_incr_after_timelineincrease.sh
> test_incr_on_standby_after_promote.sh
> test_many_incrementals_dbcreate_duplicateOID.sh
> 

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Peter Eisentraut

On 30.03.24 22:27, Thomas Munro wrote:

On Sun, Mar 31, 2024 at 9:59 AM Tom Lane  wrote:

Thomas Munro  writes:

I was reminded of this thread by ambient security paranoia.  As it
stands, we require 1.0.2 (but we very much hope that package
maintainers and others in control of builds don't decide to use it).
Should we skip 1.1.1 and move to requiring 3 for v17?


I'd be kind of sad if I couldn't test SSL stuff anymore on my
primary workstation, which has

$ rpm -q openssl
openssl-1.1.1k-12.el8_9.x86_64

I think it's probably true that <=1.0.2 is not in any distro that
we still need to pay attention to, but I reject the contention
that RHEL8 is not in that set.


Hmm, OK so it doesn't have 3 available in parallel from base repos.
But it's also about to reach end of "full support" in 2 months[1], so
if we applied the policies we discussed in the LLVM-vacuuming thread
(to wit: build farm - EOL'd OSes), then...  One question I'm unclear
on is whether v17 will be packaged for RHEL8.


The rest of the thread talks about the end of support of RHEL 7, but you 
are here talking about RHEL 8.   It is true that "full support" for RHEL 
8 ended in May 2024, but that is the not the one we are tracking.  We 
are tracking the 10-year one, which I suppose is now called "maintenance 
support".


So if the above package list is correct, then we ought to keep 
supporting openssl 1.1.* until 2029.






Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> After sleeping on the problem, I think we can avoid this question
> altogether while making the code faster by using aligned accesses.
> Loads that straddle cache line boundaries run internally as 2 load
> operations. Gut feel says that there are enough out-of-order resources
> available to make it not matter in most cases. But even so, not doing
> the extra work is surely better. Attached is another approach that
> does aligned accesses, and thereby avoids going outside bounds.
> 
> Would be interesting to see how well that fares in the small use case.
> Anything that fits into one aligned cache line should be constant
> speed, and there is only one branch, but the mask setup and folding
> the separate popcounts together should add up to about 20-ish cycles
> of overhead.

I tested your patch in comparison to v25 and saw the following:

  bytes  v25   v25+ants
21108.205  1033.132
41311.227  1289.373
81927.954  2360.113
   162281.091  2365.408
   323856.992  2390.688
   643648.72   3242.498
  1284108.549  3607.148
  2564910.076  4496.852

For 2 bytes and 4 bytes, the inlining should take effect, so any difference
there is likely just noise.  At 8 bytes, we are calling the function
pointer, and there is a small regression with the masking approach.
However, by 16 bytes, the masking approach is on par with v25, and it wins
for all larger buffers, although the gains seem to taper off a bit.

If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson
> On 4 Apr 2024, at 00:06, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> On 3 Apr 2024, at 19:38, Tom Lane  wrote:
>>> Bottom line for me is that pulling 1.0.1 support now is OK,
>>> but I think pulling 1.0.2 is premature.
> 
>> Is Red Hat building and and shipping v17 packages for RHEL7 ELS customers?  
>> If
>> not then it seems mostly academical to tie our dependencies to RHEL ELS 
>> unless
>> I'm missing something.
> 
> True, they won't be doing that, and neither will Devrim.  So maybe
> we can leave RHEL7 out of the discussion, in which case there's
> not a lot of reason to keep 1.0.2 support.  We'll need to notify
> buildfarm owners to adjust their configurations.

The patch will also need to be adjusted to work with LibreSSL, but I know Jacob
was looking into that so ideally we should have something to review before
the weekend.

--
Daniel Gustafsson





Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson
> On 3 Apr 2024, at 21:48, Andrew Dunstan  wrote:
> On 2024-04-03 We 15:12, Daniel Gustafsson wrote:
>>   The
>> fact that very few animals run the ssl tests is a pet peeve of mine, it would
>> be nice if we could get broader coverage there.
> 
> Well, the only reason for that is that the SSL tests need to be listed in 
> PG_TEST_EXTRA, and the only reason for that is that   there's a possible 
> hazard on multi-user servers. But I bet precious few buildfarm animals run in 
> such an environment. Mine don't - I'm the only user. 
> 
> Maybe we could send out an email to the buildfarm-owners list asking people 
> to consider allowing the ssl tests.

I think that sounds like a good idea.

--
Daniel Gustafsson





Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Andres Freund
Hi,

On 2024-04-03 17:58:55 -0400, Tom Lane wrote:
> Magnus Hagander  writes:
> > On Wed, Apr 3, 2024 at 7:57 PM Andres Freund  wrote:
> >> Openssh has now integrated [1] a patch to remove the dependency on
> >> libsystemd
> >> for triggering service manager readyness notifications, by inlining the
> >> necessary function. That's not hard, the protocol is pretty simple.
> >> I suspect we should do the same. We're not even close to being a target as
> >> attractive as openssh, but still, it seems unnecessary.
> 
> > +1.
> 
> I didn't read the patch, but if it's short and stable enough then this
> seems like a good idea.

It's basically just checking for an env var, opening the unix socket indicated
by that, writing a string to it and closing the socket again.


> (If openssh and we are using such a patch, that will probably be a big
> enough stake in the ground to prevent somebody deciding to change the
> protocol ...)

One version of the openssh patch to remove liblzma was submitted by one of the
core systemd devs, so I think they agree that it's a stable API.  The current
protocol supports adding more information by adding attributes, so it should
be extensible enough anyway.


> >> An argument could be made to instead just remove support, but I think it's
> >> quite valuable to have intra service dependencies that can rely on the
> >> server actually having started up.
> 
> > If we remove support we're basically just asking most of our linux
> > packagers to add it back in, and they will add it back in the same way we
> > did it. I think we do everybody a disservice if we do that. It's useful
> > functionality.
> 
> Yeah, that idea seems particularly silly in view of the desire
> expressed earlier in this thread to reduce the number of patches
> carried by packagers.  People packaging for systemd-using distros
> will not consider that this functionality is optional.

Yep.

Greetings,

Andres Freund




Re: Streaming read-ready sequential scan code

2024-04-03 Thread Andres Freund
Hi,

On 2024-04-04 09:27:35 +1300, Thomas Munro wrote:
> Anecdotally by all reports I've seen so far, all-in-cache seems to be
> consistently a touch faster than master if anything, for streaming seq
> scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
> due to intentional changes.  No efficiency is coming from batching
> anything.  Perhaps it has to do with CPU pipelining effects: though
> it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
> itself is cut up into stages in a kind of pipeline:
> read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
> returns page n each time you call it, so maybe we get more CPU
> parallelism due to spreading the data dependencies out?

Another theory is that it's due to the plain ReadBuffer() path needing to do
RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't
need to.



> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
> if (strategy)
> {
> int strategy_buffers = GetAccessStrategyBufferCount(strategy);
> max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
> }

> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
should we apply that only if the ring the strategy doesn't use the
StrategyRejectBuffer() logic?

I think it's fine to add a handwavy justification for the 2, saying that we
want to balance readahead distance and reducing WAL write frequency, and that
at some point more sophisticated logic will be needed.


> Then we should increase the default ring sizes for BAS_BULKREAD and
> BAS_VACUUM to make the previous statement true.

I'm not sure it's right tying them together. The concerns for both are fairly
different.


> The size of main memory and L2 cache have increased dramatically since those
> strategies were invented.  I think we should at least double them, and more
> likely quadruple them.  I realise you already made them configurable per
> command in commit 1cbbee033, but I mean the hard coded default 256 in
> freelist.c.  It's not only to get more space for our prefetching plans, it's
> also to give the system more chance of flushing WAL asynchronously/in some
> other backend before you crash into dirty data; as you discovered,
> prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy,
> by taking away space that is effectively deferring WAL flushes, so I'd at
> least like to double the size for if we use "/ 2" above.

I think for VACUUM we should probably go a bit further. There's no comparable
L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Greetings,

Andres Freund




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Tom Lane
Daniel Gustafsson  writes:
> On 3 Apr 2024, at 19:38, Tom Lane  wrote:
>> Bottom line for me is that pulling 1.0.1 support now is OK,
>> but I think pulling 1.0.2 is premature.

> Is Red Hat building and and shipping v17 packages for RHEL7 ELS customers?  If
> not then it seems mostly academical to tie our dependencies to RHEL ELS unless
> I'm missing something.

True, they won't be doing that, and neither will Devrim.  So maybe
we can leave RHEL7 out of the discussion, in which case there's
not a lot of reason to keep 1.0.2 support.  We'll need to notify
buildfarm owners to adjust their configurations.

regards, tom lane




Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Tom Lane
Magnus Hagander  writes:
> On Wed, Apr 3, 2024 at 7:57 PM Andres Freund  wrote:
>> Openssh has now integrated [1] a patch to remove the dependency on
>> libsystemd
>> for triggering service manager readyness notifications, by inlining the
>> necessary function. That's not hard, the protocol is pretty simple.
>> I suspect we should do the same. We're not even close to being a target as
>> attractive as openssh, but still, it seems unnecessary.

> +1.

I didn't read the patch, but if it's short and stable enough then this
seems like a good idea.  (If openssh and we are using such a patch,
that will probably be a big enough stake in the ground to prevent
somebody deciding to change the protocol ...)

>> An argument could be made to instead just remove support, but I think it's
>> quite valuable to have intra service dependencies that can rely on the
>> server actually having started up.

> If we remove support we're basically just asking most of our linux
> packagers to add it back in, and they will add it back in the same way we
> did it. I think we do everybody a disservice if we do that. It's useful
> functionality.

Yeah, that idea seems particularly silly in view of the desire
expressed earlier in this thread to reduce the number of patches
carried by packagers.  People packaging for systemd-using distros
will not consider that this functionality is optional.

regards, tom lane




Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-03 Thread Tom Lane
Matthias van de Meent  writes:
> On Tue, 2 Apr 2024 at 17:47, Tom Lane  wrote:
>> IIUC, it's not possible to use the SERIALIZE option when explaining
>> CREATE TABLE AS, because you can't install the instrumentation tuple
>> receiver when the IntoRel one is needed.  I think that's fine because
>> no serialization would happen in that case anyway, but should we
>> throw an error if that combination is requested?  Blindly reporting
>> that zero bytes were serialized seems like it'd confuse people.

> I think it's easily explained as no rows were transfered to the
> client. If there is actual confusion, we can document it, but
> confusing disk with network is not a case I'd protect against. See
> also: EXPLAIN (ANALYZE, SERIALIZE) INSERT without the RETURNING
> clause.

Fair enough.  There were a couple of spots in the code where I thought
this was important to comment about, though.

>> However, isn't the bottom half of serializeAnalyzeStartup doing
>> exactly what the comment above it says we don't do, that is accounting
>> for the RowDescription message?  Frankly I agree with the comment that
>> it's not worth troubling over, so I'd just drop that code rather than
>> finding a solution for the magic-number problem.

> I'm not sure I agree with not including the size of RowDescription
> itself though, as wide results can have a very large RowDescription
> overhead; up to several times the returned data in cases where few
> rows are returned.

Meh --- if we're rounding off to kilobytes, you're unlikely to see it.
In any case, if we start counting overhead messages, where shall we
stop?  Should we count the eventual CommandComplete or ReadyForQuery,
for instance?  I'm content to say that this measures data only; that
seems to jibe with other aspects of EXPLAIN's behavior.

> Removed. I've instead added buffer usage, as I realised that wasn't
> covered yet, and quite important to detect excessive detoasting (it's
> not covered at the top-level scan).

Duh, good catch.

I've pushed this after a good deal of cosmetic polishing -- for
example, I spent some effort on making serializeAnalyzeReceive
look as much like printtup as possible, in hopes of making it
easier to keep the two functions in sync in future.

regards, tom lane




Re: Streaming read-ready sequential scan code

2024-04-03 Thread Melanie Plageman
On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro  wrote:
>
> On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
>  wrote:
> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
> if (strategy)
> {
> int strategy_buffers = GetAccessStrategyBufferCount(strategy);
> max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
> }
>
> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

Yea, I don't actually understand why not just use strategy_buffers - 1
or something. 1/2 seems like a big limiting factor for those
strategies with small rings.

I don't really think it will come up for SELECT queries since they
rely on readahead and not prefetching.
It does seem like it could easily come up for analyze.

But I am on board with the idea of clamping for sequential scan/TID
range scan. For vacuum, we might have to think harder. If the user
specifies buffer_usage_limit and io_combine_limit and they are never
reaching IOs of io_combine_limit because of their buffer_usage_limit
value, then we should probably inform them.

- Melanie




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alexander Korotkov
On Wed, Apr 3, 2024 at 10:04 PM Pavel Borisov  wrote:
> On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov  wrote:
>>
>> On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera  
>> wrote:
>> >
>> > On 2024-Apr-03, Alexander Korotkov wrote:
>> >
>> > > Regarding the shmem data structure for LSN waiters.  I didn't pick
>> > > LWLock or ConditionVariable, because I needed the ability to wake up
>> > > only those waiters whose LSN is already replayed.  In my experience
>> > > waking up a process is way slower than scanning a short flat array.
>> >
>> > I agree, but I think that's unrelated to what I was saying, which is
>> > just the patch I attach here.
>>
>> Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
>> meant this very clearly!
>>
>> I'm good with the patch.  Attached revision contains a bit of a commit 
>> message.
>>
>> > > However, I agree that when the number of waiters is very high and flat
>> > > array may become a problem.  It seems that the pairing heap is not
>> > > hard to use for shmem structures.  The only memory allocation call in
>> > > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
>> > > be able to initialize the pairing heap in-place, and it will be fine
>> > > for shmem.
>> >
>> > Ok.
>> >
>> > With the code as it stands today, everything in WaitLSNState apart from
>> > the pairing heap is accessed without any locking.  I think this is at
>> > least partly OK because each backend only accesses its own entry; but it
>> > deserves a comment.  Or maybe something more, because WaitLSNSetLatches
>> > does modify the entry for other backends.  (Admittedly, this could only
>> > happens for backends that are already sleeping, and it only happens
>> > with the lock acquired, so it's probably okay.  But clearly it deserves
>> > a comment.)
>>
>> Please, check 0002 patch attached.  I found it easier to move two
>> assignments we previously moved out of lock, into the lock; then claim
>> WaitLSNState.procInfos is also protected by WaitLSNLock.
>
> Could you re-attach 0002. Seems it failed to attach to the previous message.

I actually forgot both!

--
Regards,
Alexander Korotkov


v2-0001-Use-an-LWLock-instead-of-a-spinlock-in-waitlsn.c.patch
Description: Binary data


v2-0002-Clarify-what-is-protected-by-WaitLSNLock.patch
Description: Binary data


Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-04-03 Thread Tom Lane
Alvaro Herrera  writes:
> Great, thanks for looking.  Pushed now, I'll be closing the commitfest
> entry shortly.

On my machine, headerscheck does not like this:

$ src/tools/pginclude/headerscheck --cplusplus
In file included from /tmp/headerscheck.4gTaW5/test.cpp:3:
./src/include/libpq/libpq-be-fe-helpers.h: In function 'char* 
libpqsrv_cancel(PGconn*, TimestampTz)':
./src/include/libpq/libpq-be-fe-helpers.h:393:10: warning: ISO C++ forbids 
converting a string constant to 'char*' [-Wwrite-strings]
   return "out of memory";
  ^~~
./src/include/libpq/libpq-be-fe-helpers.h:421:13: warning: ISO C++ forbids 
converting a string constant to 'char*' [-Wwrite-strings]
 error = "cancel request timed out";
 ^~

The second part of that could easily be fixed by declaring "error" as
"const char *".  As for the first part, can we redefine the whole
function as returning "const char *"?  (If not, this coding is very
questionable anyway.)

regards, tom lane




Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Magnus Hagander
On Wed, Apr 3, 2024 at 7:57 PM Andres Freund  wrote:

> Hi,
>
> As most will know by now, the way xz debacle was able to make sshd
> vulnerable
> was through a dependency from sshd to libsystemd and then from libsystemd
> to
> liblzma. One lesson from this is that unnecessary dependencies can still
> increase risk.
>

Yeah, I think that's something to consider for every dependency added. I
think we're fairly often protected against "adding too many libraries"
because many libraries simply don't exist for all the platforms we want to
build on. But it's nevertheless something to think about each time.


It's worth noting that we have an optional dependency on libsystemd as well.
>
> Openssh has now integrated [1] a patch to remove the dependency on
> libsystemd
> for triggering service manager readyness notifications, by inlining the
> necessary function. That's not hard, the protocol is pretty simple.
>
> I suspect we should do the same. We're not even close to being a target as
> attractive as openssh, but still, it seems unnecessary.
>

+1.

When the code is this simple, we should definitely consider carrying it
ourselves. At least if we don't expect to need *other* functionality from
the same library in the future, which I doubt we will from libsystemd.


An argument could be made to instead just remove support, but I think it's
> quite valuable to have intra service dependencies that can rely on the
> server
> actually having started up.
>
>
If we remove support we're basically just asking most of our linux
packagers to add it back in, and they will add it back in the same way we
did it. I think we do everybody a disservice if we do that. It's useful
functionality.

//Magnus


Re: On disable_cost

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 08:21, Robert Haas  wrote:
> I wanted to further explore the idea of just not generating plans of
> types that are currently disabled. I looked into doing this for
> enable_indexscan and enable_indexonlyscan. As a first step, I
> investigated how those settings work now, and was horrified. I don't
> know whether I just wasn't paying attention back when the original
> index-only scan work was done -- I remember discussing
> enable_indexonlyscan with you at the time -- or whether it got changed
> subsequently. Anyway, the current behavior is:
>
> [A] enable_indexscan=false adds disable_cost to the cost of every
> Index Scan path *and also* every Index-Only Scan path. So disabling
> index-scans also in effect discourages the use of index-only scans,
> which would make sense if we didn't have a separate setting called
> enable_indexonlyscan, but we do. Given that, I think this is
> completely and utterly wrong.
>
> [b] enable_indexonlyscan=false causes index-only scan paths not to be
> generated at all, but instead, we generate index-scan paths to do the
> same thing that we would not have generated otherwise.

FWIW, I think changing this is a bad idea and I don't think the
behaviour that's in your patch is useful.  With your patch, if I SET
enable_indexonlyscan=false, any index that *can* support an IOS for my
query will now not be considered at all!

With your patch applied, I see:

-- default enable_* GUC values.
postgres=# explain select oid from pg_class order by oid;
QUERY PLAN
---
 Index Only Scan using pg_class_oid_index on pg_class
(cost=0.27..22.50 rows=415 width=4)
(1 row)


postgres=# set enable_indexonlyscan=0; -- no index scan?
SET
postgres=# explain select oid from pg_class order by oid;
   QUERY PLAN
-
 Sort  (cost=36.20..37.23 rows=415 width=4)
   Sort Key: oid
   ->  Seq Scan on pg_class  (cost=0.00..18.15 rows=415 width=4)
(3 rows)

postgres=# set enable_seqscan=0; -- still no index scan!
SET
postgres=# explain select oid from pg_class order by oid;
 QUERY PLAN

 Sort  (cost=136.20..137.23 rows=415 width=4)
   Sort Key: oid
   ->  Seq Scan on pg_class  (cost=100.00..118.15
rows=415 width=4)
(3 rows)

postgres=# explain select oid from pg_class order by oid,relname; --
now an index scan?!
 QUERY PLAN
-
 Incremental Sort  (cost=0.43..79.50 rows=415 width=68)
   Sort Key: oid, relname
   Presorted Key: oid
   ->  Index Scan using pg_class_oid_index on pg_class
(cost=0.27..60.82 rows=415 width=68)
(4 rows)

I don't think this is good as pg_class_oid_index effectively won't be
used as long as the particular query could use that index with an
index *only* scan. You can see above that as soon as I adjust the
query slightly so that IOS isn't possible, the index can be used
again. I think an Index Scan would have been a much better option for
the 2nd query than the seq scan and sort.

I think if I do SET enable_indexonlyscan=0; the index should still be
used with an Index Scan and it definitely shouldn't result in Index
Scan also being disabled if that index happens to contain all the
columns required to support an IOS.

FWIW, I'm fine with the current behaviour.  It looks like we've
assumed that, when possible, IOS are always superior to Index Scan, so
there's no point in generating an Index Scan path when we can generate
an IOS path. I think this makes sense. For that not to be true,
checking the all visible flag would have to be more costly than
visiting the heap.  Perhaps that could be true if the visibility map
page had to come from disk and the heap page was cached and the disk
was slow, but I don't think that scenario is worthy of considering
both Index scan and IOS path types when IOS is possible. We've no way
to accurately cost that anyway.

This all seems similar to enable_sort vs enable_incremental_sort. For
a while, we did consider both an incremental sort and a sort when an
incremental sort was possible, but it seemed to me that an incremental
sort would always be better when it was possible, so I changed that in
4a29eabd1. I've not seen anyone complain.  I made it so that when an
incremental sort is possible but is disabled, we do a sort instead.
That seems fairly similar to how master handles
enable_indexonlyscan=false.

In short, I don't find it strange that disabling one node type results
in considering another type that we'd otherwise not consider in cases
where we assume that the disabled node type is always superior and
should always be used when it is possible.

I do 

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Melanie Plageman
On Wed, Apr 03, 2024 at 10:25:01PM +0300, Nazir Bilal Yavuz wrote:
>
> I realized the same error while working on Jakub's benchmarking results.
> 
> Cause: I was using the nblocks variable to check how many blocks will
> be returned from the streaming API. But I realized that sometimes the
> number returned from BlockSampler_Init() is not equal to the number of
> blocks that BlockSampler_Next() will return as BlockSampling algorithm
> decides how many blocks to return on the fly by using some random
> seeds.
> 
> There are a couple of solutions I thought of:
> 
> 1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
> the acquire_sample_rows():
> 
> Streaming API uses this function to prefetch block numbers.
> BlockSampler_HasMore() will reach to the end first as it is used while
> prefetching, so it will start to return false while there are still
> buffers to return from the streaming API. That will cause some buffers
> at the end to not be processed.
> 
> 2- Expose something (function, variable etc.) from the streaming API
> to understand if the read is finished and there is no buffer to
> return:
> 
> I think this works but I am not sure if the streaming API allows
> something like that.
> 
> 3- Check every buffer returned from the streaming API, if it is
> invalid stop the main loop in the acquire_sample_rows():
> 
> This solves the problem but there will be two if checks for each
> buffer returned,
> - in heapam_scan_analyze_next_block() to check if the returned buffer is 
> invalid
> - to break main loop in acquire_sample_rows() if
> heapam_scan_analyze_next_block() returns false
> One of the if cases can be bypassed by moving
> heapam_scan_analyze_next_block()'s code to the main loop in the
> acquire_sample_rows().
> 
> I implemented the third solution, here is v6.

I've reviewed the patches inline below and attached a patch that has
some of my ideas on top of your patch.

> From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
> From: Nazir Bilal Yavuz 
> Date: Wed, 3 Apr 2024 15:14:15 +0300
> Subject: [PATCH v6] Use streaming read API in ANALYZE
> 
> ANALYZE command gets random tuples using BlockSampler algorithm. Use
> streaming reads to get these tuples by using BlockSampler algorithm in
> streaming read API prefetch logic.
> ---
>  src/include/access/heapam.h  |  6 +-
>  src/backend/access/heap/heapam_handler.c | 22 +++---
>  src/backend/commands/analyze.c   | 85 
>  3 files changed, 42 insertions(+), 71 deletions(-)
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index a307fb5f245..633caee9d95 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -25,6 +25,7 @@
>  #include "storage/bufpage.h"
>  #include "storage/dsm.h"
>  #include "storage/lockdefs.h"
> +#include "storage/read_stream.h"
>  #include "storage/shm_toc.h"
>  #include "utils/relcache.h"
>  #include "utils/snapshot.h"
> @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
> struct 
> GlobalVisState *vistest);
>  
>  /* in heap/heapam_handler.c*/
> -extern void heapam_scan_analyze_next_block(TableScanDesc scan,
> - 
>BlockNumber blockno,
> - 
>BufferAccessStrategy bstrategy);
> +extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
> + 
>ReadStream *stream);
>  extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
>   
>TransactionId OldestXmin,
>   
>double *liverows, double *deadrows,
> diff --git a/src/backend/access/heap/heapam_handler.c 
> b/src/backend/access/heap/heapam_handler.c
> index 0952d4a98eb..d83fbbe6af3 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> Relation NewHeap,
>  }
>  
>  /*
> - * Prepare to analyze block `blockno` of `scan`.  The scan has been started
> - * with SO_TYPE_ANALYZE option.
> + * Prepare to analyze block returned from streaming object.  If the block 
> returned
> + * from streaming object is valid, true is returned; otherwise false is 
> returned.
> + * The scan has been started with SO_TYPE_ANALYZE option.
>   *
>   * This routine holds a buffer pin and lock on the heap page.  They are held
>   * until heapam_scan_analyze_next_tuple() returns false.  That is until all 
> the
>   * items of the heap page are analyzed.
>   */
> -void
> -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> - 

RE: Can't compile PG 17 (master) from git under Msys2 autoconf

2024-04-03 Thread Regina Obe
> Hi Regina,
> 
> On 2024-Mar-27, Regina Obe wrote:
> 
> > The error is
> >
> > rm -f '../../src/include/storage/lwlocknames.h'
> > cp -pR ../../backend/storage/lmgr/lwlocknames.h
> > '../../src/include/storage/lwlocknames.h'
> > cp: cannot stat '../../backend/storage/lmgr/lwlocknames.h': No such
> > file or directory
> > make[1]: *** [Makefile:143: ../../src/include/storage/lwlocknames.h]
> > Error 1
> > make[1]: Leaving directory '/projects/postgresql/postgresql-
> git/src/backend'
> > make: *** [../../src/Makefile.global:382: submake-generated-headers]
> > Error 2
> 
> Hmm, I changed these rules again in commit da952b415f44, maybe your
> problem is with that one?  I wonder if changing the references to
> ../include/storage/lwlocklist.h to $(topdir)/src/include/storage/lwlocklist.h
> (and similar things in
> src/backend/storage/lmgr/Makefile) would fix it.
> 
> Do you run configure in the source directory or a separate build one?
> 
> --
> Álvaro HerreraBreisgau, Deutschland  —
> https://www.EnterpriseDB.com/
> "If it is not right, do not do it.
> If it is not true, do not say it." (Marcus Aurelius, Meditations)

I run in the source directory, but tried doing in a separate build directory 
and ran into the same issue.

Let me look at that commit and get back to you if that makes a difference.

Thanks,
Regina





Re: Streaming read-ready sequential scan code

2024-04-03 Thread Thomas Munro
On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
 wrote:
> On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas  wrote:
> > On 01/04/2024 22:58, Melanie Plageman wrote:
> > > Attached v7 has version 14 of the streaming read API as well as a few
> > > small tweaks to comments and code.
> >
> > I saw benchmarks in this thread to show that there's no regression when
> > the data is in cache, but I didn't see any benchmarks demonstrating the
> > benefit of this. So I ran this quick test:
>
> Good point! It would be good to show why we would actually want this
> patch. Attached v8 is rebased over current master (which now has the
> streaming read API).

Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
due to intentional changes.  No efficiency is coming from batching
anything.  Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out?  (Makes me
wonder what happens if you insert a memory prefetch for the page
header into that production line, and if there are more opportunities
to spread dependencies out eg hashing the buffer tag ahead of time.)

> On the topic of BAS_BULKREAD buffer access strategy, I think the least
> we could do is add an assert like this to read_stream_begin_relation()
> after calculating max_pinned_buffers.
>
> Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
>
> Perhaps we should do more? I think with a ring size of 16 MB, large
> SELECTs are safe for now. But I know future developers will make
> changes and it would be good not to assume they will understand that
> pinning more buffers than the size of the ring effectively invalidates
> the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}

I just don't know where to get that '2'.  The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true.  The size of main
memory and L2 cache have increased dramatically since those strategies
were invented.  I think we should at least double them, and more
likely quadruple them.  I realise you already made them configurable
per command in commit 1cbbee033, but I mean the hard coded default 256
in freelist.c.  It's not only to get more space for our prefetching
plans, it's also to give the system more chance of flushing WAL
asynchronously/in some other backend before you crash into dirty data;
as you discovered, prefetching accidentally makes that effect worse
in.a BAS_VACUUM strategy, by taking away space that is effectively
deferring WAL flushes, so I'd at least like to double the size for if
we use "/ 2" above.




Re: LogwrtResult contended spinlock

2024-04-03 Thread Jeff Davis
On Wed, 2024-04-03 at 13:19 +0200, Alvaro Herrera wrote:
> So what I do in the attached 0001 is stop using the XLogwrtResult
> struct
> in XLogCtl and replace it with separate Write and Flush values, and
> add
> the macro XLogUpdateLocalLogwrtResult() that copies the values of
> Write
> and Flush from the shared XLogCtl to the local variable given as
> macro
> argument.

+1

> 0002 then adds pg_atomic_monotonic_advance_u64.  (I don't add the
> _u32
> variant, because I don't think it's a great idea to add dead code. 
> If
> later we see a need for it we can put it in.)  It also changes the
> two
> new members to be atomics, changes the macro to use atomic read, and
> XLogWrite now uses monotonic increment.  A couple of other places can
> move the macro calls to occur outside the spinlock.  Also, XLogWrite
> gains the invariant checking that involves Write and Flush.

To maintain the invariant that Write >= Flush, I believe you need to
always store to Write first, then Flush; and load from Flush first,
then from Write. So RefreshXLogWriteResult should load Flush then load
Write, and the same for the Assert code. And in 0003, loading from Copy
should happen last.

Also, should pg_atomic_monotonic_advance_u64() return currval? I don't
think it's important for the current patches, but I could imagine a
caller wanting the most up-to-date value possible, even if it's beyond
what the caller requested. Returning void seems slightly wasteful.

Other than that, it looks good to me.

> Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and
> Flush, and updates WALReadFromBuffers to test that instead of the
> Write
> pointer, and adds in XLogWrite the invariant checks that involve the
> Copy pointer.

Looks good to me.

Regards,
Jeff Davis





Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote:
> I committed v23-0001.  Here is a rebased version of the remaining patches.
> I intend to test the masking idea from Ants next.

0002 was missing a cast that is needed for the 32-bit builds.  I've fixed
that in v25.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From fe001e38b3f209c2fe615a2c4c64109d5e4d3da1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v25 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 673 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 

Re: Opportunistically pruning page before update

2024-04-03 Thread Melanie Plageman
On Fri, Jan 26, 2024 at 8:33 PM James Coleman  wrote:
>
> On Tue, Jan 23, 2024 at 2:46 AM Dilip Kumar  wrote:
> >
> > On Tue, Jan 23, 2024 at 7:18 AM James Coleman  wrote:
> > >
> > > On Mon, Jan 22, 2024 at 8:21 PM James Coleman  wrote:
> > > >
> > > > See rebased patch attached.
> > >
> > > I just realized I left a change in during the rebase that wasn't 
> > > necessary.
> > >
> > > v4 attached.
> >
> > I have noticed that you are performing the opportunistic pruning after
> > we decided that the updated tuple can not fit in the current page and
> > then we are performing the pruning on the new target page.  Why don't
> > we first perform the pruning on the existing page of the tuple itself?
> >  Or this is already being done before this patch? I could not find
> > such existing pruning so got this question because such pruning can
> > convert many non-hot updates to the HOT update right?
>
> First off I noticed that I accidentally sent a different version of
> the patch I'd originally worked on. Here's the one from the proper
> branch. It's still similar, but I want to make sure the right one is
> being reviewed.

I finally got back around to looking at this. Sorry for the delay.

I don't feel confident enough to say at a high level whether or not it
is a good idea in the general case to try pruning every block
RelationGetBufferForTuple() considers as a target block.

But, I did have a few thoughts on the implementation:

heap_page_prune_opt() checks PageGetHeapFreeSpace() twice. You will
have already done that in RelationGetBufferForTuple(). And you don't
need even need to do it both of those times because you have a lock
(which is why heap_page_prune_opt() does it twice). This all seems a
bit wasteful. And then, you do it again after pruning.

This made me think, vacuum cares how much space heap_page_prune() (now
heap_page_prune_and_freeze()) freed up. Now if we add another caller
who cares how much space pruning freed up, perhaps it is worth
calculating this while pruning and returning it. I know
PageGetHeapFreeSpace() isn't the most expensive function in the world,
but it also seems like a useful, relevant piece of information to
inform the caller of.

You don't have to implement the above, it was just something I was
thinking about.

Looking at the code also made me wonder if it is worth changing
RelationGetBufferForTuple() to call PageGetHeapFreeSpace() before
taking a lock (which won't give a totally accurate result, but that's
probably okay) and then call heap_page_prune_opt() without a lock when
PageGetHeapFreeSpace() says there isn't enough space.

Also do we want to do GetVisibilityMapPins() before calling
heap_page_prune_opt()? I don't quite get why we do that before knowing
if we are going to actually use the target block in the current code.

Anyway, I'm not sure I like just adding a parameter to
heap_page_prune_opt() to indicate it already has an exclusive lock. It
does a bunch of stuff that was always done without a lock and now you
are doing it with an exclusive lock.

- Melanie




Re: On disable_cost

2024-04-03 Thread Greg Sabino Mullane
On Wed, Apr 3, 2024 at 3:21 PM Robert Haas  wrote:

> It's also pretty clear to me that the fact that enable_indexscan
> and enable_indexonlyscan work completely differently from each other
> is surprising at best, wrong at worst, but here again, what this patch
> does about that is not above reproach.


Yes, that is wrong, surely there is a reason we have two vars. Thanks for
digging into this: if nothing else, the code will be better for this
discussion, even if we do nothing for now with disable_cost.

Cheers,
Greg


Re: Statistics Import and Export

2024-04-03 Thread Tom Lane
Corey Huinker  writes:
>> As far as that goes, it shouldn't be that hard to deal with, at least
>> not for "soft" errors which hopefully cover most input-function
>> failures these days.  You should be invoking array_in via
>> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
>> (Look at pg_input_error_info() for useful precedent.)

> Ah, my understanding may be out of date. I was under the impression that
> that mechanism relied on the the cooperation of the per-element input
> function, so even if we got all the builtin datatypes to play nice with
> *Safe(), we were always going to be at risk with a user-defined input
> function.

That's correct, but it's silly not to do what we can.  Also, I imagine
that there is going to be high evolutionary pressure on UDTs to
support soft error mode for COPY, so over time the problem will
decrease --- as long as we invoke the soft error mode.

> 1. NULL input =>  Return NULL. (because strict).
> 2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
> (thus ruining binary upgrade)
> 3. Call values are so bad (examples: attname not found, required stat
> missing) that nothing can recover => WARN, return FALSE.
> 4. At least one stakind-stat is wonky (impossible for datatype, missing
> stat pair, wrong type on input parameter), but that's the worst of it => 1
> to N WARNs, write stats that do make sense, return TRUE.
> 5. Hunky-dory. => No warns. Write all stats. return TRUE.

> Which of those seem like good ereturn candidates to you?

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

regards, tom lane




Re: psql not responding to SIGINT upon db reconnection

2024-04-03 Thread Robert Haas
On Wed, Apr 3, 2024 at 11:17 AM Tristan Partin  wrote:
> I think patch 2 makes it worse. The value in -Wswitch is that when new
> enum variants are added, the developer knows the locations to update.
> Adding a default case makes -Wswitch pointless.
>
> Patch 1 is still good. The comment change in patch 2 is good too!

It seems to me that 0001 should either remove the pg_unreachable()
call or change the break to a return, but not both. The commit message
tries to justify doing both by saying that the pg_unreachable() call
doesn't have much value, but there's not much value in omitting
pg_unreachable() from unreachable places, either, so I'm not
convinced.

I agree with Tristan's analysis of 0002.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Andrew Dunstan


On 2024-04-03 We 15:12, Daniel Gustafsson wrote:

The
fact that very few animals run the ssl tests is a pet peeve of mine, it would
be nice if we could get broader coverage there.



Well, the only reason for that is that the SSL tests need to be listed 
in PG_TEST_EXTRA, and the only reason for that is that there's a 
possible hazard on multi-user servers. But I bet precious few buildfarm 
animals run in such an environment. Mine don't - I'm the only user.


Maybe we could send out an email to the buildfarm-owners list asking 
people to consider allowing the ssl tests.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Security lessons from liblzma

2024-04-03 Thread Daniel Gustafsson
> On 3 Apr 2024, at 20:09, Peter Eisentraut  wrote:
> 
> On 30.03.24 00:14, Daniel Gustafsson wrote:
>> One take-away for me is how important it is to ship recipes for regenerating
>> any testdata which is included in generated/compiled/binary format.  Kind of
>> how we in our tree ship the config for test TLS certificates and keys which 
>> can
>> be manually inspected, and used to rebuild the testdata (although the risk 
>> for
>> injections in this particular case seems low).  Bad things can still be
>> injected, but formats which allow manual review at least goes some way 
>> towards
>> lowering risk.
> 
> I actually find the situation with the test TLS files quite unsatisfactory.  
> While we have build rules, the output files are not reproducible, both 
> because of some inherent randomness in the generation, and because different 
> OpenSSL versions produce different details.

This testdata is by nature not reproducible, and twisting arms to make it so
will only result in testing against synthetic data which risk hiding bugs IMO.

> So you just have to "trust" that what's there now makes sense.

Not entirely, you can review the input files which are used to generate the
test data and verify that those make sense..

> Of course, you can use openssl tools to unpack these files, but that is 
> difficult and unreliable unless you know exactly what you are looking for.

..and check like you mention above, but it's as you say not entirely trivial.  
It
would be nice to improve this but I'm not sure how.  Document how to inspect
the data in src/test/ssl/README perhaps?

> It would be better if we created the required test files as part of the test 
> run.  (Why not?  Too slow?)

The make sslfiles step requires OpenSSL 1.1.1, which is higher than what we
require to be installed to build postgres.  The higher-than-base requirement is
due to it building test data only used when running 1.1.1 or higher, so
technically it could be made to work if anyone is interesting in investing time
in 1.0.2.

Time is one aspect, on my crusty old laptop it takes ~2 seconds to regenerate
the files.  That in itself isn't that much, but we've rejected test-time
additions far less than that.  We could however make CI and the Buildfarm run
the regeneration and leave it up to each developer to decide locally?  Or
remove them and regenerate on the first SSL test run and then use the cached
ones after that?

On top of time I have a feeling the regeneration won't run on Windows.  When
it's converted to use Meson then that might be fixed though.

--
Daniel Gustafsson





Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz
Hi,

Thank you for looking into this!

On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
>
> On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > Streaming API has been committed but the committed version has a minor
> > change, the read_stream_begin_relation function takes Relation instead
> > of BufferManagerRelation now. So, here is a v5 which addresses this
> > change.
>
> I'm getting a repeatable segfault / assertion failure with this:
>
> postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> CREATE TABLE
> postgres=# insert into tengiga select g, repeat('x', 900) from
> generate_series(1, 140) g;
> INSERT 0 140
> postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> SET
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> "heapam_handler.c", Line: 1079, PID: 262232
> postgres: heikki postgres [local]
> ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> postgres: heikki postgres [local]
> ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> postgres: heikki postgres [local]
> ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> postgres: heikki postgres [local]
> ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> postgres: heikki postgres [local]
> ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> postgres: heikki postgres [local]
> ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> postgres: heikki postgres [local]
> ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> was terminated by signal 6: Aborted

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

There are a couple of solutions I thought of:

1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
the acquire_sample_rows():

Streaming API uses this function to prefetch block numbers.
BlockSampler_HasMore() will reach to the end first as it is used while
prefetching, so it will start to return false while there are still
buffers to return from the streaming API. That will cause some buffers
at the end to not be processed.

2- Expose something (function, variable etc.) from the streaming API
to understand if the read is finished and there is no buffer to
return:

I think this works but I am not sure if the streaming API allows
something like that.

3- Check every buffer returned from the streaming API, if it is
invalid stop the main loop in the acquire_sample_rows():

This solves the problem but there will be two if checks for each
buffer returned,
- in heapam_scan_analyze_next_block() to check if the returned buffer is invalid
- to break main loop in acquire_sample_rows() if
heapam_scan_analyze_next_block() returns false
One of the if cases can be bypassed by moving
heapam_scan_analyze_next_block()'s code to the main loop in the
acquire_sample_rows().

I implemented the third solution, here is v6.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v6] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using 

Re: On disable_cost

2024-04-03 Thread Robert Haas
On Tue, Apr 2, 2024 at 12:58 PM Tom Lane  wrote:
> Not really.  But if you had, say, a join of a promoted path to a
> disabled path, that would be treated as on-par with a join of two
> regular paths, which seems like it'd lead to odd choices.  Maybe
> it'd be fine, but my gut says it'd likely not act nicely.  As you
> say, it's a lot easier to believe that only-promoted or only-disabled
> situations would behave sanely.

Makes sense.

I wanted to further explore the idea of just not generating plans of
types that are currently disabled. I looked into doing this for
enable_indexscan and enable_indexonlyscan. As a first step, I
investigated how those settings work now, and was horrified. I don't
know whether I just wasn't paying attention back when the original
index-only scan work was done -- I remember discussing
enable_indexonlyscan with you at the time -- or whether it got changed
subsequently. Anyway, the current behavior is:

[A] enable_indexscan=false adds disable_cost to the cost of every
Index Scan path *and also* every Index-Only Scan path. So disabling
index-scans also in effect discourages the use of index-only scans,
which would make sense if we didn't have a separate setting called
enable_indexonlyscan, but we do. Given that, I think this is
completely and utterly wrong.

[b] enable_indexonlyscan=false causes index-only scan paths not to be
generated at all, but instead, we generate index-scan paths to do the
same thing that we would not have generated otherwise. This is weird
because it means that disabling one plan type causes us to consider
additional plans of another type, which seems like a thing that a user
might not expect. It's more defensible than [A], though, because you
could argue that we only omit the index scan path as an optimization,
on the theory that it will always lose to the index-only scan path,
and thus if the index-only scan path is not generated, there's a point
to generating the index scan path after all, so we should. However, it
seems unlikely to me that someone reading the one line of
documentation that we have about this parameter would be able to guess
that it works this way.

Here's an example of how the current system behaves:

robert.haas=# explain select count(*) from pgbench_accounts;
   QUERY PLAN
-
 Aggregate  (cost=2854.29..2854.30 rows=1 width=8)
   ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
 (cost=0.29..2604.29 rows=10 width=0)
(2 rows)

robert.haas=# set enable_indexscan=false;
SET
robert.haas=# explain select count(*) from pgbench_accounts;
  QUERY PLAN
--
 Aggregate  (cost=2890.00..2890.01 rows=1 width=8)
   ->  Seq Scan on pgbench_accounts  (cost=0.00..2640.00 rows=10 width=0)
(2 rows)

robert.haas=# set enable_seqscan=false;
SET
robert.haas=# set enable_bitmapscan=false;
SET
robert.haas=# explain select count(*) from pgbench_accounts;
QUERY PLAN
--
 Aggregate  (cost=1002854.29..1002854.30 rows=1 width=8)
   ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
 (cost=100.29..1002604.29 rows=10 width=0)
(2 rows)

robert.haas=# set enable_indexonlyscan=false;
SET
robert.haas=# explain select count(*) from pgbench_accounts;
  QUERY PLAN
---
 Aggregate  (cost=1002890.00..1002890.01 rows=1 width=8)
   ->  Seq Scan on pgbench_accounts
(cost=100.00..1002640.00 rows=10 width=0)
(2 rows)

The first time we run the query, it picks an index-only scan because
it's the cheapest. When index scans are disabled, the query now picks
a sequential scan, even though it wasn't use an index-only scan,
because the index scan that it was using is perceived to have become
very expensive. When we then shut off sequential scans and bitmap-only
scans, it switches back to an index-only scan, because setting
enable_indexscan=false didn't completely disable index-only scans, but
just made them expensive. But now everything looks expensive, so we go
back to the same plan we had initially, except with the cost increased
by a bazillion. Finally, when we disable index-only scans, that
removes that plan from the pool, so now we pick the second-cheapest
plan overall, which in this case is a sequential scan.

So just to see what would happen, I wrote a patch to make
enable_indexscan and enable_indexonlyscan do exactly what they say on
the tin: when you set one of them to false, paths of that type are not
generated, 

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson
> On 3 Apr 2024, at 17:29, Tom Lane  wrote:
> 
> Jacob Champion  writes:
>> As far as I can tell, no versions of LibreSSL so far provide
>> X509_get_signature_info(), so this patch is probably a bit too
>> aggressive.
> 
> Another problem with cutting support is how many buildfarm members
> will we lose.  I scraped recent configure logs and got the attached
> results.  I count 3 machines running 1.0.1,

Support for 1.0.1 was removed with 8e278b657664 in July 2023 so those are not
building with OpenSSL enabled already.

> 18 running some flavor of 1.0.2,

massasauga and snakefly run the ssl_passphrase_callback-check test but none of
these run the ssl-check tests AFAICT, so we have very low coverage as is.  The
fact that very few animals run the ssl tests is a pet peeve of mine, it would
be nice if we could get broader coverage there.

Worth noting is that the OpenSSL check in configure.ac only reports what the
version of the OpenSSL binary in $PATH is, not which version of the library
that we build against (using --with-libs/--with-includes etc).

--
Daniel Gustafsson





Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Daniel Gustafsson
> On 3 Apr 2024, at 19:38, Tom Lane  wrote:
> 
> Jacob Champion  writes:
>> The RHEL7-alikes are the biggest set, but that's already discussed
>> above. Looks like SUSE 12 goes EOL later this year (October 2024), and
>> it ships OpenSSL 1.1.1 as an option. Already-dead distros are Ubuntu
>> 16.04 (April 2021), Photon 2 (January 2023), and Photon 3 (March
>> 2024). That leaves AL2, OpenIndiana Hipster, and Solaris 11.4, all of
>> which appear to have newer versions of OpenSSL shipped and selectable.
> 
> The discussion we had last year concluded that we were OK with
> dropping 1.0.1 support when RHEL6 goes out of extended support
> (June 2024 per this thread, I didn't check it).  Seems like we
> should have the same policy for RHEL7.  Also, calling Photon 3
> dead because it went EOL three days ago seems over-hasty.
> 
> Bottom line for me is that pulling 1.0.1 support now is OK,
> but I think pulling 1.0.2 is premature.

Is Red Hat building and and shipping v17 packages for RHEL7 ELS customers?  If
not then it seems mostly academical to tie our dependencies to RHEL ELS unless
I'm missing something.

--
Daniel Gustafsson





Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Pavel Borisov
Hi, Alexander!

On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov 
wrote:

> On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera 
> wrote:
> >
> > On 2024-Apr-03, Alexander Korotkov wrote:
> >
> > > Regarding the shmem data structure for LSN waiters.  I didn't pick
> > > LWLock or ConditionVariable, because I needed the ability to wake up
> > > only those waiters whose LSN is already replayed.  In my experience
> > > waking up a process is way slower than scanning a short flat array.
> >
> > I agree, but I think that's unrelated to what I was saying, which is
> > just the patch I attach here.
>
> Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
> meant this very clearly!
>
> I'm good with the patch.  Attached revision contains a bit of a commit
> message.
>
> > > However, I agree that when the number of waiters is very high and flat
> > > array may become a problem.  It seems that the pairing heap is not
> > > hard to use for shmem structures.  The only memory allocation call in
> > > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
> > > be able to initialize the pairing heap in-place, and it will be fine
> > > for shmem.
> >
> > Ok.
> >
> > With the code as it stands today, everything in WaitLSNState apart from
> > the pairing heap is accessed without any locking.  I think this is at
> > least partly OK because each backend only accesses its own entry; but it
> > deserves a comment.  Or maybe something more, because WaitLSNSetLatches
> > does modify the entry for other backends.  (Admittedly, this could only
> > happens for backends that are already sleeping, and it only happens
> > with the lock acquired, so it's probably okay.  But clearly it deserves
> > a comment.)
>
> Please, check 0002 patch attached.  I found it easier to move two
> assignments we previously moved out of lock, into the lock; then claim
> WaitLSNState.procInfos is also protected by WaitLSNLock.
>
Could you re-attach 0002. Seems it failed to attach to the previous
message.

Regards,
Pavel


Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz
Hi Jakub,

Thank you for looking into this and doing a performance analysis.

On Wed, 3 Apr 2024 at 11:42, Jakub Wartak  wrote:
>
> On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz  wrote:
> [..]
> > v4 is rebased on top of v14 streaming read API changes.
>
> Hi Nazir, so with streaming API committed, I gave a try to this patch.
> With autovacuum=off and 30GB table on NVMe (with standard readahead of
> 256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
> default) created using: create table t as select repeat('a', 100) || i
> || repeat('b', 500) as filler from generate_series(1, 4500) as i;
>
> on master, effect of mainteance_io_concurency [default 10] is like
> that (when resetting the fs cache after each ANALYZE):
>
> m_io_c = 0:
> Time: 3137.914 ms (00:03.138)
> Time: 3094.540 ms (00:03.095)
> Time: 3452.513 ms (00:03.453)
>
> m_io_c = 1:
> Time: 2972.751 ms (00:02.973)
> Time: 2939.551 ms (00:02.940)
> Time: 2904.428 ms (00:02.904)
>
> m_io_c = 2:
> Time: 1580.260 ms (00:01.580)
> Time: 1572.132 ms (00:01.572)
> Time: 1558.334 ms (00:01.558)
>
> m_io_c = 4:
> Time: 938.304 ms
> Time: 931.772 ms
> Time: 920.044 ms
>
> m_io_c = 8:
> Time: 666.025 ms
> Time: 660.241 ms
> Time: 648.848 ms
>
> m_io_c = 16:
> Time: 542.450 ms
> Time: 561.155 ms
> Time: 539.683 ms
>
> m_io_c = 32:
> Time: 538.487 ms
> Time: 541.705 ms
> Time: 538.101 ms
>
> with patch applied:
>
> m_io_c = 0:
> Time: 3106.469 ms (00:03.106)
> Time: 3140.343 ms (00:03.140)
> Time: 3044.133 ms (00:03.044)
>
> m_io_c = 1:
> Time: 2959.817 ms (00:02.960)
> Time: 2920.265 ms (00:02.920)
> Time: 2911.745 ms (00:02.912)
>
> m_io_c = 2:
> Time: 1581.912 ms (00:01.582)
> Time: 1561.444 ms (00:01.561)
> Time: 1558.251 ms (00:01.558)
>
> m_io_c = 4:
> Time: 908.116 ms
> Time: 901.245 ms
> Time: 901.071 ms
>
> m_io_c = 8:
> Time: 619.870 ms
> Time: 620.327 ms
> Time: 614.266 ms
>
> m_io_c = 16:
> Time: 529.885 ms
> Time: 526.958 ms
> Time: 528.474 ms
>
> m_io_c = 32:
> Time: 521.185 ms
> Time: 520.713 ms
> Time: 517.729 ms
>
> No difference to me, which seems to be good. I've double checked and
> patch used the new way
>
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
> -> mdreadv -> FileReadV -> pg_preadv (inlined)
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
> -> ...
>
> I gave also io_combine_limit to 32 (max, 256kB) a try and got those
> slightly better results:
>
> [..]
> m_io_c = 16:
> Time: 494.599 ms
> Time: 496.345 ms
> Time: 973.500 ms
>
> m_io_c = 32:
> Time: 461.031 ms
> Time: 449.037 ms
> Time: 443.375 ms
>
> and that (last one) apparently was able to push it to ~50-60k still
> random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
> still reading random , so I assume no merging was done:
>
> Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz
> w/s wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s
> drqm/s  %drqm d_await dareq-sz f/s f_await  aqu-sz  %util
> nvme0n1   61212.00591.82 0.00   0.000.10 9.90
> 2.00  0.02 0.00   0.000.0012.000.00  0.00
> 0.00   0.000.00 0.000.000.006.28  85.20
>
> So in short it looks good to me.

My results are similar to yours, also I realized a bug while working
on your benchmarking cases. I will share the cause and the fix soon.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Peter Geoghegan
On Wed, Apr 3, 2024 at 1:04 PM Melanie Plageman
 wrote:
> Thanks! And thanks for updating the commitfest entry!

Nice work!

-- 
Peter Geoghegan




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-04-03 Thread Andres Freund
Hi,

On 2024-02-14 16:23:38 +0900, Michael Paquier wrote:
> It is possible to retrieve this information some WITH RECURSIVE as well, as
> mentioned upthread.  Perhaps we could consider documenting these tricks?

I think it's sufficiently hard that it's not a reasonable way to do this.

Greetings,

Andres Freund




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Jacob Champion
On Wed, Apr 3, 2024 at 11:13 AM Tom Lane  wrote:
> wikipedia says that RHEL7 ends ELS as of June 2026 [1].

I may have misunderstood something in here then:


https://www.redhat.com/en/blog/announcing-4-years-extended-life-cycle-support-els-red-hat-enterprise-linux-7

> ELS for RHEL 7 is now available for 4 years, starting on July 1, 2024.

Am I missing something?

--Jacob




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alexander Korotkov
On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera  wrote:
>
> On 2024-Apr-03, Alexander Korotkov wrote:
>
> > Regarding the shmem data structure for LSN waiters.  I didn't pick
> > LWLock or ConditionVariable, because I needed the ability to wake up
> > only those waiters whose LSN is already replayed.  In my experience
> > waking up a process is way slower than scanning a short flat array.
>
> I agree, but I think that's unrelated to what I was saying, which is
> just the patch I attach here.

Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
meant this very clearly!

I'm good with the patch.  Attached revision contains a bit of a commit message.

> > However, I agree that when the number of waiters is very high and flat
> > array may become a problem.  It seems that the pairing heap is not
> > hard to use for shmem structures.  The only memory allocation call in
> > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
> > be able to initialize the pairing heap in-place, and it will be fine
> > for shmem.
>
> Ok.
>
> With the code as it stands today, everything in WaitLSNState apart from
> the pairing heap is accessed without any locking.  I think this is at
> least partly OK because each backend only accesses its own entry; but it
> deserves a comment.  Or maybe something more, because WaitLSNSetLatches
> does modify the entry for other backends.  (Admittedly, this could only
> happens for backends that are already sleeping, and it only happens
> with the lock acquired, so it's probably okay.  But clearly it deserves
> a comment.)

Please, check 0002 patch attached.  I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.

> Don't we need to WaitLSNCleanup() during error recovery or something?

Yes, there is WaitLSNCleanup(). It's currently only called from one
place, given that waiting for LSN can't be invoked from background
workers or inside the transaction.

--
Regards,
Alexander Korotkov




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Tom Lane
Jacob Champion  writes:
> On Wed, Apr 3, 2024 at 10:38 AM Tom Lane  wrote:
>> Bottom line for me is that pulling 1.0.1 support now is OK,
>> but I think pulling 1.0.2 is premature.

> Okay, but IIUC, waiting for it to drop out of extended support means
> we deal with it for four more years. That seems excessive.

wikipedia says that RHEL7 ends ELS as of June 2026 [1].

regards, tom lane

[1] https://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux#Product_life_cycle




Re: Statistics Import and Export

2024-04-03 Thread Corey Huinker
>
>
> As far as that goes, it shouldn't be that hard to deal with, at least
> not for "soft" errors which hopefully cover most input-function
> failures these days.  You should be invoking array_in via
> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
> (Look at pg_input_error_info() for useful precedent.)
>

Ah, my understanding may be out of date. I was under the impression that
that mechanism relied on the the cooperation of the per-element input
function, so even if we got all the builtin datatypes to play nice with
*Safe(), we were always going to be at risk with a user-defined input
function.


> There might be something to be said for handling all the error
> cases via an ErrorSaveContext and use of ereturn() instead of
> ereport().  Not sure if it's worth the trouble or not.
>

It would help us tailor the user experience. Right now we have several
endgames. To recap:

1. NULL input =>  Return NULL. (because strict).
2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
(thus ruining binary upgrade)
3. Call values are so bad (examples: attname not found, required stat
missing) that nothing can recover => WARN, return FALSE.
4. At least one stakind-stat is wonky (impossible for datatype, missing
stat pair, wrong type on input parameter), but that's the worst of it => 1
to N WARNs, write stats that do make sense, return TRUE.
5. Hunky-dory. => No warns. Write all stats. return TRUE.

Which of those seem like good ereturn candidates to you?


Re: Security lessons from liblzma

2024-04-03 Thread Peter Eisentraut

On 30.03.24 00:14, Daniel Gustafsson wrote:

One take-away for me is how important it is to ship recipes for regenerating
any testdata which is included in generated/compiled/binary format.  Kind of
how we in our tree ship the config for test TLS certificates and keys which can
be manually inspected, and used to rebuild the testdata (although the risk for
injections in this particular case seems low).  Bad things can still be
injected, but formats which allow manual review at least goes some way towards
lowering risk.


I actually find the situation with the test TLS files quite 
unsatisfactory.  While we have build rules, the output files are not 
reproducible, both because of some inherent randomness in the 
generation, and because different OpenSSL versions produce different 
details.  So you just have to "trust" that what's there now makes sense. 
 Of course, you can use openssl tools to unpack these files, but that 
is difficult and unreliable unless you know exactly what you are looking 
for.  Also, for example, do we even know whether all the files that are 
there now are even used by any tests?


A few years ago, some guy on the internet sent in a purported update to 
one of the files [0], which I ended up committing, but I remember that 
that situation gave me quite some pause at the time.


It would be better if we created the required test files as part of the 
test run.  (Why not?  Too slow?)  Alternatively, I have been thinking 
that maybe we could make the output more reproducible by messing with 
whatever random seed OpenSSL uses.  Or maybe use a Python library to 
create the files.  Some things to think about.


[0]: 
https://www.postgresql.org/message-id/FEF81714-D479-4512-839B-C769D2605F8A%40yesql.se






Re: LogwrtResult contended spinlock

2024-04-03 Thread Alvaro Herrera
On 2024-Apr-03, Bharath Rupireddy wrote:

> On Wed, Apr 3, 2024 at 4:49 PM Alvaro Herrera  wrote:

> > So what I do in the attached 0001 is stop using the XLogwrtResult struct
> > in XLogCtl and replace it with separate Write and Flush values, and add
> > the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
> > and Flush from the shared XLogCtl to the local variable given as macro
> > argument.  (I also added our idiomatic do {} while(0) to the macro
> > definition, for safety).  The new members are XLogCtl->logWriteResult
> > and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
> > essentially identical semantics as the previous code.  No atomic access
> > yet!
> 
> +1.

> Also, I'm fine with renaming XLogUpdateLocalLogwrtResult() to
> RefreshXLogWriteResult().

Okay, I have pushed 0001 with the name change, will see about getting
the others in tomorrow.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: Security lessons from liblzma - libsystemd

2024-04-03 Thread Andres Freund
Hi,

As most will know by now, the way xz debacle was able to make sshd vulnerable
was through a dependency from sshd to libsystemd and then from libsystemd to
liblzma. One lesson from this is that unnecessary dependencies can still
increase risk.

It's worth noting that we have an optional dependency on libsystemd as well.

Openssh has now integrated [1] a patch to remove the dependency on libsystemd
for triggering service manager readyness notifications, by inlining the
necessary function. That's not hard, the protocol is pretty simple.

I suspect we should do the same. We're not even close to being a target as
attractive as openssh, but still, it seems unnecessary.

Intro into the protocol is at [2], with real content and outline of the
relevant code at [3].


An argument could be made to instead just remove support, but I think it's
quite valuable to have intra service dependencies that can rely on the server
actually having started up.

Greetings,

Andres Freund

[1] https://bugzilla.mindrot.org/show_bug.cgi?id=2641
[2] 
https://www.freedesktop.org/software/systemd/man/devel/systemd.html#Readiness%20Protocol
[3] https://www.freedesktop.org/software/systemd/man/devel/sd_notify.html#Notes




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Jacob Champion
On Wed, Apr 3, 2024 at 10:38 AM Tom Lane  wrote:
> Also, calling Photon 3
> dead because it went EOL three days ago seems over-hasty.

Well, March 1, but either way I thought "dead" for the purposes of
this thread meant "you can't build the very latest version of Postgres
on it", not "we've forgotten it exists". Back branches will continue
to need support and testing.

> Bottom line for me is that pulling 1.0.1 support now is OK,
> but I think pulling 1.0.2 is premature.

Okay, but IIUC, waiting for it to drop out of extended support means
we deal with it for four more years. That seems excessive.

--Jacob




Re: Can't compile PG 17 (master) from git under Msys2 autoconf

2024-04-03 Thread Alvaro Herrera
Hi Regina,

On 2024-Mar-27, Regina Obe wrote:

> The error is 
> 
> rm -f '../../src/include/storage/lwlocknames.h'
> cp -pR ../../backend/storage/lmgr/lwlocknames.h
> '../../src/include/storage/lwlocknames.h'
> cp: cannot stat '../../backend/storage/lmgr/lwlocknames.h': No such file or
> directory
> make[1]: *** [Makefile:143: ../../src/include/storage/lwlocknames.h] Error 1
> make[1]: Leaving directory '/projects/postgresql/postgresql-git/src/backend'
> make: *** [../../src/Makefile.global:382: submake-generated-headers] Error 2

Hmm, I changed these rules again in commit da952b415f44, maybe your
problem is with that one?  I wonder if changing the references to
../include/storage/lwlocklist.h to
$(topdir)/src/include/storage/lwlocklist.h (and similar things in
src/backend/storage/lmgr/Makefile) would fix it.

Do you run configure in the source directory or a separate build one?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"If it is not right, do not do it.
If it is not true, do not say it." (Marcus Aurelius, Meditations)




Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
I committed v23-0001.  Here is a rebased version of the remaining patches.
I intend to test the masking idea from Ants next.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 295b03530de5f42fe876b4489191da2f8dc83194 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v24 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 673 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define 

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Tom Lane
Jacob Champion  writes:
> The RHEL7-alikes are the biggest set, but that's already discussed
> above. Looks like SUSE 12 goes EOL later this year (October 2024), and
> it ships OpenSSL 1.1.1 as an option. Already-dead distros are Ubuntu
> 16.04 (April 2021), Photon 2 (January 2023), and Photon 3 (March
> 2024). That leaves AL2, OpenIndiana Hipster, and Solaris 11.4, all of
> which appear to have newer versions of OpenSSL shipped and selectable.

The discussion we had last year concluded that we were OK with
dropping 1.0.1 support when RHEL6 goes out of extended support
(June 2024 per this thread, I didn't check it).  Seems like we
should have the same policy for RHEL7.  Also, calling Photon 3
dead because it went EOL three days ago seems over-hasty.

Bottom line for me is that pulling 1.0.1 support now is OK,
but I think pulling 1.0.2 is premature.

regards, tom lane




Re: Psql meta-command conninfo+

2024-04-03 Thread Imseih (AWS), Sami
Building the docs fail for v26. The error is:



ref/psql-ref.sgml:1042: element member: validity error : Element term is not 
declared in member list of possible children

 

  ^



I am able to build up to v24 before the  was replaced with 




I tested building with a modified structure as below; the  is a 

that has a  within it.  Each field is a simple list member and

the the name of the fields should be .



  

   \conninfo[+]

   

   

Outputs a string displaying information about the current

database connection. When + is appended,

more details about the connection are displayed in table

format:

   

   Database:The database name of the 
connection.

   Authenticated User:The authenticated database 
user of the connection.

   Current User:The user name of the current 
execution context;

  see the current_user() function in

   for more 
details.

   

   

   

  



Regards,



Sami


Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Jeff Davis
On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:
> I suggest that you add a "heap_index" field to ReorderBufferTXN that
> would point to the index into the heap's array (the same as
> bh_nodeidx_entry.index in your patch). Each time an element moves
> within the heap array, just follow the pointer to the
> ReorderBufferTXN
> object and update the heap_index -- no hash lookup required.

It looks like my email was slightly too late, as the work was already
committed.

My suggestion is not required for 17, and so it's fine if this waits
until the next CF. If it turns out to be a win we can consider
backporting to 17 just to keep the code consistent, otherwise it can go
in 18.

Regards,
Jeff Davis





Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Jacob Champion
On Wed, Apr 3, 2024 at 8:29 AM Tom Lane  wrote:
> I count 3 machines running 1.0.1, 18 running some flavor
> of 1.0.2, and 7 running various LibreSSL versions.

I don't know all the tradeoffs with buildfarm wrangling, but IMO all
those 1.0.2 installations are the most problematic, so I dug in a bit:

  arowana CentOS 7
  batfish Ubuntu 16.04.3
  boa RHEL 7
  buri CentOS 7
  butterflyfish Photon 2.0
  clam RHEL 7.1
  cuon Ubuntu 16.04
  dhole CentOS 7.4
  hake OpenIndiana hipster
  mantid CentOS 7.9
  margay Solaris 11.4.42
  massasauga Amazon Linux 2
  myna Photon 3.0
  parula Amazon Linux 2
  rhinoceros CentOS 7.1
  shelduck SUSE 12SP5
  siskin RHEL 7.9
  snakefly Amazon Linux 2

The RHEL7-alikes are the biggest set, but that's already discussed
above. Looks like SUSE 12 goes EOL later this year (October 2024), and
it ships OpenSSL 1.1.1 as an option. Already-dead distros are Ubuntu
16.04 (April 2021), Photon 2 (January 2023), and Photon 3 (March
2024). That leaves AL2, OpenIndiana Hipster, and Solaris 11.4, all of
which appear to have newer versions of OpenSSL shipped and selectable.

--Jacob




Re: Statistics Import and Export

2024-04-03 Thread Tom Lane
Corey Huinker  writes:
> - functions strive to not ERROR unless absolutely necessary. The biggest
> exposure is the call to array_in().

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days.  You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport().  Not sure if it's worth the trouble or not.

regards, tom lane




Re: Use streaming read API in ANALYZE

2024-04-03 Thread Heikki Linnakangas

On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.


I'm getting a repeatable segfault / assertion failure with this:

postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
CREATE TABLE
postgres=# insert into tengiga select g, repeat('x', 900) from 
generate_series(1, 140) g;

INSERT 0 140
postgres=# set default_statistics_target = 10; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target = 100; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target =1000; ANALYZE tengiga;
SET
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File: 
"heapam_handler.c", Line: 1079, PID: 262232
postgres: heikki postgres [local] 
ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
postgres: heikki postgres [local] 
ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]

postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
postgres: heikki postgres [local] 
ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
postgres: heikki postgres [local] 
ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]

postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
postgres: heikki postgres [local] 
ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]

postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
postgres: heikki postgres [local] 
ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]

postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
postgres: heikki postgres [local] 
ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]

postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232) 
was terminated by signal 6: Aborted


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Adding comments to help understand psql hidden queries

2024-04-03 Thread David Christensen
I got Greg's blessing on squashing the commits down, and now including
a v4 with additional improvements on the output formatting front.
Main changes:

- all generated comments are the same width
- width has been bumped to 80
- removed _() functions for consumers of the new output functions

This patch adds two new helper functions, OutputComment() and
OutputCommentStars() to output and wrap the comments to the
appropriate widths.  Everything should continue to work just fine if
we want to adjust the width by just adjusting the MAX_COMMENT_WIDTH
symbol.

I removed _() in the output of the query/stars since there'd be no
sensible existing translations for the constructed string, which
included the query string itself.  If we need it for the "QUERY"
string, this could be added fairly easily, but the existing piece
would have been nonsensical and never used in practice.

Thanks,

David


v4-0001-Improve-SQL-comments-on-echo-hidden-output.patch
Description: Binary data


Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Melanie Plageman
On Wed, Apr 3, 2024 at 12:34 PM Heikki Linnakangas  wrote:
>
> On 03/04/2024 17:18, Melanie Plageman wrote:
> > I noticed you didn't make the comment updates I suggested in my
> > version 13 here [1]. A few of them are outdated references to
> > heap_page_prune() and one to a now deleted variable name
> > (all_visible_except_removable).
> >
> > I applied them to your v13 and attached the diff.
>
> Applied those changes, and committed. Thank you!

Thanks! And thanks for updating the commitfest entry!

- Melanie




Re: RFC: Additional Directory for Extensions

2024-04-03 Thread Christoph Berg
> Fwiw, I wrote this patch to solve the problem of testing extensions at
> build-time where the build process does not have write access to
> PGSHAREDIR. It solves that problem quite well, almost all PG
> extensions have build-time test coverage now (where there was
> basically 0 before).

Also, it's called extension "destdir" because it behaves like DESTDIR
in Makefiles: It prepends the given path to the path that PG is trying
to open when set. So it doesn't allow arbitrary new locations as of
now, just /home/build/foo-1/debian/foo/usr/share/postgresql/17/extension
in addition to /usr/share/postgresql/17/extension. (That is what the
Debian package build process needs, so that restriction/design choice
made sense.)

> Security is not a concern at this point as everything is running as
> the same user, and the test cluster will be wiped right after the
> test. I figured marking the setting as "super user" only was enough
> security at that point, but I would recommend another audit before
> using it together with "trusted" extensions and other things in
> production.

That's also included in the current GUC description:

   This directory is prepended to paths when loading extensions
   (control and SQL files), and to the '$libdir' directive when
   loading modules that back functions. The location is made
   configurable to allow build-time testing of extensions that do not
   have been installed to their proper location yet.

Perhaps I should have included a more verbose "NOT FOR PRODUCTION"
there.

As for compatibility, the patch has been part of the PG 9.5..17 now
for several years, and I'm very happy with extra test coverage it
provides, especially on the Debian architectures that don't have
"autopkgtest" runners yet.

Christoph




Re: Streaming read-ready sequential scan code

2024-04-03 Thread Melanie Plageman
On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas  wrote:
>
> On 01/04/2024 22:58, Melanie Plageman wrote:
> > Attached v7 has version 14 of the streaming read API as well as a few
> > small tweaks to comments and code.
>
> I saw benchmarks in this thread to show that there's no regression when
> the data is in cache, but I didn't see any benchmarks demonstrating the
> benefit of this. So I ran this quick test:

Good point! It would be good to show why we would actually want this
patch. Attached v8 is rebased over current master (which now has the
streaming read API).

On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.

Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);

Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.

- Melanie
From cfccafec650a77c53b1d78180b52db31742181ff Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 27 Mar 2024 20:25:06 -0400
Subject: [PATCH v8 3/3] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the read stream object in heap_beginscan(). On rescan, reset
the stream by releasing all pinned buffers and resetting the prefetch
block.

Implement a callback returning the next block to prefetch to the
read stream infrastructure.

Invoke the read stream API when a new page is needed. When the scan
direction changes, reset the stream.
---
 src/backend/access/heap/heapam.c | 94 
 src/include/access/heapam.h  | 15 +
 2 files changed, 97 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6b26f5bf8af..3546f637c13 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -221,6 +221,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * 
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+		   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+		   scan->rs_prefetch_block,
+		   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* 
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * 
@@ -323,6 +342,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -459,12 +485,14 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the next block of the relation into a buffer and returns
- * with that pinned buffer saved in the scan descriptor.
+ * This routine gets gets the next block of the relation from the read stream
+ * and saves that pinned buffer in the scan descriptor.
  */
 static inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -479,19 +507,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-		   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	

Re: Is it safe to cache data by GiST consistent function

2024-04-03 Thread Tom Lane
=?utf-8?Q?Micha=C5=82_K=C5=82eczek?=  writes:
> On 3 Apr 2024, at 16:27, Tom Lane  wrote:
>> AFAIK it works.  I don't see any of the in-core ones doing so,
>> but at least range_gist_consistent and multirange_gist_consistent
>> are missing a bet by repeating their cache search every time.

> pg_trgm consistent caches tigrams but it has some logic to make sure cached 
> values are recalculated:

>   cache = (gtrgm_consistent_cache *) fcinfo->flinfo->fn_extra;
>   if (cache == NULL ||
>   cache->strategy != strategy ||
>   VARSIZE(cache->query) != querysize ||
>   memcmp((char *) cache->query, (char *) query, querysize) != 0)

> What I don’t understand is if it is necessary or it is enough to check 
> fn_extra==NULL.

Ah, I didn't think to search contrib.  Yes, you need to validate the
cache entry.  In this example, a rescan could insert a new query
value.  In general, an opclass support function could get called using
a pretty long-lived FunctionCallInfo (e.g. one in the index's relcache
entry), so it's unwise to assume that cached data is relevant to the
current call without checking.

regards, tom lane




Re: LogwrtResult contended spinlock

2024-04-03 Thread Bharath Rupireddy
On Wed, Apr 3, 2024 at 4:49 PM Alvaro Herrera  wrote:
>
> Thanks for keeping this moving forward.  I gave your proposed patches a
> look.   One thing I didn't like much is that we're adding a new member
> (Copy) to XLogwrtAtomic -- but this struct is supposed to be a mirror of
> XLogwrtResult for use with atomic access.  Since this new member is not
> added to XLogwrtResult (because it's not needed there), the whole idea
> of there being symmetry between those two structs crumbles down.
> Because we later stop using struct-assign anyway, meaning we no longer
> need the structs to match, we can instead spell out the members in
> XLogCtl and call it a day.

Hm, I have no objection to having separate variables in XLogCtl.

> So what I do in the attached 0001 is stop using the XLogwrtResult struct
> in XLogCtl and replace it with separate Write and Flush values, and add
> the macro XLogUpdateLocalLogwrtResult() that copies the values of Write
> and Flush from the shared XLogCtl to the local variable given as macro
> argument.  (I also added our idiomatic do {} while(0) to the macro
> definition, for safety).  The new members are XLogCtl->logWriteResult
> and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so
> essentially identical semantics as the previous code.  No atomic access
> yet!

+1.

> 0002 then adds pg_atomic_monotonic_advance_u64.  (I don't add the _u32
> variant, because I don't think it's a great idea to add dead code.  If
> later we see a need for it we can put it in.)  It also changes the two
> new members to be atomics, changes the macro to use atomic read, and
> XLogWrite now uses monotonic increment.  A couple of other places can
> move the macro calls to occur outside the spinlock.  Also, XLogWrite
> gains the invariant checking that involves Write and Flush.

I'm fine with not having the 32 bit variant of
pg_atomic_monotonic_advance. However, a recent commit bd5132db5 added
both 32 and 64 bit versions of pg_atomic_read_membarrier even though
32 bit isn't being used.

> Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and
> Flush, and updates WALReadFromBuffers to test that instead of the Write
> pointer, and adds in XLogWrite the invariant checks that involve the
> Copy pointer.

+1.

The attached patches look good to me.

Also, I'm fine with renaming XLogUpdateLocalLogwrtResult() to
RefreshXLogWriteResult().

> I haven't rerun Bharath test loop yet; will do so shortly.

I ran it a 100 times [1] on top of all the 3 patches, it looks fine.

[1] for i in {1..100}; do make check
PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ]; then echo "The
command failed on iteration $i"; break; fi; done

./configure --prefix=$PWD/pg17/ --enable-debug --enable-tap-tests
--enable-cassert CC=/usr/bin/clang-14 > install.log && make -j 8
install > install.log 2>&1 &

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-03 Thread Matthias van de Meent
On Tue, 2 Apr 2024 at 17:47, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > Attached is v9, which is rebased on master to handle 24eebc65's
> > changed signature of pq_sendcountedtext.
> > It now also includes autocompletion, and a second patch which adds
> > documentation to give users insights into this new addition to
> > EXPLAIN.
>
> I took a quick look through this.  Some comments in no particular
> order:

Thanks!

> Documentation is not optional, so I don't really see the point of
> splitting this into two patches.

I've seen the inverse several times, but I've merged them in the
attached version 10.

> IIUC, it's not possible to use the SERIALIZE option when explaining
> CREATE TABLE AS, because you can't install the instrumentation tuple
> receiver when the IntoRel one is needed.  I think that's fine because
> no serialization would happen in that case anyway, but should we
> throw an error if that combination is requested?  Blindly reporting
> that zero bytes were serialized seems like it'd confuse people.

I think it's easily explained as no rows were transfered to the
client. If there is actual confusion, we can document it, but
confusing disk with network is not a case I'd protect against. See
also: EXPLAIN (ANALYZE, SERIALIZE) INSERT without the RETURNING
clause.

> I'd lose the stuff about measuring memory consumption.  Nobody asked
> for that and the total is completely misleading, because in reality
> we'll reclaim the memory used after each row.  It would allow cutting
> the text-mode output down to one line, too, instead of having your
> own format that's not like anything else.

Done.

> I thought the upthread agreement was to display the amount of
> data sent rounded to kilobytes, so why is the code displaying
> an exact byte count?

Probably it was because the other explain code I referenced was using
bytes in the json/yaml format. Fixed.

> I don't especially care for magic numbers like these:
>
> +   /* see printtup.h why we add 18 bytes here. These are the 
> infos
> +* needed for each attribute plus the attribute's name */
> +   receiver->metrics.bytesSent += (int64) namelen + 1 + 18;
>
> If the protocol is ever changed in a way that invalidates this,
> there's about 0 chance that somebody would remember to touch
> this code.
> However, isn't the bottom half of serializeAnalyzeStartup doing
> exactly what the comment above it says we don't do, that is accounting
> for the RowDescription message?  Frankly I agree with the comment that
> it's not worth troubling over, so I'd just drop that code rather than
> finding a solution for the magic-number problem.

In the comment above I intended to explain that it takes negligible
time to serialize the RowDescription message (when compared to all
other tasks of explain), so skipping the actual writing of the message
would be fine.
I'm not sure I agree with not including the size of RowDescription
itself though, as wide results can have a very large RowDescription
overhead; up to several times the returned data in cases where few
rows are returned.

Either way, I've removed that part of the code.

> Don't bother with duplicating valgrind-related logic in
> serializeAnalyzeReceive --- that's irrelevant to actual users.

Removed. I've instead added buffer usage, as I realised that wasn't
covered yet, and quite important to detect excessive detoasting (it's
not covered at the top-level scan).

> This seems like cowboy coding:
>
> +   self->destRecevier.mydest = DestNone;
>
> You should define a new value of the CommandDest enum and
> integrate this receiver type into the support functions
> in dest.c.

Done.

> BTW, "destRecevier" is misspelled...

Thanks, fixed.


Kind regards,

Matthias van de Meent.


v10-0001-Explain-Add-SERIALIZE-option.patch
Description: Binary data


Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alvaro Herrera
Hello Alexander,

On 2024-Apr-03, Alexander Korotkov wrote:

> Regarding the shmem data structure for LSN waiters.  I didn't pick
> LWLock or ConditionVariable, because I needed the ability to wake up
> only those waiters whose LSN is already replayed.  In my experience
> waking up a process is way slower than scanning a short flat array.

I agree, but I think that's unrelated to what I was saying, which is
just the patch I attach here.

> However, I agree that when the number of waiters is very high and flat
> array may become a problem.  It seems that the pairing heap is not
> hard to use for shmem structures.  The only memory allocation call in
> paritingheap.c is in pairingheap_allocate().  So, it's only needed to
> be able to initialize the pairing heap in-place, and it will be fine
> for shmem.

Ok.

With the code as it stands today, everything in WaitLSNState apart from
the pairing heap is accessed without any locking.  I think this is at
least partly OK because each backend only accesses its own entry; but it
deserves a comment.  Or maybe something more, because WaitLSNSetLatches
does modify the entry for other backends.  (Admittedly, this could only
happens for backends that are already sleeping, and it only happens
with the lock acquired, so it's probably okay.  But clearly it deserves
a comment.)

Don't we need to WaitLSNCleanup() during error recovery or something?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"(Andrew Morton)
>From 4079dc6a6a6893055b32dee75f178b324bbaef77 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Apr 2024 18:35:15 +0200
Subject: [PATCH] Use an LWLock instead of a spinlock in waitlsn.c

---
 src/backend/commands/waitlsn.c  | 15 +++
 src/backend/utils/activity/wait_event_names.txt |  1 +
 src/include/commands/waitlsn.h  |  5 +
 src/include/storage/lwlocklist.h|  1 +
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 51a34d422e..a57b818a2d 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -58,7 +58,6 @@ WaitLSNShmemInit(void)
 			   );
 	if (!found)
 	{
-		SpinLockInit(>waitersHeapMutex);
 		pg_atomic_init_u64(>minWaitedLSN, PG_UINT64_MAX);
 		pairingheap_initialize(>waitersHeap, lsn_cmp, NULL);
 		memset(>procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
@@ -115,13 +114,13 @@ addLSNWaiter(XLogRecPtr lsn)
 	procInfo->procnum = MyProcNumber;
 	procInfo->waitLSN = lsn;
 
-	SpinLockAcquire(>waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	pairingheap_add(>waitersHeap, >phNode);
 	procInfo->inHeap = true;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(>waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -132,11 +131,11 @@ deleteLSNWaiter(void)
 {
 	WaitLSNProcInfo *procInfo = >procInfos[MyProcNumber];
 
-	SpinLockAcquire(>waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	if (!procInfo->inHeap)
 	{
-		SpinLockRelease(>waitersHeapMutex);
+		LWLockRelease(WaitLSNLock);
 		return;
 	}
 
@@ -144,7 +143,7 @@ deleteLSNWaiter(void)
 	procInfo->inHeap = false;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(>waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -160,7 +159,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
 
-	SpinLockAcquire(>waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	/*
 	 * Iterate the pairing heap of waiting processes till we find LSN not yet
@@ -182,7 +181,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	updateMinWaitedLSN();
 
-	SpinLockRelease(>waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 
 	/*
 	 * Set latches for processes, whose waited LSNs are already replayed. This
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0d288d6b3d..ec43a78d60 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -330,6 +330,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared pg_serial state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
index 0d80248682..b3d9eed64d 100644
--- a/src/include/commands/waitlsn.h
+++ b/src/include/commands/waitlsn.h
@@ -55,13 +55,10 @@ typedef struct WaitLSNState
 
 	/*
 	 * A pairing heap of waiting processes order by LSN values (least LSN is
-	 * on top).
+	 * on top).  Protected by WaitLSNLock.
 	 */
 	pairingheap 

Re: Is it safe to cache data by GiST consistent function

2024-04-03 Thread Michał Kłeczek
Thanks for taking your time to answer. Not sure if I understand though.

> On 3 Apr 2024, at 16:27, Tom Lane  wrote:
> 
> =?utf-8?Q?Micha=C5=82_K=C5=82eczek?=  writes:
>> When implementing a GiST consistent function I found the need to cache 
>> pre-processed query across invocations.
>> I am not sure if it is safe to do (or I need to perform some steps to make 
>> sure cached info is not leaked between rescans).
> 
> AFAIK it works.  I don't see any of the in-core ones doing so,
> but at least range_gist_consistent and multirange_gist_consistent
> are missing a bet by repeating their cache search every time.

pg_trgm consistent caches tigrams but it has some logic to make sure cached 
values are recalculated:

cache = (gtrgm_consistent_cache *) fcinfo->flinfo->fn_extra;
if (cache == NULL ||
cache->strategy != strategy ||
VARSIZE(cache->query) != querysize ||
memcmp((char *) cache->query, (char *) query, querysize) != 0)

What I don’t understand is if it is necessary or it is enough to check 
fn_extra==NULL.

> 
>> The comment in gistrescan says:
> 
>>  /*
>>   * If this isn't the first time through, preserve the fn_extra
>>   * pointers, so that if the consistentFns are using them to 
>> cache
>>   * data, that data is not leaked across a rescan.
>>   */
> 
>> which seems to me self-contradictory as fn_extra is preserved between 
>> rescans (so leaks are indeed possible).
> 
> I think you're reading it wrong.  If we cleared fn_extra during
> rescan, access to the old extra value would be lost so a new one
> would have to be created, leaking the old value for the rest of
> the query.

I understand that but not sure what “that data is not leaked across a rescan” 
means.

—
Michal



Re: RFC: Additional Directory for Extensions

2024-04-03 Thread Christoph Berg
Re: David E. Wheeler
> > Yes, I like the suggestion to make it require a restart, which lets the 
> > sysadmin control it and not limited to whatever the person who compiled it 
> > thought would make sense.
> 
> Here’s a revision of the Debian patch that requires a server start.

Thanks for bringing this up, I should have submitted this years ago.
(The patch is originally from September 2020.)

I designed the patch to require a superuser to set it, so it doesn't
matter very much by which mechanism it gets updated. There should be
little reason to vary it at run-time, so I'd be fine with requiring a
restart, but otoh, why restrict the superuser from reloading it if
they know what they are doing?

> However, in studying the patch, it appears that the `extension_directory` is 
> searched for *all* shared libraries, not just those being loaded for an 
> extension. Am I reading the `expand_dynamic_library_name()` function right?
> 
> If so, this seems like a good way for a bad actor to muck with things, by 
> putting an exploited libpgtypes library into the extension directory, where 
> it would be loaded in preference to the core libpgtypes library, if they 
> couldn’t exploit the original.
> 
> I’m thinking it would be better to have the dynamic library lookup for 
> extension libraries (and LOAD libraries?) separate, so that the 
> `extension_directory` would not be used for core libraries.

I'm not sure the concept of "core libraries" exists. PG happens to
dlopen things at run time, and it doesn't know/care if they were
installed by users or by the original PG server. Also, an exploited
libpgtypes library is not worse than any other untrusted "user"
library, so you really don't want to allow users to provide their own
.so files, no matter by what mechanism.

> This would also allow the lookup of extension libraries prefixed by the 
> directory field from the control file, which would enable much tidier 
> extension installation: The control file, SQL scripts, and DSOs could all be 
> in a single directory for an extension.

Nice idea, but that would mean installing .so files into PGSHAREDIR.
Perhaps the whole extension stuff would have to move to PKGLIBDIR
instead.


Fwiw, I wrote this patch to solve the problem of testing extensions at
build-time where the build process does not have write access to
PGSHAREDIR. It solves that problem quite well, almost all PG
extensions have build-time test coverage now (where there was
basically 0 before).

Security is not a concern at this point as everything is running as
the same user, and the test cluster will be wiped right after the
test. I figured marking the setting as "super user" only was enough
security at that point, but I would recommend another audit before
using it together with "trusted" extensions and other things in
production.

Christoph




Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Heikki Linnakangas

On 03/04/2024 17:18, Melanie Plageman wrote:

I noticed you didn't make the comment updates I suggested in my
version 13 here [1]. A few of them are outdated references to
heap_page_prune() and one to a now deleted variable name
(all_visible_except_removable).

I applied them to your v13 and attached the diff.


Applied those changes, and committed. Thank you!


Off-list, Melanie reported that there is a small regression with the
benchmark script she posted yesterday, after all, but I'm not able to
reproduce that.


Actually, I think it was noise.


Ok, phew.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: [PATCH] Modify pg_ctl to detect presence of geek user

2024-04-03 Thread Andrey M. Borodin



> On 3 Apr 2024, at 18:17, Panda Developpeur  
> wrote:
> 
> Thank you for considering my contribution.

Looks interesting!

+   usleep(50);

Don't we need to make system 500ms faster instead? Let's change it to

+   usleep(-50);

Thanks!


Best regards, Andrey Borodin.



Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot
Hi,

On Wed, Apr 03, 2024 at 08:28:04PM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 6:46 PM Bertrand Drouvot
>  wrote:
> >
> > Just one comment on v32-0001:
> >
> > +# Synced slot on the standby must get its own inactive_since.
> > +is( $standby1->safe_psql(
> > +   'postgres',
> > +   "SELECT '$inactive_since_on_primary'::timestamptz <= 
> > '$inactive_since_on_standby'::timestamptz AND
> > +   '$inactive_since_on_standby'::timestamptz <= 
> > '$slot_sync_time'::timestamptz;"
> > +   ),
> > +   "t",
> > +   'synchronized slot has got its own inactive_since');
> > +
> >
> > By using <= we are not testing that it must get its own inactive_since (as 
> > we
> > allow them to be equal in the test). I think we should just add some 
> > usleep()
> > where appropriate and deny equality during the tests on inactive_since.
> 
> > Except for the above, v32-0001 LGTM.
> 
> Thanks. Please see the attached v33-0001 patch after removing equality
> on inactive_since TAP tests.

Thanks! v33-0001 LGTM.

> On Wed, Apr 3, 2024 at 1:47 PM Bertrand Drouvot
>  wrote:
> > Some comments regarding v31-0002:
> >
> > T2 ===
> >
> > In case the slot is invalidated on the primary,
> >
> > primary:
> >
> > postgres=# select slot_name, inactive_since, invalidation_reason from 
> > pg_replication_slots where slot_name = 's1';
> >  slot_name |inactive_since | invalidation_reason
> > ---+---+-
> >  s1| 2024-04-03 06:56:28.075637+00 | inactive_timeout
> >
> > then on the standby we get:
> >
> > standby:
> >
> > postgres=# select slot_name, inactive_since, invalidation_reason from 
> > pg_replication_slots where slot_name = 's1';
> >  slot_name |inactive_since| invalidation_reason
> > ---+--+-
> >  s1| 2024-04-03 07:06:43.37486+00 | inactive_timeout
> >
> > shouldn't the slot be dropped/recreated instead of updating inactive_since?
> 
> The sync slots that are invalidated on the primary aren't dropped and
> recreated on the standby.

Yeah, right (I was confused with synced slots that are invalidated locally).

> However, I
> found that the synced slot is acquired and released unnecessarily
> after the invalidation_reason is synced from the primary. I added a
> skip check in synchronize_one_slot to skip acquiring and releasing the
> slot if it's locally found inactive. With this, inactive_since won't
> get updated for invalidated sync slots on the standby as we don't
> acquire and release the slot.

CR1 ===

Yeah, I can see:

@@ -575,6 +575,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid 
remote_dbid)
   " name slot \"%s\" already 
exists on the standby",
   remote_slot->name));

+   /*
+* Skip the sync if the local slot is already invalidated. We 
do this
+* beforehand to save on slot acquire and release.
+*/
+   if (slot->data.invalidated != RS_INVAL_NONE)
+   return false;

Thanks to the drop_local_obsolete_slots() call I think we are not missing the 
case
where the slot has been invalidated on the primary, invalidation reason has been
synced on the standby and later the slot is dropped/ recreated manually on the
primary (then it should be dropped/recreated on the standby too).

Also it seems we are not missing the case where a sync slot is invalidated
locally due to wal removal (it should be dropped/recreated).

> 
> > CR5 ===
> >
> > +   /*
> > +* This function isn't expected to be called for inactive timeout 
> > based
> > +* invalidation. A separate function 
> > InvalidateInactiveReplicationSlot is
> > +* to be used for that.
> >
> > Do you think it's worth to explain why?
> 
> Hm, I just wanted to point out the actual function here. I modified it
> to something like the following, if others feel we don't need that, I
> can remove it.

Sorry If I was not clear but I meant to say "Do you think it's worth to explain
why we decided to create a dedicated function"? (currently we "just" explain 
that
we created one).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Alvaro Herrera
On 2024-Apr-03, Dilip Kumar wrote:

> Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
> thanks for reporting.  Your fix looks correct to me.

Thanks for the quick review!  And thanks to Alexander for the report.
Pushed the fix.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"




Re: [PATCH] Modify pg_ctl to detect presence of geek user

2024-04-03 Thread Panda Developpeur
Yeah sorry for the delay, it took me some time to understood how build,
modify and test the modification


Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Tom Lane
Jacob Champion  writes:
> As far as I can tell, no versions of LibreSSL so far provide
> X509_get_signature_info(), so this patch is probably a bit too
> aggressive.

Another problem with cutting support is how many buildfarm members
will we lose.  I scraped recent configure logs and got the attached
results.  I count 3 machines running 1.0.1, 18 running some flavor
of 1.0.2, and 7 running various LibreSSL versions.  We could
probably retire or update the 1.0.1 installations, but the rest
would represent a heavier lift.  Notably, it seems that what macOS
is shipping is LibreSSL.

regards, tom lane

sysname|  l 
  
---+--
 alabio| configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 alimoche  | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022
 arowana   | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 avocet| configure: using openssl: OpenSSL 1.1.1l-fips  24 Aug 2021 
SUSE release 150400.7.60.2
 ayu   | configure: using openssl: OpenSSL 1.1.0l  10 Sep 2019
 babbler   | configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 basilisk  | configure: using openssl: OpenSSL 3.1.4 24 Oct 2023 (Library: 
OpenSSL 3.1.4 24 Oct 2023)
 batfish   | configure: using openssl: OpenSSL 1.0.2g  1 Mar 2016
 batta | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 blackneck | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022
 boa   | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 boomslang | configure: using openssl: OpenSSL 1.1.1k  25 Mar 2021
 broadbill | configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 bulbul| configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 buri  | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 bushmaster| configure: using openssl: OpenSSL 3.1.5 30 Jan 2024 (Library: 
OpenSSL 3.1.5 30 Jan 2024)
 butterflyfish | configure: using openssl: OpenSSL 1.0.2p-fips  14 Aug 2018
 caiman| configure: using openssl: OpenSSL 3.2.1 30 Jan 2024 (Library: 
OpenSSL 3.2.1 30 Jan 2024)
 canebrake | configure: using openssl: OpenSSL 3.1.5 30 Jan 2024 (Library: 
OpenSSL 3.1.5 30 Jan 2024)
 cascabel  | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022 
(Library: OpenSSL 1.1.1w  11 Sep 2023)
 cavefish  | configure: using openssl: OpenSSL 1.1.1  11 Sep 2018
 chevrotain| configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022 
(Library: OpenSSL 1.1.1w  11 Sep 2023)
 chimaera  | configure: using openssl: OpenSSL 1.1.0l  10 Sep 2019
 chipmunk  | configure: using openssl: OpenSSL 1.0.1t  3 May 2016
 cisticola | configure: using openssl: OpenSSL 1.1.1g FIPS  21 Apr 2020
 clam  | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 conchuela | configure: using openssl: LibreSSL 3.2.5
 copperhead| configure: using openssl: OpenSSL 1.1.1k  25 Mar 2021
 culpeo| configure: using openssl: OpenSSL 3.0.11 19 Sep 2023 (Library: 
OpenSSL 3.0.11 19 Sep 2023)
 cuon  | configure: using openssl: OpenSSL 1.0.2g  1 Mar 2016
 demoiselle| configure: using openssl: OpenSSL 1.1.0h-fips  27 Mar 2018
 desman| configure: using openssl: OpenSSL 3.0.9 30 May 2023 (Library: 
OpenSSL 3.0.9 30 May 2023)
 dhole | configure: using openssl: OpenSSL 1.0.2k-fips  26 Jan 2017
 dikkop| configure: using openssl: OpenSSL 3.0.10 1 Aug 2023 (Library: 
OpenSSL 3.0.10 1 Aug 2023)
 elasmobranch  | configure: using openssl: OpenSSL 1.1.0h-fips  27 Mar 2018
 gokiburi  | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 grison| configure: using openssl: OpenSSL 1.1.0l  10 Sep 2019
 grison| configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022
 guaibasaurus  | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 gull  | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 habu  | configure: using openssl: OpenSSL 3.0.9 30 May 2023 (Library: 
OpenSSL 3.0.9 30 May 2023)
 hachi | configure: using openssl: OpenSSL 1.1.1w  11 Sep 2023
 hake  | configure: using openssl: OpenSSL 1.0.2u  20 Dec 2019
 hippopotamus  | configure: using openssl: OpenSSL 1.1.1l-fips  24 Aug 2021 
SUSE release 150400.7.60.2
 indri | configure: using openssl: OpenSSL 3.2.0 23 Nov 2023 (Library: 
OpenSSL 3.2.0 23 Nov 2023)
 jackdaw   | configure: using openssl: OpenSSL 1.1.1n  15 Mar 2022 
(Library: OpenSSL 1.1.1w  11 Sep 2023)
 jay   | configure: using openssl: OpenSSL 1.1.1l-fips  24 Aug 2021 
SUSE release 150400.7.60.2
 kingsnake | configure: using openssl: OpenSSL 3.0.9 30 May 2023 (Library: 
OpenSSL 3.0.9 30 May 2023)
 krait | configure: using openssl: OpenSSL 1.1.1k  FIPS 25 Mar 2021
 lancehead | configure: using 

  1   2   >