Re: Emitting JSON to file using COPY TO

2024-01-15 Thread jian he
On Tue, Jan 16, 2024 at 11:46 AM jian he  wrote:
>
>
> I think the reason is maybe related to the function copy_dest_startup.
I was wrong about this sentence.

in the function CopyOneRowTo `if (!cstate->opts.json_mode)` else branch
change to the following:

else
{
Datum rowdata;
StringInfo result;
if (slot->tts_tupleDescriptor->natts == 1)
{
/* Flat-copy the attribute array */
memcpy(TupleDescAttr(slot->tts_tupleDescriptor, 0),
TupleDescAttr(cstate->queryDesc->tupDesc, 0),
1 * sizeof(FormData_pg_attribute));
}
BlessTupleDesc(slot->tts_tupleDescriptor);
rowdata = ExecFetchSlotHeapTupleDatum(slot);
result = makeStringInfo();
composite_to_json(rowdata, result, false);
if (json_row_delim_needed &&
cstate->opts.force_array)
{
CopySendChar(cstate, ',');
}
else if (cstate->opts.force_array)
{
/* first row needs no delimiter */
CopySendChar(cstate, ' ');
json_row_delim_needed = true;
}
CopySendData(cstate, result->data, result->len);
}

all the cases work, more like a hack.
because I cannot fully explain it to you why it works.
---
demo


drop function if exists execute_into_test cascade;
NOTICE:  function execute_into_test() does not exist, skipping
DROP FUNCTION
drop type if exists execute_into_test cascade;
NOTICE:  type "execute_into_test" does not exist, skipping
DROP TYPE
create type eitype as (i integer, y integer);
CREATE TYPE
create or replace function execute_into_test() returns eitype as $$
declare
_v eitype;
begin
execute 'select 1,2' into _v;
return _v;
end; $$ language plpgsql;
CREATE FUNCTION

COPY (SELECT 1 from generate_series(1,1) g) TO stdout WITH (format json);
{"?column?":1}
COPY (SELECT g from generate_series(1,1) g) TO stdout WITH (format json);
{"g":1}
COPY (SELECT g,1 from generate_series(1,1) g) TO stdout WITH (format json);
{"g":1,"?column?":1}
COPY (select * from execute_into_test()) TO stdout WITH (format json);
{"i":1,"y":2}
COPY (select * from execute_into_test() sub) TO stdout WITH (format json);
{"i":1,"y":2}
COPY (select sub from execute_into_test() sub) TO stdout WITH (format json);
{"sub":{"i":1,"y":2}}
COPY (select sub.i from execute_into_test() sub) TO stdout WITH (format json);
{"i":1}
COPY (select sub.y from execute_into_test() sub) TO stdout WITH (format json);
{"y":2}
COPY (VALUES (1), (2)) TO stdout WITH (format json);
{"column1":1}
{"column1":2}
 COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);
{"?column?":1}
{"?column?":2}




Introduce a new API for TableAmRoutine

2024-01-15 Thread Japin Li


Hi, hackers

Recently, I'm trying to implement a new TAM for PostgreSQL, I find there is no
API for handling table's option.  For example:

CREATE TABLE t (...) USING new_am WITH (...);

Is it possible add a new API to handle table's option in TableAmRoutine?

--
Regrads,
Japin Li.




Re: Synchronizing slots from primary to standby

2024-01-15 Thread Masahiko Sawada
On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila  wrote:
>
> On Tue, Jan 16, 2024 at 9:03 AM shveta malik  wrote:
> >
> > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila  
> > wrote:
> > >
> > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik  
> > > wrote:
> > > >
> > > > There are multiple approaches discussed and tried when it comes to
> > > > starting a slot-sync worker. I am summarizing all here:
> > > >
> > > >  1) Make slotsync worker as an Auxiliary Process (like checkpointer,
> > > > walwriter, walreceiver etc). The benefit this approach provides is, it
> > > > can control begin and stop in a more flexible way as each auxiliary
> > > > process could have different checks before starting and can have
> > > > different stop conditions. But it needs code duplication for process
> > > > management(start, stop, crash handling, signals etc) and currently it
> > > > does not support db-connection smoothly (none of the auxiliary process
> > > > has one so far)
> > > >
> > >
> > > As slotsync worker needs to perform transactions and access syscache,
> > > we can't make it an auxiliary process as that doesn't initialize the
> > > required stuff like syscache. Also, see the comment "Auxiliary
> > > processes don't run transactions ..." in AuxiliaryProcessMain() which
> > > means this is not an option.
> > >
> > > >
> > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher
> > > > which is neither an Auxiliary process nor a bgworker one. It allows
> > > > db-connection and also provides flexibility to have start and stop
> > > > conditions for a process.
> > > >
> > >
> > > Yeah, due to these reasons, I think this option is worth considering
> > > and another plus point is that this allows us to make enable_syncslot
> > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER.
> > >
> > > >
> > > > 3) Make slotysnc worker a bgworker. Here we just need to register our
> > > > process as a bgworker (RegisterBackgroundWorker()) by providing a
> > > > relevant start_time and restart_time and then the process management
> > > > is well taken care of. It does not need any code-duplication and
> > > > allows db-connection smoothly in registered process. The only thing it
> > > > lacks is that it does not provide flexibility of having
> > > > start-condition which then makes us to have 'enable_syncslot' as
> > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I
> > > > feel enable_syncslot is something which will not be changed frequently
> > > > and with the benefits provided by bgworker infra, it seems a
> > > > reasonably good option to choose this approach.
> > > >
> > >
> > > I agree but it may be better to make it a PGC_SIGHUP parameter.
> > >
> > > > 4) Another option is to have Logical Replication Launcher(or a new
> > > > process) to launch slot-sync worker. But going by the current design
> > > > where we have only 1 slotsync worker, it may be an overhead to have an
> > > > additional manager process maintained.
> > > >
> > >
> > > I don't see any good reason to have an additional launcher process here.
> > >
> > > >
> > > > Thus weighing pros and cons of all these options, we have currently
> > > > implemented the bgworker approach (approach 3).  Any feedback is
> > > > welcome.
> > > >
> > >
> > > I vote to go for (2) unless we face difficulties in doing so but (3)
> > > is also okay especially if others also think so.
> >
> > I am not against any of the approaches but I still feel that when we
> > have a standard way of doing things (bgworker) we should not keep
> > adding code to do things in a special way unless there is a strong
> > reason to do so. Now we need to decide if 'enable_syncslot' being
> > PGC_POSTMASTER is a strong reason to go the non-standard way?
> >
>
> Agreed and as said earlier I think it is better to make it a
> PGC_SIGHUP. Also, not sure we can say it is a non-standard way as
> already autovacuum launcher is handled in the same way. One more minor
> thing is it will save us for having a new bgworker state
> BgWorkerStart_ConsistentState_HotStandby as introduced by this patch.

Why do we need to add a new BgWorkerStart_ConsistentState_HotStandby
for the slotsync worker? Isn't it sufficient that the slotsync worker
exits if not in hot standby mode?

Is there any technical difficulty or obstacle to make the slotsync
worker start using bgworker after reloading the config file?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [HACKERS] Allow INSTEAD OF DELETE triggers to modify the tuple for RETURNING

2024-01-15 Thread jian he
On Tue, Jan 9, 2024 at 6:21 PM vignesh C  wrote:
>
> > doc seems to still have an issue.
> >
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F45%2F4617
> >
> > In the regress test, do we need to clean up the created object after we
use it.
> > tested passed, looking at ExecIRInsertTriggers, your changes look sane.
>
> I have changed the status of the patch to "Waiting on Author" as the
> CFBot reported by jina he is not yet handled.


Hi
it took me a while to figure out why the doc build fails.

Currently your wording is:
For INSERT, UPDATE, and DELETE operations, INSTEAD OF triggers can modify
the data returned by RETURNING. In the case of INSERT and UPDATE, triggers
can modify the NEW row before returning it, while for DELETE, triggers can
modify the OLD row before returning it. This feature is useful when the
returned data needs to be adjusted to match the view or other requirements.

The doc is:
For INSERT and UPDATE operations only, the trigger may modify the NEW row
before returning it. This will change the data returned by INSERT RETURNING
or UPDATE RETURNING, and is useful when the view will not show exactly the
same data that was provided.

to make it a minimum change compared to doc, i think the following make
sense:
For INSERT and UPDATE operations only, the trigger may modify the NEW row
before returning it.  For DELETE operations, the trigger may modify the OLD
row before returning it.
This will change the data returned by INSERT RETURNING, UPDATE RETURNING,
DELETE RETURNING and is useful when the view will not show exactly the same
data that was provided.

I am not sure the following changes in the function ExecIRDeleteTriggers is
right.
+ else if (newtuple != oldtuple)
+ {
+ ExecForceStoreHeapTuple(newtuple, slot, false);
+
+ if (should_free)
+ heap_freetuple(oldtuple);
+
+ /* signal tuple should be re-fetched if used */
+ newtuple = NULL;

In the ExecIRDeleteTriggers function,
all we want is to return the slot,
so that, nodeModifyTable.c `if (processReturning &&
resultRelInfo->ri_projectReturning) {}` can further process it, materialize
it.

if newtuple != oldtuple that means DELETE INSTEAD OF changed the value.
ExecForceStoreHeapTuple already put the new values into the slot, we should
just free the newtuple, since we don't use it anymore?
Also maybe we don't need the variable should_free, if (newtuple !=
oldtuple) then we should free oldtuple and newtuple, because the content is
already in the slot.

Anyway, based on your patch, I modified it, also added a slightly more
complicated test.
From 5f41738b3c7dc7bf3d849539d16a568b52cedc55 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 15 Jan 2024 16:20:28 +0800
Subject: [PATCH v5 1/1] allow INSTEAD OF DELETE triggers to modify the OLD
 tuple and use that for RETURNING instead of returning the tuple in planSlot.

Currently the tuple returned by INSTEAD OF triggers on DELETEs is only
used to determine whether to pretend that the DELETE happened or not.
now we can modified the old value returned by INSTEAD OF DELETE trigger.
---
 doc/src/sgml/trigger.sgml  |  7 --
 src/backend/commands/trigger.c | 34 ++
 src/backend/executor/nodeModifyTable.c | 14 +--
 src/include/commands/trigger.h |  2 +-
 src/test/regress/expected/triggers.out | 27 
 src/test/regress/sql/triggers.sql  | 20 +++
 6 files changed, 89 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index a5390ff6..9813e323 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -261,9 +261,12 @@
 modifications in the view.  This will cause the count of the number
 of rows affected by the command to be incremented. For
 INSERT and UPDATE operations only, the trigger
-may modify the NEW row before returning it.  This will
+may modify the NEW row before returning it.
+For DELETE operations, the trigger
+may modify the OLD row before returning it.
+ This will
 change the data returned by
-INSERT RETURNING or UPDATE RETURNING,
+INSERT RETURNING, UPDATE RETURNING, DELETE RETURNING
 and is useful when the view will not show exactly the same data
 that was provided.

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 0880ca51..d8e95286 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2815,10 +2815,11 @@ ExecARDeleteTriggers(EState *estate,
 
 bool
 ExecIRDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
-	 HeapTuple trigtuple)
+	 TupleTableSlot *slot)
 {
 	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
-	TupleTableSlot *slot = ExecGetTriggerOldSlot(estate, relinfo);
+	HeapTuple	newtuple = NULL;
+	bool		should_free;
 	TriggerData LocTriggerData = {0};
 	int			i;
 
@@ -2828,12 +2829,10 @@ ExecIRDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 		TRIGGER_EVENT_INSTEAD;
 	

RE: A failure in t/038_save_logical_slots_shutdown.pl

2024-01-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Bharath,

> This is a more strict check because it is possible that even if the
> latest confirmed_flush location is not persisted there is no
> meaningful decodable WAL between whatever the last confirmed_flush
> location saved on disk and the shutdown_checkpoint record.
> Kuroda-San/Vignesh, do you have any suggestion on this one?

I think it should be as testcase explicitly. There are two reasons:
 
* e0b2eed is a commit for backend codes, so it should be tested by src/test/*
  files. Each src/bin/XXX/*.pl files should test only their executable.
* Assuming that the feature would be broken. In this case 003_logical_slots.pl
  would fail, but we do not have a way to recognize on the build farm.
  038_save_logical_slots_shutdown.pl helps to distinguish the case.
 
Based on that, I think it is OK to add advance_wal() and comments, like 
Bharath's patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: [PATCH] Exponential backoff for auth_delay

2024-01-15 Thread Abhijit Menon-Sen
At 2024-01-04 08:30:36 +0100, mba...@gmx.net wrote:
>
> +typedef struct AuthConnRecord
> +{
> + charremote_host[NI_MAXHOST];
> + boolused;
> + double  sleep_time; /* in milliseconds */
> +} AuthConnRecord;

Do we really need a "used" field here? You could avoid it by setting
remote_host[0] = '\0' in cleanup_conn_record.

>  static void
>  auth_delay_checks(Port *port, int status)
>  {
> + double  delay;

I would just initialise this to auth_delay_milliseconds here, instead of
the harder-to-notice "else" below.

> @@ -43,8 +69,150 @@ auth_delay_checks(Port *port, int status)
>*/
>   if (status != STATUS_OK)
>   {
> - pg_usleep(1000L * auth_delay_milliseconds);
> + if (auth_delay_exp_backoff)
> + {
> + /*
> +  * Exponential backoff per remote host.
> +  */
> + delay = record_failed_conn_auth(port);
> + if (auth_delay_max_seconds > 0)
> + delay = Min(delay, 1000L * 
> auth_delay_max_seconds);
> + }

I would make this comment more specific, maybe something like "Delay by
2^n seconds after each authentication failure from a particular host,
where n is the number of consecutive authentication failures".

It's slightly discordant to see the new delay being returned by a
function named "record_" (rather than "calculate_delay" or
similar). Maybe a name like "backoff_after_failed_auth" would be better?
Or "increase_delay_on_auth_fail"?

> +static double
> +record_failed_conn_auth(Port *port)
> +{
> + AuthConnRecord *acr = NULL;
> + int j = -1;
> +
> + acr = find_conn_record(port->remote_host, );
> +
> + if (!acr)
> + {
> + if (j == -1)
> +
> + /*
> +  * No free space, MAX_CONN_RECORDS reached. Wait as 
> long as the
> +  * largest delay for any remote host.
> +  */
> + return find_conn_max_delay();

In this extraordinary situation (where there are lots of hosts with lots
of authentication failures), why not delay by auth_delay_max_seconds
straightaway, especially when the default is only 10s? I don't see much
point in coordinating the delay between fifty known hosts and an unknown
number of others.

> + elog(DEBUG1, "new connection: %s, index: %d", acr->remote_host, 
> j);

I think this should be removed, but if you want to leave it in, the
message should be more specific about what this is actually about, and
where the message is from, so as to not confuse debug-log readers.

(The same applies to the other debug messages.)

> +static AuthConnRecord *
> +find_conn_record(char *remote_host, int *free_index)
> +{
> + int i;
> +
> + *free_index = -1;
> + for (i = 0; i < MAX_CONN_RECORDS; i++)
> + {
> + if (!acr_array[i].used)
> + {
> + if (*free_index == -1)
> + /* record unused element */
> + *free_index = i;
> + continue;
> + }
> + if (strcmp(acr_array[i].remote_host, remote_host) == 0)
> + return _array[i];
> + }
> +
> + return NULL;
> +}

It's not a big deal, but to be honest, I would much prefer to (get rid
of used, as suggested earlier, and) have separate find_acr_for_host()
and find_free_acr() functions.

> +static void
> +record_conn_failure(AuthConnRecord *acr)
> +{
> + if (acr->sleep_time == 0)
> + acr->sleep_time = (double) auth_delay_milliseconds;
> + else
> + acr->sleep_time *= 2;
> +}

I would just roll this function into record_failed_conn_auth (or
whatever it's named), but if you want to keep it a separate function, it
should at least have a name that's sufficiently different from
record_failed_conn_auth.

In terms of the logic, it would have been slightly clearer to store the
number of failures and calculate the delay, but it's easier to double
the sleep time that way you've written it. I think it's fine.

It's worth noting that there is no time-based reset of the delay with
this feature, which I think is something that people might expect to go
hand-in-hand with exponential backoff. I think that's probably OK too.

> +static void
> +auth_delay_shmem_startup(void)
> +{
> + Sizerequired;
> + boolfound;
> +
> + if (shmem_startup_next)
> + shmem_startup_next();
> +
> + required = sizeof(AuthConnRecord) * MAX_CONN_RECORDS;
> + acr_array = ShmemInitStruct("Array of AuthConnRecord", required, 
> );
> + if (found)
> + /* this should not happen ? */
> + elog(DEBUG1, "variable acr_array already exists");
> + /* all fileds are set to 0 */
> + memset(acr_array, 0, 

Re: speed up a logical replica setup

2024-01-15 Thread Shubham Khanna
On Thu, Dec 21, 2023 at 11:47 AM Amit Kapila  wrote:
>
> On Wed, Dec 6, 2023 at 12:53 PM Euler Taveira  wrote:
> >
> > On Thu, Nov 9, 2023, at 8:12 PM, Michael Paquier wrote:
> >
> > On Thu, Nov 09, 2023 at 03:41:53PM +0100, Peter Eisentraut wrote:
> > > On 08.11.23 00:12, Michael Paquier wrote:
> > >> - Should the subdirectory pg_basebackup be renamed into something more
> > >> generic at this point?  All these things are frontend tools that deal
> > >> in some way with the replication protocol to do their work.  Say
> > >> a replication_tools?
> > >
> > > Seems like unnecessary churn.  Nobody has complained about any of the 
> > > other
> > > tools in there.
> >
> > Not sure.  We rename things across releases in the tree from time to
> > time, and here that's straight-forward.
> >
> >
> > Based on this discussion it seems we have a consensus that this tool should 
> > be
> > in the pg_basebackup directory. (If/when we agree with the directory 
> > renaming,
> > it could be done in a separate patch.) Besides this move, the v3 provides a 
> > dry
> > run mode. It basically executes every routine but skip when should do
> > modifications. It is an useful option to check if you will be able to run it
> > without having issues with connectivity, permission, and existing objects
> > (replication slots, publications, subscriptions). Tests were slightly 
> > improved.
> > Messages were changed to *not* provide INFO messages by default and 
> > --verbose
> > provides INFO messages and --verbose --verbose also provides DEBUG 
> > messages. I
> > also refactored the connect_database() function into which the connection 
> > will
> > always use the logical replication mode. A bug was fixed in the transient
> > replication slot name. Ashutosh review [1] was included. The code was also 
> > indented.
> >
> > There are a few suggestions from Ashutosh [2] that I will reply in another
> > email.
> >
> > I'm still planning to work on the following points:
> >
> > 1. improve the cleanup routine to point out leftover objects if there is any
> >connection issue.
> >
>
> I think this is an important part. Shall we try to write to some file
> the pending objects to be cleaned up? We do something like that during
> the upgrade.
>
> > 2. remove the physical replication slot if the standby is using one
> >(primary_slot_name).
> > 3. provide instructions to promote the logical replica into primary, I mean,
> >stop the replication between the nodes and remove the replication setup
> >(publications, subscriptions, replication slots). Or even include another
> >action to do it. We could add both too.
> >
> > Point 1 should be done. Points 2 and 3 aren't essential but will provide a 
> > nice
> > UI for users that would like to use it.
> >
>
> Isn't point 2 also essential because how would otherwise such a slot
> be advanced or removed?
>
> A few other points:
> ==
> 1. Previously, I asked whether we need an additional replication slot
> patch created to get consistent LSN and I see the following comment in
> the patch:
>
> + *
> + * XXX we should probably use the last created replication slot to get a
> + * consistent LSN but it should be changed after adding pg_basebackup
> + * support.
>
> Yeah, sure, we may want to do that after backup support and we can
> keep a comment for the same but I feel as the patch stands today,
> there is no good reason to keep it. Also, is there a reason that we
> can't create the slots after backup is complete and before we write
> recovery parameters
>
> 2.
> + appendPQExpBuffer(str,
> +   "CREATE SUBSCRIPTION %s CONNECTION '%s' PUBLICATION %s "
> +   "WITH (create_slot = false, copy_data = false, enabled = false)",
> +   dbinfo->subname, dbinfo->pubconninfo, dbinfo->pubname);
>
> Shouldn't we enable two_phase by default for newly created
> subscriptions? Is there a reason for not doing so?
>
> 3. How about sync slots on the physical standby if present? Do we want
> to retain those as it is or do we need to remove those? We are
> actively working on the patch [1] for the same.
>
> 4. Can we see some numbers with various sizes of databases (cluster)
> to see how it impacts the time for small to large-size databases as
> compared to the traditional method? This might help us with giving
> users advice on when to use this tool. We can do this bit later as
> well when the patch is closer to being ready for commit.

I have done the Performance testing and attached the results to
compare the 'Execution Time' between 'logical replication' and
'pg_subscriber' for 100MB, 1GB and 5GB data:
| 100MB | 1GB  | 5GB
Logical rep (2 w) | 1.815s  | 14.895s | 75.541s
Logical rep (4 w) | 1.194s  | 9.484s   | 46.938s
Logical rep (8 w) | 0.828s  | 6.422s   | 31.704s
Logical rep(10 w)| 0.646s  | 3.843s   | 18.425s
pg_subscriber | 3.977s  | 9.988s   | 12.665s

Here, 'w' stands for 'workers'. I have included the tests to see the
test result variations with 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-15 Thread Masahiko Sawada
On Sun, Jan 14, 2024 at 10:43 PM John Naylor  wrote:
>
> On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada  wrote:
> >
> > On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada  
> > wrote:
> > > So I agree to remove both max_bytes and num_items from the control
> > > object.Also, as you mentioned, we can remove the tidstore control
> > > object itself. TidStoreGetHandle() returns a radix tree handle, and we
> > > can pass it to TidStoreAttach().  I'll try it.
>
> Thanks. It's worth looking closely here.
>
> > I realized that if we remove the whole tidstore control object
> > including max_bytes, processes who attached the shared tidstore cannot
> > use TidStoreIsFull() actually as it always returns true.
>
> I imagine that we'd replace that with a function (maybe an earlier
> version had it?) to report the memory usage to the caller, which
> should know where to find max_bytes.
>
> > Also they
> > cannot use TidStoreReset() as well since it needs to pass max_bytes to
> > RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
> > could be problematic for general use.
>
> HEAD has no problem finding the necessary values, and I don't think
> it'd be difficult to maintain that ability. I'm not actually sure what
> "general use" needs to have, and I'm not sure anyone can guess.
> There's the future possibility of parallel heap-scanning, but I'm
> guessing a *lot* more needs to happen for that to work, so I'm not
> sure how much it buys us to immediately start putting those two fields
> in a special abstraction. The only other concrete use case mentioned
> in this thread that I remember is bitmap heap scan, and I believe that
> would never need to reset, only free the whole thing when finished.
>
> I spent some more time studying parallel vacuum, and have some
> thoughts. In HEAD, we have
>
> -/*
> - * VacDeadItems stores TIDs whose index tuples are deleted by index 
> vacuuming.
> - */
> -typedef struct VacDeadItems
> -{
> - int max_items; /* # slots allocated in array */
> - int num_items; /* current # of entries */
> -
> - /* Sorted array of TIDs to delete from indexes */
> - ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
> -} VacDeadItems;
>
> ...which has the tids, plus two fields that function _very similarly_
> to the two extra fields in the tidstore control object. It's a bit
> strange to me that the patch doesn't have this struct anymore.
>
> I suspect if we keep it around (just change "items" to be the local
> tidstore struct), the patch would have a bit less churn and look/work
> more like the current code. I think it might be easier to read if the
> v17 commits are suited to the current needs of vacuum, rather than try
> to anticipate all uses. Richer abstractions can come later if needed.

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore. This
is a reason why I abstract them behind TidStore. However, IIUC the
current parallel vacuum can work with such VacDeadItems fields,
fortunately. The leader process can use VacDeadItems allocated on DSM,
and worker processes can use a local VacDeadItems of which max_bytes
and num_items are copied from the shared one and "items" is a local
pointer.

Assuming parallel heap scan requires for both the leader and workers
to update the shared VacDeadItems concurrently, we may need such
richer abstractions.

I've implemented this idea in the v52 patch set. Here is the summary
of the updates:

0008: Remove the control object from tidstore. Also removed some
unsupported functions such as TidStoreNumTids()
0009: Adjust lazy vacuum integration patch with the control object removal.

I've not updated any locking code yet. Once we confirm this direction,
I'll update the locking code too.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v52-ART.tar.gz
Description: GNU Zip compressed data


Re: Add test module for Table Access Method

2024-01-15 Thread Japin Li


On Tue, 16 Jan 2024 at 13:15, Bharath Rupireddy 
 wrote:
> On Tue, Jan 16, 2024 at 10:28 AM Michael Paquier  wrote:
>>
>> Hmm.  I'd rather have it do something useful in terms of test coverage
>> rather than being just an empty skull.
>>
>> How about adding the same kind of coverage as dummy_index_am with a
>> couple of reloptions then?  That can serve as a point of reference
>> when a table AM needs a few custom options.  A second idea would be to
>> show how to use toast relations when implementing your new AM, where a
>> toast table could be created even in cases where we did not want one
>> with heap, when it comes to size limitations with char and/or varchar,
>> and that makes for a simpler needs_toast_table callback.
>
> I think a test module for a table AM will really help developers. Just
> to add to the above list - how about the table AM implementing a
> simple in-memory (columnar if possible) database storing tables
> in-memory and subsequently providing readers with the access to the
> tables?

That's a good idea.




Re: heavily contended lwlocks with long wait queues scale badly

2024-01-15 Thread Michael Paquier
On Thu, Jan 11, 2024 at 09:47:33AM -0500, Jonathan S. Katz wrote:
> I have similar data sources to Nathan/Michael and I'm trying to avoid piling
> on, but one case that's interesting occurred after a major version upgrade
> from PG10 to PG14 on a database supporting a very active/highly concurrent
> workload. On inspection, it seems like backpatching would help this
> particularly case.
> 
> With 10/11 EOL, I do wonder if we'll see more of these reports on upgrade to
> < PG16.
> 
> (I was in favor of backpatching prior; opinion is unchanged).

Hearing nothing, I have prepared a set of patches for v12~v15,
checking all the lwlock paths for all the branches.  At the end the
set of changes look rather sane to me regarding the queue handlings.

I have also run some numbers on all the branches, and the test case
posted upthread falls off dramatically after 512 concurrent
connections at the top of all the stable branches :(

For example on REL_12_STABLE with and without the patch attached:
num  v12   v12+patch
129717.151665  29096.707588
263257.709301  61889.476318
4127921.873393 124575.901330
8231400.571662 230562.725174
16   343911.185351 312432.897015
32   291748.985280 281011.787701
64   268998.728648 269975.605115
128  297332.597018 286449.176950
256  243902.817657 240559.122309 
512  190069.602270 194510.718508
768  58915.650225  165714.707198
1024 39920.950552  149433.836901
2048 16922.391688  108164.301054
4096 6229.063321   69032.338708

I'd like to apply that, just let me know if you have any comments
and/or objections.
--
Michael
From 83706cd5792bc002a668fd54cb93c15011a97063 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 20 Nov 2022 11:56:32 -0800
Subject: [PATCH] lwlock: Fix quadratic behavior with very long wait lists

Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.

Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.

The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.

In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.

As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.

Author: Andres Freund 
Reviewed-by: Bharath Rupireddy 
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6...@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=k8xi7psqboftzozavhaxjatvc14iyalu4m...@mail.gmail.com
---
 src/include/storage/lwlock.h  |  8 
 src/include/storage/proc.h|  2 +-
 src/backend/access/transam/twophase.c |  2 +-
 src/backend/storage/lmgr/lwlock.c | 53 +++
 src/backend/storage/lmgr/proc.c   |  4 +-
 5 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e03d317eea..d88fa4b4ad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -23,6 +23,14 @@
 
 struct PGPROC;
 
+/* what state of the wait process is a backend in */
+typedef enum LWLockWaitState
+{
+	LW_WS_NOT_WAITING, /* not currently waiting / woken up */
+	LW_WS_WAITING, /* currently waiting */
+	LW_WS_PENDING_WAKEUP /* removed from waitlist, but not yet signalled */
+} LWLockWaitState;
+
 /*
  * Code outside of lwlock.c should not manipulate the contents of this
  * structure directly, but we have to declare it here to allow LWLocks to be
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..2ea773098b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -211,7 +211,7 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	uint8		lwWaiting;		/* see LWLockWaitState */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5293c6920b..ca7037eb2f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -486,7 +486,7 @@ 

Re: minor replication slot docs edits

2024-01-15 Thread Bharath Rupireddy
On Mon, Jan 15, 2024 at 9:08 PM Alvaro Herrera  wrote:
>
> Pursuant to a comment I made a few months ago[1], I propose the attached
> changes to replication slots documentation.  In essence, I want to
> explain that replication slots are good, and the max_size GUC, before
> moving on to explain that the other methods are worse.
>
> [1] https://postgr.es/m/20230413111838.e7yxke2dtwrxw3qy@alvherre.pgsql

Thanks for the patch. The wording looks good to me. However, I have
some comments on the placement of the note:

1. How about bundling this in a   or  ?

+   
+Beware that replication slots can retain so many WAL segments that they
+fill up the space allocated for pg_wal.
+ can be used to limit the size
+of WAL files retained by replication slots.
+   

2. I think the better place for this note is at the end after the
"Similarly,  on its own,
without" paragraph. It will then be like we introduce what replication
slot is and why it is better over other mechanisms to retain WAL and
then caution the users of it retaining WAL.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add test module for Table Access Method

2024-01-15 Thread Bharath Rupireddy
On Tue, Jan 16, 2024 at 10:28 AM Michael Paquier  wrote:
>
> Hmm.  I'd rather have it do something useful in terms of test coverage
> rather than being just an empty skull.
>
> How about adding the same kind of coverage as dummy_index_am with a
> couple of reloptions then?  That can serve as a point of reference
> when a table AM needs a few custom options.  A second idea would be to
> show how to use toast relations when implementing your new AM, where a
> toast table could be created even in cases where we did not want one
> with heap, when it comes to size limitations with char and/or varchar,
> and that makes for a simpler needs_toast_table callback.

I think a test module for a table AM will really help developers. Just
to add to the above list - how about the table AM implementing a
simple in-memory (columnar if possible) database storing tables
in-memory and subsequently providing readers with the access to the
tables?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: introduce dynamic shared memory registry

2024-01-15 Thread Bharath Rupireddy
On Sun, Jan 14, 2024 at 3:11 AM Nathan Bossart  wrote:
>
> Here is a new version of the patch set with these changes.

Thanks. Here are some comments on v7-0002.

1.
+GetNamedDSMSegment(const char *name, size_t size,
+   void (*init_callback) (void *ptr), bool *found)
+{
+
+Assert(name);
+Assert(size);
+Assert(found);

I've done some input validation to GetNamedDSMSegment():

With an empty key name (""), it works, but do we need that in
practice? Can't we error out saying the name can't be empty?

With a size 0, an assertion is fine, but in production (without the
assertion), I'm seeing the following errors.

2024-01-16 04:49:28.961 UTC client backend[864369]
pg_regress/test_dsm_registry ERROR:  could not resize shared memory
segment "/PostgreSQL.3701090278" to 0 bytes: Invalid argument
2024-01-16 04:49:29.264 UTC postmaster[864357] LOG:  server process
(PID 864370) was terminated by signal 11: Segmentation fault

I think it's better for GetNamedDSMSegment() to error out on empty
'name' and size 0. This makes the user-facing function
GetNamedDSMSegment more concrete.

2.
+void *
+GetNamedDSMSegment(const char *name, size_t size,
+   void (*init_callback) (void *ptr), bool *found)

+Assert(found);

Why is input parameter 'found' necessary to be passed by the caller?
Neither the test module added, nor the pg_prewarm is using the found
variable. The function will anyway create the DSM segment if one with
the given name isn't found. IMO, found is an optional parameter for
the caller. So, the assert(found) isn't necessary.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add test module for Table Access Method

2024-01-15 Thread Michael Paquier
On Mon, Jan 15, 2024 at 03:40:30PM +0100, Jelte Fennema-Nio wrote:
> On Mon, 15 Jan 2024 at 14:26, Aleksander Alekseev
>  wrote:
>> To be fair, Postgres uses TAM internally, so there is at least one
>> complete and up-to-date real-life example.
> 
> Sure, but that one is quite hard to follow if you don't already know
> lots of details of the heap storage. At least for me, having a minimal
> example was extremely helpful and it made for a great code skeleton to
> start from.

Hmm.  I'd rather have it do something useful in terms of test coverage
rather than being just an empty skull.

How about adding the same kind of coverage as dummy_index_am with a
couple of reloptions then?  That can serve as a point of reference
when a table AM needs a few custom options.  A second idea would be to
show how to use toast relations when implementing your new AM, where a
toast table could be created even in cases where we did not want one
with heap, when it comes to size limitations with char and/or varchar,
and that makes for a simpler needs_toast_table callback.
--
Michaxel


signature.asc
Description: PGP signature


Re: Synchronizing slots from primary to standby

2024-01-15 Thread Dilip Kumar
On Tue, Jan 16, 2024 at 9:37 AM Amit Kapila  wrote:
>
> On Tue, Jan 16, 2024 at 9:03 AM shveta malik  wrote:
> >
> Agreed and as said earlier I think it is better to make it a
> PGC_SIGHUP. Also, not sure we can say it is a non-standard way as
> already autovacuum launcher is handled in the same way. One more minor
> thing is it will save us for having a new bgworker state
> BgWorkerStart_ConsistentState_HotStandby as introduced by this patch.

Yeah, it's not a nonstandard way.  But bgworker provides a lot of
inbuilt infrastructure which otherwise we would have to maintain by
ourselves if we opt for option 2.  I would have preferred option 3
from the simplicity point of view but I would prefer to make this
PGC_SIGHUP over simplicity.  But anyway, if there are issues in doing
so then we can keep it PGC_POSTMASTER but it's worth trying this out.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-15 Thread Bharath Rupireddy
On Sun, Jan 14, 2024 at 2:58 AM Nathan Bossart  wrote:
>
> Great.  I've attached a v3 with a couple of fixes suggested in the other
> thread [0].  I'll wait a little while longer in case anyone else wants to
> take a look.

The v3 patch looks good to me except for a nitpick: the input
parameter for RequestAddinShmemSpace is 'Size' not 'int'

 
 void RequestAddinShmemSpace(int size)
 

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-01-15 Thread Amit Kapila
On Tue, Jan 16, 2024 at 9:03 AM shveta malik  wrote:
>
> On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila  wrote:
> >
> > On Fri, Jan 12, 2024 at 5:50 PM shveta malik  wrote:
> > >
> > > There are multiple approaches discussed and tried when it comes to
> > > starting a slot-sync worker. I am summarizing all here:
> > >
> > >  1) Make slotsync worker as an Auxiliary Process (like checkpointer,
> > > walwriter, walreceiver etc). The benefit this approach provides is, it
> > > can control begin and stop in a more flexible way as each auxiliary
> > > process could have different checks before starting and can have
> > > different stop conditions. But it needs code duplication for process
> > > management(start, stop, crash handling, signals etc) and currently it
> > > does not support db-connection smoothly (none of the auxiliary process
> > > has one so far)
> > >
> >
> > As slotsync worker needs to perform transactions and access syscache,
> > we can't make it an auxiliary process as that doesn't initialize the
> > required stuff like syscache. Also, see the comment "Auxiliary
> > processes don't run transactions ..." in AuxiliaryProcessMain() which
> > means this is not an option.
> >
> > >
> > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher
> > > which is neither an Auxiliary process nor a bgworker one. It allows
> > > db-connection and also provides flexibility to have start and stop
> > > conditions for a process.
> > >
> >
> > Yeah, due to these reasons, I think this option is worth considering
> > and another plus point is that this allows us to make enable_syncslot
> > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER.
> >
> > >
> > > 3) Make slotysnc worker a bgworker. Here we just need to register our
> > > process as a bgworker (RegisterBackgroundWorker()) by providing a
> > > relevant start_time and restart_time and then the process management
> > > is well taken care of. It does not need any code-duplication and
> > > allows db-connection smoothly in registered process. The only thing it
> > > lacks is that it does not provide flexibility of having
> > > start-condition which then makes us to have 'enable_syncslot' as
> > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I
> > > feel enable_syncslot is something which will not be changed frequently
> > > and with the benefits provided by bgworker infra, it seems a
> > > reasonably good option to choose this approach.
> > >
> >
> > I agree but it may be better to make it a PGC_SIGHUP parameter.
> >
> > > 4) Another option is to have Logical Replication Launcher(or a new
> > > process) to launch slot-sync worker. But going by the current design
> > > where we have only 1 slotsync worker, it may be an overhead to have an
> > > additional manager process maintained.
> > >
> >
> > I don't see any good reason to have an additional launcher process here.
> >
> > >
> > > Thus weighing pros and cons of all these options, we have currently
> > > implemented the bgworker approach (approach 3).  Any feedback is
> > > welcome.
> > >
> >
> > I vote to go for (2) unless we face difficulties in doing so but (3)
> > is also okay especially if others also think so.
>
> I am not against any of the approaches but I still feel that when we
> have a standard way of doing things (bgworker) we should not keep
> adding code to do things in a special way unless there is a strong
> reason to do so. Now we need to decide if 'enable_syncslot' being
> PGC_POSTMASTER is a strong reason to go the non-standard way?
>

Agreed and as said earlier I think it is better to make it a
PGC_SIGHUP. Also, not sure we can say it is a non-standard way as
already autovacuum launcher is handled in the same way. One more minor
thing is it will save us for having a new bgworker state
BgWorkerStart_ConsistentState_HotStandby as introduced by this patch.

> If yes,
> then we should think of option 2 else option 3 seems better in my
> understanding (which may be limited due to my short experience here),
> so I am all ears to what others think on this.
>

I also think it would be better if more people share their opinion on
this matter.

-- 
With Regards,
Amit Kapila.




Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-15 Thread Bharath Rupireddy
On Tue, Jan 16, 2024 at 7:25 AM Jeff Davis  wrote:
>
> On Fri, 2024-01-12 at 17:17 -0800, Jeff Davis wrote:
> > I think 0004 needs a bit more work, so I'm leaving it off for now,
> > but
> > I'll bring it back in the next patch set.
>
> Here's the next patch set. 0001 - 0003 are mostly the same with some
> improved error messages and some code fixes. I am looking to start
> committing 0001 - 0003 soon, as they have received some feedback
> already and 0004 isn't required for the earlier patches to be useful.

Thanks. Here are some comments on 0001. I'll look at other patches very soon.

1.
+/* Load the library providing us libpq calls. */
+load_file("libpqwalreceiver", false);

At first glance, it looks odd that libpqwalreceiver library is being
linked to every backend that uses postgresql_fdw_validator. After a
bit of grokking, this feels/is a better and easiest way to not link
libpq to the main postgresql executable as specified at the beginning
of libpqwalreceiver.c file comments. May be a more descriptive note is
worth here instead of just saying "Load the library providing us libpq
calls."?

2. Why not typedef keyword before the ConnectionOption structure? This
way all the "struct ConnectionOption" can be remvoed, no? I know the
previously there is no typedef, but we can add it now so that the code
looks cleaner.

typedef struct ConnectionOption
{
const char *optname;
boolissecret;/* is option for a password? */
boolisdebug;/* is option a debug option? */
} ConnectionOption;

FWIW, with the above change and removal of struct before every use of
ConnectionOption, the code compiles cleanly for me.

3.
+static const struct ConnectionOption *
+libpqrcv_conninfo_options(void)

Why is libpqrcv_conninfo_options returning the const ConnectionOption?
Is it that we don't expect callers to modify the result? I think it's
not needed given the fact that PQconndefaults doesn't constify the
return value.

4.
+/* skip options that must be overridden */
+if (strcmp(option, "client_encoding") == 0)
+return false;
+

Options that must be overriden or disallow specifiing
"client_encoding" in the SERVER/USER MAPPING definition (just like the
dblink)?

/* Disallow "client_encoding" */
if (strcmp(opt->keyword, "client_encoding") == 0)
return false;

5.
"By using the correct libpq options, it no longer needs to be
deprecated, and can be used by the upcoming pg_connection_fdw."

Use of postgresql_fdw_validator for pg_connection_fdw seems a bit odd
to me. I don't mind pg_connection_fdw having its own validator
pg_connection_fdw_validator even if it duplicates the code. To avoid
code duplication we can move the guts to an internal function in
foreign.c so that both postgresql_fdw_validator and
pg_connection_fdw_validator can use it. This way the code is cleaner
and we can just leave postgresql_fdw_validator as deprecated.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Emitting JSON to file using COPY TO

2024-01-15 Thread jian he
On Tue, Jan 9, 2024 at 4:40 AM Joe Conway  wrote:
>
> On 1/8/24 14:36, Dean Rasheed wrote:
> > On Thu, 7 Dec 2023 at 01:10, Joe Conway  wrote:
> >>
> >> The attached should fix the CopyOut response to say one column.
> >>
> >
> > Playing around with this, I found a couple of cases that generate an error:
> >
> > COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);
> >
> > COPY (VALUES (1), (2)) TO stdout WITH (format json);
> >
> > both of those generate the following:
> >
> > ERROR:  record type has not been registered
>
>
> Thanks -- will have a look
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>
>
>

In the function CopyOneRowTo, I try to call the function BlessTupleDesc again.

+BlessTupleDesc(slot->tts_tupleDescriptor);
rowdata = ExecFetchSlotHeapTupleDatum(slot);

Please check the attachment. (one test.sql file, one patch, one bless twice).

Now the error cases are gone, less cases return error.
but the new result is not the expected.

`COPY (SELECT g from generate_series(1,1) g) TO stdout WITH (format json);`
returns
{"":1}
The expected result would be `{"g":1}`.

I think the reason is maybe related to the function copy_dest_startup.


test.sql
Description: application/sql


patch.out
Description: Binary data


patch_bless_twice.out
Description: Binary data


Re: Synchronizing slots from primary to standby

2024-01-15 Thread shveta malik
On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila  wrote:
>
> On Fri, Jan 12, 2024 at 5:50 PM shveta malik  wrote:
> >
> > There are multiple approaches discussed and tried when it comes to
> > starting a slot-sync worker. I am summarizing all here:
> >
> >  1) Make slotsync worker as an Auxiliary Process (like checkpointer,
> > walwriter, walreceiver etc). The benefit this approach provides is, it
> > can control begin and stop in a more flexible way as each auxiliary
> > process could have different checks before starting and can have
> > different stop conditions. But it needs code duplication for process
> > management(start, stop, crash handling, signals etc) and currently it
> > does not support db-connection smoothly (none of the auxiliary process
> > has one so far)
> >
>
> As slotsync worker needs to perform transactions and access syscache,
> we can't make it an auxiliary process as that doesn't initialize the
> required stuff like syscache. Also, see the comment "Auxiliary
> processes don't run transactions ..." in AuxiliaryProcessMain() which
> means this is not an option.
>
> >
> > 2) Make slotsync worker as a 'special' process like AutoVacLauncher
> > which is neither an Auxiliary process nor a bgworker one. It allows
> > db-connection and also provides flexibility to have start and stop
> > conditions for a process.
> >
>
> Yeah, due to these reasons, I think this option is worth considering
> and another plus point is that this allows us to make enable_syncslot
> a PGC_SIGHUP GUC rather than a PGC_POSTMASTER.
>
> >
> > 3) Make slotysnc worker a bgworker. Here we just need to register our
> > process as a bgworker (RegisterBackgroundWorker()) by providing a
> > relevant start_time and restart_time and then the process management
> > is well taken care of. It does not need any code-duplication and
> > allows db-connection smoothly in registered process. The only thing it
> > lacks is that it does not provide flexibility of having
> > start-condition which then makes us to have 'enable_syncslot' as
> > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I
> > feel enable_syncslot is something which will not be changed frequently
> > and with the benefits provided by bgworker infra, it seems a
> > reasonably good option to choose this approach.
> >
>
> I agree but it may be better to make it a PGC_SIGHUP parameter.
>
> > 4) Another option is to have Logical Replication Launcher(or a new
> > process) to launch slot-sync worker. But going by the current design
> > where we have only 1 slotsync worker, it may be an overhead to have an
> > additional manager process maintained.
> >
>
> I don't see any good reason to have an additional launcher process here.
>
> >
> > Thus weighing pros and cons of all these options, we have currently
> > implemented the bgworker approach (approach 3).  Any feedback is
> > welcome.
> >
>
> I vote to go for (2) unless we face difficulties in doing so but (3)
> is also okay especially if others also think so.

I am not against any of the approaches but I still feel that when we
have a standard way of doing things (bgworker) we should not keep
adding code to do things in a special way unless there is a strong
reason to do so. Now we need to decide if 'enable_syncslot' being
PGC_POSTMASTER is a strong reason to go the non-standard way? If yes,
then we should think of option 2 else option 3 seems better in my
understanding (which may be limited due to my short experience here),
so I am all ears to what others think on this.

thanks
Shveta




Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-15 Thread David Rowley
While working on [1], I noticed some strange code in
DiscreteKnapsack() which seems to be aiming to copy the Bitmapset.

It's not that obvious at a casual glance, but:

sets[j] = bms_del_members(sets[j], sets[j]);

this is aiming to zero all the words in the set by passing the same
set in both parameters.

Now that 00b41463c changed Bitmapset to have NULL be the only valid
representation of an empty set, this code no longer makes sense.  We
may as well just bms_free() the original set and bms_copy() in the new
set as the bms_del_members() call will always pfree the set anyway.

I've done that in the attached.

I did consider if we might want bms_merge_members() or
bms_exchange_members() or some other function suitably named function
to perform a del/add operation, but given the lack of complaints about
any performance regressions here, I think it's not worthwhile.

The code could also be adjusted to:

sets[j] = bms_add_members(sets[j], sets[ow]);
sets[j] = bms_del_members(sets[j], sets[j]);
sets[j] = bms_add_members(sets[j], sets[ow]); // re-add any deletions

so that the set never becomes fully empty... but ... that's pretty horrid.

00b41463c is in PG16, but I'm not proposing to backpatch this.  The
misleading comment does not seem critical enough and the resulting
behaviour isn't changing, just the performance characteristics.

Unless there's some objection, I plan to push this in the next day or two.

David

[1] 
https://postgr.es/m/caaphdvoxpoadyejmj9e1ihzzzynctgqdappwgpzmamq222n...@mail.gmail.com
diff --git a/src/backend/lib/knapsack.c b/src/backend/lib/knapsack.c
index 13d800718f..b2f32fdd60 100644
--- a/src/backend/lib/knapsack.c
+++ b/src/backend/lib/knapsack.c
@@ -87,11 +87,11 @@ DiscreteKnapsack(int max_weight, int num_items,
 
if (values[j] <= values[ow] + iv)
{
-   /* copy sets[ow] to sets[j] without realloc */
+   /* copy sets[ow] to sets[j] */
if (j != ow)
{
-   sets[j] = bms_del_members(sets[j], 
sets[j]);
-   sets[j] = bms_add_members(sets[j], 
sets[ow]);
+   bms_free(sets[j]);
+   sets[j] = bms_copy(sets[ow]);
}
 
sets[j] = bms_add_member(sets[j], i);


Re: Add PQsendSyncMessage() to libpq

2024-01-15 Thread Michael Paquier
On Mon, Jan 15, 2024 at 10:49:56AM +0100, Alvaro Herrera wrote:
> the new function pqPipelineSyncInternal is not a wrapper for these other
> two functions -- the opposite is true actually.  We tend to use the term
> "workhorse" or "internal workhorse" for this kind of thing.

Indeed, makes sense.

> In the docs, after this patch we have
> 
> - PQpipelineSync
> - PQsendFlushRequest
> - PQsendPipelineSync
> 
> Wouldn't it make more sense to add the new function in the middle of the
> two existing ones instead?

Ordering PQsendPipelineSync just after PQpipelineSync is OK by me.
I've applied the patch with all these modifications to move on with
the subject.

> Looking again at the largish comment that's now atop
> pqPipelineSyncInternal(), I think most of it should be removed -- these
> things should be explained in the SGML docs, and I think they are, in
> the "Using Pipeline Mode" section.  We can just have the lines this
> patch is adding.

Hmm.  The first two sentences about being able to submit more commands
to the pipeline are documented in the subsection "Issuing Queries".
The third sentence is implied in the second paragraph of this
subsection.  The 4th paragraph of the comment where sync commands
cannot be issued until all the results from the pipeline have been
consumed is mentioned in the first paragraph in "Using Pipeline Mode".
So you are right that this could be entirely removed.

How about the attached to remove all that, then?
--
Michael
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 52d41658c1..152b100624 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -3245,23 +3245,6 @@ PQsendPipelineSync(PGconn *conn)
 /*
  * Workhorse function for PQpipelineSync and PQsendPipelineSync.
  *
- * It's legal to start submitting more commands in the pipeline immediately,
- * without waiting for the results of the current pipeline. There's no need to
- * end pipeline mode and start it again.
- *
- * If a command in a pipeline fails, every subsequent command up to and
- * including the result to the Sync message sent by pqPipelineSyncInternal
- * gets set to PGRES_PIPELINE_ABORTED state. If the whole pipeline is
- * processed without error, a PGresult with PGRES_PIPELINE_SYNC is produced.
- *
- * Queries can already have been sent before pqPipelineSyncInternal is called,
- * but pqPipelineSyncInternal needs to be called before retrieving command
- * results.
- *
- * The connection will remain in pipeline mode and unavailable for new
- * synchronous command execution functions until all results from the pipeline
- * are processed by the client.
- *
  * immediate_flush controls if the flush happens immediately after sending the
  * Sync message or not.
  */


signature.asc
Description: PGP signature


Re: Revise the Asserts added to bimapset manipulation functions

2024-01-15 Thread David Rowley
On Tue, 2 Jan 2024 at 20:18, Richard Guo  wrote:
> I think one purpose of introducing REALLOCATE_BITMAPSETS is to help find
> dangling pointers to Bitmapset's.  From this point of view, I agree that
> we should call bms_copy_and_free() in bms_add_members(), because the
> bitmapset 'a' might be recycled (in-place modified, or pfreed).

I spoke to Andres about this in our weekly meeting and he explained it
in such I way that I now agree that it's useful to repalloc with all
modifications.  I'd previously thought that when the comments mention
that some function "recycle their left input" that you could be
certain that the Bitmapset would be modified in-place, therefore any
other pointers pointing to this set should remain valid.  As Andres
reminded me, that's not the case when the set becomes empty and
there's nothing particularly special about a set becoming empty so
making a clone of the set will help identify any pointers that don't
get updated as a result of the modification.

I've now adjusted the patch to have all modifications to Bitmapsets
return a newly allocated set. There are a few cases missing in master
that need to be fixed (items 6-10 below):

The patch now includes changes for the following:

1. Document what REALLOCATE_BITMAPSETS is for.
2. Provide a reusable function to check a set is valid and use it in
cassert builds and use it to validate sets (Richard)
3. Provide a reusable function to copy a set and pfree the original
and use that instead of duplicating that code all over bitmapset.c
4. Fix Assert duplication (Richard)
5. Make comments in bms_union, bms_intersect, bms_difference clear
that a new set is allocated.  (This has annoyed me for a while)
6. Fix failure to duplicate the set in bms_add_members() when b == NULL.
7. Fix failure to duplicate the set in bms_add_range() when upper < lower
8. Fix failure to duplicate the set in bms_add_range() when the set
has enough words already.
9. Fix failure to duplicate the set in bms_del_members() when b == NULL
10. Fix failure to duplicate the set in bms_join() when a == NULL and
also fix the b == NULL case too
11. Fix hazard in bms_add_members(), bms_int_members(),
bms_del_members() and bms_join(),  where the code added in 7d58f2342
could crash if a == b due to the REALLOCATE_BITMAPSETS code pfreeing
'a'. This happens in knapsack.c:93, although I propose we adjust that
code now that empty sets are always represented as NULL.

David





> According to this criterion, it seems to me that we should also call
> bms_copy_and_free() in bms_join(), since both inputs there might be
> recycled.  But maybe I'm not understanding it correctly.
>
>>
>> > * Do you think we can add Assert(bms_is_valid_set(a)) for bms_free()?
>>
>> I did briefly have the Assert in bms_free(), but I removed it as I
>> didn't think it was that useful to validate the set before freeing it.
>> Certainly, there'd be an argument to do that, but I ended up not
>> putting it there. I wondered if there could be a case where we do
>> something like bms_int_members() which results in an empty set and
>> after checking for and finding the set is now empty, we call
>> bms_free().  If we did that then we'd Assert fail.  In reality, we use
>> pfree() instead of bms_free() as we already know the set is not NULL,
>> so it wouldn't cause a problem for that particular case. I wondered if
>> there might be another one that I didn't spot, so felt it was best not
>> to Assert there.
>>
>> I imagine that if we found some case where the bms_free() Assert
>> failed, we'd likely just remove it rather than trying to make the set
>> valid before freeing it.  So it seems pretty pointless if we'd opt to
>> remove the Assert() at the first sign of trouble.
>
>
> I think I understand your concern.  In some bitmapset manipulation
> functions, like bms_int_members(), the result might be computed as
> empty.  In such cases we need to free the input bitmap.  If we use
> bms_free(), the Assert would fail.
>
> It seems to me that this scenario can only occur within the bitmapset
> manipulation functions.  Outside of bitmapset.c, I think it should be
> quite safe to call bms_free() with this Assert.  So I think it should
> not have problem to have this Assert in bms_free() as long as we are
> careful enough when calling bms_free() inside bitmapset.c.
>
> Thanks
> Richard
From 2a8048beb4623f7b0d98763658252a74619cd887 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Fri, 5 Jan 2024 13:00:35 +1300
Subject: [PATCH v3] Fix REALLOCATE_BITMAPSETS code

7d58f2342 added a compile-time option to have bitmapset.c reallocate the
set before returning when a set is modified.  That commit failed to do
its job in various cases and returned the input set when it shouldn't
have in these cases.

This commit also adds some documentation about what
REALLOCATE_BITMAPSETS is for.  This is important as future functions
that go inside bitmapset.c need to know if they need to do anything
special when this compile-time option is 

Re: Make mesage at end-of-recovery less scary.

2024-01-15 Thread Kyotaro Horiguchi
Thank you for the comments.

At Fri, 12 Jan 2024 15:03:26 +0300, Aleksander Alekseev 
 wrote in 
> ```
> +p = (char *) record;
> +pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> +
> +while (p < pe && *p == 0)
> +p++;
> +
> +if (p == pe)
> ```
> 
> Just as a random thought: perhaps we should make this a separate
> function, as a part of src/port/. It seems to me that this code could
> benefit from using vector instructions some day, similarly to
> memcmp(), memset() etc. Surprisingly there doesn't seem to be a
> standard C function for this. Alternatively one could argue that one
> cycle doesn't make much code to reuse and that the C compiler will
> place SIMD instructions for us. However a counter-counter argument
> would be that we could use a macro or even better an inline function
> and have the same effect except getting a slightly more readable code.

Creating a function with a name like memcmp_byte() should be
straightforward, but implementing it with SIMD right away seems a bit
challenging. Similar operations are already being performed elsewhere
in the code, probably within the stats collector, where memcmp is used
with a statically allocated area that's filled with zeros. If we can
achieve a performance equivalent to memcmp with this new function,
then it definitely seems worth pursuing.

> ```
> - * This is just a convenience subroutine to avoid duplicated code in
> + * This is just a convenience subroutine to avoid duplicate code in
> ```
> 
> This change doesn't seem to be related to the patch. Personally I
> don't mind it though.

Ah, I'm sorry. That was something I mistakenly thought I had written
at the last moment and made modifications to.

> All in all I find v28 somewhat scary. It does much more than "making
> one message less scary" as it was initially intended and what bugged
> me personally, and accordingly touches many more places including
> xlogreader.c, xlogrecovery.c, etc.
> 
> Particularly I have mixed feeling about this:
> 
> ```
> +/*
> + * Consider it as end-of-WAL if all subsequent bytes of this page
> + * are zero. We don't bother checking the subsequent pages since
> + * they are not zeroed in the case of recycled segments.
> + */
> ```
> 
> If I understand correctly, if somehow several FS blocks end up being
> zeroed (due to OS bug, bit rot, restoring from a corrupted for
> whatever reason backup, hardware failures, ...) there is non-zero
> chance that PG will interpret this as a normal situation. To my
> knowledge this is not what we typically do - typically PG would report
> an error and ask a human to figure out what happened. Of course the
> possibility of such a scenario is small, but I don't think that as
> DBMS developers we can ignore it.

For now, let me explain the basis for this patch. The fundamental
issue is that these warnings that always appear are, in practice, not
a problem in almost all cases. Some of those who encounter them for
the first time may feel uneasy and reach out with inquiries. On the
other hand, those familiar with these warnings tend to ignore them and
only pay attention to details when actual issues arise. Therefore, the
intention of this patch is to label them as "no issue" unless a
problem is blatantly evident, in order to prevent unnecessary concern.

> Does anyone agree or maybe I'm making things up?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-15 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Mon, 15 Jan 2024 16:03:41 +0900,
  Masahiko Sawada  wrote:

>> Defining one more static const struct instead of providing a
>> convenient (but a bit tricky) macro may be straightforward:
>>
>> static const CopyToFormatRoutine testfmt_copyto_routine = {
>> .type = T_CopyToFormatRoutine,
>> .start_fn = testfmt_copyto_start,
>> .onerow_fn = testfmt_copyto_onerow,
>> .end_fn = testfmt_copyto_end
>> };
>>
>> static const CopyFormatRoutine testfmt_copyto_handler = {
>> .type = T_CopyFormatRoutine,
>> .is_from = false,
>> .routine = (Node *) _copyto_routine
>> };
> 
> Yeah, IIUC this is the option 2 you mentioned[1]. I think we can go
> with this idea as it's the simplest.
>
> [1] 
> https://www.postgresql.org/message-id/20240110.120034.501385498034538233.kou%40clear-code.com

Ah, you're right. I forgot it...

>  That is CopyFormatRoutine will be like:
> 
> typedef struct CopyFormatRoutine
> {
> NodeTag type;
> 
> /* either CopyToFormatRoutine or CopyFromFormatRoutine */
> Node   *routine;
> }   CopyFormatRoutine;
> 
> And the core can check the node type of the 'routine7 in the
> CopyFormatRoutine returned by extensions.

It makes sense.


If no more comments about the current design, I'll start
implementing this feature based on the current design.


Thanks,
-- 
kou




Re: POC: GROUP BY optimization

2024-01-15 Thread Richard Guo
On Mon, Jan 15, 2024 at 3:56 PM Alexander Korotkov 
wrote:

> On Mon, Jan 15, 2024 at 8:42 AM Richard Guo 
> wrote:
> > On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov 
> wrote:
> >>
> >> Thank you for providing the test case relevant for this code change.
> >> The revised patch incorporating this change is attached.  Now the
> >> patchset looks good to me.  I'm going to push it if there are no
> >> objections.
> >
> > Seems I'm late for the party.  Can we hold for several more days?  I'd
> > like to have a review on this patch.
>
> Sure, NP.  I'll hold it till your review.


Thanks.  Appreciate that.

I have briefly reviewed this patch and here are some comments.

* When trying to match the ordering of GROUP BY to that of ORDER BY in
get_useful_group_keys_orderings, this patch checks against the length of
path->pathkeys.  This does not make sense.  I guess this is a typo and
the real intention is to check against root->sort_pathkeys.

--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path
*path)
   root->num_groupby_pathkeys);

if (n > 0 &&
-   (enable_incremental_sort || n == list_length(path->pathkeys)))
+   (enable_incremental_sort || n ==
list_length(root->sort_pathkeys)))

However, I think this is also incorrect.  When incremental sort is
disabled, it is reasonable to consider reordering the GROUP BY keys only
if the number of matching pathkeys is equal to the length of
root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.


* When the original ordering of GROUP BY keys matches the path's
pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
generate duplicate PathKeyInfos.  Consequently, this duplication would
lead to the creation and addition of identical paths multiple times.
This is not great.  Consider the query below:

create table t (a int, b int);
create index on t (a, b);
set enable_hashagg to off;

explain select count(*) from t group by a, b;
QUERY PLAN
--
 GroupAggregate  (cost=0.15..97.27 rows=226 width=16)
   Group Key: a, b
   ->  Index Only Scan using t_a_b_idx on t  (cost=0.15..78.06 rows=2260
width=8)
(3 rows)

The same path with group keys {a, b} is created and added twice.


* Part of the work performed in this patch overlaps with that of
preprocess_groupclause.  They are both trying to adjust the ordering of
the GROUP BY keys to match ORDER BY.  I wonder if it would be better to
perform this work only once.


* When reordering the GROUP BY keys, I think the following checks need
some improvements.

+   /*
+* Since 1349d27 pathkey coming from underlying node can be in the
+* root->group_pathkeys but not in the processed_groupClause. So, we
+* should be careful here.
+*/
+   sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
+   *group_clauses);
+   if (!sgc)
+   /* The grouping clause does not cover this pathkey */
+   break;

I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
valid before calling get_sortgroupref_clause_noerr with it.  Also, how
can we guarantee that the returned GROUP BY item is sortable?  Should we
check that with OidIsValid(sgc->sortop)?

Thanks
Richard


Re: make pg_ctl more friendly

2024-01-15 Thread Junwang Zhao
Hi Nathan,

On Tue, Jan 16, 2024 at 5:39 AM Nathan Bossart  wrote:
>
> +   POSTMASTER_RECOVERY_SHUTDOWN,
>
> Perhaps this should be POSTMASTER_SHUTDOWN_IN_RECOVERY to match the state
> in the control file?

Agreed

>
> +   case POSTMASTER_RECOVERY_SHUTDOWN:
> +   print_msg(_("PITR shutdown\n"));
> +   print_msg(_("update configuration for startup 
> again if needed\n"));
> +   break;
>
> I'm not sure I agree that this is a substantially friendlier message.  From
> a quick skim of the thread, it seems like you want to avoid sending a scary
> error message if Postgres was intentionally shut down while in recovery.
> If I got this particular message, I think I would be worried that something
> went wrong during my point-in-time restore, and I'd be scrambling to figure
> out what configuration this message wants me to update.
>
> If I'm correctly interpreting the intent here, it might be worth fleshing
> out the messages a bit more.  For example, instead of "PITR shutdown,"
> perhaps we could say "shut down while in recovery."

Make sense. Fixed. See V4

> And maybe we should
> point to the specific settings in the latter message.

I've changed this latter message to:
update recovery target settings for startup again if needed

What do you think?

>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com



-- 
Regards
Junwang Zhao


v4-0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch
Description: Binary data


Re: Error "initial slot snapshot too large" in create replication slot

2024-01-15 Thread Kyotaro Horiguchi
At Fri, 12 Jan 2024 11:28:09 -0500, Robert Haas  wrote 
in 
> However, I wonder whether this whole area is in need of a bigger
> rethink. There seem to be a number of situations in which the split
> into xip and subxip arrays is not very convenient, and also some
> situations where it's quite important. Sometimes we want to record
> what's committed, and sometimes what isn't. It's all a bit messy and
> inconsistent. The necessity of limiting snapshot size is annoying,
> too. I have no real idea what can be done about all of this, but what
> strikes me is that the current system has grown up incrementally: we
> started with a data structure designed for the original use case, and
> now by gradually adding new use cases things have gotten complicated.
> If we were designing things over from scratch, maybe we'd do it
> differently and end up with something less messy. And maybe someone
> can imagine a redesign that is likewise less messy.
> 
> But on the other hand, maybe not. Perhaps we can't really do any
> better than what we have. Then the question becomes whether this case
> is important enough to justify additional code complexity. I don't
> think I've personally seen users run into this problem so I have no
> special reason to think that it's important, but if it's causing
> issues for other people then maybe it is.

Thank you for the deep insights. I have understood your points. As I
can't think of any further simple modifications on this line, I will
withdraw this CF entry. At the moment, I also lack a fundamental,
comprehensive solution, but should if I or anyone else come up with
such a solution in the future, I believe it would worth a separate
discussion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-15 Thread Peter Geoghegan
On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
 wrote:
> Can you pull these planner changes into their own commit(s)?
> As mentioned upthread, it's a significant change in behavior that
> should have separate consideration and reference in the commit log. I
> really don't think it should be buried in the 5th paragraph of an
> "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> changes of btree are arguably independent of the planner changes, as
> the btree changes improve performance even if we ignore that it
> implements strict result ordering.

I'm not going to break out the planner changes, because they're *not*
independent in any way. You could say the same thing about practically
any work that changes the planner. They're "buried" in the 5th
paragraph of the commit message. If an interested party can't even
read that far to gain some understanding of a legitimately complicated
piece of work such as this, I'm okay with that.

> An independent thought when reviewing the finaltup / HIKEY scan
> optimization parts of this patch:
> The 'use highkey to check for next page's data range' optimization is
> useful, but can currently only be used for scans that go to the right.
> Could we use a similar optimization in the inverse direction if we
> marked the right page of a split with "the left page has a HIKEY based
> on this page's (un)truncated leftmost value" or "left page's HIKEY was
> truncated to N key attributes"? It'd give one bit of information,
> specifically that it does (or does not) share some (or all) key
> attributes with this page's minimum value, which allows for earlier
> scan termination in boundary cases.

That would have to be maintained in all sorts of different places in
nbtree. And could be broken at any time by somebody inserting a
non-pivot tuple before every existing one on the leaf page. Doesn't
seem worth it to me.

If we were to do something like this then it would be discussed as
independent work. It's akin to adding a low key, which could be used
in several different places. As you say, it's totally independent.

> > +++ b/src/include/access/nbtree.h
> > +#define SK_BT_RDDNARRAY0x0004/* redundant in array 
> > preprocessing */
>
> I think "made redundant" is more appropriate than just "redundant";
> the array key is not itself redundant in array preprocessing (indeed
> we actually work hard on that key during preprocessing to allow us to
> mark it redundant)

Meh. I did it that way to fit under 78 chars while staying on the same
line. I don't think that it matters.

> I think this was mentioned before, but can't we have an argument or
> version of _bt_checkkeys that makes it not advance the array keys, so
> that we can keep this optimization? Or, isn't that now
> _bt_check_compare?

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

> For an orderlines table with 1000s of lines per order, we would
> greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
> handles scan keys more efficiently; the optimization which is being
> disabled here.

The way that these optimizations might work with the mechanism from
the patch isn't some kind of natural extension to what's there
already. We'll need new heuristics to not waste cycles. Applying all
of the optimizations together just isn't trivial, and it's not yet
clear what really makes sense. Combining the two optimizations more or
less adds another dimension of complexity.

> > +++ b/src/backend/access/nbtree/nbtutils.c
> > _bt_preprocess_array_keys(IndexScanDesc scan)
> The _bt_preprocess_array_keys code is now broken due to type
> confusion, as it assumes there is only one array subtype being
> compared per attribute in so.orderProcs.

I've been aware of this for some time, but didn't think that it was
worth bringing up before I had a solution to present...

> I'm also concerned about the implications of this in
> _bt_binsrch_array_skey, as this too assumes the same compare operator
> is always used for all array operations on each attribute. We may need
> one so->orderProcs entry for each array key, but at least one per
> sk_subtype of each array key.

...since right now it's convenient to make so->orderProcs have a 1:1
correspondence with index key columns

> I also notice that the merging of values doesn't seem to be applied
> optimally with mixed typed array operations: num = int[] AND num =
> bigint[] AND num = int[] doesn't seem to merge the first and last
> array ops. I'm also concerned about being (un)able to merge

...which ought to work and be robust, once the cross-type support is
in place. That is, it should work once we really can assume that there
really must be exactly one so->orderProcs entry per equality-strategy
scan key in all cases -- including in highly obscure corner cases
involving a mix of cross-type 

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-15 Thread Kirk Wolak
On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson  wrote:

> > On 15 Jan 2024, at 07:24, Kirk Wolak  wrote:
>
> >   You have a commit [1] that MIGHT fix this.
> > I have a script that recreates the problem, using random data in pg_temp.
> > And a nested cursor.
>
> Running your reproducer script in a tight loop for a fair bit of time on
> the
> v16 HEAD I cannot see any memory growth, so it's plausible that the
> upcoming
> 16.2 will work better in your environment.
>

Okay, I took the latest source off of git (17devel) and got it to work
there in a VM.

It appears this issue is fixed.  It must have been related to the issue
originally tagged.

Thanks!


Re: Synchronizing slots from primary to standby

2024-01-15 Thread Peter Smith
Here are some review comments for patch v61-0002

==
doc/src/sgml/logical-replication.sgml

1.
+  
+   Examples: logical replication failover

The current documentation structure (after the patch is applied) looks
like this:

30.1. Publication
30.2. Subscription
30.2.1. Replication Slot Management
30.2.2. Examples: Set Up Logical Replication
30.2.3. Examples: Deferred Replication Slot Creation
30.2.4. Examples: logical replication failover

I don't think it is ideal.

Firstly, I think this new section is not just "Examples:"; it is more
like instructions for steps to check if a successful failover is
possible. IMO call it something like "Logical Replication Failover" or
"Replication Slot Failover".

Secondly, I don't think this new section strictly belongs underneath
the "Subscription" section anymore because IMO it is just as much
about the promotion of the publications. Now that you are adding this
new (2nd) section about slots, I think the whole structure of this
document should be changed like below:

SUGGESTION #1 (make a new section 30.3 just for slot-related topics)

30.1. Publication
30.2. Subscription
30.2.1. Examples: Set Up Logical Replication
30.3. Logical Replication Slots
30.3.1. Replication Slot Management
30.3.2. Examples: Deferred Replication Slot Creation
30.3.3. Logical Replication Failover

~

SUGGESTION #2 (keep the existing structure, but give the failover its
own new section 30.3)

30.1. Publication
30.2. Subscription
30.2.1. Replication Slot Management
30.2.2. Examples: Set Up Logical Replication
30.2.3. Examples: Deferred Replication Slot Creation
30.3 Logical Replication Failover

~

SUGGESTION #2a (and maybe later you can extract some of the failover
examples further)

30.1. Publication
30.2. Subscription
30.2.1. Replication Slot Management
30.2.2. Examples: Set Up Logical Replication
30.2.3. Examples: Deferred Replication Slot Creation
30.3 Logical Replication Failover
30.3.1. Examples: Checking if failover ready

~~~

2.
+   
+In a logical replication setup, if the publisher server is also the primary
+server of the streaming replication, the logical slots on the
primary server
+can be synchronized to the standby server by specifying
failover = true
+when creating the subscription. Enabling failover ensures a seamless
+transition of the subscription to the promoted standby, allowing it to
+subscribe to the new primary server without any data loss.
+   

I was initially confused by the wording. How about like below:

SUGGESTION
When the publisher server is the primary server of a streaming
replication, the logical slots on that primary server can be
synchronized to the standby server by specifying failover =
true when creating subscriptions for those publications.
Enabling failover ensures a seamless transition of those subscriptions
after the standby is promoted. They can continue subscribing to
publications now on the new primary server without any data loss.

~~~

3.
+   
+However, the replication slots are copied asynchronously, which
means it's necessary
+to confirm that replication slots have been synced to the standby server
+before the failover happens. Additionally, to ensure a successful failover,
+the standby server must not lag behind the subscriber. To confirm
+that the standby server is ready for failover, follow these steps:
+   

Minor rewording

SUGGESTION
Because the slot synchronization logic copies asynchronously, it is
necessary to confirm that replication slots have been synced to the
standby server before the failover happens. Furthermore, to ensure a
successful failover, the standby server must not be lagging behind the
subscriber. To confirm that the standby server is indeed ready for
failover, follow these 2 steps:

~~~

4.
The instructions said "follow these steps", so the next parts should
be rendered as 2 "steps" (using  markup?)

SUGGESTION (show as steps 1,2  and also some minor rewording of the
step heading)

1. Confirm that all the necessary logical replication slots have been
synced to the standby server.
2. Confirm that the standby server is not lagging behind the subscribers.

~~~

5.
+   
+Check if all the necessary logical replication slots have been synced to
+the standby server.
+   

SUGGESTION
Confirm that all the necessary logical replication slots have been
synced to the standby server.

~~~

6.
+ 
+  
+   On logical subscriber, fetch the slot names that should be synced to the
+   standby that we plan to promote.

SUGGESTION
Firstly, on the subscriber node, use the following SQL to identify the
slot names that should be...

~~~

7.
+
+test_sub=# SELECT
+   array_agg(slotname) AS slots
+   FROM
+   ((
+   SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid ||
'_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname
+   FROM pg_control_system() 

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-15 Thread torikoshia

On 2024-01-16 00:17, Alexander Korotkov wrote:
On Mon, Jan 15, 2024 at 8:44 AM Masahiko Sawada  
wrote:


On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov 
 wrote:

>
> On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada  
wrote:
> > Thank you for updating the patch. Here are two comments:
> >
> > ---
> > +   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
> > +   cstate->num_errors > 0)
> > +   ereport(WARNING,
> > +   errmsg("%zd rows were skipped due to data type 
incompatibility",
> > +  cstate->num_errors));
> > +
> > /* Done, clean up */
> > error_context_stack = errcallback.previous;
> >
> > If a malformed input is not the last data, the context message seems odd:
> >
> > postgres(1:1769258)=# create table test (a int);
> > CREATE TABLE
> > postgres(1:1769258)=# copy test from stdin (save_error_to none);
> > Enter data to be copied followed by a newline.
> > End with a backslash and a period on a line by itself, or an EOF signal.
> > >> a
> > >> 1
> > >>
> > 2024-01-15 05:05:53.980 JST [1769258] WARNING:  1 rows were skipped
> > due to data type incompatibility
> > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT:  COPY test, line 3: ""
> > COPY 1
> >
> > I think it's better to report the WARNING after resetting the
> > error_context_stack. Or is a WARNING really appropriate here? The
> > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but
> > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to
> > WARNING without explanation.
>
> Thank you for noticing this.  I think NOTICE is more appropriate here.
> There is nothing to "worry" about: the user asked to ignore the errors
> and we did.  And yes, it doesn't make sense to use the last line as
> the context.  Fixed.
>
> > ---
> > +-- test missing data: should fail
> > +COPY check_ign_err FROM STDIN WITH (save_error_to none);
> > +1  {1}
> > +\.
> >
> > We might want to cover the extra data cases too.
>
> Agreed, the relevant test is added.

Thank you for updating the patch. I have one minor point:

+   if (cstate->opts.save_error_to != 
COPY_SAVE_ERROR_TO_UNSPECIFIED &&

+   cstate->num_errors > 0)
+   ereport(NOTICE,
+   errmsg("%zd rows were skipped due to
data type incompatibility",
+  cstate->num_errors));
+

We can use errmsg_plural() instead.


Makes sense.  Fixed.


I have a question about the option values; do you think we need to
have another value of SAVE_ERROR_TO option to explicitly specify the
current default behavior, i.e. not accept any error? With the v4
patch, the user needs to omit SAVE_ERROR_TO option to accept errors
during COPY FROM. If we change the default behavior in the future,
many users will be affected and probably end up changing their
applications to keep the current default behavior.


Valid point.  I've implemented the handling of CopySaveErrorToChoice
in a similar way to CopyHeaderChoice.

Please, check the revised patch attached.


Thanks for updating the patch!

Here is a minor comment:


+/*
+ * Extract a defGetCopySaveErrorToChoice value from a DefElem.
+ */


Should be Extract a "CopySaveErrorToChoice"?


BTW I'm thinking we should add a column to pg_stat_progress_copy that 
counts soft errors. I'll suggest this in another thread.



--
Regards,
Alexander Korotkov


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Add PQsendSyncMessage() to libpq

2024-01-15 Thread Michael Paquier
On Mon, Jan 15, 2024 at 10:01:59AM +0100, Jelte Fennema-Nio wrote:
> Error message should be "second SELECT" not "first SELECT". Same note
> for the error message in the third pipeline, where it still says
> "second SELECT".
>
> same issue: s/first/second/g (and s/second/third/g for the existing
> part of the test).

Ugh, yes.  The note in the test was wrong.  Thanks for
double-checking.
--
Michael


signature.asc
Description: PGP signature


Re: make pg_ctl more friendly

2024-01-15 Thread Nathan Bossart
+   POSTMASTER_RECOVERY_SHUTDOWN,

Perhaps this should be POSTMASTER_SHUTDOWN_IN_RECOVERY to match the state
in the control file?

+   case POSTMASTER_RECOVERY_SHUTDOWN:
+   print_msg(_("PITR shutdown\n"));
+   print_msg(_("update configuration for startup 
again if needed\n"));
+   break;

I'm not sure I agree that this is a substantially friendlier message.  From
a quick skim of the thread, it seems like you want to avoid sending a scary
error message if Postgres was intentionally shut down while in recovery.
If I got this particular message, I think I would be worried that something
went wrong during my point-in-time restore, and I'd be scrambling to figure
out what configuration this message wants me to update.

If I'm correctly interpreting the intent here, it might be worth fleshing
out the messages a bit more.  For example, instead of "PITR shutdown,"
perhaps we could say "shut down while in recovery."  And maybe we should
point to the specific settings in the latter message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-15 Thread Jeff Davis
On Mon, 2024-01-15 at 15:53 -0500, Joe Conway wrote:
> I took a quick scan through the patch. The only thing that jumped out
> at 
> me was that it seems like it might make sense to use 
> quote_literal_cstr() rather than defining your own
> appendEscapedValue() 
> function?

The rules are slightly different. Libpq expects a connection string to
escape only single-quote and backslash, and the escape character is
always backslash:

https://www.postgresql.org/docs/16/libpq-connect.html#LIBPQ-CONNSTRING-KEYWORD-VALUE

quote_literal_cstr() has more complicated rules. If there's a backslash
anywhere in the string, it uses the E'' form. If it encounters a
backslash it escapes it with backslash, but if it encounters a single-
quote it escapes it with single-quote. See:

https://www.postgresql.org/docs/16/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
https://www.postgresql.org/docs/16/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-ESCAPE

I'll include some tests and a better comment for it in the next patch
set.

Regards,
Jeff Davis





Re: Fixing backslash dot for COPY FROM...CSV

2024-01-15 Thread Robert Haas
On Mon, Dec 18, 2023 at 3:36 PM Daniel Verite  wrote:
> PFA a patch that attempts to fix the bug that \. on a line
> by itself is handled incorrectly by COPY FROM ... CSV.
> This issue has been discussed several times previously,
> for instance in [1] and [2], and mentioned in the
> doc for \copy in commit 42d3125.

Those links unfortunately seem not to be entirely specific to this
issue. Other, related things seem to be discussed there, and it's not
obvious that everyone agrees on what to do, or really that anyone
agrees on what to do. The best link that I found for this exact issue
is 
https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org
but the thread isn't very conclusive and is so old that any
conclusions reached then might no longer be considered valid today.

And I guess the reason I mention is that, supposing your patch were
technically perfect in every way, would everyone agree that it ought
to be committed? If Alice is a user with a CSV file that might contain
\. on a line by itself within a quoted CSV field, then Alice is
currently sad because she can't necessarily load all of her CSV files.
The patch would fix that, and make her happy. On the other hand, if
Bob is a user with a CSV-ish file that definitely doesn't contain \.
on a line by itself within a quoted CSV field but might have been
truncated in the middle of a quoted field, maybe Bob will be sad if
this patch gets committed, because he will no longer be able to append
\n\.\n to whatever junk he's got in the file and let the server sort
out whether to throw an error.

I have to admit that it seems more likely that there are people in the
world with Alice's problem rather than people in the world with Bob's
problem. We'd probably make more people happy with the change than
sad. But it is a definitional change, I think, and that's a little
scary, and maybe somebody will think that's a reason why we should
change nothing here. Part of my hesitancy, I suppose, is that I don't
understand why we even have this strange convention of making \.
terminate the input in the first place -- I mean, why wouldn't that be
done in some kind of out-of-band way, rather than including a special
marker in the data?

I can't help feeling a bit nervous about this first documentation hunk:

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY count
 format, \., the end-of-data marker, could also appear
 as a data value.  To avoid any misinterpretation, a \.
 data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in the
-input file.
+quoted on output.




It doesn't feel right to me to just replace all of this text with
nothing. That leaves us documenting only the behavior on output,
whereas the previous text documents both the output behavior (we quote
it) and the input behavior (it has to be quoted to avoid being taken
as the EOF marker).

 /*
- * In CSV mode, we only recognize \. alone on a line.  This is because
- * \. is a valid CSV data value.
+ * In CSV mode, backslash is a normal character.
  */
-if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+if (c == '\\' && !cstate->opts.csv_mode)

I don't think that the update comment accurately describes the
behavior. If I understand correctly what you're trying to fix, I'd
expect the new behavior to be that we only recognize \. alone on a
line and even then only if we're not inside a quoting string, but
that's not what the revised comment says. Instead, it claims that
backslash is just a normal character, but if that were true, the whole
if-statement wouldn't exist at all, since its purpose is to provide
special-handling for backslashes -- and indeed the patch does not
change that, since the if-statement is still looking for a backslash
and doing something special if one is found.

Hmm. Looking at the rest of the patch, it seems like you're removing
the logic that prevents us from interpreting

\. lksdghksdhgjskdghjs

as an end-of-file while in CSV mode. But I would have thought based on
what problem you're trying to fix that you would have wanted to keep
that logic and further restrict it so that it only applies when not
within a quoted string.

Maybe I'm misunderstanding what bug you're trying to fix?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-15 Thread Melanie Plageman
On Mon, Jan 15, 2024 at 12:29:57PM -0500, Robert Haas wrote:
> On Fri, Jan 12, 2024 at 4:05 PM Melanie Plageman
>  wrote:
> > Yea, that works for now. I mean, I think the way we should do it is
> > update the FSM in lazy_scan_noprune(), but, for the purposes of this
> > patch, yes. has_lpdead_items output parameter seems fine to me.
> 
> Here's v2.
> 
> It's not exactly clear to me why you want to update the FSM in
> lazy_scan_[no]prune(). When I first looked at v7-0004, I said to
> myself "well, this is dumb, because now we're just duplicating
> something that is common to both cases". But then I realized that the
> logic was *already* duplicated in lazy_scan_heap(), and that by
> pushing the duplication down to lazy_scan_[no]prune(), you made the
> two copies identical rather than, as at present, having two copies
> that differ from each other. Perhaps that's a sufficient reason to
> make the change all by itself, but it seems like what would be really
> good is if we only needed one copy of the logic. I don't know if
> that's achievable, though.

Upthread in v4-0004, I do have a version which combines the two FSM
updates for lazy_scan_prune() and lazy_scan_noprune().

If you move the VM updates from lazy_scan_heap() into lazy_scan_prune(),
then there is little difference between the prune and no prune cases in
lazy_scan_heap(). The main difference is that, when lazy_scan_noprune()
returns true, you are supposed to avoid calling lazy_scan_new_or_empty()
again and have to avoid calling lazy_scan_prune(). I solved this with a
local variable "do_prune" and checked it before calling
lazy_scan_new_or_empty() and lazy_scan_prune().

I moved away from this approach because it felt odd to test "do_prune"
before calling lazy_scan_new_or_empty(). Though, perhaps it doesn't
matter if we just call lazy_scan_new_or_empty() again. We do that when
lazy_scan_noprune() returns false anyway.

I thought perhaps we could circumvent this issue by moving
lazy_scan_new_or_empty() into lazy_scan_prune(). But, this seemed wrong
to me because, as it stands now, if lazy_scan_new_or_empty() returns
true, we would want to skip the VM update code and FSM update code in
lazy_scan_heap() after lazy_scan_prune(). That would mean
lazy_scan_prune() would have to return something to indicate that.

> More generally, I somewhat question whether it's really right to push
> things from lazy_scan_heap() and into lazy_scan_[no]prune(), mostly
> because of the risk of having to duplicate logic. I'm not even really
> convinced that it's good for us to have both of those functions.
> There's an awful lot of code duplication between them already. Each
> has a loop that tests the status of each item and then, for LP_USED,
> switches on htsv_get_valid_status or HeapTupleSatisfiesVacuum. It
> doesn't seem trivial to unify all that code, but it doesn't seem very
> nice to have two copies of it, either.

I agree that the duplicated logic in both places is undesirable.
I think the biggest issue with combining them would be that when
lazy_scan_noprune() returns false, it needs to go get a cleanup lock and
then invoke the logic of lazy_scan_prune() for all the tuples on the
page. This seems a little hard to get right in a single function.

Then there are more trivial-to-solve differences like invoking
heap_tuple_should_freeze() and not bothering with the whole
heap_prepare_freeze_tuple() if we only have the share lock.

I am willing to try and write a version of this to see if it is better.
I will say, though, my agenda was to eventually push the actions taken
in the loop in lazy_scan_prune() into heap_page_prune() and
heap_prune_chain().

> From 32684f41d1dd50f726aa0dfe8a5d816aa5c42d64 Mon Sep 17 00:00:00 2001
> From: Robert Haas 
> Date: Mon, 15 Jan 2024 12:05:52 -0500
> Subject: [PATCH v2] Be more consistent about whether to update the FSM while
>  vacuuming.

Few small review comments below, but, overall, LGTM.

> Previously, when lazy_scan_noprune() was called and returned true, we would
> update the FSM immediately if the relation had no indexes or if the page
> contained no dead items. On the other hand, when lazy_scan_prune() was
> called, we would update the FSM if either of those things was true or
> if index vacuuming was disabled. Eliminate that behavioral difference by
> considering vacrel->do_index_vacuuming in both cases.
> 
> Also, instead of having lazy_scan_noprune() make the decision
> internally and pass it back to the caller via *recordfreespace, just
> have it pass the number of LP_DEAD items back to the caller, and then

It doesn't pass the number of LP_DEAD items back to the caller. It
passes a boolean.

> let the caller make the decision.  That seems less confusing, since
> the caller also decides in the lazy_scan_prune() case; moreover, this
> way, the whole test is in one place, instead of spread out.

Perhaps it isn't important, but I find this wording confusing. You
mention lazy_scan_prune() and then mention that "the whole 

Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-15 Thread Joe Conway

On 1/12/24 20:17, Jeff Davis wrote:

On Fri, 2024-01-05 at 16:11 +0530, Ashutosh Bapat wrote:

I don't think we need to add a test for every FDW. E.g. adding a test
in file_fdw would be pointless. But postgres_fdw is special. The test
could further create a foreign table ftab_foo on subscriber
referencing foo on publisher and then compare the data from foo and
ftab_foo to make sure that the replication is happening. This will
serve as a good starting point for replicated tables setup in a
sharded cluster.



Attached updated patch set with added TAP test for postgres_fdw, which
uses a postgres_fdw server as the source for subscription connection
information.

I think 0004 needs a bit more work, so I'm leaving it off for now, but
I'll bring it back in the next patch set.


I took a quick scan through the patch. The only thing that jumped out at 
me was that it seems like it might make sense to use 
quote_literal_cstr() rather than defining your own appendEscapedValue() 
function?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Custom explain options

2024-01-15 Thread Konstantin Knizhnik


On 15/01/2024 5:08 pm, Tomas Vondra wrote:


My patch does not care about prefetching internal index pages. Yes, it's
a limitation, but my assumption is the internal pages are maybe 0.1% of
the index, and typically very hot / cached. Yes, if the index is not
used very often, this may be untrue. But I consider it a possible future
improvement, for some other patch. FWIW there's a prefetching patch for
inserts into indexes (which only prefetches just the index leaf pages).


We have to prefetch pages at height-1 level (parents of leave pages) for 
IOS because otherwise prefetch pipeline is broken at each transition to 
next leave page.
When we start with new leave patch we have to fill prefetch ring from 
the scratch which certainly has negative impact on performance.




Not sure I understand what this is about. The patch simply calls the
index AM function index_getnext_tid() enough times to fill the prefetch
queue. It does not prefetch the next index leaf page, it however does
prefetch the heap pages. It does not "stall" at the boundary of the
index leaf page, or something.


Ok, now I fully understand your approach. Looks really elegant and works 
for all indexes.

There is still issue with IOS and seqscan.






Another challenge - is how far we should prefetch (as far as I
understand both your and our approach using dynamically extended
prefetch window)


By dynamic extension of prefetch window you mean the incremental growth
of the prefetch distance from 0 to effective_io_concurrency?


Yes


I don't
think there's a better solution.


I tried one more solution: propagate information about expected number 
of fetched rows to AM. Based on this information it is possible to 
choose proper prefetch distance.
Certainly it is not quote precise: we can scan large number rows but 
filter only few of them. This is why this approach was not committed in 
Neon.
But I still think that using statistics for determining prefetch window 
is not so bad idea. May be it needs better thinking.





There might be additional information that we could consider (e.g.
expected number of rows for the plan, earlier executions of the scan,
...) but each of these has a failure more.


I wrote reply above before reading next fragment:)
So I have already tried it.


I haven't tried with pgvector, but I don't see why my patch would not
work for all index AMs that cna return TID.



Yes, I agree. But it will be efficient only if getting next TIS is 
cheap  - it is located on the same leaf page.






I have also tried to implement alternative approach for prefetch based
on access statistic.
It comes from use case of seqscan of table with larger toasted records.
So for each record we have to extract its TOAST data.
It is done using standard index scan, but unfortunately index prefetch
doesn't help much here: there is usually just one TOAST segment and so
prefetch just have no chance to do something useful. But as far as heap
records are accessed sequentially, there is good chance that toast table
will also be accessed mostly sequentially. So we just can count number
of sequential requests to each relation and if ratio or seq/rand
accesses is above some threshold we can prefetch next pages of this
relation. This is really universal approach but ... working mostly for
TOAST table.


Are you're talking about what works / doesn't work in neon, or about
postgres in general?

I'm not sure what you mean by "one TOAST segment" and I'd also guess
that if both tables are accessed mostly sequentially, the read-ahead
will do most of the work (in postgres).


Yes, I agree: in case of vanilla Postgres OS will do read-ahead. But not 
in Neon.

By one TOAST segment I mean "one TOAST record - 2kb.



It's probably true that as we do a separate index scan for each TOAST-ed
value, that can't really ramp-up the prefetch distance fast enough.
Maybe we could have a mode where we start with the full distance?


Sorry, I do not understand. Especially in this case large prefetch 
window is undesired.
Most of records fits in 2kb, so we need to fetch onely one head (TOAST) 
record per TOAST index search.




This is exactly the difference. In Neon such approach doesn't work.
Each backend maintains it's own prefetch ring. And if prefetched page
was not actually received, then the whole pipe is lost.
I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it
has to consume responses for 1,5,10 and issue another request for page 2.
Instead of improving speed we are just doing extra job.
So each backend should prefetch only those pages which it is actually
going to read.
This is why prefetch approach used in Postgres for example for parallel
bitmap heap scan doesn't work for Neon.
If you do `posic_fadvise` then prefetched page is placed in OS cache and
can be used by any parallel worker.
But in Neon each parallel worker should be given its own range of pages
to scan and prefetch only this pages.


I still don't quite see/understand the difference. 

Re: cleanup patches for incremental backup

2024-01-15 Thread Matthias van de Meent
On Mon, 15 Jan 2024 at 17:58, Robert Haas  wrote:
>
> On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin  wrote:
> > I've found one more typo in the sgml:
> > summarized_pid
> > And one in a comment:
> > sumamry
> >
> > A trivial fix is attached.
>
> Thanks, committed.

Off-list I was notified that the new WAL summarizer process was not
yet added to the glossary, so PFA a patch that does that.
In passing, it also adds "incremental backup" to the glossary, and
updates the documented types of backends in monitoring.sgml with the
new backend type, too.

Kind regards,

Matthias van de Meent.


v1-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data


Re: Forbid the use of invalidated physical slots in streaming replication.

2024-01-15 Thread Robert Haas
On Thu, Dec 7, 2023 at 8:00 AM Zhijie Hou (Fujitsu)
 wrote:
> After looking closer, it seems this behavior started from 15f8203 which 
> introduced the
> ReplicationSlotInvalidationCause 'invalidated', after that we check the 
> invalidated enum
> in Xmin/Lsn computation function.

Adding Andres in Cc, as that was his commit.

It's not entirely clear to me how this feature was intended to
interact with physical replication slots. I found seemingly-relevant
documentation in two places:

https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-SLOT-WAL-KEEP-SIZE
https://www.postgresql.org/docs/current/view-pg-replication-slots.html

In the latter, it says "unreserved means that the slot no longer
retains the required WAL files and some of them are to be removed at
the next checkpoint. This state can return to reserved or extended."
But if a slot becomes invalid in such a way that it cannot return to a
valid state later, then this isn't accurate.

I have a feeling that the actual behavior here has evolved and the
documentation hasn't kept up. And I wonder whether we need a more
comprehensive explanation of the intended behavior in this section:

https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: ALTER TYPE OWNER fails to recurse to multirange

2024-01-15 Thread Robert Haas
On Mon, Jan 15, 2024 at 2:28 PM Tom Lane  wrote:
> I'm reasoning by analogy to array types, which are automatically
> created and automatically updated to keep the same ownership
> etc. properties as their base type.  To the extent that multirange
> types don't act exactly like that, I say it's a bug/oversight in the
> multirange patch.  So I think this is a backend bug, not a pg_dump
> bug.

Oh...

Well, I guess maybe I'm just clueless. I thought that the range and
multirange were two essentially independent objects being created by
the same command. But I haven't studied the implementation so maybe
I'm completely wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: gai_strerror() is not thread-safe on Windows

2024-01-15 Thread Robert Haas
On Wed, Dec 6, 2023 at 8:45 PM Kyotaro Horiguchi
 wrote:
> > So I think we should just hard-code the error messages in English and
> > move on.  However, English is my language so perhaps I should abstain
> > and leave it to others to decide how important that is.
>
> I also think that would be a good way.

Considering this remark from Kyotaro Horiguchi, I think the
previously-posted patch could be committed.

Thomas, do you plan to do that, or are there outstanding issues here?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Built-in CTYPE provider

2024-01-15 Thread Jeff Davis
On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote:
> Concerning the target category_test, it produces failures with
> versions of ICU with Unicode < 15. The first one I see with Ubuntu
> 22.04 (ICU 70.1) is:

...

> I find these results interesting because they tell us what contents
> can break regexp-based check constraints on upgrades.

Thank you for collecting and consolidating this information.

> But about category_test as a pass-or-fail kind of test, it can only
> be
> used when the Unicode version in ICU is the same as in Postgres.

The test has a few potential purposes:

1. To see if there is some error in parsing the Unicode files and
building the arrays in the .h file. For instance, let's say the perl
parser I wrote works fine on the Unicode 15.1 data file, but does
something wrong on the 16.0 data file: the test would fail and we'd
investigate. This is the most important reason for the test.

2. To notice any quirks between how we interpret Unicode vs how ICU
does.

3. To help see "interesting" differences between different Unicode
versions.

For #1 and #2, the best way to test is by using a version of ICU that
uses the same Unicode version as Postgres. The one running update-
unicode can try to recompile with the right one for the purposes of the
test. NB: There might be no version of ICU where the Unicode version
exactly matches what we'd like to update to. In that case, we'd need to
use the closest version and do some manual validation that the
generated tables are sane.

For #3, that is also interesting information to know about, but it's
not directly actionable. As you point out, Unicode does not guarantee
that these properties are static forever, so regexes can change
behavior when we update Unicode for the next PG version. That is a much
lower risk than a collation change, but as you point out, is a risk for
regexes inside of a CHECK constraint. If a user needs zero risk of
semantic changes for regexes, the only option is "C". Perhaps there
should be a separate test target for this mode so that it doesn't exit
early?

(Note: case mapping has much stronger guarantees than the character
classification.)

I will update the README to document how someone running update-unicode
should interpret the test results.

Regards,
Jeff Davis





Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-15 Thread Matthias van de Meent
On Thu, 28 Dec 2023 at 18:28, Peter Geoghegan  wrote:
>
> On Sat, Dec 9, 2023 at 10:38 AM Peter Geoghegan  wrote:
> > Attached is v8, which pretty much rips all of this stuff out.
>
> Attached is v9, which I'm posting just to fix bitrot. The patch
> stopped cleanly applying against HEAD due to recent bugfix commit
> 7e6fb5da. No real changes here compared to v8.

I found 2 major issues; one correctness issue in the arraykey
processing code of _bt_preprocess_array_keys, and one assertion error
in _bt_advance_array_keys; both discussed in the relevant sections
below.

> Subject: [PATCH v9] Enhance nbtree ScalarArrayOp execution.
> [...]
> Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
> path keys: path keys on a multicolumn index path, with a SAOP clause on
> any attribute beyond the first/most significant attribute.  These cases
> are now all safe, so we go back to generating path keys without regard
> for the presence of SAOP clauses (just like with any other clause type).
> Also undo changes from follow-up bugfix commit a4523c5a, which taught
> the planner to produce alternative index paths without any low-order
> ScalarArrayOpExpr quals (making the SAOP quals into filter quals).
> We'll no longer generate these alternative paths, which can no longer
> offer any advantage over the index qual paths that we do still generate.

Can you pull these planner changes into their own commit(s)?
As mentioned upthread, it's a significant change in behavior that
should have separate consideration and reference in the commit log. I
really don't think it should be buried in the 5th paragraph of an
"Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
changes of btree are arguably independent of the planner changes, as
the btree changes improve performance even if we ignore that it
implements strict result ordering.

An independent thought when reviewing the finaltup / HIKEY scan
optimization parts of this patch:
The 'use highkey to check for next page's data range' optimization is
useful, but can currently only be used for scans that go to the right.
Could we use a similar optimization in the inverse direction if we
marked the right page of a split with "the left page has a HIKEY based
on this page's (un)truncated leftmost value" or "left page's HIKEY was
truncated to N key attributes"? It'd give one bit of information,
specifically that it does (or does not) share some (or all) key
attributes with this page's minimum value, which allows for earlier
scan termination in boundary cases.

> +++ b/src/include/access/nbtree.h
> +#define SK_BT_RDDNARRAY0x0004/* redundant in array preprocessing 
> */

I think "made redundant" is more appropriate than just "redundant";
the array key is not itself redundant in array preprocessing (indeed
we actually work hard on that key during preprocessing to allow us to
mark it redundant)

> +++ b/src/backend/access/nbtree/nbtsearch.c
>  * We skip this for the first page in the scan to evade the possible
> - * slowdown of the point queries.
> + * slowdown of point queries.  Never apply the optimization with a scan
> + * that uses array keys, either, since that breaks certain assumptions.
> + * (Our search-type scan keys change whenever _bt_checkkeys advances the
> + * arrays, invalidating any precheck.  Tracking all that would be 
> tricky.)

I think this was mentioned before, but can't we have an argument or
version of _bt_checkkeys that makes it not advance the array keys, so
that we can keep this optimization? Or, isn't that now
_bt_check_compare?
For an orderlines table with 1000s of lines per order, we would
greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
handles scan keys more efficiently; the optimization which is being
disabled here.

> +++ b/src/backend/access/nbtree/nbtutils.c
> _bt_preprocess_array_keys(IndexScanDesc scan)
The _bt_preprocess_array_keys code is now broken due to type
confusion, as it assumes there is only one array subtype being
compared per attribute in so.orderProcs. Reproducer:
CREATE TABLE test AS
SELECT generate_series(1, 1, 1::bigint) num;
CREATE INDEX ON test (num); /* bigint typed */

SELECT num FROM test
WHERE num = ANY ('{1}'::smallint[])
  AND num = ANY ('{1}'::int[]) /* ensure matching
lastEqualityArrayAtt, lastOrderProc for next qual
  AND num = ANY ('{65537}'::int[]); /* qual is broken due to
application of smallint compare operator on int values that do equal
mod 2^16, but do not equal in their own type */
 num
-
   1

I'm also concerned about the implications of this in
_bt_binsrch_array_skey, as this too assumes the same compare operator
is always used for all array operations on each attribute. We may need
one so->orderProcs entry for each array key, but at least one per
sk_subtype of each array key.

I also notice that the merging of values doesn't seem to be applied
optimally with mixed typed array operations: num = int[] AND num =
bigint[] AND num = 

Re: ALTER TYPE OWNER fails to recurse to multirange

2024-01-15 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jan 15, 2024 at 1:27 PM Tom Lane  wrote:
>> That's pretty broken, isn't it?  joe would own the multirange if he'd
>> created the range to start with.  Even if you think the ownerships
>> ideally should be separable, this behavior causes existing pg_dump
>> files to restore incorrectly, because pg_dump assumes it need not emit
>> any commands about the multirange.

> I agree that pg_dump doing the wrong thing is bad, but the SQL example
> doesn't look broken if you ignore pg_dump.

I'm reasoning by analogy to array types, which are automatically
created and automatically updated to keep the same ownership
etc. properties as their base type.  To the extent that multirange
types don't act exactly like that, I say it's a bug/oversight in the
multirange patch.  So I think this is a backend bug, not a pg_dump
bug.

> I have a feeling that the
> source of the awkwardness here is that one SQL command is creating two
> objects, and unlike the case of a table and a TOAST table, one is not
> an implementation detail of the other or clearly subordinate to the
> other.

How is a multirange not subordinate to the underlying range type?
It can't exist without it, and we automatically create it without
any further information when you make the range type.  That smells
a lot like the way we handle array types.  The array behavior is of
very long standing and surprises nobody.

> But how does that prevent us from making pg_dump restore the
> ownership and permissions on each separately? If ownership is a
> problem, aren't permissions also?

Probably, and I wouldn't be surprised if we've also failed to make
multiranges follow arrays in the permissions department.  An
array type can't have an ACL of its own, IIRC.

regards, tom lane




Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-01-15 Thread Tom Lane
Robert Haas  writes:
> I know this is on Tom's to-do list which makes me a bit reluctant to
> get too involved here, because certainly he knows this code better
> than I do, maybe better than anyone does, but on the other hand, we
> shouldn't leave server crashes unfixed for too long, so maybe I can do
> something to help at least with that part of it.

This is indeed on my to-do list, and I have every intention of
getting to it before the end of the month.  But if you want to
help push things along, feel free.

regards, tom lane




Re: ALTER TYPE OWNER fails to recurse to multirange

2024-01-15 Thread Robert Haas
On Mon, Jan 15, 2024 at 1:27 PM Tom Lane  wrote:
> That's pretty broken, isn't it?  joe would own the multirange if he'd
> created the range to start with.  Even if you think the ownerships
> ideally should be separable, this behavior causes existing pg_dump
> files to restore incorrectly, because pg_dump assumes it need not emit
> any commands about the multirange.

I agree that pg_dump doing the wrong thing is bad, but the SQL example
doesn't look broken if you ignore pg_dump. I have a feeling that the
source of the awkwardness here is that one SQL command is creating two
objects, and unlike the case of a table and a TOAST table, one is not
an implementation detail of the other or clearly subordinate to the
other. But how does that prevent us from making pg_dump restore the
ownership and permissions on each separately? If ownership is a
problem, aren't permissions also?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] LockAcquireExtended improvement

2024-01-15 Thread Robert Haas
Hello Jingxian Li!

I agree with you that this behavior seems surprising. I don't think
it's quite a bug, more of a limitation. However, I think it would be
nice to fix it if we can find a good way to do that.

On Wed, Nov 29, 2023 at 10:43 PM Jingxian Li  wrote:
> Transaction A already holds an n-mode lock on table test,
> that is, there is no locks held conflicting with the n-mode lock on table 
> test,
> If then transaction A requests an m-mode lock on table test,
> as n's confilctTab covers m, it can be concluded that
> there are no locks conflicting with the requested m-mode lock.

This algorithm seems correct to me, but I think Andres is right to be
concerned about overhead. You're proposing to inject a call to
CheckLocalLockConflictTabCover() into the main code path of
LockAcquireExtended(), so practically every lock acquisition will pay
the cost of that function. And that function is not particularly
cheap: every call to LockHeldByMe is a hash table lookup. That sounds
pretty painful. If we could incur the overhead of this only when we
knew for certain that we would otherwise have to fail, that would be
more palatable, but doing it on every non-fastpath heavyweight lock
acquisition seems way too expensive.

Even aside from overhead, the approach the patch takes doesn't seem
quite right to me. As you noted, ProcSleep() has logic to jump the
queue if adding ourselves at the end would inevitably result in
deadlock, which is why your test case doesn't need to wait until
deadlock_timeout for the lock acquisition to succeed. But because that
logic happens in ProcSleep(), it's not reached in the NOWAIT case,
which means that it doesn't help any more once NOWAIT is specified. I
think that the right way to fix the problem would be to reach that
check even in the NOWAIT case, which could be done either by hoisting
some of the logic in ProcSleep() up into LockAcquireExtended(), or by
pushing the nowait flag down into ProcSleep() so that we can fail only
if we're definitely going to sleep. The former seems more elegant in
theory, but the latter looks easier to implement, at least at first
glance.

But the patch as proposed instead invents a new way of making the test
case work, not leveraging the existing logic and, I suspect, not
matching the behavior in all cases.

I also agree with Vignesh that a test case would be a good idea. It
would need to be an isolation test, since the regular regression
tester isn't powerful enough for this (at least, I don't see how to
make it work).

I hope that this input is helpful to you.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: postgres_fdw fails to see that array type belongs to extension

2024-01-15 Thread Tom Lane
David Geier  writes:
> On 12/27/23 18:38, Tom Lane wrote:
>> Hmm.  It seems odd that if an extension defines a type, the type is
>> listed as a member of the extension but the array type is not.
>> That makes it look like the array type is an externally-created
>> thing that happens to depend on the extension, when it's actually
>> part of the extension.  I'm surprised we've not run across other
>> misbehaviors traceable to that.

> Agreed.
> The attached patch just adds a 2nd dependency between the array type and 
> the extension, using recordDependencyOnCurrentExtension().

I don't like this patch too much: it fixes the problem in a place far
away from GenerateTypeDependencies' existing treatment of extension
dependencies, and relatedly to that, fails to update the comments
it has falsified.  I think we should do it more like the attached.
This approach means that relation rowtypes will also be explicitly
listed as extension members, which seems like it is fixing the same
sort of bug as the array case.

I also noted that you'd not run check-world, because there's a test
case that changes output.  This is good though, because we don't need
to add any new test to prove it does what we want.

There's one big remaining to-do item, I think: experimentation with
pg_upgrade proves that a binary upgrade fails to fix the extension
membership of arrays/rowtypes.  That's because pg_dump needs to
manually reconstruct the membership dependencies, and it thinks that
it doesn't need to do anything for implicit arrays.  Normally the
point of that is that we want to exactly reconstruct the extension's
contents, but I think in this case we'd really like to add the
additional pg_depend entries even if they weren't there before.
Otherwise people wouldn't get the new behavior until they do a
full dump/reload.

I can see two ways we could do that:

* add logic to pg_dump

* modify ALTER EXTENSION ADD TYPE so that it automatically recurses
from a base type to its array type (and I guess we'd need to add
something for relation rowtypes and multiranges, too).

I think I like the latter approach because it's like how we
handle ownership: pg_dump doesn't emit any commands to explicitly
change the ownership of dependent types, either.  (But see [1].)
We could presumably steal some logic from ALTER TYPE OWNER.
I've not tried to code that here, though.

regards, tom lane

[1] https://www.postgresql.org/message-id/1580383.1705343264%40sss.pgh.pa.us

diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c
index d7167108fb..5691be26a6 100644
--- a/src/backend/catalog/pg_type.c
+++ b/src/backend/catalog/pg_type.c
@@ -538,7 +538,7 @@ TypeCreate(Oid newTypeOid,
  * that means it doesn't need its own dependencies on owner etc.
  *
  * We make an extension-membership dependency if we're in an extension
- * script and makeExtensionDep is true (and isDependentType isn't true).
+ * script and makeExtensionDep is true.
  * makeExtensionDep should be true when creating a new type or replacing a
  * shell type, but not for ALTER TYPE on an existing type.  Passing false
  * causes the type's extension membership to be left alone.
@@ -598,7 +598,7 @@ GenerateTypeDependencies(HeapTuple typeTuple,
 	ObjectAddressSet(myself, TypeRelationId, typeObjectId);
 
 	/*
-	 * Make dependencies on namespace, owner, ACL, extension.
+	 * Make dependencies on namespace, owner, ACL.
 	 *
 	 * Skip these for a dependent type, since it will have such dependencies
 	 * indirectly through its depended-on type or relation.
@@ -618,11 +618,18 @@ GenerateTypeDependencies(HeapTuple typeTuple,
 
 		recordDependencyOnNewAcl(TypeRelationId, typeObjectId, 0,
  typeForm->typowner, typacl);
-
-		if (makeExtensionDep)
-			recordDependencyOnCurrentExtension(, rebuild);
 	}
 
+	/*
+	 * Make extension dependency if requested.
+	 *
+	 * We used to skip this for dependent types, but it seems better to record
+	 * their extension membership explicitly; otherwise code such as
+	 * postgres_fdw's shippability test will be fooled.
+	 */
+	if (makeExtensionDep)
+		recordDependencyOnCurrentExtension(, rebuild);
+
 	/* Normal dependencies on the I/O and support functions */
 	if (OidIsValid(typeForm->typinput))
 	{
diff --git a/src/test/modules/test_extensions/expected/test_extensions.out b/src/test/modules/test_extensions/expected/test_extensions.out
index 472627a232..9528448e09 100644
--- a/src/test/modules/test_extensions/expected/test_extensions.out
+++ b/src/test/modules/test_extensions/expected/test_extensions.out
@@ -53,7 +53,11 @@ Objects in extension "test_ext7"
  table ext7_table1
  table ext7_table2
  table old_table1
-(6 rows)
+ type ext7_table1
+ type ext7_table1[]
+ type ext7_table2
+ type ext7_table2[]
+(10 rows)
 
 alter extension test_ext7 update to '2.0';
 \dx+ test_ext7
@@ -62,7 +66,9 @@ Objects in extension "test_ext7"
 ---
  sequence ext7_table2_col2_seq
  table ext7_table2
-(2 rows)
+ type 

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-01-15 Thread Robert Haas
On Thu, Jan 11, 2024 at 3:33 AM Kyotaro Horiguchi
 wrote:
> Is it correct to understand that you are requesting changes as follows?
>
> --- a/src/bin/pg_ctl/pg_ctl.c
> +++ b/src/bin/pg_ctl/pg_ctl.c
> @@ -1995,11 +1995,14 @@ pgwin32_find_postmaster_pid(pid_t shell_pid)
>  *
>  * Check for duplicate processes to ensure reliability.
>  *
> -* The launcher shell might start other cmd.exe instances or programs
> -* besides postgres.exe. Verifying the program file name is essential.
> -*
> -* The launcher shell process isn't checked in this function.  It 
> will be
> -* checked by the caller.
> +* The ppe entry to be examined is identified by th32ParentProcessID, 
> which
> +* should correspond to the cmd.exe process that executes the 
> postgres.exe
> +* binary. Additionally, th32ProcessID in the same entry should be 
> the PID
> +* of the launched postgres.exe. However, even though we have 
> launched the
> +* parent cmd.exe with the /D option specified, it is sometimes 
> observed
> +* that another cmd.exe is launched for unknown reasons. Therefore, 
> it is
> +* crucial to verify the program file name to avoid returning the 
> wrong
> +* PID.
>  */

This kind of change looks massively helpful to me. I don't know if it
is exactly right or not, but it would have been a big help to me when
writing my previous review, so +1 for some change of this general
type.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-01-15 Thread Robert Haas
On Mon, Jan 8, 2024 at 3:32 AM Richard Guo  wrote:
> Thanks for the suggestion.  Attached is an updated patch which is added
> with a commit message that tries to explain the problem and the fix.

This is great. The references to "the sampling infos" are a little bit
confusing to me. I'm not sure if this is referring to
TableSampleClause objects or what.

> Fair point.  I think we can back-patch a more minimal fix, as Tom
> proposed in [1], which disallows the reparameterization if the path
> contains sampling info that references the outer rel.  But I think we
> need also to disallow the reparameterization of a path if it contains
> restriction clauses that reference the outer rel, as such paths have
> been found to cause incorrect results, or planning errors as in [2].

Do you want to produce a patch for that, to go along with this patch for master?

I know this is on Tom's to-do list which makes me a bit reluctant to
get too involved here, because certainly he knows this code better
than I do, maybe better than anyone does, but on the other hand, we
shouldn't leave server crashes unfixed for too long, so maybe I can do
something to help at least with that part of it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




ALTER TYPE OWNER fails to recurse to multirange

2024-01-15 Thread Tom Lane
d=# create type varbitrange as range (subtype = varbit);
CREATE TYPE
d=# \dT+
 List of data types
 Schema |   Name   |  Internal name   | Size | Elements |  Owner   | 
Access privileges | Description 
+--+--+--+--+--+---+-
 public | varbitmultirange | varbitmultirange | var  |  | postgres |
   | 
 public | varbitrange  | varbitrange  | var  |  | postgres |
   | 
(2 rows)

d=# create user joe;
CREATE ROLE
d=# alter type varbitrange owner to joe;
ALTER TYPE
d=# \dT+
 List of data types
 Schema |   Name   |  Internal name   | Size | Elements |  Owner   | 
Access privileges | Description 
+--+--+--+--+--+---+-
 public | varbitmultirange | varbitmultirange | var  |  | postgres |
   | 
 public | varbitrange  | varbitrange  | var  |  | joe  |
   | 
(2 rows)

That's pretty broken, isn't it?  joe would own the multirange if he'd
created the range to start with.  Even if you think the ownerships
ideally should be separable, this behavior causes existing pg_dump
files to restore incorrectly, because pg_dump assumes it need not emit
any commands about the multirange.

A related issue is that you can manually alter the multirange's
ownership:

d=# alter type varbitmultirange owner to joe;
ALTER TYPE

which while it has some value in allowing recovery from this bug,
is inconsistent with our handling of other dependent types such
as arrays:

d=# alter type _varbitrange owner to joe;
ERROR:  cannot alter array type varbitrange[]
HINT:  You can alter type varbitrange, which will alter the array type as well.

Possibly the thing to do about that is to forbid it in HEAD
for consistency, while still allowing it in back branches
so that people can clean up inconsistent ownership if needed.

regards, tom lane




Re: Oom on temp (un-analyzed table caused by JIT) V16.1

2024-01-15 Thread Daniel Gustafsson
> On 15 Jan 2024, at 16:49, Kirk Wolak  wrote:
> On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson  > wrote:

> The script starts by creating 90 Million rows...  In my world that part of 
> the script, plus the indexes, etc.  Takes about 8-9 minutes.
> And has no memory loss. 

That's expected, the memory leak did not affect those operations.

> I used the memory reported by HTOP to track the problem.  I Forgot to mention 
> this.
> I am curious what you used?  (Because it doesn't display symptoms [running 
> dog slow] until the backend has about 85% of the machines memory)

I use a combination of tools, in thise case I analyzed a build with Instruments 
on macOS.

> There are up to date snapshots of the upcoming v16 minor release which might
> make testing easier than building postgres from source?
> 
> https://www.postgresql.org/download/snapshots/ 
> 
> 
> Thank you.  I have assigned that task to the guy who maintains our 
> VMs/Environments. 
> I will report back to you.

Great!

--
Daniel Gustafsson





Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-15 Thread Robert Haas
On Fri, Jan 12, 2024 at 4:05 PM Melanie Plageman
 wrote:
> Yea, that works for now. I mean, I think the way we should do it is
> update the FSM in lazy_scan_noprune(), but, for the purposes of this
> patch, yes. has_lpdead_items output parameter seems fine to me.

Here's v2.

It's not exactly clear to me why you want to update the FSM in
lazy_scan_[no]prune(). When I first looked at v7-0004, I said to
myself "well, this is dumb, because now we're just duplicating
something that is common to both cases". But then I realized that the
logic was *already* duplicated in lazy_scan_heap(), and that by
pushing the duplication down to lazy_scan_[no]prune(), you made the
two copies identical rather than, as at present, having two copies
that differ from each other. Perhaps that's a sufficient reason to
make the change all by itself, but it seems like what would be really
good is if we only needed one copy of the logic. I don't know if
that's achievable, though.

More generally, I somewhat question whether it's really right to push
things from lazy_scan_heap() and into lazy_scan_[no]prune(), mostly
because of the risk of having to duplicate logic. I'm not even really
convinced that it's good for us to have both of those functions.
There's an awful lot of code duplication between them already. Each
has a loop that tests the status of each item and then, for LP_USED,
switches on htsv_get_valid_status or HeapTupleSatisfiesVacuum. It
doesn't seem trivial to unify all that code, but it doesn't seem very
nice to have two copies of it, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Be-more-consistent-about-whether-to-update-the-FS.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-15 Thread Robert Haas
On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin  wrote:
> I've found one more typo in the sgml:
> summarized_pid
> And one in a comment:
> sumamry
>
> A trivial fix is attached.

Thanks, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Rework LogicalOutputPluginWriterUpdateProgress

2024-01-15 Thread vignesh C
On Mon, 13 Mar 2023 at 08:17, wangw.f...@fujitsu.com
 wrote:
>
> On Fri, Mar 10, 2023 20:17 PM Osumi, Takamichi/大墨 昂道 
>  wrote:
> > Hi,
> >
> >
> > On Friday, March 10, 2023 6:32 PM Wang, Wei/王 威 
> > wrote:
> > > Attach the new patch set.
> > Thanks for updating the patch ! One review comment on v7-0005.
>
> Thanks for your comment.
>
> > stream_start_cb_wrapper and stream_stop_cb_wrapper don't call the pair of
> > threshold check and UpdateProgressAndKeepalive unlike other write wrapper
> > functions like below. But, both of them write some data to the output 
> > plugin, set
> > the flag of did_write and thus it updates the subscriber's 
> > last_recv_timestamp
> > used for timeout check in LogicalRepApplyLoop. So, it looks adding the pair 
> > to
> > both functions can be more accurate, in order to reset the counter in
> > changes_count on the publisher ?
> >
> > @@ -1280,6 +1282,8 @@ stream_start_cb_wrapper(ReorderBuffer *cache,
> > ReorderBufferTXN *txn,
> >
> > /* Pop the error context stack */
> > error_context_stack = errcallback.previous;
> > +
> > +   /* No progress has been made, so don't call 
> > UpdateProgressAndKeepalive */
> >  }
>
> Since I think stream_start/stop_cp are different from change_cb, they don't
> represent records in wal, so I think the LSNs corresponding to these two
> messages are the LSNs of other records. So, we don't call the function
> UpdateProgressAndKeepalive here. Also, for the reasons described in [1].#05, I
> didn't reset the counter here.

As there has been no activity in this thread and it seems there is not
much interest on this from the last 9 months, I have changed the
status of the patch to "Returned with Feedback".

Regards,
Vignesh




Re: Oom on temp (un-analyzed table caused by JIT) V16.1

2024-01-15 Thread Pavel Stehule
po 15. 1. 2024 v 15:03 odesílatel Daniel Gustafsson 
napsal:

> > On 15 Jan 2024, at 07:24, Kirk Wolak  wrote:
>
> >   You have a commit [1] that MIGHT fix this.
> > I have a script that recreates the problem, using random data in pg_temp.
> > And a nested cursor.
>
> Running your reproducer script in a tight loop for a fair bit of time on
> the
> v16 HEAD I cannot see any memory growth, so it's plausible that the
> upcoming
> 16.2 will work better in your environment.
>

yes



>
> >   It took me a few days to reduce this from actual code that was
> experiencing this.  If I turn off JIT, the problem goes away.  (if I don't
> FETCH the first row, the memory loss does not happen.  Maybe because
> opening a cursor is more decoration/prepare)
> >
> >   I don't have an easy way to test this script right now against the
> commit.
>
> There are up to date snapshots of the upcoming v16 minor release which
> might
> make testing easier than building postgres from source?
>
> https://www.postgresql.org/download/snapshots/
>
> Admittedly I don't know whether those are built with LLVM support but I
> think
> they might be.
>
> > I am hopeful that your fix fixes this.
>
> It seems likely, so it would be very valuable if you could try running the
> pre
> -release version of v16 before 16.2 ships to verify this.
>
> --
> Daniel Gustafsson
>
>


Re: Make attstattarget nullable

2024-01-15 Thread Peter Eisentraut

On 12.01.24 12:16, Alvaro Herrera wrote:

In get_attstattarget() I think we should return 0 for dropped columns
without reading attstattarget, which is useless anyway, and if it did
happen to return non-null, it might cause us to do stuff, which would be
a waste.


I ended up deciding to get rid of get_attstattarget() altogether and just do
the fetching inline in examine_attribute().  Because the previous API and
what you are discussing here is over-designed, since the only caller doesn't
call it with dropped columns or system columns anyway.  This way these
issues are contained in the ANALYZE code, not in a very general place like
lsyscache.c.


Sounds good.


I have committed this first patch.


Maybe instead of having examine_attribute hand a -1 target to the
analyze functions, we could just put default_statistics_target there.
Analyze functions would never receive negative values, and we could
remove that from the analyze functions.  Maybe make
VacAttrStats->attstattarget unsigned while at it.  (This could be a
separate patch.)


But I now remembered why I didn't do this.  The extended statistics code 
needs to know whether the statistics target was set or left as default, 
because it will then apply its own sequence of logic to determine a 
final value.  (Maybe there is a way to untangle this further, but it's 
not as obvious as it seems.)


At which point I then realized that extended statistics have their own 
statistics target catalog field and command, and we really should change 
that to match the changes done to attstattarget.  So here is another 
patch that does all that again for stxstattarget.  It's meant to mirror 
the attstattarget changes exactly.



And of course the 0003 patch gets rid of it anyway.


I again didn't look at 0002 and 0003 very closely, but from 10,000 feet
it looks mostly reasonable -- but I think naming the struct
FormData_pg_attribute_extra is not a great idea, as it looks like there
would have to be a catalog named pg_attribute_extra -- and I don't think
I would make the "non-Data" pointer-to-struct typedef either.


I agree that this naming was problematic.  After some introverted 
bikeshedding, I changed it to FormExtraData_pg_attribute.  Obviously, 
other solutions are possible.  I also removed the typedef as you suggested.
From 9199be09efcbca1b906b5c41e8524e68b1a6b64e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jan 2024 16:44:55 +0100
Subject: [PATCH v4 1/3] Make stxstattarget nullable

To match attstattarget change (commit 4f622503d6d).

TODO: catversion
---
 doc/src/sgml/catalogs.sgml  | 28 -
 doc/src/sgml/ref/alter_statistics.sgml  | 13 ++--
 src/backend/commands/statscmds.c| 21 ---
 src/backend/parser/gram.y   |  4 ++--
 src/backend/statistics/extended_stats.c |  4 +++-
 src/bin/pg_dump/pg_dump.c   | 10 +
 src/bin/psql/describe.c |  3 ++-
 src/include/catalog/pg_statistic_ext.h  |  6 +++---
 src/include/nodes/parsenodes.h  |  2 +-
 9 files changed, 56 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c15d861e823..34458df6d4c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7633,6 +7633,19 @@ pg_statistic_ext 
Columns
   
  
 
+ 
+  
+   stxkeys int2vector
+   (references pg_attribute.attnum)
+  
+  
+   An array of attribute numbers, indicating which table columns are
+   covered by this statistics object;
+   for example a value of 1 3 would
+   mean that the first and the third table columns are covered
+  
+ 
+
  
   
stxstattarget int2
@@ -7642,7 +7655,7 @@ pg_statistic_ext 
Columns
of statistics accumulated for this statistics object by
ANALYZE.
A zero value indicates that no statistics should be collected.
-   A negative value says to use the maximum of the statistics targets of
+   A null value says to use the maximum of the statistics targets of
the referenced columns, if set, or the system default statistics target.
Positive values of stxstattarget
determine the target number of most common values
@@ -7650,19 +7663,6 @@ pg_statistic_ext 
Columns
   
  
 
- 
-  
-   stxkeys int2vector
-   (references pg_attribute.attnum)
-  
-  
-   An array of attribute numbers, indicating which table columns are
-   covered by this statistics object;
-   for example a value of 1 3 would
-   mean that the first and the third table columns are covered
-  
- 
-
  
   
stxkind char[]
diff --git a/doc/src/sgml/ref/alter_statistics.sgml 
b/doc/src/sgml/ref/alter_statistics.sgml
index 73cc9e830de..c16caa0b61b 100644
--- a/doc/src/sgml/ref/alter_statistics.sgml
+++ b/doc/src/sgml/ref/alter_statistics.sgml
@@ -26,7 +26,7 @@
 ALTER STATISTICS name OWNER TO 

Re: Oom on temp (un-analyzed table caused by JIT) V16.1

2024-01-15 Thread Kirk Wolak
On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson  wrote:

> > On 15 Jan 2024, at 07:24, Kirk Wolak  wrote:
>
> >   You have a commit [1] that MIGHT fix this.
> > I have a script that recreates the problem, using random data in pg_temp.
> > And a nested cursor.
>
> Running your reproducer script in a tight loop for a fair bit of time on
> the
> v16 HEAD I cannot see any memory growth, so it's plausible that the
> upcoming
> 16.2 will work better in your environment.
>

The script starts by creating 90 Million rows...  In my world that part of
the script, plus the indexes, etc.  Takes about 8-9 minutes.
And has no memory loss.

I used the memory reported by HTOP to track the problem.  I Forgot to
mention this.
I am curious what you used?  (Because it doesn't display symptoms [running
dog slow] until the backend has about 85% of the machines memory)


> There are up to date snapshots of the upcoming v16 minor release which
> might
> make testing easier than building postgres from source?
>
> https://www.postgresql.org/download/snapshots/


Thank you.  I have assigned that task to the guy who maintains our
VMs/Environments.
I will report back to you.

Kirk


minor replication slot docs edits

2024-01-15 Thread Alvaro Herrera
Pursuant to a comment I made a few months ago[1], I propose the attached
changes to replication slots documentation.  In essence, I want to
explain that replication slots are good, and the max_size GUC, before
moving on to explain that the other methods are worse.

Thanks

[1] https://postgr.es/m/20230413111838.e7yxke2dtwrxw3qy@alvherre.pgsql

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Entristecido, Wutra (canción de Las Barreras)
echa a Freyr a rodar
y a nosotros al mar"
>From c003a2c6ae8d1c177fbd4c1f77b428c5ad24e705 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 19 May 2023 20:15:52 +0200
Subject: [PATCH] Rework paragraphs in replication slots docs

Discussion: https://postgr.es/m/20230413111838.e7yxke2dtwrxw3qy@alvherre.pgsql
---
 doc/src/sgml/high-availability.sgml | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 9dd52ff275..887366bdec 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -930,25 +930,27 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 recovery conflict even when the
 standby is disconnected.

+   
+Beware that replication slots can retain so many WAL segments that they
+fill up the space allocated for pg_wal.
+ can be used to limit the size
+of WAL files retained by replication slots.
+   

 In lieu of using replication slots, it is possible to prevent the removal
 of old WAL segments using , or by
 storing the segments in an archive using
  or .
-However, these methods often result in retaining more WAL segments than
+A disadvantage of these methods is that they
+often result in retaining more WAL segments than
 required, whereas replication slots retain only the number of segments
-known to be needed.  On the other hand, replication slots can retain so
-many WAL segments that they fill up the space allocated
-for pg_wal;
- limits the size of WAL files
-retained by replication slots.
+known to be needed.


 Similarly,  on its own, without
 also using a replication slot, provides protection against relevant rows
 being removed by vacuum, but provides no protection during any time period
-when the standby is not connected.  Replication slots overcome these
-disadvantages.
+when the standby is not connected.


 Querying and Manipulating Replication Slots
-- 
2.39.2



Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-15 Thread Alexander Korotkov
On Mon, Jan 15, 2024 at 8:44 AM Masahiko Sawada  wrote:
>
> On Mon, Jan 15, 2024 at 8:21 AM Alexander Korotkov  
> wrote:
> >
> > On Sun, Jan 14, 2024 at 10:35 PM Masahiko Sawada  
> > wrote:
> > > Thank you for updating the patch. Here are two comments:
> > >
> > > ---
> > > +   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
> > > +   cstate->num_errors > 0)
> > > +   ereport(WARNING,
> > > +   errmsg("%zd rows were skipped due to data type 
> > > incompatibility",
> > > +  cstate->num_errors));
> > > +
> > > /* Done, clean up */
> > > error_context_stack = errcallback.previous;
> > >
> > > If a malformed input is not the last data, the context message seems odd:
> > >
> > > postgres(1:1769258)=# create table test (a int);
> > > CREATE TABLE
> > > postgres(1:1769258)=# copy test from stdin (save_error_to none);
> > > Enter data to be copied followed by a newline.
> > > End with a backslash and a period on a line by itself, or an EOF signal.
> > > >> a
> > > >> 1
> > > >>
> > > 2024-01-15 05:05:53.980 JST [1769258] WARNING:  1 rows were skipped
> > > due to data type incompatibility
> > > 2024-01-15 05:05:53.980 JST [1769258] CONTEXT:  COPY test, line 3: ""
> > > COPY 1
> > >
> > > I think it's better to report the WARNING after resetting the
> > > error_context_stack. Or is a WARNING really appropriate here? The
> > > v15-0001-Make-COPY-FROM-more-error-tolerant.patch[1] uses NOTICE but
> > > the v1-0001-Add-new-COPY-option-SAVE_ERROR_TO.patch[2] changes it to
> > > WARNING without explanation.
> >
> > Thank you for noticing this.  I think NOTICE is more appropriate here.
> > There is nothing to "worry" about: the user asked to ignore the errors
> > and we did.  And yes, it doesn't make sense to use the last line as
> > the context.  Fixed.
> >
> > > ---
> > > +-- test missing data: should fail
> > > +COPY check_ign_err FROM STDIN WITH (save_error_to none);
> > > +1  {1}
> > > +\.
> > >
> > > We might want to cover the extra data cases too.
> >
> > Agreed, the relevant test is added.
>
> Thank you for updating the patch. I have one minor point:
>
> +   if (cstate->opts.save_error_to != COPY_SAVE_ERROR_TO_UNSPECIFIED &&
> +   cstate->num_errors > 0)
> +   ereport(NOTICE,
> +   errmsg("%zd rows were skipped due to
> data type incompatibility",
> +  cstate->num_errors));
> +
>
> We can use errmsg_plural() instead.

Makes sense.  Fixed.

> I have a question about the option values; do you think we need to
> have another value of SAVE_ERROR_TO option to explicitly specify the
> current default behavior, i.e. not accept any error? With the v4
> patch, the user needs to omit SAVE_ERROR_TO option to accept errors
> during COPY FROM. If we change the default behavior in the future,
> many users will be affected and probably end up changing their
> applications to keep the current default behavior.

Valid point.  I've implemented the handling of CopySaveErrorToChoice
in a similar way to CopyHeaderChoice.

Please, check the revised patch attached.

--
Regards,
Alexander Korotkov


0001-Add-new-COPY-option-SAVE_ERROR_TO-v5.patch
Description: Binary data


Re: Custom explain options

2024-01-15 Thread Tomas Vondra



On 1/15/24 15:22, Konstantin Knizhnik wrote:
> 
> On 14/01/2024 11:47 pm, Tomas Vondra wrote:
>> The thing that was not clear to me is who decides what to prefetch,
>> which code issues the prefetch requests etc. In the github links you
>> shared I see it happens in the index AM code (in nbtsearch.c).
> 
> 
> It is up to the particular plan node (seqscan, indexscan,...) which
> pages to prefetch.
> 
> 
>>
>> That's interesting, because that's what my first prefetching patch did
>> too - not the same way, ofc, but in the same layer. Simply because it
>> seemed like the simplest way to do that. But the feedback was that's the
>> wrong layer, and that it should happen in the executor. And I agree with
>> that - the reasons are somewhere in the other thread.
>>
> I read the arguments in
> 
> https://www.postgresql.org/message-id/flat/8c86c3a6-074e-6c88-3e7e-9452b6a37b9b%40enterprisedb.com#fc792f8d013215ace7971535a5744c83
> 
> Separating prefetch info in index scan descriptor is really good idea.
> It will be amazing to have generic prefetch mechanism for all indexes.
> But unfortunately I do not understand how it is possible. The logic of
> index traversal is implemented inside AM. Executor doesn't know it.
> For example for B-Tree scan we can prefetch:
> 
> - intermediate pages
> - leave pages
> - referenced by TID heap pages
> 

My patch does not care about prefetching internal index pages. Yes, it's
a limitation, but my assumption is the internal pages are maybe 0.1% of
the index, and typically very hot / cached. Yes, if the index is not
used very often, this may be untrue. But I consider it a possible future
improvement, for some other patch. FWIW there's a prefetching patch for
inserts into indexes (which only prefetches just the index leaf pages).

> Before we load next intermediate page, we do not know next leave pages.
> And before we load next leave page, we can not find out TIDs from this
> page.
> 

Not sure I understand what this is about. The patch simply calls the
index AM function index_getnext_tid() enough times to fill the prefetch
queue. It does not prefetch the next index leaf page, it however does
prefetch the heap pages. It does not "stall" at the boundary of the
index leaf page, or something.

> Another challenge - is how far we should prefetch (as far as I
> understand both your and our approach using dynamically extended
> prefetch window)
> 

By dynamic extension of prefetch window you mean the incremental growth
of the prefetch distance from 0 to effective_io_concurrency? I don't
think there's a better solution.

There might be additional information that we could consider (e.g.
expected number of rows for the plan, earlier executions of the scan,
...) but each of these has a failure more.

>> Based on what I saw in the neon code, I think it should be possible for
>> neon to use "my" approach too, but that only works for the index scans,
>> ofc. Not sure what to do about the other places.
> We definitely need prefetch for heap scan (it gives the most advantages
> in performance), for vacuum  and also for pg_prewarm. Also I tried to
> implement it for custom indexes such as pg_vector. I still not sure
> whether it is possible to create some generic solution which will work
> for all indexes.
> 

I haven't tried with pgvector, but I don't see why my patch would not
work for all index AMs that cna return TID.

> I have also tried to implement alternative approach for prefetch based
> on access statistic.
> It comes from use case of seqscan of table with larger toasted records.
> So for each record we have to extract its TOAST data.
> It is done using standard index scan, but unfortunately index prefetch
> doesn't help much here: there is usually just one TOAST segment and so
> prefetch just have no chance to do something useful. But as far as heap
> records are accessed sequentially, there is good chance that toast table
> will also be accessed mostly sequentially. So we just can count number
> of sequential requests to each relation and if ratio or seq/rand 
> accesses is above some threshold we can prefetch next pages of this
> relation. This is really universal approach but ... working mostly for
> TOAST table.
> 

Are you're talking about what works / doesn't work in neon, or about
postgres in general?

I'm not sure what you mean by "one TOAST segment" and I'd also guess
that if both tables are accessed mostly sequentially, the read-ahead
will do most of the work (in postgres).

It's probably true that as we do a separate index scan for each TOAST-ed
value, that can't really ramp-up the prefetch distance fast enough.
Maybe we could have a mode where we start with the full distance?

> 
>>> As I already wrote - prefetch is done locally for each backend. And each
>>> backend has its own connection with page server. It  can be changed in
>>> future when we implement multiplexing of page server connections. But
>>> right now prefetch is local. And certainly prefetch can improve
>>> 

Re: archive modules loose ends

2024-01-15 Thread Nathan Bossart
On Mon, Jan 15, 2024 at 12:21:44PM +, Li, Yong wrote:
> The patch looks good to me.  With the context explained in the thread,
> the patch is easy to understand.
> The patch serves as a refactoring which pulls up common memory management
> and error handling concerns into the pgarch.c.  With the patch,
> individual archive callbacks can focus on copying the files and leave the
> boilerplate code to pgarch.c..
> 
> The patch applies cleanly to HEAD.  “make check-world” also runs cleanly
> with no error.

Thanks for reviewing.  I've marked this as ready-for-committer, and I'm
hoping to commit it in the near future.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add test module for Table Access Method

2024-01-15 Thread Jelte Fennema-Nio
On Mon, 15 Jan 2024 at 14:26, Aleksander Alekseev
 wrote:
> To be fair, Postgres uses TAM internally, so there is at least one
> complete and up-to-date real-life example.

Sure, but that one is quite hard to follow if you don't already know
lots of details of the heap storage. At least for me, having a minimal
example was extremely helpful and it made for a great code skeleton to
start from.




Re: Built-in CTYPE provider

2024-01-15 Thread Daniel Verite
Jeff Davis wrote:

> New version attached.

[v16]

Concerning the target category_test, it produces failures with
versions of ICU with Unicode < 15. The first one I see with Ubuntu
22.04 (ICU 70.1) is:

category_test: Postgres Unicode version:15.1
category_test: ICU Unicode version: 14.0
category_test: FAILURE for codepoint 0x000c04
category_test: Postgres property   
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
1/0/0/0/0/0
category_test: ICU  property   
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
0/0/0/0/0/0

U+0C04 is a codepoint added in Unicode 11.
https://en.wikipedia.org/wiki/Telugu_(Unicode_block)

In Unicode.txt:
0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;N;

In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt
$ grep 0C04 PropList.txt 
0C04  ; Other_Alphabetic # Mn   TELUGU SIGN COMBINING ANUSVARA
ABOVE

But in Unicode 14 it was not there.
As a result its binary property UCHAR_ALPHABETIC has changed from
false to true in ICU 72 vs previous versions.

As I understand, the stability policy says that such things happen.
From https://www.unicode.org/policies/stability_policy.html

   Once a character is encoded, its properties may still be changed,
   but not in such a way as to change the fundamental identity of the
   character.

   The Consortium will endeavor to keep the values of the other
   properties as stable as possible, but some circumstances may arise
   that require changing them. Particularly in the situation where
   the Unicode Standard first encodes less well-documented characters
   and scripts, the exact character properties and behavior initially
   may not be well known.

   As more experience is gathered in implementing the characters,
   adjustments in the properties may become necessary. Examples of
   such properties include, but are not limited to, the following:

- General_Category
- Case mappings
- Bidirectional properties
[...]

I've commented the exit(1) in category_test to collect all errors, and
built it with versions of ICU from 74 down to 60 (that is Unicode 10.0).
Results are attached. As expected, the older the ICU version, the more
differences are found against Unicode 15.1.

I find these results interesting because they tell us what contents
can break regexp-based check constraints on upgrades.

But about category_test as a pass-or-fail kind of test, it can only be
used when the Unicode version in ICU is the same as in Postgres.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite


results-category-tests-multiple-icu-versions.tar.bz2
Description: Binary data


Re: Custom explain options

2024-01-15 Thread Konstantin Knizhnik


On 14/01/2024 11:47 pm, Tomas Vondra wrote:

The thing that was not clear to me is who decides what to prefetch,
which code issues the prefetch requests etc. In the github links you
shared I see it happens in the index AM code (in nbtsearch.c).



It is up to the particular plan node (seqscan, indexscan,...) which 
pages to prefetch.





That's interesting, because that's what my first prefetching patch did
too - not the same way, ofc, but in the same layer. Simply because it
seemed like the simplest way to do that. But the feedback was that's the
wrong layer, and that it should happen in the executor. And I agree with
that - the reasons are somewhere in the other thread.


I read the arguments in

https://www.postgresql.org/message-id/flat/8c86c3a6-074e-6c88-3e7e-9452b6a37b9b%40enterprisedb.com#fc792f8d013215ace7971535a5744c83

Separating prefetch info in index scan descriptor is really good idea. 
It will be amazing to have generic prefetch mechanism for all indexes.
But unfortunately I do not understand how it is possible. The logic of 
index traversal is implemented inside AM. Executor doesn't know it.

For example for B-Tree scan we can prefetch:

- intermediate pages
- leave pages
- referenced by TID heap pages

Before we load next intermediate page, we do not know next leave pages.
And before we load next leave page, we can not find out TIDs from this page.

Another challenge - is how far we should prefetch (as far as I 
understand both your and our approach using dynamically extended 
prefetch window)



Based on what I saw in the neon code, I think it should be possible for
neon to use "my" approach too, but that only works for the index scans,
ofc. Not sure what to do about the other places.
We definitely need prefetch for heap scan (it gives the most advantages 
in performance), for vacuum  and also for pg_prewarm. Also I tried to 
implement it for custom indexes such as pg_vector. I still not sure 
whether it is possible to create some generic solution which will work 
for all indexes.


I have also tried to implement alternative approach for prefetch based 
on access statistic.
It comes from use case of seqscan of table with larger toasted records. 
So for each record we have to extract its TOAST data.
It is done using standard index scan, but unfortunately index prefetch 
doesn't help much here: there is usually just one TOAST segment and so 
prefetch just have no chance to do something useful. But as far as heap 
records are accessed sequentially, there is good chance that toast table 
will also be accessed mostly sequentially. So we just can count number 
of sequential requests to each relation and if ratio or seq/rand  
accesses is above some threshold we can prefetch next pages of this 
relation. This is really universal approach but ... working mostly for 
TOAST table.




As I already wrote - prefetch is done locally for each backend. And each
backend has its own connection with page server. It  can be changed in
future when we implement multiplexing of page server connections. But
right now prefetch is local. And certainly prefetch can improve
performance only if we correctly predict subsequent page requests.
If not - then page server does useless jobs and backend has to waity and
consume all issues prefetch requests. This is why in prefetch
implementation for most of nodes we  start with minimal prefetch
distance and then increase it. It allows to perform prefetch only for
such queries where it is really efficient (OLAP) and doesn't degrade
performance of simple OLTP queries.


Not sure I understand what's so important about prefetches being "local"
for each backend. I mean even in postgres each backend prefetches it's
own buffers, no matter what the other backends do. Although, neon
probably doesn't have the cross-backend sharing through shared buffers
etc. right?



Sorry if my explanation was not clear:(


I mean even in postgres each backend prefetches it's own buffers, no matter 
what the other backends do.


This is exactly the difference. In Neon such approach doesn't work.
Each backend maintains it's own prefetch ring. And if prefetched page was not 
actually received, then the whole pipe is lost.
I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it has to 
consume responses for 1,5,10 and issue another request for page 2.
Instead of improving speed we are just doing extra job.
So each backend should prefetch only those pages which it is actually going to 
read.
This is why prefetch approach used in Postgres for example for parallel bitmap 
heap scan doesn't work for Neon.
If you do `posic_fadvise` then prefetched page is placed in OS cache and can be 
used by any parallel worker.
But in Neon each parallel worker should be given its own range of pages to scan 
and prefetch only this pages.




Well, my assumption was the following: prefetch is most efficient forOLAP 
queries.
Although HTAP (hybrid transactional/analytical processing) is popular
trend 

Re: More new SQL/JSON item methods

2024-01-15 Thread Peter Eisentraut

Attached are two small fixup patches for your patch set.

In the first one, I simplified the grammar for the .decimal() method. 
It seemed a bit overkill to build a whole list structure when all we 
need are 0, 1, or 2 arguments.


Per SQL standard, the precision and scale arguments are unsigned 
integers, so unary plus and minus signs are not supported.  So my patch 
removes that support, but I didn't adjust the regression tests for that.


Also note that in your 0002 patch, the datetime precision is similarly 
unsigned, so that's consistent.


By the way, in your 0002 patch, don't see the need for the separate 
datetime_method grammar rule.  You can fold that into accessor_op.


Overall, I think it would be better if you combined all three of these 
patches into one.  Right now, you have arranged these as incremental 
features, and as a result of that, the additions to the JsonPathItemType 
enum and the grammar keywords etc. are ordered in the way you worked on 
these features, I guess.  It would be good to maintain a bit of sanity 
to put all of this together and order all the enums and everything else 
for example in the order they are in the sql_features.txt file (which is 
alphabetical, I suppose).  At this point I suspect we'll end up 
committing this whole feature set together anyway, so we might as well 
organize it that way.
From 81e330a243d85dff7f64adf17815258e2764ea01 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jan 2024 14:11:36 +0100
Subject: [PATCH 1/2] fixup! Implement jsonpath .number(), .decimal([precision
 [, scale]]), .bigint(), and .integer() methods

---
 src/backend/utils/adt/jsonpath_gram.y | 43 +--
 1 file changed, 8 insertions(+), 35 deletions(-)

diff --git a/src/backend/utils/adt/jsonpath_gram.y 
b/src/backend/utils/adt/jsonpath_gram.y
index 24c31047ffd..9c06c3f6cb9 100644
--- a/src/backend/utils/adt/jsonpath_gram.y
+++ b/src/backend/utils/adt/jsonpath_gram.y
@@ -89,9 +89,9 @@ static bool makeItemLikeRegex(JsonPathParseItem *expr,
 %type   scalar_value path_primary expr array_accessor
any_path accessor_op key predicate 
delimited_predicate
index_elem starts_with_initial 
expr_or_predicate
-   datetime_template opt_datetime_template 
csv_elem
+   datetime_template opt_datetime_template
 
-%type   accessor_expr csv_list opt_csv_list
+%type   accessor_expr
 
 %type  index_list
 
@@ -252,39 +252,12 @@ accessor_op:
| '.' DATETIME_P '(' opt_datetime_template ')'
{ $$ = 
makeItemUnary(jpiDatetime, $4); }
| '?' '(' predicate ')' { $$ = makeItemUnary(jpiFilter, 
$3); }
-   | '.' DECIMAL_P '(' opt_csv_list ')'
-   {
-   if (list_length($4) == 0)
-   $$ = makeItemBinary(jpiDecimal, NULL, NULL);
-   else if (list_length($4) == 1)
-   $$ = makeItemBinary(jpiDecimal, linitial($4), NULL);
-   else if (list_length($4) == 2)
-   $$ = makeItemBinary(jpiDecimal, linitial($4), 
lsecond($4));
-   else
-   ereturn(escontext, false,
-   (errcode(ERRCODE_SYNTAX_ERROR),
-errmsg("invalid input syntax for type 
%s", "jsonpath"),
-errdetail(".decimal() can only have an 
optional precision[,scale].")));
-   }
-   ;
-
-csv_elem:
-   INT_P
-   { $$ = makeItemNumeric(&$1); }
-   | '+' INT_P %prec UMINUS
-   { $$ = makeItemUnary(jpiPlus, makeItemNumeric(&$2)); }
-   | '-' INT_P %prec UMINUS
-   { $$ = makeItemUnary(jpiMinus, makeItemNumeric(&$2)); }
-   ;
-
-csv_list:
-   csv_elem{ $$ = 
list_make1($1); }
-   | csv_list ',' csv_elem { $$ = lappend($1, $3); }
-   ;
-
-opt_csv_list:
-   csv_list{ $$ = $1; }
-   | /* EMPTY */   { $$ = NULL; }
+   | '.' DECIMAL_P '(' ')'
+   { $$ = makeItemBinary(jpiDecimal, NULL, NULL); }
+   | '.' DECIMAL_P '(' INT_P ')'
+   { $$ = makeItemBinary(jpiDecimal, makeItemNumeric(&$4), NULL); }
+   | '.' DECIMAL_P '(' INT_P ',' INT_P ')'
+   { $$ = makeItemBinary(jpiDecimal, makeItemNumeric(&$4), 
makeItemNumeric(&$6)); }
;
 
 datetime_template:
-- 
2.43.0

From f2f9176548d36d3b02bd3baf1d5afdf0a84a84b1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jan 2024 14:19:22 +0100
Subject: [PATCH 2/2] fixup! Implement jsonpath .number(), .decimal([precision
 [, scale]]), .bigint(), and .integer() methods

---
 

Re: Oom on temp (un-analyzed table caused by JIT) V16.1

2024-01-15 Thread Daniel Gustafsson
> On 15 Jan 2024, at 07:24, Kirk Wolak  wrote:

>   You have a commit [1] that MIGHT fix this.
> I have a script that recreates the problem, using random data in pg_temp.
> And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on the
v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
16.2 will work better in your environment.

>   It took me a few days to reduce this from actual code that was experiencing 
> this.  If I turn off JIT, the problem goes away.  (if I don't FETCH the first 
> row, the memory loss does not happen.  Maybe because opening a cursor is more 
> decoration/prepare)
> 
>   I don't have an easy way to test this script right now against the commit.

There are up to date snapshots of the upcoming v16 minor release which might
make testing easier than building postgres from source?

https://www.postgresql.org/download/snapshots/

Admittedly I don't know whether those are built with LLVM support but I think
they might be.

> I am hopeful that your fix fixes this.

It seems likely, so it would be very valuable if you could try running the pre
-release version of v16 before 16.2 ships to verify this.

--
Daniel Gustafsson





Re: Add test module for Table Access Method

2024-01-15 Thread Aleksander Alekseev
Hi,

> When trying to implement a table access method in the past I remember
> very well that I was having a really hard time finding an example of
> one.

To be fair, Postgres uses TAM internally, so there is at least one
complete and up-to-date real-life example. Learning curve for TAMs is
indeed steep, and I wonder if we could do a better job in this respect
e.g. by providing a simpler example. This being said, I know several
people who learned TAM successfully (so far only for R tasks) which
indicates that its difficulty is adequate.

-- 
Best regards,
Aleksander Alekseev




Re: postgres_fdw fails to see that array type belongs to extension

2024-01-15 Thread David Geier

Hi,

I realized that ALTER EXTENSION foo ADD TYPE _bar does pretty much the 
same via ExecAlterExtensionContentsStmt(). So the code in the patch 
seems fine.


On 1/8/24 12:21, David Geier wrote:
The attached patch just adds a 2nd dependency between the array type 
and the extension, using recordDependencyOnCurrentExtension(). It 
seems like that the other internal dependency on the element type must 
stay? If that seems reasonable I can add a test to 
modules/test_extensions. Can extensions in contrib use test extension 
in their own tests? It looks like postgres_fdw doesn't test any of the 
shippability logic.



--
David Geier
(ServiceNow)





Re: POC: GROUP BY optimization

2024-01-15 Thread Alena Rybakina

On 15.01.2024 12:46, Andrei Lepikhov wrote:

On 15/1/2024 13:42, Richard Guo wrote:


On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov 
mailto:aekorot...@gmail.com>> wrote:


    Thank you for providing the test case relevant for this code change.
    The revised patch incorporating this change is attached. Now the
    patchset looks good to me.  I'm going to push it if there are no
    objections.


Seems I'm late for the party.  Can we hold for several more days?  I'd
like to have a review on this patch.
Get on board! It looks like this feature needs as much review as 
possible (likewise SJE).


Hi! Thank you for your work on this issue! I believe that this will help 
the scheduler to make a more optimal query plan here and therefore speed 
up their execution.


I have reviewed patches and noticed that we can add some code 
refactoring. I have attached a diff file (group_by.diff) to this email.


The changes involve spelling corrections, renaming variables and porting 
some common parts.



In addition, I have a few questions, since some points in the code 
remained unclear to me.


1.  I didn't understand why we have a question in the comment next to 
the enable_group_by_reordering variable in 
src/backend/optimizer/path/pathkeys.c file, I assumed it was spelling 
and fixed it in the diff file.


2. Why do we set the variable (path = path_save) here 
(add_paths_to_grouping_rel function) if we change its variable below and 
we can pass path_save as a parameter?


foreach(lc2, pathkey_orderings)
{
    PathKeyInfo *info = (PathKeyInfo *) lfirst(lc2);

    /* restore the path (we replace it in the loop) */
    path = path_save;

    path = make_ordered_path(root,
                         grouped_rel,
                         path,
                         cheapest_path,
                         info->pathkeys);
    if (path == NULL)
    continue;

--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 5aac6d66776..8be58fa2b0e 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,7 +29,7 @@
 #include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
-/* Consider reordering of GROUP BY keys? */
+/* Consider reordering of GROUP BY keys */
 bool		enable_group_by_reordering = true;
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -362,7 +362,7 @@ pathkeys_contained_in(List *keys1, List *keys2)
  *
  * Returns the number of GROUP BY keys with a matching pathkey.
  */
-static int
+static PathKeyInfo *
 group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 			   List **group_clauses,
 			   int num_groupby_pathkeys)
@@ -421,7 +421,16 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
 	*group_clauses = list_concat_unique_ptr(new_group_clauses,
 			*group_clauses);
 
-	return n;
+	if (n > 0 &&
+		(enable_incremental_sort || n == list_length(*group_pathkeys)))
+	{
+		PathKeyInfo *info = makeNode(PathKeyInfo);
+		info->pathkeys = *group_pathkeys;
+		info->clauses = *group_clauses;
+		return info;
+	}
+
+	return NULL;
 }
 
 /*
@@ -436,7 +445,7 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List **group_pathkeys,
  *
  * - the original ordering, as specified by the GROUP BY clause,
  * - GROUP BY keys reordered to match 'path' ordering (as much as possible),
- * - GROUP BY keys to match target ORDER BY clause (as much as possible).
+ * - GROUP BY keys should match the target ORDER BY clause (as much as possible).
  */
 List *
 get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
@@ -475,20 +484,11 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
 	 */
 	if (path->pathkeys)
 	{
-		int			n;
-
-		n = group_keys_reorder_by_pathkeys(path->pathkeys, , ,
+		info = group_keys_reorder_by_pathkeys(path->pathkeys, , ,
 		   root->num_groupby_pathkeys);
 
-		if (n > 0 &&
-			(enable_incremental_sort || n == list_length(path->pathkeys)))
-		{
-			info = makeNode(PathKeyInfo);
-			info->pathkeys = pathkeys;
-			info->clauses = clauses;
-
+		if (info)
 			infos = lappend(infos, info);
-		}
 	}
 
 	/*
@@ -497,21 +497,12 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
 	 */
 	if (root->sort_pathkeys)
 	{
-		int			n;
-
-		n = group_keys_reorder_by_pathkeys(root->sort_pathkeys, ,
+		info = group_keys_reorder_by_pathkeys(root->sort_pathkeys, ,
 		   ,
 		   root->num_groupby_pathkeys);
 
-		if (n > 0 &&
-			(enable_incremental_sort || n == list_length(path->pathkeys)))
-		{
-			info = makeNode(PathKeyInfo);
-			info->pathkeys = pathkeys;
-			info->clauses = clauses;
-
+		if (info)
 			infos = lappend(infos, info);
-		}
 	}
 
 	return infos;
@@ -2163,27 +2154,26 @@ truncate_useless_pathkeys(PlannerInfo *root,
 		  RelOptInfo *rel,
 		  List *pathkeys)
 {

Re: Fix a possible socket leak at Windows (src/backend/port/win32/socket.c)

2024-01-15 Thread Daniel Gustafsson
> On 13 Jan 2024, at 22:38, Ranier Vilela  wrote:

> In the pgwin32_socket function (src/backend/port/win32/socket.c), there is a 
> possible socket leak if the socket cannot be made non-blocking.

I don't know Windows well enough to comment on the implications of not calling
closesocket here, but it definitely seems like a prudent thing to do
backpatched down to 12. Unless objections I'll do that.

--
Daniel Gustafsson





Re: archive modules loose ends

2024-01-15 Thread Li, Yong


> On Nov 29, 2023, at 01:18, Nathan Bossart  wrote:
> 
> External Email
> 
> Here is a new version of the patch with feedback addressed.
> 
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com

Hi Nathan,

The patch looks good to me.  With the context explained in the thread, the 
patch is easy to understand.
The patch serves as a refactoring which pulls up common memory management and 
error handling concerns into the pgarch.c.  With the patch, individual archive 
callbacks can focus on copying the files and leave the boilerplate code to 
pgarch.c. 

The patch applies cleanly to HEAD.  “make check-world” also runs cleanly with 
no error.


Regards,
Yong

Re: Lockless exit path for ReplicationOriginExitCleanup

2024-01-15 Thread Alvaro Herrera
Thanks, pushed.  I reworded the comment again, though.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Postgres and --config-file option

2024-01-15 Thread Aleksander Alekseev
Hi,

> It might be worthwhile to update the documentation if it would've helped
> prevent confusion here.

> Its documentation also describes this method of specifying parameters in
> the 'Examples' section.

I believe the documentation for 'postgres' already does a decent job
in describing what --NAME=VALUE means, and gives an example. IMO the
actual problem is with --help message and the specific error message.

> Please read the documentation for the complete list of run-time
> configuration settings and how to set them on the command line or in
> the configuration file

Additionally --help message doesn't tell which part of the
documentation should be read specifically. This being said, personally
I don't think that providing specific URLs in the --help message would
be a good idea. This would indicate that the --help message is just
written poorly.

> As for the help message, I’d minimally add:
>
> “You must specify the --config-file (or equivalent -c) or -D invocation …”

Good idea.

PFA the patch. It's short but I think it mitigates the problem.

-- 
Best regards,
Aleksander Alekseev


v1-0001-Clarify-error-message-about-specifying-config-fil.patch
Description: Binary data


Re: Make NUM_XLOGINSERT_LOCKS configurable

2024-01-15 Thread Jakub Wartak
On Fri, Jan 12, 2024 at 7:33 AM Bharath Rupireddy
 wrote:
>
> On Wed, Jan 10, 2024 at 11:43 AM Tom Lane  wrote:
> >
> > Bharath Rupireddy  writes:
> > > On Wed, Jan 10, 2024 at 10:00 AM Tom Lane  wrote:
> > >> Maybe.  I bet just bumping up the constant by 2X or 4X or so would get
> > >> most of the win for far less work; it's not like adding a few more
> > >> LWLocks is expensive.  But we need some evidence about what to set it to.
> >
> > > I previously made an attempt to improve WAL insertion performance with
> > > varying NUM_XLOGINSERT_LOCKS. IIRC, we will lose what we get by
> > > increasing insertion locks (reduction in WAL insertion lock
> > > acquisition time) to the CPU overhead of flushing the WAL in
> > > WaitXLogInsertionsToFinish as referred to by the following comment.
> >
> > Very interesting --- this is at variance with what the OP said, so
> > we definitely need details about the test conditions in both cases.
> >
> > > Unfortunately, I've lost the test results, I'll run them up again and
> > > come back.
> >
> > Please.
>
> Okay, I'm back with some testing

[..]

> Results with varying NUM_XLOGINSERT_LOCKS (note that we can't allow it
> be more than MAX_SIMUL_LWLOCKS):
> LocksTPSWAL Insert Lock Acquire Time in MillisecondsWAL
> Wait for In-progress Inserts to Finish Time in Milliseconds
> 818669125328775
> 16180761064113491
> 3218034663513997
> 6417582393714718
> 12817782456320145
>
> Also, check the attached graph. Clearly there's an increase in the
> time spent in waiting for in-progress insertions to finish in
> WaitXLogInsertionsToFinish from 8.7 seconds to 20 seconds. Whereas,
> the time spent to acquire WAL insertion locks decreased from 12.5
> seconds to 4.5 seconds. Overall, this hasn't resulted any improvement
> in TPS, in fact observed slight reduction.

Hi, I've hastily tested using Bharath's patches too as I was thinking
it would be a fast win due to contention, however it seems that (at
least on fast NVMEs?) increasing NUM_XLOGINSERT_LOCKS doesn't seem to
help.

With pgbench -P 5 -c 32 -j 32 -T 30 and
- 64vCPU Lsv2 (AMD EPYC), on single NVMe device (with ext4) that can
do 100k RW IOPS@8kB (with fio/libaio, 4jobs)
- shared_buffers = '8GB', max_wal_size = '32GB', track_wal_io_timing = on
- maxed out wal_buffers = '256MB'

tpcb-like with synchronous_commit=off
   TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8   3039324087   128
32 31205968   93

tpcb-like with synchronous_commit=on
   TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8   12031   8472  10722
32 11957   1188   12563

tpcb-like with synchronous_commit=on and pgbench -c 64 -j 64
   TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8   25010   90620 68318
32 25976   18569 85319
// same, Bharath said , it shifted from insert_lock to
waiting_for_insert to finish

insertonly (largeinserts) with synchronous_commit=off (still -c 32 -j 32)
  TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8 3671914283
32   39387568

insertonly (largeinserts) with synchronous_commit=on (still -c 32 -j 32)
  TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8 32915950125
32   3102177  316

insertonly was := {
create sequence s1;
create table t (id bigint, t text) partition by hash (id);
create table t_h0 partition of t FOR VALUES WITH (modulus 8, remainder 0);
create table t_h1 partition of t FOR VALUES WITH (modulus 8, remainder 1);
create table t_h2 partition of t FOR VALUES WITH (modulus 8, remainder 2);
create table t_h3 partition of t FOR VALUES WITH (modulus 8, remainder 3);
create table t_h4 partition of t FOR VALUES WITH (modulus 8, remainder 4);
create table t_h5 partition of t FOR VALUES WITH (modulus 8, remainder 5);
create table t_h6 partition of t FOR VALUES WITH (modulus 8, remainder 6);
create table t_h7 partition of t FOR VALUES WITH (modulus 8, remainder 7);

and runtime pgb:
insert into t select nextval('s1'), repeat('A', 1000) from
generate_series(1, 1000);
}

it was truncated every time, DB was checkpointed, of course it was on master.
Without more details from Qingsong it is going to be hard to explain
the boost he witnessed.

-J.




Re: Synchronizing slots from primary to standby

2024-01-15 Thread Amit Kapila
On Mon, Jan 15, 2024 at 2:54 PM Bertrand Drouvot
 wrote:
>
> On Sat, Jan 13, 2024 at 10:05:52AM +0530, Amit Kapila wrote:
> > On Fri, Jan 12, 2024 at 12:07 PM Bertrand Drouvot
> >  wrote:
> > > Maybe the "best" approach would be to have a way to detect that a slot 
> > > has been
> > > re-created on the primary (but that would mean rely on more than the slot 
> > > name
> > > to "identify" a slot and probably add a new member to the struct to do 
> > > so).
> > >
> >
> > Right, I also thought so but not sure further complicating the slot
> > machinery is worth detecting this case explicitly. If we see any
> > problem with the idea discussed then we may need to think something
> > along those lines.
>
> Yeah, let's see. On one side that would require extra work but on the other 
> side
> that would also probably simplify (and less bug prone in the mid-long term?)
> other parts of the code.
>

After following Sawada-San's suggestion to not copy the 'failover'
option there doesn't seem to be much special handling, so there is
probably less to simplify.

-- 
With Regards,
Amit Kapila.




Re: automating RangeTblEntry node support

2024-01-15 Thread Peter Eisentraut

On 06.12.23 21:02, Peter Eisentraut wrote:
I have been looking into what it would take to get rid of the 
custom_read_write and custom_query_jumble for the RangeTblEntry node 
type.  This is one of the larger and more complex exceptions left.


(Similar considerations would also apply to the Constraint node type.)


In this updated patch set, I have also added the treatment of the 
Constraint type.  (I also noted that the manual read/write functions for 
the Constraint type are out-of-sync again, so simplifying this would be 
really helpful.)  I have also added commit messages to each patch.


The way I have re-ordered the patch series now, I think patches 0001 
through 0003 are candidates for inclusion after review, patch 0004 still 
needs a bit more analysis and testing, as described therein.
From f0e896a201367a639be9d08d070066867c46d7cf Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jan 2024 11:23:39 +0100
Subject: [PATCH v2 1/4] Remove custom Constraint node read/write
 implementations

This is part of an effort to reduce the number of special cases in the
automatically generated node support functions.

Allegedly, only certain fields of the Constraint node are valid based
on contype.  But this has historically not been kept up to date in the
read/write functions.  The Constraint node is only used for debugging
DDL statements, so there are no strong requirements for its output,
and there is no enforcement for its correctness.  (There was no read
support before a6bc3301925.)  Commits e7a552f303c and abf46ad9c7b are
examples of where omissions were fixed.  And the current output is
again incorrect, because the recently added "inhcount" field is not
included.

This patch just removes the custom read/write implementations for the
Constraint node type.  Now we just output all the fields, which is a
bit more than before, but at least we don't have to maintain these
functions anymore.  Also, we lose the string representation of the
contype field, but for this marginal use case that seems tolerable.
This patch also changes the documentation of the Constraint struct to
put less emphasis on grouping fields by constraint type but rather
document for each field how it's used.

Discussion: 
https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca...@eisentraut.org
---
 src/backend/nodes/outfuncs.c   | 126 -
 src/backend/nodes/readfuncs.c  | 143 -
 src/include/nodes/parsenodes.h |  38 +++--
 3 files changed, 13 insertions(+), 294 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 296ba845187..dee2b9e87b2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -699,132 +699,6 @@ _outA_Const(StringInfo str, const A_Const *node)
WRITE_LOCATION_FIELD(location);
 }
 
-static void
-_outConstraint(StringInfo str, const Constraint *node)
-{
-   WRITE_NODE_TYPE("CONSTRAINT");
-
-   WRITE_STRING_FIELD(conname);
-   WRITE_BOOL_FIELD(deferrable);
-   WRITE_BOOL_FIELD(initdeferred);
-   WRITE_LOCATION_FIELD(location);
-
-   appendStringInfoString(str, " :contype ");
-   switch (node->contype)
-   {
-   case CONSTR_NULL:
-   appendStringInfoString(str, "NULL");
-   break;
-
-   case CONSTR_NOTNULL:
-   appendStringInfoString(str, "NOT_NULL");
-   WRITE_NODE_FIELD(keys);
-   WRITE_INT_FIELD(inhcount);
-   WRITE_BOOL_FIELD(is_no_inherit);
-   WRITE_BOOL_FIELD(skip_validation);
-   WRITE_BOOL_FIELD(initially_valid);
-   break;
-
-   case CONSTR_DEFAULT:
-   appendStringInfoString(str, "DEFAULT");
-   WRITE_NODE_FIELD(raw_expr);
-   WRITE_STRING_FIELD(cooked_expr);
-   break;
-
-   case CONSTR_IDENTITY:
-   appendStringInfoString(str, "IDENTITY");
-   WRITE_NODE_FIELD(options);
-   WRITE_CHAR_FIELD(generated_when);
-   break;
-
-   case CONSTR_GENERATED:
-   appendStringInfoString(str, "GENERATED");
-   WRITE_NODE_FIELD(raw_expr);
-   WRITE_STRING_FIELD(cooked_expr);
-   WRITE_CHAR_FIELD(generated_when);
-   break;
-
-   case CONSTR_CHECK:
-   appendStringInfoString(str, "CHECK");
-   WRITE_BOOL_FIELD(is_no_inherit);
-   WRITE_NODE_FIELD(raw_expr);
-   WRITE_STRING_FIELD(cooked_expr);
-   WRITE_BOOL_FIELD(skip_validation);
-   WRITE_BOOL_FIELD(initially_valid);
-   break;
-
-   

Re: System username in pg_stat_activity

2024-01-15 Thread Bertrand Drouvot
Hi,

On Fri, Jan 12, 2024 at 05:16:53PM +0100, Magnus Hagander wrote:
> On Thu, Jan 11, 2024 at 5:55 PM Bertrand Drouvot
>  wrote:
> >
> > I'm wondering if it would make sense to populate it for parallel workers 
> > too.
> > I think it's doable thanks to d951052, but I'm not sure it's worth it (one 
> > could
> > join based on the leader_pid though). OTOH that would be consistent with
> > how the SYSTEM_USER behaves with parallel workers (it's populated).
> 
> I guess one could conceptually argue that "authentication happens int
> he leader". But we do populate it with the other user records, and
> it'd be weird if this one was excluded.
> 
> The tricky thing is that pgstat_bestart() is called long before we
> deserialize the data. But from what I can tell it should be safe to
> change it per the attached? That should be AFAICT an extremely short
> window of time longer before we report it, not enough to matter.

Thanks! Yeah, that seems reasonable to me. Also, I think we should remove the
"MyProcPort" test here then (looking at v3):

+   if (MyProcPort && MyClientConnectionInfo.authn_id)
+   strlcpy(lbeentry.st_auth_identity, 
MyClientConnectionInfo.authn_id, NAMEDATALEN);
+   else
+   MemSet(_auth_identity, 0, 
sizeof(lbeentry.st_auth_identity));

to get the st_auth_identity propagated to the parallel workers.

> >
> > Same remark regarding the parallel workers case +:
> >
> > - Would it be better to use the `name` datatype for auth_identity?
> 
> I've been going back and forth. And I think my conclusion is that it's
> not a postgres identifier, so it shouldn't be. See the earlier
> discussion, and for example that that's what we do for cert names when
> SSL is used.

Yeah, Okay let's keep text then.

> 
> > - what about "Contains the same value as the identity part in  > linkend="system-user" />"?

Not sure, but looks like you missed this comment?

> >
> > +   /*
> > +* Trust doesn't set_authn_id(), but we still need 
> > to store the
> > +* auth_method
> > +*/
> > +   MyClientConnectionInfo.auth_method = uaTrust;
> >
> > +1, I think it is useful here to provide "trust" and not a NULL value in the
> > context of this patch.
> 
> Yeah, that's probably "independently correct", but actually useful here.

+1

> > +# Users with md5 auth should show both auth method and name in 
> > pg_stat_activity
> >
> > what about "show both auth method and identity"?
> 
> Good spot, yeah, I changed it over to identity everywhere else so it
> should be here as well.

Did you forget to share the new revision (aka v4)? I can only see the
"reorder_parallel_worker_bestart.patch" attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add PQsendSyncMessage() to libpq

2024-01-15 Thread Alvaro Herrera
On 2024-Jan-15, Michael Paquier wrote:

Looks good!  Just some small notes,

> +/*
> + * Wrapper for PQpipelineSync and PQsendPipelineSync.
>   *
>   * It's legal to start submitting more commands in the pipeline immediately,
>   * without waiting for the results of the current pipeline. There's no need 
> to

the new function pqPipelineSyncInternal is not a wrapper for these other
two functions -- the opposite is true actually.  We tend to use the term
"workhorse" or "internal workhorse" for this kind of thing.

In the docs, after this patch we have

- PQpipelineSync
- PQsendFlushRequest
- PQsendPipelineSync

Wouldn't it make more sense to add the new function in the middle of the
two existing ones instead?


Looking again at the largish comment that's now atop
pqPipelineSyncInternal(), I think most of it should be removed -- these
things should be explained in the SGML docs, and I think they are, in
the "Using Pipeline Mode" section.  We can just have the lines this
patch is adding.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)




Re: POC: GROUP BY optimization

2024-01-15 Thread Andrei Lepikhov

On 15/1/2024 13:42, Richard Guo wrote:


On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov > wrote:


Thank you for providing the test case relevant for this code change.
The revised patch incorporating this change is attached.  Now the
patchset looks good to me.  I'm going to push it if there are no
objections.


Seems I'm late for the party.  Can we hold for several more days?  I'd
like to have a review on this patch.
Get on board! It looks like this feature needs as much review as 
possible (likewise SJE).


--
regards,
Andrei Lepikhov
Postgres Professional





Re: Add test module for Table Access Method

2024-01-15 Thread Jelte Fennema-Nio
On Thu, 28 Sept 2023 at 03:08, Michael Paquier  wrote:
> dummy_index_am has included from the start additional coverage for the
> various internal add_*_reloption routines, that were never covered in
> the core tree.  Except if I am missing something, I am not seeing some
> of the extra usefulness for the patch you've sent here.

When trying to implement a table access method in the past I remember
very well that I was having a really hard time finding an example of
one. I remember seeing the dummy_index_am module and being quite
disappointed that there wasn't a similar one for table access methods.
I believe that I eventually found blackhole_am, but it took me quite a
bit of mailing list spelunking to get there. So I think purely for
documentation purposes this addition would already be useful.




Re: On login trigger: take three

2024-01-15 Thread Daniel Gustafsson
> On 13 Jan 2024, at 17:00, Alexander Lakhin  wrote:

> I suspected that this failure was caused by autovacuum, and I've managed to
> reproduce it without Valgrind or slowing down a machine.

This might be due to the fact that the cleanup codepath to remove the flag when
there is no login event trigger doesn't block on locking pg_database to avoid
stalling connections.  There are no guarantees when the flag is cleared, so a
test like this will always have potential to be flaky it seems.

--
Daniel Gustafsson





Re: Synchronizing slots from primary to standby

2024-01-15 Thread Bertrand Drouvot
Hi,

On Sat, Jan 13, 2024 at 10:05:52AM +0530, Amit Kapila wrote:
> On Fri, Jan 12, 2024 at 12:07 PM Bertrand Drouvot
>  wrote:
> > Maybe the "best" approach would be to have a way to detect that a slot has 
> > been
> > re-created on the primary (but that would mean rely on more than the slot 
> > name
> > to "identify" a slot and probably add a new member to the struct to do so).
> >
> 
> Right, I also thought so but not sure further complicating the slot
> machinery is worth detecting this case explicitly. If we see any
> problem with the idea discussed then we may need to think something
> along those lines.

Yeah, let's see. On one side that would require extra work but on the other side
that would also probably simplify (and less bug prone in the mid-long term?)
other parts of the code.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: Documentation to upgrade logical replication cluster

2024-01-15 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for updating the patch!

> > 7.
> > ```
> > +
> > +dba@node1:/opt/PostgreSQL/postgres//bin$ pg_ctl -D
> /opt/PostgreSQL/pub_data stop -l logfile
> > +
> > ```
> >
> > Hmm. I thought you did not have to show the current directory. You were in 
> > the
> > bin dir, but it is not our requirement, right?
> 
> I kept this just to show the version being used
>

Hmm, but by default, the current directory is not set as PATH. So this example
looks strange for me.

Below lines are my comments for v2 patch.

01.

```
+   
+Upgrade logical replication clusters
+
+
+ Refer logical replication 
upgrade section
+ for details on upgrading logical replication clusters.
+
+
+   
```

I think we do not have to write it as one of steps. I think we can move to
"Usage" part and modify like:

This page only focus on nodes which are not logical replication participant. See
 for upgrading such nodes.

02.

```
   with the primary.)  Only logical slots on the primary are copied to the
   new standby, but other slots on the old standby are not copied so must
   be recreated manually.
```

A description for logical slots were remained. If you want to keep, we must
say that it would be done for PG17+.

03.

I think the numbering seems bit confusing. sectX sgml tags should be used in
this case. How about formatting like below?

Upgrade (sect1)
--- Prerequisites (sect2)
  --- For upgrading a publisher node  (sect3)
  --- For upgrading a subscriber node  (sect3)
--- Examples (sect2)
  --- Two-node logical replication cluster  (sect3)
  --- Cascaded logical replication cluster  (sect3)
  --- Two-node circular logical replication cluster (sect3)

04. 

Missing introduction in the head of this section. E.g., 

Both publishers and subscribers can be upgraded, but there are some notes.
Before reading this section, you should read  page.

05.

```
+   
+Prepare for publisher upgrades
...
```

Should we describe in this page that publications can be upgraded in any
versions?

06.

```
+   
+Prepare for subscriber upgradesnode1.
```

Both "sub1_node1_node2" and "sub2_node1_node2" must be rendered.

09.

```
+   
+
+ Initialize data1_upgraded instance by using the required newer
+ version.
+
```

Missing rendering. All similar paragraphs must be fixed.

10.

```
+ On node2, create any tables that were created in
+ the upgraded publisher node1 server between
+ 
+ when the subscriptions where disabled in 
node2
+ and now, e.g.:
```

a.

I think the link is not correct, it should refer Step 6. Can we add the step 
number?
All similar paragraphs must be fixed.

b.

Not sure, but s/where disabled/were disabled/ ?
All similar paragraphs must be fixed.

11.

```
+
+ Refresh the node2 subscription's publications using
+ ALTER 
SUBSCRIPTION ... REFRESH PUBLICATION,
+ e.g.:
+
+node2=# ALTER SUBSCRIPTION sub1_node1_node2 REFRESH PUBLICATION;
+ALTER SUBSCRIPTION
+node2=# ALTER SUBSCRIPTION sub2_node1_node2 REFRESH PUBLICATION;
+ALTER SUBSCRIPTION
+
+
```

Not sure, but should we clarify that copy_data must be on?

12.

```
+   has two subscriptions sub1_node1_node2 and sub2_node1_node2 which is
+   subscribing the changes from node1. The
+   node3 has two subscriptions sub1_node2_node3 and
+   sub2_node2_node3 which is subscribing the changes from
```

Name of subscriptions must be rendered.

13.

```
+
+ On node1, Create any tables that were created in
+ node2 between 
+ when the subscriptions where disabled in 
node2
+ and now, e.g.:
+
+node1=# CREATE TABLE distributors (did integer PRIMARY KEY, name varchar(40));
+CREATE TABLE
+
+
...
+
+ On node2, Create any tables that were created in
+ the upgraded node1 between 
+ when the subscriptions where disabled in 
node1
+ and now, e.g.:
+
+node2=# CREATE TABLE distributors (did integer PRIMARY KEY, name varchar(40));
+CREATE TABLE
+
+
```

Same tables were created, they must have another name.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: On login trigger: take three

2024-01-15 Thread Daniel Gustafsson
> On 13 Jan 2024, at 17:00, Alexander Lakhin  wrote:

> DROP EVENT TRIGGER olt;
> SELECT dathasloginevt FROM pg_database WHERE datname= current_database();
>  dathasloginevt
> 
>  t
> (1 row)
> 
> Although after reconnection (\c, as done in the event_trigger_login test),
> this query returns 'f'.

This is by design in the patch.  The flag isn't changed on DROP, it is only
cleared on logins iff there is no login event trigger.  The description in the
docs is worded to indicate that it shouldn't be taken as truth for monitoring
purposes:

"This flag is used internally by PostgreSQL and should not be manually
altered or read for monitoring purposes."

--
Daniel Gustafsson





Re: Add PQsendSyncMessage() to libpq

2024-01-15 Thread Jelte Fennema-Nio
On Mon, 15 Jan 2024 at 08:50, Michael Paquier  wrote:
> Yeah, I'll go with that after a second look.  Attached is what I am
> finishing with, and I have reproduced some numbers with the pgbench
> metacommand mentioned upthread, which is reeeaaally nice.

Code looks good to me. But one small notes on the test.

+ /* second pipeline */
+ if (PQsendQueryParams(conn, "SELECT $1", 1, dummy_param_oids,
+ dummy_params, NULL, NULL, 0) != 1)
+   pg_fatal("dispatching first SELECT failed: %s",
PQerrorMessage(conn));

Error message should be "second SELECT" not "first SELECT". Same note
for the error message in the third pipeline, where it still says
"second SELECT".


+ res = PQgetResult(conn);
+ if (res == NULL)
+   pg_fatal("PQgetResult returned null when there's a
pipeline item: %s",
+PQerrorMessage(conn));
+
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   pg_fatal("Unexpected result code %s from first pipeline item",
+PQresStatus(PQresultStatus(res)));
+ PQclear(res);
+ res = NULL;
+
+ if (PQgetResult(conn) != NULL)
+   pg_fatal("PQgetResult returned something extra after first result");

same issue: s/first/second/g (and s/second/third/g for the existing
part of the test).




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-15 Thread Alexander Lakhin

Hello Michael and Bertrand,

15.01.2024 06:59, Michael Paquier wrote:

The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing.  Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable.  That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.


I've found a way to implement pause/resume for Windows processed and it
looks acceptable to me if we can afford "use Win32::API;" on Windows
(maybe the test could be skipped only if this perl module is absent).
Please look at the PoC patch for the test 035_standby_logical_decoding.
(The patched test passes for me.)

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.


While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test.  This requires a small
implementation, but nothing really huge, while being portable
everywhere.  And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.


I see that the test in question exists in REL_16_STABLE, it means that a
new GUC would not help there?

Best regards,
Alexanderdiff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..8b08c7b5c7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,8 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+use Win32::API;
+
 my ($stdin, $stdout, $stderr,
 	$cascading_stdout, $cascading_stderr, $subscriber_stdin,
 	$subscriber_stdout, $subscriber_stderr, $ret,
@@ -28,6 +30,28 @@ my $res;
 my $primary_slotname = 'primary_physical';
 my $standby_physical_slotname = 'standby_physical';
 
+my $OpenProcess = new Win32::API("kernel32", "HANDLE OpenProcess(DWORD dwDesiredAccess, BOOL bInheritHandle, DWORD dwProcessId )");
+my $NtSuspendProcess = new Win32::API("ntdll", 'LONG NtSuspendProcess(HANDLE hProcess)');
+my $NtResumeProcess = new Win32::API("ntdll", 'LONG NtResumeProcess(HANDLE hProcess)');
+my $CloseHandle = new Win32::API("kernel32", "BOOL CloseHandle(HANDLE hObject)");
+my $PROCESS_ALL_ACCESS = 0x001F0FFF;
+
+sub suspend_process
+{
+	my $pid = shift;
+	my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+	$NtSuspendProcess->Call($hProcess);
+	$CloseHandle->Call($hProcess);
+}
+
+sub resume_process
+{
+	my $pid = shift;
+	my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+	$NtResumeProcess->Call($hProcess);
+	$CloseHandle->Call($hProcess);
+}
+
 # Fetch xmin columns from slot's pg_replication_slots row, after waiting for
 # given boolean condition to be true to ensure we've reached a quiescent state.
 sub wait_for_xmins
@@ -456,6 +480,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+my $bgwriterpid = $node_primary->safe_psql('postgres', "SELECT pid FROM pg_stat_activity WHERE backend_type = 'background writer'");
+suspend_process($bgwriterpid);
+
 ##
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -690,6 +717,7 @@ $node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(25);]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 
 $node_primary->wait_for_replay_catchup($node_standby);
@@ -709,6 +737,8 @@ check_pg_recvlogical_stderr($handle,
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
+resume_process($bgwriterpid);
+
 ##
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 5: incorrect wal_level on primary.


Re: Create shorthand for including all extra tests

2024-01-15 Thread Nazir Bilal Yavuz
Hi,

On Wed, 10 Jan 2024 at 23:48, Peter Eisentraut  wrote:
>
> On 05.09.23 19:26, Nazir Bilal Yavuz wrote:
> > Thanks for the feedback! I updated the patch, 'needs-private-lo'
> > option enables kerberos, ldap, load_balance and ssl extra tests now.
>
> As was discussed, I don't think "needs private lo" is the only condition
> for these tests.  At least kerberos and ldap also need extra software
> installed, and load_balance might need editing the system's hosts file.
> So someone would still need to familiarize themselves with these tests
> individually before setting a global option like this.
>
> Also, if we were to create test groupings like this, I think the
> implementation should be different.  The way you have it, there is a
> sort of central registry of all affected tests in
> src/test/perl/PostgreSQL/Test/Utils.pm and a mapping of groups to tests.
>   I would prefer a more decentralized approach where each test decides
> on its own whether to run, with pseudo-code conditionals like
>
> if (!(PG_TEST_EXTRA contains "ldap" or PG_TEST_EXTRA contains
> "needs-private-lo"))
>skip_all
>
> Anyway, at the moment, I don't see a sensible way to group these things
> beyond what we have now (effectively, "ldap" is already a group, because
> it affects more than one test suite).  Right now, we have six possible
> values, which is probably just about doable to keep track of manually.
> If we get a lot more, then we need to look into this again, but maybe
> then we'll also have more patterns to group things around.

I see your point. It looks like the best option is to reevaluate this
if there are more PG_TEST_EXTRA options.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-15 Thread Bertrand Drouvot
Hi,

On Mon, Jan 15, 2024 at 01:11:26PM +0900, Michael Paquier wrote:
> On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote:
> > Michael Paquier  writes:
> >> While thinking about that, a second idea came into my mind: a
> >> superuser-settable developer GUC to disable such WAL records to be
> >> generated within certain areas of the test.  This requires a small
> >> implementation, but nothing really huge, while being portable
> >> everywhere.  And it is not the first time I've been annoyed with these
> >> records when wanting a predictible set of WAL records for some test
> >> case.
> > 
> > Hmm ... I see what you are after, but to what extent would this mean
> > that what we are testing is not our real-world behavior?
> 
> Don't think so.  We don't care much about these records when it comes
> to checking slot invalidation scenarios with a predictible XID
> horizon, AFAIK.

Yeah, we want to test slot invalidation behavior so we need to ensure that such
an invalidation occur (which is not the case if we get a xl_running_xacts in the
middle) at the first place.

OTOH I also see Tom's point: for example I think we'd not have discovered [1]
(outside from the field) with such a developer GUC in place.

We did a few things in this thread, so to sum up what we've discovered:

- a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
- we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
( proposal in in v6 attached)
- we need a way to control how frequent xl_running_xacts are emmitted (to ensure
they are not triggered in a middle of an active slot invalidation test).

I'm not sure it's possible to address Tom's concern and keep the test 
"predictable".

So, I think I'd vote for Michael's proposal to implement a superuser-settable
developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) 
would
still not address Tom's concern anyway).

Another option would be to "sacrifice" the full predictablity of the test (in
favor of real-world behavior testing)?

[1]: 
https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a35a308626b6b61c3994531cbf89fe835f4842c2 Mon Sep 17 00:00:00 2001
From: bdrouvot 
Date: Tue, 9 Jan 2024 05:08:35 +
Subject: [PATCH v6] Fix 035_standby_logical_decoding.pl race condition

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.
---
 .../t/035_standby_logical_decoding.pl | 59 ++-
 1 file changed, 30 insertions(+), 29 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..9bfa8833b5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -238,6 +238,25 @@ sub check_for_invalidation
 	) or die "Timed out waiting confl_active_logicalslot to be updated";
 }
 
+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum with $vac_option on $to_vac.
+sub wait_until_vacuum_can_remove
+{
+	my ($vac_option, $sql, $to_vac) = @_;
+
+	# get the current xid horizon
+	my $xid_horizon = $node_primary->safe_psql('testdb', qq[select pg_snapshot_xmin(pg_current_snapshot());]);
+
+	# launch our sql
+	$node_primary->safe_psql('testdb', qq[$sql]);
+
+	$node_primary->poll_query_until('testdb',
+		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0")
+	  or die "new snapshot does not have a newer horizon";
+
+	$node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose $to_vac;
+		  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal;]);
+}
 
 # Initialize primary node
 
@@ -248,6 +267,7 @@ $node_primary->append_conf(
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
+autovacuum = off
 });
 $node_primary->dump_info;
 $node_primary->start;
@@ -468,13 +488,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 	0, 1);
 
 # This should trigger the conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE TABLE conflict_test(x integer, y text);
-  DROP TABLE conflict_test;
-  VACUUM full pg_class;
-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
-]);
+wait_until_vacuum_can_remove('full', 'CREATE TABLE conflict_test(x integer, y text);
+ DROP TABLE conflict_test;', 'pg_class');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
@@ -550,13 +565,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
 	0, 1);
 
 # This should trigger the conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE