date:20240131

Sorry for a minor correction, but..

At Thu, 01 Feb 2024 14:53:57 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Ah.. Understood. "NaN or Infinity" cannot be used in those
> cases. Additionally, for jpiBoolean and jpiBigint, we lack the text
> representation of the value.

This "Additionally" was merely left in by mistake.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: More new SQL/JSON item methods

At Thu, 1 Feb 2024 09:22:22 +0530, Jeevan Chalke 
 wrote in 
> On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi 
> wrote:
> 
> > At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi <
> > horikyota@gmail.com> wrote in
> > > By the way, while playing with this feature, I noticed the following
> > > error message:
> > >
> > > > select jsonb_path_query('1.1' , '$.boolean()');
> > > > ERROR:  numeric argument of jsonpath item method .boolean() is out of
> > range for type boolean
> > >
> > > The error message seems a bit off to me. For example, "argument '1.1'
> > > is invalid for type [bB]oolean" seems more appropriate for this
> > > specific issue. (I'm not ceratin about our policy on the spelling of
> > > Boolean..)
> >
> > Or, following our general convention, it would be spelled as:
> >
> > 'invalid argument for type Boolean: "1.1"'
> >
> 
> jsonpath way:

Hmm. I see.

> ERROR:  argument of jsonpath item method .boolean() is invalid for type
> boolean
> 
> or, if we add input value, then
> 
> ERROR:  argument "1.1" of jsonpath item method .boolean() is invalid for
> type boolean
> 
> And this should work for all the error types, like out of range, not valid,
> invalid input, etc, etc. Also, we don't need separate error messages for
> string input as well, which currently has the following form:
> 
> "string argument of jsonpath item method .%s() is not a valid
> representation.."

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: More new SQL/JSON item methods

At Thu, 1 Feb 2024 09:19:40 +0530, Jeevan Chalke 
 wrote in 
> > As Tom suggested, given that similar situations have already been
> > disregarded elsewhere, worrying about excessively long input strings
> > in this specific instance won't notably improve safety in total.
> >
> > > Also, for non-string input, we need to convert numeric to string just for
> > > the error message, which seems overkill.
> >
> > As I suggested and you seem to agree, using literally "Nan or
> > Infinity" would be sufficient.
> >
> 
> I am more concerned about .bigint() and .integer(). We can have errors when
> the numeric input is out of range, but not NaN or Infinity. At those
> places, we need to convert numeric to string to put that value into the
> error.
> Do you mean we should still put "Nan or Infinity" there?
> 
> This is the case:
>  select jsonb_path_query('12345678901', '$.integer()');
>  ERROR:  numeric argument of jsonpath item method .integer() is out of
> range for type integer

Ah.. Understood. "NaN or Infinity" cannot be used in those
cases. Additionally, for jpiBoolean and jpiBigint, we lack the text
representation of the value.

By a quick grepping, I found that the following functions call
numeric_out to convert the jbvNumeric values back into text
representation.

JsonbValueAstext, populate_scalar, iterate_jsonb_values,
executeItemOptUnrwapTarget, jsonb_put_escaped_value

The function iterate_jsonb_values(), in particular, iterates over a
values array, calling numeric_out on each iteration.

The following functions re-converts the converted numeric into another type.

jsonb_int[248]() converts the numeric value into int2 using numeric_int[248]().
jsonb_float[48]() converts it into float4 using numeric_float[48]().

Given these facts, it seems more efficient for jbvNumber to retain the
original scalar value, converting it only when necessary. If needed,
we could also add a numeric struct member as a cache for better
performance. I'm not sure we refer the values more than once, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Synchronizing slots from primary to standby

Here are some review comments for v740001.

==
src/sgml/logicaldecoding.sgml

1.
+   
+Replication Slot Synchronization
+
+ A logical replication slot on the primary can be synchronized to the hot
+ standby by enabling the failover option during slot
+ creation and setting
+ enable_syncslot
+ on the standby. For the synchronization
+ to work, it is mandatory to have a physical replication slot between the
+ primary and the standby, and
+ hot_standby_feedback
+ must be enabled on the standby. It is also necessary to specify a valid
+ dbname in the
+ primary_conninfo
+ string, which is used for slot synchronization and is ignored
for streaming.
+

IMO we don't need to repeat that last part ", which is used for slot
synchronization and is ignored for streaming." because that is a
detail about the primary_conninfo GUC, and the same information is
already described in that GUC section.

==

2. ALTER_REPLICATION_SLOT slot_name ( option [, ...] ) #

  
-  If true, the slot is enabled to be synced to the standbys.
+  If true, the slot is enabled to be synced to the standbys
+  so that logical replication can be resumed after failover.
  

This also should have the sentence "The default is false.", e.g. the
same as the same option in CREATE_REPLICATION_SLOT says.

==
synchronize_one_slot

3.
+ /*
+ * Make sure that concerned WAL is received and flushed before syncing
+ * slot to target lsn received from the primary server.
+ *
+ * This check will never pass if on the primary server, user has
+ * configured standby_slot_names GUC correctly, otherwise this can hit
+ * frequently.
+ */
+ latestFlushPtr = GetStandbyFlushRecPtr(NULL);
+ if (remote_slot->confirmed_lsn > latestFlushPtr)

BEFORE
This check will never pass if on the primary server, user has
configured standby_slot_names GUC correctly, otherwise this can hit
frequently.

SUGGESTION (simpler way to say the same thing?)
This will always be the case unless the standby_slot_names GUC is not
correctly configured on the primary server.

~~~

4.
+ /* User created slot with the same name exists, raise ERROR. */

/User created/User-created/

~~~

5. synchronize_slots, and also drop_obsolete_slots

+ /*
+ * Use shared lock to prevent a conflict with
+ * ReplicationSlotsDropDBSlots(), trying to drop the same slot while
+ * drop-database operation.
+ */

(same code comment is in a couple of places)

SUGGESTION (while -> during, etc.)

Use a shared lock to prevent conflicting with
ReplicationSlotsDropDBSlots() trying to drop the same slot during a
drop-database operation.

~~~

6. validate_parameters_and_get_dbname

strcmp() just for the empty string "" might be overkill.

6a.
+ if (PrimarySlotName == NULL || strcmp(PrimarySlotName, "") == 0)

SUGGESTION
if (PrimarySlotName == NULL || *PrimarySlotName == '\0')

~~

6b.
+ if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0)

SUGGESTION
if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0')

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-31 Thread Ajin Cherian

On Tue, Jan 30, 2024 at 11:53 PM shveta malik 
wrote:

> On Tue, Jan 30, 2024 at 4:06 PM shveta malik 
> wrote:
> >
> > PFA v73-0001 which addresses the above comments. Other patches will be
> > rebased and posted after pushing this one.
>
> Since v73-0001 is pushed, PFA  rest of the patches. Changes are:
>
> 1) Rebased the patches.
> 2) Ran pg_indent on all.
> 3) patch001: Updated logicaldecoding.sgml for dbname requirement in
> primary_conninfo for slot-synchronization.
>
> thanks
> Shveta
>

Just to test the behaviour, I modified the code to set failover flag to
default to "true" while creating subscription and ran the regression tests.
I only saw the expected errors.
1. Make check in postgres root folder  - all failures are because of
difference when listing subscription as failover flag is now enabled. The
diff is attached for regress.

2. Make check in src/test/subscription - no failures All tests successful.
Files=34, Tests=457, 81 wallclock secs ( 0.14 usr  0.05 sys +  9.53 cusr
13.00 csys = 22.72 CPU)
Result: PASS

3. Make check in src/test/recovery - 3 failures Test Summary Report
---
t/027_stream_regress.pl (Wstat: 256 Tests: 6 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
t/035_standby_logical_decoding.pl   (Wstat: 7424 Tests: 8 Failed: 0)
  Non-zero exit status: 29
  Parse errors: No plan found in TAP output t/
050_standby_failover_slots_sync.pl (Wstat: 7424 Tests: 5 Failed: 0)
  Non-zero exit status: 29
  Parse errors: No plan found in TAP output

3a. Analysis of t/027_stream_regress.pl - No, 027 fails with the same issue
as "make check" in postgres root folder (for which I attached the diffs).
027 is about running the standard regression tests with streaming
replication. Since the regression tests fail because listing subscription
now has failover enabled, 027 also fails in the same way with streaming
replication.

3b. Analysis of t/035_standby_logical_decoding.pl - In this test case, they
attempt to create a subscription from the subscriber to the standby
##
# Test that we can subscribe on the standby with the publication # created
on the primary.
##

Now, this fails because creating a subscription on the standby with
failover enabled will result in error:
I see the following error in the log:
2024-01-28 23:51:30.425 EST [23332] tap_sub STATEMENT:
 CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (FAILOVER, SNAPSHOT
'nothing')
2024-01-28 23:51:30.425 EST [23332] tap_sub ERROR:  cannot create
replication slot with failover enabled on the standby I discussed this with
Shveta and she agreed that this is the expected behaviour as we don't
support failover to cascading standby yet.

3c. Analysis of t/050_standby_failover_slots_sync.pl - This is a new test
case created for this patch, and it creates a subscription without failover
enabled to make sure that the Subscription with failover disabled does not
depend on sync on standby, but this fails because we have failover enabled
by default.

In summary, I don't think these issues are actual bugs but expected
behaviour change.

regards,
Ajin Cherian
Fujitsu Australia

Re: POC, WIP: OR-clause support for indexes

2024-01-31 Thread jian he

On Wed, Jan 31, 2024 at 7:10 PM Alena Rybakina
 wrote:
>
> Hi, thank you for your review and interest in this subject.
>
> On 31.01.2024 13:15, jian he wrote:
>
> On Wed, Jan 31, 2024 at 10:55 AM jian he  wrote:
>
> based on my understanding of
> https://www.postgresql.org/docs/current/xoper-optimization.html#XOPER-COMMUTATOR
> I think you need move commutator check right after the `if
> (get_op_rettype(opno) != BOOLOID)` branch
>
> I was wrong about this part. sorry for the noise.
>
>
> I have made some changes (attachment).
> * if the operator expression left or right side type category is
> {array | domain | composite}, then don't do the transformation.
> (i am not 10% sure with composite)
>
> To be honest, I'm not sure about this check, because we check the type of 
> variable there:
>
> if (!IsA(orqual, OpExpr))
> {
> or_list = lappend(or_list, orqual);
> continue;
> }
> And below:
> if (IsA(leftop, Const))
> {
> opno = get_commutator(opno);
>
> if (!OidIsValid(opno))
> {
> /* Commuter doesn't exist, we can't reverse the order */
> or_list = lappend(or_list, orqual);
> continue;
> }
>
> nconst_expr = get_rightop(orqual);
> const_expr = get_leftop(orqual);
> }
> else if (IsA(rightop, Const))
> {
> const_expr = get_rightop(orqual);
> nconst_expr = get_leftop(orqual);
> }
> else
> {
> or_list = lappend(or_list, orqual);
> continue;
> }
>
> Isn't that enough?

alter table tenk1 add column arr int[];
set enable_or_transformation to on;
EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1
WHERE arr = '{1,2,3}' or arr = '{1,2}';

the above query will not do the OR transformation. because array type
doesn't have array type.
`
scalar_type = entry->key.exprtype;
if (scalar_type != RECORDOID && OidIsValid(scalar_type))
array_type = get_array_type(scalar_type);
else
array_type = InvalidOid;
`

If either side of the operator expression is array or array related type,
we can be sure it cannot do the transformation
(get_array_type will return InvalidOid for anyarray type).
we can check it earlier, so hash related code will not be invoked for
array related types.


> Besides, some of examples (with ARRAY) works fine:
>
> postgres=# CREATE TABLE sal_emp (
> pay_by_quarter  integer[],
> pay_by_quater1 integer[]
> );
> CREATE TABLE
> postgres=# INSERT INTO sal_emp
> VALUES (
> '{1, 1, 1, 1}',
> '{1,2,3,4}');
> INSERT 0 1
> postgres=# select * from sal_emp where pay_by_quarter[1] = 1 or 
> pay_by_quarter[1]=2;
>   pay_by_quarter   | pay_by_quater1
> ---+
>  {1,1,1,1} | {1,2,3,4}
> (1 row)
>
> postgres=# explain select * from sal_emp where pay_by_quarter[1] = 1 or 
> pay_by_quarter[1]=2;
>   QUERY PLAN
> --
>  Seq Scan on sal_emp  (cost=0.00..21.00 rows=9 width=64)
>Filter: (pay_by_quarter[1] = ANY ('{1,2}'::integer[]))
> (2 rows)
>
> * if the left side of the operator expression node contains volatile
> functions, then don't do the transformation.
>
> I'm also not sure about the volatility check function, because we perform 
> such a conversion at the parsing stage, and at this stage we don't have a 
> RelOptInfo variable and especially a RestictInfo such as PathTarget.
>
 see the example in here:
https://www.postgresql.org/message-id/CACJufxGXhJ823cdAdp2Ho7qC-HZ3_-dtdj-myaAi_u9RQLn45g%40mail.gmail.com

set enable_or_transformation to on;
create or replace function retint(int) returns int as
$func$
begin raise notice 'hello';
return $1 + round(10 * random()); end
$func$ LANGUAGE plpgsql;

SELECT count(*) FROM tenk1 WHERE thousand = 42;
will return 10 rows.

SELECT count(*) FROM tenk1 WHERE thousand = 42 AND (retint(1) = 4 OR
retint(1) = 3);
this query I should return 20 notices 'hello', but now only 10.

EXPLAIN (COSTS OFF)
SELECT count(*) FROM tenk1
WHERE thousand = 42 AND (retint(1) = 4 OR  retint(1) = 3);
  QUERY PLAN
--
 Aggregate
   ->  Seq Scan on tenk1
 Filter: ((thousand = 42) AND (retint(1) = ANY ('{4,3}'::integer[])))
(3 rows)

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-01-31 Thread vignesh C

On Tue, 30 Jan 2024 at 17:22, Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, January 30, 2024 11:21 AM vignesh C  wrote:
> >
> > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) 
> > wrote:
> > >
> > > On Monday, January 29, 2024 9:22 PM vignesh C 
> > wrote:
> > > >
> > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin 
> > wrote:
> > > > >
> > > > > Hello hackers,
> > > > >
> > > > > After determining a possible cause for intermittent failures of
> > > > > the test subscription/031_column_list [1], I was wondering what
> > > > > makes another subscription test (014_binary) fail on the buildfarm:
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&d
> > > > > t=20
> > > > > 24-01-22%2001%3A19%3A03
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > > 02
> > > > > 4-01-14%2018%3A19%3A20
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt
> > > > > =202
> > > > > 3-12-21%2001%3A11%3A52
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > > 02
> > > > > 3-11-27%2001%3A42%3A39
> > > > >
> > > > > All those failures caused by a timeout when waiting for a message
> > > > > expected in _subscriber.log. For example, in the snakefly's case:
> > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > > > incorrect binary data format) at
> > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
> > > > ary.pl
> > > > line 269.
> > > > >
> > > > > _subscriber.log contains:
> > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2023-12-21 01:17:46.756 UTC [40] ERROR:  could not receive
> > > > > data from WAL stream: server closed the connection unexpectedly
> > > > >  This probably means the server terminated abnormally
> > > > >  before or while processing the request.
> > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker
> > > > > "logical replication apply worker" (PID 40) exited with exit
> > > > > code 1
> > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication
> > > > > apply worker for subscription "tsub" has started ...
> > > > >
> > > > > While _subscriber.log from a successful test run contains:
> > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > > > session time: 0:00:00.014 user=postgres database=postgres
> > > > > host=[local]
> > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication
> > > > > apply worker for subscription "tsub" has started
> > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication
> > > > > table synchronization worker for subscription "tsub", table
> > > > > "test_mismatching_types" has started
> > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > > > > format
> > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > > > > test_mismatching_types, line 1, column a
> > > > >
> > > > > In this case, "logical replication apply worker for subscription
> > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
> > > > > minutes
> > > > later.
> > > > >
> > > > > I've managed to reproduce this failure locally by running multiple
> > > > > tests in parallel, and my analysis shows that it is caused by a
> > > > > race condition when accessing variable table_states_valid inside
> > tablesync.c.
> > > > >
> > > > > tablesync.c does the following with table_states_valid:
> > > > > /*
> > > > >   * Callback from syscache invalidation.
> > > > >   */
> > > > > void
> > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > > > hashvalue) {
> > > > >  table_states_valid = false;
> > > > > }
> > > > > ...
> > > > > static bool
> > > > > FetchTableStates(bool *started_tx) { ...
> > > > >  if (!table_states_valid)
> > > > >  {
> > > > > ...
> > > > >  /* Fetch all non-ready tables. */
> > > > >  rstates = GetSubscriptionRelations(MySubscription->oid,
> > > > > true); ...
> > > > >  table_states_valid = true;
> > > > >  }
> > > > >
> > > > > So, when syscache invalidation occurs inside the code block "if
> > > > > (!table_states_valid)", that invalidation is effectively ignored.
> > > > >
> > > > > In a failed case I observe the following events:
> > > > > 1. logical replication apply worker performs
> > > > >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> > > > >   process_syncing_tables_for_apply() -> FetchTableStates() 
> > > > > periodically.
> > > > >
> > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates
> > syscache
> > > > >   for SUBSCRIPTIONRELMAP, and that leads to callin

Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2024-01-31 Thread Yugo NAGATA

On Tue, 30 Jan 2024 14:57:20 +0800
jian he  wrote:

> On Tue, Jan 30, 2024 at 1:56 PM Yugo NAGATA  wrote:
> >
> > I attached the correct one, v4.
> >
> 
> +-- Test pg_column_toast_chunk_id:
> +-- Check if the returned chunk_id exists in the TOAST table
> +CREATE TABLE test_chunk_id (v1 text, v2 text);
> +INSERT INTO test_chunk_id VALUES (
> +  repeat('0123456789', 10), -- v1: small enough not to be TOASTed
> +  repeat('0123456789', 10)); -- v2: large enough to be TOASTed
> 
> select pg_size_pretty(10::bigint);
> return 98kb.
> 
> I think this is just too much, maybe I didn't consider the
> implications of compression.
> Anyway, I refactored the tests, making the toast value size be small.

Actually the data is compressed and the size is much smaller,
but I agree with you it is better not to generate large data unnecessarily.
I rewrote the test to disallow compression in the toast data using 
"ALTER TABLE ... SET STORAGE EXTERNAL". In this case, any text larger
than 2k will be TOASTed on disk without compression, and it makes the
test simple, not required to use string_agg.
> 
> I aslo refactor the doc.
> pg_column_toast_chunk_id entry will be right after pg_column_compression 
> entry.
> You can check the screenshot.

I found the document order was not changed between my patch and yours.
In both, pg_column_toast_chunk_id entry is right after 
pg_column_compression.

Here is a updated patch, v6.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
>From dfec0fdb52c6aad8846b5c984e111dc8b2985e1a Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Wed, 29 Mar 2023 09:59:25 +0900
Subject: [PATCH v6] Add pg_column_toast_chunk_id function

This function returns the chunk_id of an on-disk TOASTed value, or
NULL if the value is un-TOASTed or not on disk. This enables users to
know which columns are actually TOASTed. This function is also useful
to identify a problematic row when an error like
  "ERROR:  unexpected chunk number ... (expected ...) for toast value"
occurs.
---
 doc/src/sgml/func.sgml   | 17 
 src/backend/utils/adt/varlena.c  | 41 
 src/include/catalog/pg_proc.dat  |  3 ++
 src/test/regress/expected/misc_functions.out | 21 ++
 src/test/regress/sql/misc_functions.sql  | 23 +++
 5 files changed, 105 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6788ba8ef4..4116aaff7a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28502,6 +28502,23 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset

   
 
+  
+   
+
+ pg_column_toast_chunk_id
+
+pg_column_toast_chunk_id ( "any" )
+oid
+   
+   
+Shows the chunk_id of an on-disk
+TOASTed value. Returns NULL
+if the value is un-TOASTed or not on-disk.
+See  for details about
+TOAST.
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 543afb66e5..84d36781a4 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -5105,6 +5105,47 @@ pg_column_compression(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(result));
 }
 
+/*
+ * Return the chunk_id of the on-disk TOASTed value.
+ * Return NULL if the value is unTOASTed or not on disk.
+ */
+Datum
+pg_column_toast_chunk_id(PG_FUNCTION_ARGS)
+{
+	int			typlen;
+	struct varlena *attr;
+	struct varatt_external toast_pointer;
+
+	/* On first call, get the input type's typlen, and save at *fn_extra */
+	if (fcinfo->flinfo->fn_extra == NULL)
+	{
+		/* Lookup the datatype of the supplied argument */
+		Oid			argtypeid = get_fn_expr_argtype(fcinfo->flinfo, 0);
+
+		typlen = get_typlen(argtypeid);
+		if (typlen == 0)		/* should not happen */
+			elog(ERROR, "cache lookup failed for type %u", argtypeid);
+
+		fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
+	  sizeof(int));
+		*((int *) fcinfo->flinfo->fn_extra) = typlen;
+	}
+	else
+		typlen = *((int *) fcinfo->flinfo->fn_extra);
+
+	if (typlen != -1)
+		PG_RETURN_NULL();
+
+	attr = (struct varlena *) DatumGetPointer(PG_GETARG_DATUM(0));
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		PG_RETURN_NULL();
+
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+	PG_RETURN_OID(toast_pointer.va_valueid);
+}
+
 /*
  * string_agg - Concatenates values and returns string.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 29af4ce65d..9ab71646e3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7454,6 +7454,9 @@
 { oid => '2121', descr => 'compression method for the compressed datum',
   proname => 'pg_column_compression', provolatile => 's', prorettype => 'text',
   proargtypes => 'any', prosrc => 'pg_column_compression' },
+{ oid => '8393', descr => 'chunk ID of on-disk TOASTed value',
+  proname => 'pg

Re: Synchronizing slots from primary to standby

2024-01-31 Thread Amit Kapila

On Thu, Feb 1, 2024 at 8:15 AM Euler Taveira  wrote:
>
> On Mon, Jan 29, 2024, at 10:17 AM, Zhijie Hou (Fujitsu) wrote:
>
> Attach the V72-0001 which addressed above comments, other patches will be
> rebased and posted after pushing first patch. Thanks Shveta for helping 
> address
> the comments.
>
>
> While working on another patch I noticed a new NOTICE message:
>
> NOTICE:  changed the failover state of replication slot "foo" on publisher to 
> false
>
> I wasn't paying much attention to this thread then I start reading the 2
> patches that was recently committed. The message above surprises me because
> pg_createsubscriber starts to emit this message. The reason is that it doesn't
> create the replication slot during the CREATE SUBSCRIPTION. Instead, it 
> creates
> the replication slot with failover = false and no such option is informed
> during CREATE SUBSCRIPTION which means it uses the default value (failover =
> false). I expect that I don't see any message because it is *not* changing the
> behavior. I was wrong. It doesn't check the failover state on publisher, it
> just executes walrcv_alter_slot() and emits a message.
>
> IMO if we are changing an outstanding property on node A from node B, node B
> already knows (or might know) about that behavior change (because it is 
> sending
> the command), however, node A doesn't (unless log_replication_commands = on --
> it is not the default).
>
> Do we really need this message as NOTICE?
>

The reason for adding this NOTICE was to keep it similar to other
Notice messages in these commands like create/drop slot. However, here
the difference is we may not have altered the slot as the property is
already the same as we want to set on the publisher. So, I am not sure
whether we should follow the existing behavior or just get rid of it.
And then do we remove similar NOTICE in AlterSubscription() as well?
Normally, I think NOTICE intends to let users know if we did anything
with slots while executing subscription commands. Does anyone else
have an opinion on this point?

A related point, I think we can avoid setting the 'failover' property
in ReplicationSlotAlter() if it is not changed, the advantage is we
will avoid saving slots. OTOH, this won't be a frequent operation so
we can leave it as it is as well.

-- 
With Regards,
Amit Kapila.

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-31 Thread Junwang Zhao

On Thu, Feb 1, 2024 at 11:56 AM Michael Paquier  wrote:
>
> On Thu, Feb 01, 2024 at 11:43:07AM +0800, Junwang Zhao wrote:
> > The first 6 rounds are like 10% better than the later 9 rounds, is this 
> > normal?
>
> Even with HEAD?  Perhaps you have some OS cache eviction in play here?
> FWIW, I'm not seeing any of that with longer runs after 7~ tries in a
> loop of 15.

Yeah, with HEAD. I'm on ubuntu 22.04, I did not change any gucs, maybe I should
set a higher shared_buffers? But I dought that's related ;(


> --
> Michael



-- 
Regards
Junwang Zhao

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-31 Thread Michael Paquier

On Thu, Feb 01, 2024 at 11:43:07AM +0800, Junwang Zhao wrote:
> The first 6 rounds are like 10% better than the later 9 rounds, is this 
> normal?

Even with HEAD?  Perhaps you have some OS cache eviction in play here?
FWIW, I'm not seeing any of that with longer runs after 7~ tries in a
loop of 15.
--
Michael


signature.asc
Description: PGP signature

Re: More new SQL/JSON item methods

2024-01-31 Thread Jeevan Chalke

On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi 
wrote:

> At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi <
> horikyota@gmail.com> wrote in
> > By the way, while playing with this feature, I noticed the following
> > error message:
> >
> > > select jsonb_path_query('1.1' , '$.boolean()');
> > > ERROR:  numeric argument of jsonpath item method .boolean() is out of
> range for type boolean
> >
> > The error message seems a bit off to me. For example, "argument '1.1'
> > is invalid for type [bB]oolean" seems more appropriate for this
> > specific issue. (I'm not ceratin about our policy on the spelling of
> > Boolean..)
>
> Or, following our general convention, it would be spelled as:
>
> 'invalid argument for type Boolean: "1.1"'
>

jsonpath way:

ERROR:  argument of jsonpath item method .boolean() is invalid for type
boolean

or, if we add input value, then

ERROR:  argument "1.1" of jsonpath item method .boolean() is invalid for
type boolean

And this should work for all the error types, like out of range, not valid,
invalid input, etc, etc. Also, we don't need separate error messages for
string input as well, which currently has the following form:

"string argument of jsonpath item method .%s() is not a valid
representation.."


Thanks


> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

Re: Synchronizing slots from primary to standby

2024-01-31 Thread Amit Kapila

On Wed, Jan 31, 2024 at 9:20 PM Masahiko Sawada  wrote:
>
> On Wed, Jan 31, 2024 at 7:42 PM Amit Kapila  wrote:
> >
> >
> > Considering my previous where we don't want to restart for a required
> > parameter change, isn't it better to avoid repeated restart (say when
> > the user gave an invalid dbname)? BTW, I think this restart interval
> > is added based on your previous complaint [1].
>
> I think it's useful that the slotsync worker restarts immediately when
> a required parameter is changed but waits to restart when it exits
> with an error. IIUC the apply worker does so; if it restarts due to a
> subscription parameter change, it resets the last-start time so that
> the launcher will restart it without waiting.
>

Agreed, this idea sounds good to me.

> >
> > >
> > > ---
> > > When I dropped a database on the primary that has a failover slot, I
> > > got the following logs on the standby:
> > >
> > > 2024-01-31 17:25:21.750 JST [1103933] FATAL:  replication slot "s" is
> > > active for PID 1103935
> > > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT:  WAL redo at 0/3020D20
> > > for Database/DROP: dir 1663/16384
> > > 2024-01-31 17:25:21.751 JST [1103930] LOG:  startup process (PID
> > > 1103933) exited with exit code 1
> > >
> > > It seems that because the slotsync worker created the slot on the
> > > standby, the slot's active_pid is still valid.
> > >
> >
> > But we release the slot after sync. And we do take a shared lock on
> > the database to make the startup process wait for slotsync. There is
> > one gap which is that we don't reset active_pid for temp slots in
> > ReplicationSlotRelease(), so for temp slots such an error can occur
> > but OTOH, we immediately make the slot persistent after sync. As per
> > my understanding, it is only possible to get this error if the initial
> > sync doesn't happen and the slot remains temporary. Is that your case?
> > How did reproduce this?
>
> I created a failover slot manually on the primary and dropped the
> database where the failover slot is created. So this would not happen
> in normal cases.
>

Right, it won't happen in normal cases (say for walsender). This can
happen in some cases even without this patch as noted in comments just
above active_pid check in ReplicationSlotsDropDBSlots(). Now, we need
to think whether we should just update the comments above active_pid
check to explain this case or try to engineer some solution for this
not-so-common case. I guess if we want a solution we need to stop
slotsync worker temporarily till the drop database WAL is applied or
something like that.

> BTW I've tested the following switch/fail-back scenario but it seems
> not to work fine. Am I missing something?
>
> Setup:
> node1 is the primary, node2 is the physical standby for node1, and
> node3 is the subscriber connecting to node1.
>
> Steps:
> 1. [node1]: create a table and a publication for the table.
> 2. [node2]: set enable_syncslot = on and start (to receive WALs from node1).
> 3. [node3]: create a subscription with failover = true for the publication.
> 4. [node2]: promote to the new standby.
> 5. [node3]: alter subscription to connect the new primary, node2.
> 6. [node1]: stop, set enable_syncslot = on (and other required
> parameters), then start as a new standby.
>
> Then I got the error "exiting from slot synchronization because same
> name slot "test_sub" already exists on the standby".
>
> The logical replication slot that was created on the old primary
> (node1) has been synchronized to the old standby (node2). Therefore on
> node2, the slot's "synced" field is true. However, once node1 starts
> as the new standby with slot synchronization, the slotsync worker
> cannot synchronize the slot because the slot's "synced" field on the
> primary is false.
>

Yeah, we avoided doing anything in this case because the user could
have manually created another slot with the same name on standby.
Unlike WAL slots can be modified on standby as we allow decoding on
standby, so we can't allow to overwrite the existing slots. We won't
be able to distinguish whether the existing slot was a slot that the
user wants to sync with primary or a slot created on standby to
perform decoding. I think in this case user first needs to drop the
slot on new standby. We probably need to document it as well unless we
decide to do something else. What do you think?

-- 
With Regards,
Amit Kapila.

Re: More new SQL/JSON item methods

2024-01-31 Thread Jeevan Chalke

On Thu, Feb 1, 2024 at 7:20 AM Kyotaro Horiguchi 
wrote:

> Thank you for the fix!
>
> At Tue, 30 Jan 2024 13:46:17 +0530, Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote in
> > On Mon, Jan 29, 2024 at 11:03 AM Tom Lane  wrote:
> >
> > > Kyotaro Horiguchi  writes:
> > > > Having said that, I'm a bit concerned about the case where an overly
> > > > long string is given there. However, considering that we already have
> > > > "invalid input syntax for type xxx: %x" messages (including for the
> > > > numeric), this concern might be unnecessary.
> > >
> > > Yeah, we have not worried about that in the past.
> > >
> > > > Another concern is that the input value is already a numeric, not a
> > > > string. This situation occurs when the input is NaN of +-Inf.
> Although
> > > > numeric_out could be used, it might cause another error. Therefore,
> > > > simply writing out as "argument NaN and Infinity.." would be better.
> > >
> > > Oh!  I'd assumed that we were discussing a string that we'd failed to
> > > convert to numeric.  If the input is already numeric, then either
> > > the error is unreachable or what we're really doing is rejecting
> > > special values such as NaN on policy grounds.  I would ask first
> > > if that policy is sane at all.  (I'd lean to "not" --- if we allow
> > > it in the input JSON, why not in the output?)  If it is sane, the
> > > error message needs to be far more specific.
> > >
> > > regards, tom lane
> > >
> >
> > *Consistent error message related to type:*
> ...
> > What if we use phrases like "for type double precision" at both places,
> > like:
> > numeric argument of jsonpath item method .%s() is out of range for type
> > double precision
> > string argument of jsonpath item method .%s() is not a valid
> representation
> > for type double precision
> >
> > With this, the rest will be like:
> > for type numeric
> > for type bigint
> > for type integer
> >
> > Suggestions?
>
> FWIW, I prefer consistently using "for type hoge".
>

OK.


>
> > ---
> >
> > *Showing input string in the error message:*
> >
> > argument "...input string here..." of jsonpath item method .%s() is out
> of
> > range for type numeric
> >
> > If we add the input string in the error, then for some errors, it will be
> > too big, for example:
> >
> > -ERROR:  numeric argument of jsonpath item method .double() is out of
> range
> > for type double precision
> > +ERROR:  argument
> > "10"
> > of jsonpath item method .double() is out of range for type double
> precision
>
> As Tom suggested, given that similar situations have already been
> disregarded elsewhere, worrying about excessively long input strings
> in this specific instance won't notably improve safety in total.
>
> > Also, for non-string input, we need to convert numeric to string just for
> > the error message, which seems overkill.
>
> As I suggested and you seem to agree, using literally "Nan or
> Infinity" would be sufficient.
>

I am more concerned about .bigint() and .integer(). We can have errors when
the numeric input is out of range, but not NaN or Infinity. At those
places, we need to convert numeric to string to put that value into the
error.
Do you mean we should still put "Nan or Infinity" there?

This is the case:
 select jsonb_path_query('12345678901', '$.integer()');
 ERROR:  numeric argument of jsonpath item method .integer() is out of
range for type integer


>
> > On another note, irrespective of these changes, is it good to show the
> > given input in the error messages? Error messages are logged and may leak
> > some details.
> >
> > I think the existing way seems ok.
>
> In my opinion, it is quite common to include the error-causing value
> in error messages. Also, we have already many functions that impliy
> the possibility for revealing input values when converting text
> representation into internal format, such as with int4in. However, I
> don't stick to that way.
>
> > ---
> >
> > *NaN and Infinity restrictions:*
> >
> > I am not sure why NaN and Infinity are not allowed in conversion to
> double
> > precision (.double() method). I have used the same restriction for
> > .decimal() and .number(). However, as you said, we should have error
> > messages more specific. I tried that in the attached patch; please have
> > your views. I have the following wordings for that error message:
> > "NaN or Infinity is not allowed for jsonpath item method .%s()"
> >
> > Suggestions...
>
> They seem good to *me*.
>

Thanks


>
> By the way, while playing with this feature, I noticed the following
> error message:
>
> > select jsonb_path_query('1.1' , '$.boolean()');
> > ERROR:  numeric argument of jsonpath item method .boolean() is out of
> range for type boolean
>
> The error message seems a bit off to me. For example, "argument '1.1'
> is invalid for type [bB]oolean" seems more appropriate for this
> specific issue. (I'm not ceratin about our policy on the spelling of
> Boolean..)
>
> regards.
>
> --

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-31 Thread Michael Paquier

On Thu, Feb 01, 2024 at 10:57:58AM +0900, Michael Paquier wrote:
> And here are the results I get for text and binary (ms, average of 15
> queries after discarding the three highest and three lowest values):
>   test   | master |  v7  | v10  
> -++--+--
>  from_bin_1col   | 1575   | 1546 | 1575
>  from_bin_10col  | 5364   | 5208 | 5230
>  from_text_1col  | 1690   | 1715 | 1684
>  from_text_10col | 4875   | 4793 | 4757
>  to_bin_1col | 1717   | 1730 | 1731
>  to_bin_10col| 7728   | 7707 | 7513
>  to_text_1col| 1710   | 1730 | 1698
>  to_text_10col   | 5998   | 5960 | 5987

Here are some numbers from a second local machine:
  test   | master |  v7  | v10  
-++--+--
 from_bin_1col   | 508| 467  | 461
 from_bin_10col  | 2192   | 2083 | 2098
 from_text_1col  | 510| 499  | 517
 from_text_10col | 1970   | 1678 | 1654
 to_bin_1col | 575| 577  | 573
 to_bin_10col| 2680   | 2678 | 2722
 to_text_1col| 516| 506  | 527
 to_text_10col   | 2250   | 2245 | 2235

This is confirming a speedup with COPY FROM for both text and binary,
with more impact with a larger number of attributes.  That's harder to
conclude about COPY TO in both cases, but at least I'm not seeing any
regression even with some variance caused by what looks like noise.
We need more numbers from more people.  Sutou-san or Sawada-san, or
any volunteers?
--
Michael


signature.asc
Description: PGP signature

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-31 Thread Junwang Zhao

Hi Michael,

On Thu, Feb 1, 2024 at 9:58 AM Michael Paquier  wrote:
>
> On Wed, Jan 31, 2024 at 02:39:54PM +0900, Michael Paquier wrote:
> > Thanks, I'm looking into that now.
>
> I have much to say about the patch, but for now I have begun running
> some performance tests using the patches, because this thread won't
> get far until we are sure that the callbacks do not impact performance
> in some kind of worst-case scenario.  First, here is what I used to
> setup a set of tables used for COPY FROM and COPY TO (requires [1] to
> feed COPY FROM's data to the void, and note that default values is to
> have a strict control on the size of the StringInfos used in the copy
> paths):
> CREATE EXTENSION blackhole_am;
> CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int)
> RETURNS VOID AS
> $func$
> DECLARE
>   query text;
> BEGIN
>   query := 'CREATE UNLOGGED TABLE ' || tabname || ' (';
>   FOR i IN 1..num_cols LOOP
> query := query || 'a_' || i::text || ' int default 1';
> IF i != num_cols THEN
>   query := query || ', ';
> END IF;
>   END LOOP;
>   query := query || ')';
>   EXECUTE format(query);
> END
> $func$ LANGUAGE plpgsql;
> -- Tables used for COPY TO
> SELECT create_table_cols ('to_tab_1', 1);
> SELECT create_table_cols ('to_tab_10', 10);
> INSERT INTO to_tab_1 SELECT FROM generate_series(1, 1000);
> INSERT INTO to_tab_10 SELECT FROM generate_series(1, 1000);
> -- Data for COPY FROM
> COPY to_tab_1 TO '/tmp/to_tab_1.bin' WITH (format binary);
> COPY to_tab_10 TO '/tmp/to_tab_10.bin' WITH (format binary);
> COPY to_tab_1 TO '/tmp/to_tab_1.txt' WITH (format text);
> COPY to_tab_10 TO '/tmp/to_tab_10.txt' WITH (format text);
> -- Tables used for COPY FROM
> SELECT create_table_cols ('from_tab_1', 1);
> SELECT create_table_cols ('from_tab_10', 10);
> ALTER TABLE from_tab_1 SET ACCESS METHOD blackhole_am;
> ALTER TABLE from_tab_10 SET ACCESS METHOD blackhole_am;
>
> Then I have run a set of tests using HEAD, v7 and v10 with queries
> like that (adapt them depending on the format and table):
> COPY to_tab_1 TO '/dev/null' WITH (FORMAT text) \watch count=5
> SET client_min_messages TO error; -- for blackhole_am
> COPY from_tab_1 FROM '/tmp/to_tab_1.txt' with (FORMAT 'text') \watch count=5
> COPY from_tab_1 FROM '/tmp/to_tab_1.bin' with (FORMAT 'binary') \watch count=5
>
> All the patches have been compiled with -O2, without assertions, etc.
> Postgres is run in tmpfs mode, on scissors, without fsync.  Unlogged
> tables help a bit in focusing on the execution paths as we don't care
> about WAL, of course.  I have also included v7 in the test of tests,
> as this version uses more simple per-row callbacks.
>
> And here are the results I get for text and binary (ms, average of 15
> queries after discarding the three highest and three lowest values):
>   test   | master |  v7  | v10
> -++--+--
>  from_bin_1col   | 1575   | 1546 | 1575
>  from_bin_10col  | 5364   | 5208 | 5230
>  from_text_1col  | 1690   | 1715 | 1684
>  from_text_10col | 4875   | 4793 | 4757
>  to_bin_1col | 1717   | 1730 | 1731
>  to_bin_10col| 7728   | 7707 | 7513
>  to_text_1col| 1710   | 1730 | 1698
>  to_text_10col   | 5998   | 5960 | 5987
>
> I am getting an interesting trend here in terms of a speedup between
> HEAD and the patches with a table that has 10 attributes filled with
> integers, especially for binary and text with COPY FROM.  COPY TO
> binary also gets nice numbers, while text looks rather stable.  Hmm.
>
> These were on my buildfarm animal, but we need to be more confident
> about all this.  Could more people run these tests?  I am going to do
> a second session on a local machine I have at hand and see what
> happens.  Will publish the numbers here, the method will be the same.
>
> [1]: https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
> --
> Michael

I'm running the benchmark, but I got some strong numbers:

postgres=# \timing
Timing is on.
postgres=# COPY to_tab_10 TO '/dev/null' WITH (FORMAT binary) \watch count=15
COPY 1000
Time: 3168.497 ms (00:03.168)
COPY 1000
Time: 3255.464 ms (00:03.255)
COPY 1000
Time: 3270.625 ms (00:03.271)
COPY 1000
Time: 3285.112 ms (00:03.285)
COPY 1000
Time: 3322.304 ms (00:03.322)
COPY 1000
Time: 3341.328 ms (00:03.341)
COPY 1000
Time: 3621.564 ms (00:03.622)
COPY 1000
Time: 3700.911 ms (00:03.701)
COPY 1000
Time: 3717.992 ms (00:03.718)
COPY 1000
Time: 3708.350 ms (00:03.708)
COPY 1000
Time: 3704.367 ms (00:03.704)
COPY 1000
Time: 3724.281 ms (00:03.724)
COPY 1000
Time: 3703.335 ms (00:03.703)
COPY 1000
Time: 3728.629 ms (00:03.729)
COPY 1000
Time: 3758.135 ms (00:03.758)

The first 6 rounds are like 10% better than the later 9 rounds, is this normal?

-- 
Regards
Junwang Zhao

Re: set_cheapest without checking pathlist

2024-01-31 Thread David Rowley

On Thu, 1 Feb 2024 at 16:29, Richard Guo  wrote:
>
>
> On Thu, Feb 1, 2024 at 10:04 AM James Coleman  wrote:
>>
>> I don't see any inherent reason why we must always assume that
>> gather_grouping_paths will always result in having at least one entry
>> in pathlist. If, for example, we've disabled incremental sort and the
>> cheapest partial path happens to already be sorted, then I don't
>> believe we'll add any paths. And if that happens then set_cheapest
>> will error with the message "could not devise a query plan for the
>> given query". So I propose we add a similar guard to this call point.
>
>
> I don't believe that would happen.  It seems to me that we should, at
> the very least, have a path which is Gather on top of the cheapest
> partial path (see generate_gather_paths), as long as the
> partially_grouped_rel has partial paths.

It will have partial paths because it's nested inside "if
(partially_grouped_rel && partially_grouped_rel->partial_pathlist)"

David

Re: set_cheapest without checking pathlist

2024-01-31 Thread Richard Guo

On Thu, Feb 1, 2024 at 10:04 AM James Coleman  wrote:

> I don't see any inherent reason why we must always assume that
> gather_grouping_paths will always result in having at least one entry
> in pathlist. If, for example, we've disabled incremental sort and the
> cheapest partial path happens to already be sorted, then I don't
> believe we'll add any paths. And if that happens then set_cheapest
> will error with the message "could not devise a query plan for the
> given query". So I propose we add a similar guard to this call point.

I don't believe that would happen.  It seems to me that we should, at
the very least, have a path which is Gather on top of the cheapest
partial path (see generate_gather_paths), as long as the
partially_grouped_rel has partial paths.

Thanks
Richard

RE: speed up a logical replica setup

2024-01-31 Thread Hayato Kuroda (Fujitsu)

Dear Fabrízio, Euler,

I think you set the primary_slot_name to the standby server, right?
While reading codes, I found below line in v11-0001.
```
if (primary_slot_name != NULL)
{
conn = connect_database(dbinfo[0].pubconninfo);
if (conn != NULL)
{
drop_replication_slot(conn, &dbinfo[0], temp_replslot);
}
```

Now the temp_replslot is temporary one, so it would be removed automatically.
This function will cause the error: replication slot 
"pg_createsubscriber_%u_startpoint" does not exist.
Also, the physical slot is still remained on the primary. 
In short, "temp_replslot" should be "primary_slot_name".

PSA a script file for reproducing.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



test_0201.sh
Description: test_0201.sh

Re: Flushing large data immediately in pqcomm

2024-01-31 Thread Andres Freund

Hi,

On 2024-01-31 14:57:35 -0500, Robert Haas wrote:
> > You're right and I'm open to doing more legwork. I'd also appreciate any
> > suggestion about how to test this properly and/or useful scenarios to
> > test. That would be really helpful.
>
> I think experimenting to see whether the long-short-long-short
> behavior that Heikki postulated emerges in practice would be a really
> good start.
>
> Another experiment that I think would be interesting is: suppose you
> create a patch that sends EVERY message without buffering and compare
> that to master. My naive expectation would be that this will lose if
> you pump short messages through that connection and win if you pump
> long messages through that connection. Is that true? If yes, at what
> point do we break even on performance? Does it depend on whether the
> connection is local or over a network? Does it depend on whether it's
> with or without SSL? Does it depend on Linux vs. Windows vs.
> whateverBSD? What happens if you twiddle the 8kB buffer size up or,
> say, down to just below the Ethernet frame size?

I feel like you're putting up a too high bar for something that can be a
pretty clear improvement on its own, without a downside. The current behaviour
is pretty absurd, doing all this research across all platforms isn't going to
disprove that - and it's a lot of work.  ISTM we can analyze this without
taking concrete hardware into account easily enough.

One thing that I haven't seen mentioned here that's relevant around using
small buffers: Postgres uses TCP_NODELAY and has to do so. That means doing
tiny sends can hurt substantially

> I think that what we really want to understand here is under what
> circumstances the extra layer of buffering is a win vs. being a loss.

It's quite easy to see that doing no buffering isn't viable - we end up with
tiny tiny TCP packets, one for each send(). And then there's the syscall
overhead.

Here's a quickly thrown together benchmark using netperf. First with -D, which
instructs it to use TCP_NODELAY, as we do.

10gbit network, remote host:

$ (fields="request_size,throughput"; echo "$fields";for i in $(seq 0 16); do 
s=$((2**$i));netperf -P0 -t TCP_STREAM -l1 -H alap5-10gbe  -- -r $s,$s -D 1 -o 
"$fields";done)|column -t -s,

request_size  throughput
1 22.73
2 45.77
4 108.64
8 225.78
16560.32
321035.61
642177.91
128   3604.71
256   5878.93
512   9334.70
1024  9031.13
2048  9405.35
4096  9334.60
8192  9275.33
16384 9406.29
32768 9385.52
65536 9399.40

localhost:
request_size  throughput
1 2.76
2 5.10
4 9.89
8 20.51
1643.42
3287.13
64173.72
128   343.70
256   647.89
512   1328.79
1024  2550.14
2048  4998.06
4096  9482.06
8192  17130.76
16384 29048.02
32768 42106.33
65536 48579.95

I'm slightly baffled by the poor performance of localhost with tiny packet
sizes. Ah, I see - it's the NODELA, without that:

localhost:
1 32.02
2 60.58
4 114.32
8 262.71
16558.42
321053.66
642099.39
128   3815.60
256   6566.19
512   11751.79
1024  18976.11
2048  27222.99
4096  33838.07
8192  38219.60
16384 39146.37
32768 44784.98
65536 44214.70

NODELAY triggers many more context switches, because there's immediately data
available for the receiving side. Whereas with real network the interrupts get
coalesced.

I think that's pretty clear evidence that we need buffering.  But I think we
can probably be smarter than we are right now, and then what's been proposed
in the patch. Because of TCP_NODELAY we shouldn't send a tiny buffer on its
own, it may trigger sending a small TCP packet, which is quite inefficient.

While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.

Greetings,

Andres Freund

Re: speed up a logical replica setup

On Wed, Jan 31, 2024, at 11:09 PM, Hayato Kuroda (Fujitsu) wrote:
> >
> Why? Are you suggesting that the dry run mode covers just the verification
> part? If so, it is not a dry run mode. I would expect it to run until the end
> (or until it accomplish its goal) but *does not* modify data. For pg_resetwal,
> the modification is one of the last steps and the other ones (KillFoo
> functions) that are skipped modify data. It ends the dry run mode when it
> accomplish its goal (obtain the new control data values). If we stop earlier,
> some of the additional steps won't be covered by the dry run mode and a 
> failure
> can happen but could be detected if you run a few more steps.
> >
> 
> Yes, it was my expectation. I'm still not sure which operations can detect by 
> the
> dry_run, but we can keep it for now.

The main goal is to have information for troubleshooting.

> 
> Good point. I included a check for pg_create_subscription role and CREATE
> privilege on the specified database.
> >
> 
> Not sure, but can we do the replication origin functions by these privilege?
> According to the doc[1], these ones seem not to be related.

Hmm. No. :( Better add this check too.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-01-31 Thread Richard Guo

On Wed, Jan 31, 2024 at 11:21 PM Tom Lane  wrote:

> Richard Guo  writes:
> > On Wed, Jan 31, 2024 at 5:12 AM Tom Lane  wrote:
> >> * Why is it okay to just use pull_varnos on a tablesample expression,
> >> when contain_references_to() does something different?
>
> > Hmm, the main reason is that the expression_tree_walker() function does
> > not handle T_RestrictInfo nodes.  So we cannot just use pull_varnos or
> > pull_var_clause on a restrictinfo list.
>
> Right, the extract_actual_clauses step is not wanted for the
> tablesample expression.  But my point is that surely the handling of
> Vars and PlaceHolderVars should be the same, which it's not, and your
> v11 makes it even less so.  How can it be OK to ignore Vars in the
> restrictinfo case?

Hmm, in this specific scenario it seems that it's not possible to have
Vars in the restrictinfo list that laterally reference the outer join
relation; otherwise the clause containing such Vars would not be a
restriction clause but rather a join clause.  So in v11 I did not check
Vars in contain_references_to().

But I agree that we'd better to handle Vars and PlaceHolderVars in the
same way.

> I think the code structure we need to end up with is a routine that
> takes a RestrictInfo-free node tree (and is called directly in the
> tablesample case) with a wrapper routine that strips the RestrictInfos
> (for use on restriction lists).

How about the attached v12 patch?  But I'm a little concerned about
omitting PVC_RECURSE_AGGREGATES and PVC_RECURSE_WINDOWFUNCS in this
case.  Is it possible that aggregates or window functions appear in the
tablesample expression?

Thanks
Richard

v12-0001-Avoid-reparameterizing-Paths-when-it-s-not-suitable.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

On Mon, Jan 29, 2024, at 10:17 AM, Zhijie Hou (Fujitsu) wrote:
> Attach the V72-0001 which addressed above comments, other patches will be
> rebased and posted after pushing first patch. Thanks Shveta for helping 
> address
> the comments.

While working on another patch I noticed a new NOTICE message:

NOTICE:  changed the failover state of replication slot "foo" on publisher to 
false

I wasn't paying much attention to this thread then I start reading the 2
patches that was recently committed. The message above surprises me because
pg_createsubscriber starts to emit this message. The reason is that it doesn't
create the replication slot during the CREATE SUBSCRIPTION. Instead, it creates
the replication slot with failover = false and no such option is informed
during CREATE SUBSCRIPTION which means it uses the default value (failover =
false). I expect that I don't see any message because it is *not* changing the
behavior. I was wrong. It doesn't check the failover state on publisher, it
just executes walrcv_alter_slot() and emits a message.

IMO if we are changing an outstanding property on node A from node B, node B
already knows (or might know) about that behavior change (because it is sending
the command), however, node A doesn't (unless log_replication_commands = on --
it is not the default).

Do we really need this message as NOTICE? I would set it to DEBUG1 if it is
worth or even remove it (if we consider there are other ways to obtain the same
information).

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Parallelize correlated subqueries that execute within each worker

2024-01-31 Thread James Coleman

On Wed, Jan 31, 2024 at 3:18 PM Robert Haas  wrote:
>
> On Tue, Jan 30, 2024 at 9:56 PM James Coleman  wrote:
> > I don't follow the "Idle since July" since it just hasn't received
> > review since then, so there's been nothing to reply to.
>
> It wasn't clear to me if you thought that the patch was ready for
> review since July, or if it was waiting on you since July. Those are
> quite different, IMV.

Agreed they're very different. I'd thought it was actually in "Needs
review" and with no outstanding questions on the thread since July,
but maybe I'm missing something -- I've definitely misunderstood CF
app status before, but usually that's been in the other direction
(forgetting to mark it back to Needs Review after responding to a
Waiting on Author.

Regards,
James Coleman

RE: speed up a logical replica setup

2024-01-31 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

Thanks for giving comments! I want to reply some of them.

>
I didn't provide the whole explanation. I'm envisioning the use case that pg_ctl
doesn't reach the consistent state and the timeout is reached (the consequence
is that pg_createsubscriber aborts the execution). It might occur on a busy
server. The probability that it occurs with the current code is low (LSN gap
for recovery is small). Maybe I'm anticipating issues when the base backup
support is added but better to raise concerns during development.
>

Hmm, actually I didn't know the case. Thanks for explanation. I want to see
how you describe on the doc.

>
pg_upgrade doesn't but others do like pg_rewind, pg_resetwal, pg_controldata,
pg_checksums. It seems newer tools tend to provide short and long options.
>

Oh, you are right.

>
Nothing? If you interrupt the execution, there will be objects left behind and
you, as someone that decided to do it, have to clean things up. What do you
expect this tool to do? The documentation will provide some guidance informing
the object name patterns this tool uses and you can check for leftover objects.
Someone can argue that is a valid feature request but IMO it is not one in the
top of the list.
>

OK, so let's keep current style.

>
Why? Are you suggesting that the dry run mode covers just the verification
part? If so, it is not a dry run mode. I would expect it to run until the end
(or until it accomplish its goal) but *does not* modify data. For pg_resetwal,
the modification is one of the last steps and the other ones (KillFoo
functions) that are skipped modify data. It ends the dry run mode when it
accomplish its goal (obtain the new control data values). If we stop earlier,
some of the additional steps won't be covered by the dry run mode and a failure
can happen but could be detected if you run a few more steps.
>

Yes, it was my expectation. I'm still not sure which operations can detect by 
the
dry_run, but we can keep it for now.

>
Why? See [1]. I prefer the kind mode (always wait until the recovery ends) but
you and Amit are proposing a more aggressive mode. The proposal (-t 60) seems
ok right now, however, if the goal is to provide base backup support in the
future, you certainly should have to add the --recovery-timeout in big clusters
or those with high workloads because base backup is run between replication slot
creation and consistent LSN. Of course, we can change the default when base
backup support is added.
>

Sorry, I was missing your previous post. Let's keep yours.

>
Good point. I included a check for pg_create_subscription role and CREATE
privilege on the specified database.
>

Not sure, but can we do the replication origin functions by these privilege?
According to the doc[1], these ones seem not to be related.

[1]: 
https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

set_cheapest without checking pathlist

2024-01-31 Thread James Coleman

Hello,

Robert: I've taken the liberty of cc'ing you since you worked on most
of this code. My apologies if that wasn't appropriate.

While working on "Parallelize correlated subqueries that execute
within each worker" [1] I noticed that while in the other call to
set_cheapest (for partially_grouped_rel) in the same function the call
after gather_grouping_paths(root, partially_grouped_rel) is not
similarly guarded with a check for a NIL pathlist on
partially_grouped_rel.

I don't see any inherent reason why we must always assume that
gather_grouping_paths will always result in having at least one entry
in pathlist. If, for example, we've disabled incremental sort and the
cheapest partial path happens to already be sorted, then I don't
believe we'll add any paths. And if that happens then set_cheapest
will error with the message "could not devise a query plan for the
given query". So I propose we add a similar guard to this call point.

I could be convinced that this should be simply part of the patch in
the other thread, but it seemed to me it'd be worth considering
independently because as noted above I don't see any reason why this
couldn't happen separately. That being said, on master I don't have a
case showing this is necessary.

Thanks,
James Coleman

1:
https://www.postgresql.org/message-id/flat/CAAaqYe-Aq6oNf9NPZnpPE7SgRLomXXWJA1Gz9L9ndi_Nv%3D94Yw%40mail.gmail.com#e0b1a803d0fdb97189ce493f15f99c14

v1-0001-Guard-set_cheapest-with-pathlist-NIL-check.patch
Description: Binary data

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-01-31 Thread Michael Paquier

On Wed, Jan 31, 2024 at 02:39:54PM +0900, Michael Paquier wrote:
> Thanks, I'm looking into that now.

I have much to say about the patch, but for now I have begun running
some performance tests using the patches, because this thread won't
get far until we are sure that the callbacks do not impact performance
in some kind of worst-case scenario.  First, here is what I used to
setup a set of tables used for COPY FROM and COPY TO (requires [1] to
feed COPY FROM's data to the void, and note that default values is to
have a strict control on the size of the StringInfos used in the copy
paths):
CREATE EXTENSION blackhole_am;
CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int)
RETURNS VOID AS
$func$
DECLARE
  query text;
BEGIN
  query := 'CREATE UNLOGGED TABLE ' || tabname || ' (';
  FOR i IN 1..num_cols LOOP
query := query || 'a_' || i::text || ' int default 1';
IF i != num_cols THEN
  query := query || ', ';
END IF;
  END LOOP;
  query := query || ')';
  EXECUTE format(query);
END
$func$ LANGUAGE plpgsql;
-- Tables used for COPY TO
SELECT create_table_cols ('to_tab_1', 1);
SELECT create_table_cols ('to_tab_10', 10);
INSERT INTO to_tab_1 SELECT FROM generate_series(1, 1000);
INSERT INTO to_tab_10 SELECT FROM generate_series(1, 1000);
-- Data for COPY FROM
COPY to_tab_1 TO '/tmp/to_tab_1.bin' WITH (format binary);
COPY to_tab_10 TO '/tmp/to_tab_10.bin' WITH (format binary);
COPY to_tab_1 TO '/tmp/to_tab_1.txt' WITH (format text);
COPY to_tab_10 TO '/tmp/to_tab_10.txt' WITH (format text);
-- Tables used for COPY FROM
SELECT create_table_cols ('from_tab_1', 1);
SELECT create_table_cols ('from_tab_10', 10);
ALTER TABLE from_tab_1 SET ACCESS METHOD blackhole_am;
ALTER TABLE from_tab_10 SET ACCESS METHOD blackhole_am;

Then I have run a set of tests using HEAD, v7 and v10 with queries
like that (adapt them depending on the format and table):
COPY to_tab_1 TO '/dev/null' WITH (FORMAT text) \watch count=5
SET client_min_messages TO error; -- for blackhole_am
COPY from_tab_1 FROM '/tmp/to_tab_1.txt' with (FORMAT 'text') \watch count=5
COPY from_tab_1 FROM '/tmp/to_tab_1.bin' with (FORMAT 'binary') \watch count=5

All the patches have been compiled with -O2, without assertions, etc.
Postgres is run in tmpfs mode, on scissors, without fsync.  Unlogged
tables help a bit in focusing on the execution paths as we don't care
about WAL, of course.  I have also included v7 in the test of tests,
as this version uses more simple per-row callbacks.

And here are the results I get for text and binary (ms, average of 15
queries after discarding the three highest and three lowest values):
  test   | master |  v7  | v10  
-++--+--
 from_bin_1col   | 1575   | 1546 | 1575
 from_bin_10col  | 5364   | 5208 | 5230
 from_text_1col  | 1690   | 1715 | 1684
 from_text_10col | 4875   | 4793 | 4757
 to_bin_1col | 1717   | 1730 | 1731
 to_bin_10col| 7728   | 7707 | 7513
 to_text_1col| 1710   | 1730 | 1698
 to_text_10col   | 5998   | 5960 | 5987

I am getting an interesting trend here in terms of a speedup between
HEAD and the patches with a table that has 10 attributes filled with
integers, especially for binary and text with COPY FROM.  COPY TO
binary also gets nice numbers, while text looks rather stable.  Hmm.

These were on my buildfarm animal, but we need to be more confident
about all this.  Could more people run these tests?  I am going to do
a second session on a local machine I have at hand and see what
happens.  Will publish the numbers here, the method will be the same.

[1]: https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
--
Michael


signature.asc
Description: PGP signature

Re: More new SQL/JSON item methods

At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> By the way, while playing with this feature, I noticed the following
> error message:
> 
> > select jsonb_path_query('1.1' , '$.boolean()');
> > ERROR:  numeric argument of jsonpath item method .boolean() is out of range 
> > for type boolean
> 
> The error message seems a bit off to me. For example, "argument '1.1'
> is invalid for type [bB]oolean" seems more appropriate for this
> specific issue. (I'm not ceratin about our policy on the spelling of
> Boolean..)

Or, following our general convention, it would be spelled as:

'invalid argument for type Boolean: "1.1"'

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: More new SQL/JSON item methods

Thank you for the fix!

At Tue, 30 Jan 2024 13:46:17 +0530, Jeevan Chalke 
 wrote in 
> On Mon, Jan 29, 2024 at 11:03 AM Tom Lane  wrote:
> 
> > Kyotaro Horiguchi  writes:
> > > Having said that, I'm a bit concerned about the case where an overly
> > > long string is given there. However, considering that we already have
> > > "invalid input syntax for type xxx: %x" messages (including for the
> > > numeric), this concern might be unnecessary.
> >
> > Yeah, we have not worried about that in the past.
> >
> > > Another concern is that the input value is already a numeric, not a
> > > string. This situation occurs when the input is NaN of +-Inf. Although
> > > numeric_out could be used, it might cause another error. Therefore,
> > > simply writing out as "argument NaN and Infinity.." would be better.
> >
> > Oh!  I'd assumed that we were discussing a string that we'd failed to
> > convert to numeric.  If the input is already numeric, then either
> > the error is unreachable or what we're really doing is rejecting
> > special values such as NaN on policy grounds.  I would ask first
> > if that policy is sane at all.  (I'd lean to "not" --- if we allow
> > it in the input JSON, why not in the output?)  If it is sane, the
> > error message needs to be far more specific.
> >
> > regards, tom lane
> >
> 
> *Consistent error message related to type:*
...
> What if we use phrases like "for type double precision" at both places,
> like:
> numeric argument of jsonpath item method .%s() is out of range for type
> double precision
> string argument of jsonpath item method .%s() is not a valid representation
> for type double precision
> 
> With this, the rest will be like:
> for type numeric
> for type bigint
> for type integer
> 
> Suggestions?

FWIW, I prefer consistently using "for type hoge".

> ---
> 
> *Showing input string in the error message:*
> 
> argument "...input string here..." of jsonpath item method .%s() is out of
> range for type numeric
> 
> If we add the input string in the error, then for some errors, it will be
> too big, for example:
> 
> -ERROR:  numeric argument of jsonpath item method .double() is out of range
> for type double precision
> +ERROR:  argument
> "10"
> of jsonpath item method .double() is out of range for type double precision

As Tom suggested, given that similar situations have already been
disregarded elsewhere, worrying about excessively long input strings
in this specific instance won't notably improve safety in total.

> Also, for non-string input, we need to convert numeric to string just for
> the error message, which seems overkill.

As I suggested and you seem to agree, using literally "Nan or
Infinity" would be sufficient.

> On another note, irrespective of these changes, is it good to show the
> given input in the error messages? Error messages are logged and may leak
> some details.
> 
> I think the existing way seems ok.

In my opinion, it is quite common to include the error-causing value
in error messages. Also, we have already many functions that impliy
the possibility for revealing input values when converting text
representation into internal format, such as with int4in. However, I
don't stick to that way.

> ---
> 
> *NaN and Infinity restrictions:*
> 
> I am not sure why NaN and Infinity are not allowed in conversion to double
> precision (.double() method). I have used the same restriction for
> .decimal() and .number(). However, as you said, we should have error
> messages more specific. I tried that in the attached patch; please have
> your views. I have the following wordings for that error message:
> "NaN or Infinity is not allowed for jsonpath item method .%s()"
> 
> Suggestions...

They seem good to *me*.

By the way, while playing with this feature, I noticed the following
error message:

> select jsonb_path_query('1.1' , '$.boolean()');
> ERROR:  numeric argument of jsonpath item method .boolean() is out of range 
> for type boolean

The error message seems a bit off to me. For example, "argument '1.1'
is invalid for type [bB]oolean" seems more appropriate for this
specific issue. (I'm not ceratin about our policy on the spelling of
Boolean..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: speed up a logical replica setup

2024-01-31 Thread Hayato Kuroda (Fujitsu)

Dear Fabrízio,

Thanks for reporting. I understood that the issue occurred on v11 and v12.
I will try to reproduce and check the reason.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

Re: Rename setup_cancel_handler in pg_dump

2024-01-31 Thread Yugo NAGATA

On Tue, 30 Jan 2024 13:44:28 +0100
Daniel Gustafsson  wrote:

> > On 26 Jan 2024, at 01:42, Yugo NAGATA  wrote:
> 
> > I am proposing it because there is a public function with
> > the same name in fe_utils/cancel.c. I know pg_dump/parallel.c
> > does not include fe_utils/cancel.h, so there is no conflict,
> > but I think it is better to use different names to reduce
> > possible confusion. 
> 
> Given that a "git grep setup_cancel_hander" returns hits in pg_dump along with
> other frontend utils, I can see the risk of confusion.

Thank you for looking into it!
 
> -setup_cancel_handler(void)
> +pg_dump_setup_cancel_handler(void)
> 
> We don't have any other functions prefixed with pg_dump_, based on the naming
> of the surrounding code in the file I wonder if set_cancel_handler is a more
> appropriate name?

Agreed. Here is a updated patch.

Regards,
Yugo Nagata

> 
> --
> Daniel Gustafsson
> 


-- 
Yugo NAGATA 
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index 188186829c..a09247fae4 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -204,7 +204,7 @@ static ParallelSlot *GetMyPSlot(ParallelState *pstate);
 static void archive_close_connection(int code, void *arg);
 static void ShutdownWorkersHard(ParallelState *pstate);
 static void WaitForTerminatingWorkers(ParallelState *pstate);
-static void setup_cancel_handler(void);
+static void set_cancel_handler(void);
 static void set_cancel_pstate(ParallelState *pstate);
 static void set_cancel_slot_archive(ParallelSlot *slot, ArchiveHandle *AH);
 static void RunWorker(ArchiveHandle *AH, ParallelSlot *slot);
@@ -550,7 +550,7 @@ sigTermHandler(SIGNAL_ARGS)
 	/*
 	 * Some platforms allow delivery of new signals to interrupt an active
 	 * signal handler.  That could muck up our attempt to send PQcancel, so
-	 * disable the signals that setup_cancel_handler enabled.
+	 * disable the signals that set_cancel_handler enabled.
 	 */
 	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN);
@@ -605,7 +605,7 @@ sigTermHandler(SIGNAL_ARGS)
  * Enable cancel interrupt handler, if not already done.
  */
 static void
-setup_cancel_handler(void)
+set_cancel_handler(void)
 {
 	/*
 	 * When forking, signal_info.handler_set will propagate into the new
@@ -705,7 +705,7 @@ consoleHandler(DWORD dwCtrlType)
  * Enable cancel interrupt handler, if not already done.
  */
 static void
-setup_cancel_handler(void)
+set_cancel_handler(void)
 {
 	if (!signal_info.handler_set)
 	{
@@ -737,7 +737,7 @@ set_archive_cancel_info(ArchiveHandle *AH, PGconn *conn)
 	 * important that this happen at least once before we fork off any
 	 * threads.
 	 */
-	setup_cancel_handler();
+	set_cancel_handler();
 
 	/*
 	 * On Unix, we assume that storing a pointer value is atomic with respect

Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix

On Wed, Jan 31, 2024 at 7:48 PM Alvaro Herrera  wrote:
>
> How about rewording it more extensively?  It doesn't read great IMO.
> I would use something like
>
> # In the upgraded instance, the running status and failover option of the
> # subscription with the failover option should have been preserved; the other
> # should not.
> # So regress_sub1 should still have subenabled,subfailover set to true,
> # while regress_sub2 should have both set to false.
>

IIUC this suggested comment is implying that the running status is
*only* preserved when the failover option is true. But AFAIK that is
not correct. e.g. I hacked the test to keep subscription regress_sub2
as ENABLED but the result after the upgrade was subenabled/subfailover
= t/f, not f/f.

> I think the symmetry between the two lines confuses more than helps.
> It's not a huge thing but since we're editing anyway, why not?
>

OK. Now using your suggested 2nd sentence:

+# The subscription's running status and failover option should be preserved
+# in the upgraded instance. So regress_sub1 should still have
subenabled,subfailover
+# set to true, while regress_sub2 should have both set to false.

~

I also tweaked some other nearby comments/messages which did not
mention the 'failover' preservation.

PSA v2.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

v2-0001-Fix-pg_upgrade-test-comment.patch
Description: Binary data

Re: speed up a logical replica setup

On Tue, Jan 30, 2024, at 6:26 AM, Hayato Kuroda (Fujitsu) wrote:
> > One open item that is worrying me is how to handle the pg_ctl timeout. This
> > patch does nothing and the user should use PGCTLTIMEOUT environment 
> > variable to
> > avoid that the execution is canceled after 60 seconds (default for pg_ctl).
> > Even if you set a high value, it might not be enough for cases like
> > time-delayed replica. Maybe pg_ctl should accept no timeout as --timeout
> > option. I'll include this caveat into the documentation but I'm afraid it is
> > not sufficient and we should provide a better way to handle this situation.
> 
> I felt you might be confused a bit. Even if the recovery_min_apply_delay is 
> set,
> e.g., 10h., the pg_ctl can start and stop the server. This is because the
> walreceiver serialize changes as soon as they received. The delay is done by 
> the
> startup process. There are no unflushed data, so server instance can be 
> turned off.
> If you meant the combination of recovery-timeout and time-delayed replica, 
> yes,
> it would be likely to occur. But in the case, using like --no-timeout option 
> is
> dangerous. I think we should overwrite recovery_min_apply_delay to zero. 
> Thought?

I didn't provide the whole explanation. I'm envisioning the use case that pg_ctl
doesn't reach the consistent state and the timeout is reached (the consequence
is that pg_createsubscriber aborts the execution). It might occur on a busy
server. The probability that it occurs with the current code is low (LSN gap
for recovery is small). Maybe I'm anticipating issues when the base backup
support is added but better to raise concerns during development.

> Below part contains my comments for v11-0001. Note that the ordering is 
> random.

Hayato, thanks for reviewing v11.

> 01. doc
> ```
> 
>  -D 
>  --pgdata
> 
> ```
> 
> According to other documentation like pg_upgrade, we do not write both longer
> and shorter options in the synopsis section.

pg_upgrade doesn't but others do like pg_rewind, pg_resetwal, pg_controldata,
pg_checksums. It seems newer tools tend to provide short and long options.

> 02. doc
> ```
>   
>pg_createsubscriber takes the publisher and 
> subscriber
>connection strings, a cluster directory from a physical replica and a list 
> of
>database names and it sets up a new logical replica using the physical
>recovery process.
>   
> 
> ```
> 
> I found that you did not include my suggestion without saying [1]. Do you 
> dislike
> the comment or still considering?

Documentation is on my list. I didn't fix the documentation since some design
decisions were changed. I'm still working on it.

> 03. doc
> ```
>   -P   class="parameter">connstr
> ```
> 
> Too many blank after -P.

Fixed.

[documentation related items will be addressed later...]

> 
> 07. general
> I think there are some commenting conversions in PG, but this file breaks it.

It is on my list.

> 08. general
> Some pg_log_error() + exit(1) can be replaced with pg_fatal().

Done. I kept a few pg_log_error() + exit() because there is no
pg_fatal_and_hint() function.

> 
> 09. LogicalRepInfo
> ```
> char*subconninfo; /* subscription connection string for logical
> * replication */
> ```
> 
> As I posted in comment#8[2], I don't think it is "subscription connection". 
> Also,
> "for logical replication" is bit misreading because it would not be passed to
> workers.

Done.

s/publication/publisher/
s/subscription/subscriber/

> 10. get_base_conninfo
> ```
> static char *
> get_base_conninfo(char *conninfo, char *dbname, const char *noderole)
> ...
> /*
> * If --database option is not provided, try to obtain the dbname from
> * the publisher conninfo. If dbname parameter is not available, error
> * out.
> */
> 
> ```
> 
> I'm not sure getting dbname from the conninfo improves user-experience. I felt
> it may trigger an unintended targeting.
> (I still think the publisher-server should be removed)

Why not? Unique database is a common setup. It is unintended if you don't
document it accordingly. I'll make sure it is advertised in the --database and
the --publisher-server options.

> 11. check_data_directory
> ```
> /*
> * Is it a cluster directory? These are preliminary checks. It is far from
> * making an accurate check. If it is not a clone from the publisher, it will
> * eventually fail in a future step.
> */
> static bool
> check_data_directory(const char *datadir)
> ```
> 
> We shoud also check whether pg_createsubscriber can create a file and a 
> directory.
> GetDataDirectoryCreatePerm() verifies it.

Good point. It is included in the next patch.

> 12. main
> ```
> /* Register a function to clean up objects in case of failure. */
> atexit(cleanup_objects_atexit);
> ```
> 
> According to the manpage, callback functions would not be called when it exits
> due to signals:
> 
> > Functions  registered  using atexit() (and on_exit(3)) are not called if a
> > process terminates abnormally because

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 2024-Jan-29, Dilip Kumar wrote:

> Thank you for working on this.  There is one thing that I feel is
> problematic.  We have kept the allowed values for these GUCs to be in
> multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
> values were changed to 16 but in this refactoring patch for some of
> the buffers you have changed that to 8 so I think that's not good.

Oh, absolutely, you're right.  Restored the minimum to 16.

So, here's the patchset as two pieces.  0001 converts
SlruSharedData->latest_page_number to use atomics.  I don't see any
reason to mix this in with the rest of the patch, and though it likely
won't have any performance advantage by itself (since the lock
acquisition is pretty much the same), it seems better to get it in ahead
of the rest -- I think that simplifies matters for the second patch,
which is large enough.

So, 0002 introduces the rest of the feature.  I removed use of banklocks
in a different amount as banks, and I made commit_ts use a longer
lwlock acquisition at truncation time, rather than forcing all-lwlock
acquisition.

The more I look at 0002, the more I notice that some comments need badly
updated, so please don't read too much into it yet.  But I wanted to
post it anyway for archives and cfbot purposes.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 464a996b85c333ffc781086263c2e491758b248f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 31 Jan 2024 12:27:51 +0100
Subject: [PATCH 1/2] Use atomics for SlruSharedData->latest_page_number

---
 src/backend/access/transam/clog.c  |  7 ++
 src/backend/access/transam/commit_ts.c |  7 +++---
 src/backend/access/transam/multixact.c | 30 --
 src/backend/access/transam/slru.c  | 19 ++--
 src/include/access/slru.h  |  5 -
 5 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f6e7da7ffc..245fd21c8d 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -766,14 +766,11 @@ StartupCLOG(void)
 	TransactionId xid = XidFromFullTransactionId(TransamVariables->nextXid);
 	int64		pageno = TransactionIdToPage(xid);
 
-	LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * Initialize our idea of the latest page number.
 	 */
-	XactCtl->shared->latest_page_number = pageno;
-
-	LWLockRelease(XactSLRULock);
+	pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno);
+	pg_write_barrier();
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 61b82385f3..f68705989d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -689,9 +689,7 @@ ActivateCommitTs(void)
 	/*
 	 * Re-Initialize our idea of the latest page number.
 	 */
-	LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
-	CommitTsCtl->shared->latest_page_number = pageno;
-	LWLockRelease(CommitTsSLRULock);
+	pg_atomic_init_u64(&CommitTsCtl->shared->latest_page_number, pageno);
 
 	/*
 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -1006,7 +1004,8 @@ commit_ts_redo(XLogReaderState *record)
 		 * During XLOG replay, latest_page_number isn't set up yet; insert a
 		 * suitable value to bypass the sanity test in SimpleLruTruncate.
 		 */
-		CommitTsCtl->shared->latest_page_number = trunc->pageno;
+		pg_atomic_write_u64(&CommitTsCtl->shared->latest_page_number,
+			trunc->pageno);
 
 		SimpleLruTruncate(CommitTsCtl, trunc->pageno);
 	}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 59523be901..a886c29892 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2017,13 +2017,17 @@ StartupMultiXact(void)
 	 * Initialize offset's idea of the latest page number.
 	 */
 	pageno = MultiXactIdToOffsetPage(multi);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_init_u64(&MultiXactOffsetCtl->shared->latest_page_number,
+	   pageno);
 
 	/*
 	 * Initialize member's idea of the latest page number.
 	 */
 	pageno = MXOffsetToMemberPage(offset);
-	MultiXactMemberCtl->shared->latest_page_number = pageno;
+	pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number,
+	   pageno);
+
+	pg_write_barrier();
 }
 
 /*
@@ -2047,14 +2051,15 @@ TrimMultiXact(void)
 	oldestMXactDB = MultiXactState->oldestMultiXactDB;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Clean up offsets state */
-	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
 	pageno = MultiXactIdToOffsetPage(nextMXact);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
+		pageno);
+
+	/* Clean up offsets state */
+	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	/*

Re: Documentation: warn about two_phase when altering a subscription

On Wed, Jan 31, 2024 at 4:55 PM Bertrand Drouvot
 wrote:
...
> As a non native English speaker somehow I have to rely on you for those
> suggestions ;-)
>
> They make sense to me so applied both in v2 attached.
>

Patch v2 looks OK to me, but probably there is still room for
improvement. Let's see what others think.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: libpq fails to build with TSAN

2024-01-31 Thread Daniel Gustafsson

> On 31 Jan 2024, at 16:39, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> I think it should, the idea of that check is to catch calls to actual exits,
>> while this is instrumentation which has nothing to do with exit(2).  The
>> attached diff should be enough to handle this.
> 
> +1

Pushed.  It can be argued that it should be backpatched, but for now I've only
pushed it to master.  If there are strong opinions on backpatching I'd be happy
to fix that.

--
Daniel Gustafsson

Re: Improve the connection failure error messages

On Wed, Jan 31, 2024 at 9:58 PM Nisha Moond  wrote:
>
>
>> AFAIK some recent commits patches (e,g [1]  for the "slot sync"
>> development) have created some more cases of "could not connect..."
>> messages. So, you might need to enhance your patch to deal with any
>> new ones in the latest HEAD.
>>
>> ==
>> [1] 
>> https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a
>>
> Thank you for the update.
> The v3 patch has the changes needed as per the latest HEAD.
>

Hi, just going by visual inspection of the v2/v3 patch diffs, the
latest v3 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add system identifier to backup manifest

On Thu, Jan 25, 2024 at 2:52 AM Amul Sul  wrote:
> Thank you for the review-comments, updated version attached.

I generally agree with 0001. I spent a long time thinking about your
decision to make verifier_context contain a pointer to manifest_data
instead of, as it does currently, a pointer to manifest_files_hash. I
don't think that's a horrible idea, but it also doesn't seem to be
used anywhere currently. One advantage of the current approach is that
we know that none of the code downstream of verify_backup_directory()
or verify_backup_checksums() actually cares about anything other than
the manifest_files_hash. That's kind of nice. If we didn't change this
as you have done here, then we would need to continue passing the WAL
ranges to parse_required_walI() and the system identifier would have
to be passed explicitly to the code that checks the system identifier,
but that's not such a bad thing, either. It makes it clear which
functions are using which information.

But before you go change anything there, exactly when should 0002 be
checking the system identifier in the control file? What happens now
is that we first walk over the directory tree and make sure we have
the files (verify_backup_directory) and then go through and verify
checksums in a second pass (verify_backup_checksums). We do this
because it lets us report problems that can be detected cheaply --
like missing files -- relatively quickly, and problems that are more
expensive to detect -- like mismatching checksums -- only after we've
reported all the cheap-to-detect problems. At what stage should we
verify the control file? I don't really like verifying it first, as
you've done, because I think the error message change in
004_options.pl is a clear regression. When the whole directory is
missing, it's much more pleasant to complain about the directory being
missing than some file inside the directory being missing.

What I'd be inclined to suggest is that you have verify_backup_file()
notice when the file it's being asked to verify is the control file,
and have it check the system identifier at that stage. I think if you
do that, then the error message change in 004_options.pl goes away.
Now, to do that, you'd need to have the whole manifest_data available
from the context, not just the manifest_files_hash, so that you can
see the expected system identifier. And, interestingly, if you take
this approach, then it appears to me that 0001 is correct as-is and
doesn't need any changes.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Incorrect cost for MergeAppend

2024-01-31 Thread Alexander Kuzmenkov

On Wed, Jan 31, 2024 at 9:49 PM David Rowley  wrote:
> Pushed to master.
>
> Thanks for the report and the fix, Alexander.

Thank you!

Re: Incorrect cost for MergeAppend

2024-01-31 Thread David Rowley

On Thu, 1 Feb 2024 at 04:32, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > Since we have a minor coming up very soon, I think it's not a good idea
> > to backpatch right now.  Maybe you can push to master now, and consider
> > whether to backpatch later.
>
> As a rule, we don't back-patch changes like this ever.  People don't
> appreciate plans changing in minor releases.

Pushed to master.

Thanks for the report and the fix, Alexander.

Re: Parallelize correlated subqueries that execute within each worker

On Tue, Jan 30, 2024 at 9:56 PM James Coleman  wrote:
> I don't follow the "Idle since July" since it just hasn't received
> review since then, so there's been nothing to reply to.

It wasn't clear to me if you thought that the patch was ready for
review since July, or if it was waiting on you since July. Those are
quite different, IMV.

> That being said, Vignesh's note in January about a now-failing test is
> relevant activity, and I've just today responded to that, so I'm
> changing the status back from Waiting on Author to Needs Review.

Sounds good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Flushing large data immediately in pqcomm

On Wed, Jan 31, 2024 at 2:23 PM Melih Mutlu  wrote:
>> That seems like it might be a useful refinement of Melih Mutlu's
>> original proposal, but consider a message stream that consists of
>> messages exactly 8kB in size. If that message stream begins when the
>> buffer is empty, all messages are sent directly. If it begins when
>> there are any number of bytes in the buffer, we buffer every message
>> forever. That's kind of an odd artifact, but maybe it's fine in
>> practice. I say again that it's good to test out a bunch of scenarios
>> and see what shakes out.
>
> Isn't this already the case? Imagine sending exactly 8kB messages, the first 
> pq_putmessage() call will buffer 8kB. Any call after this point simply sends 
> a 8kB message already buffered from the previous call and buffers a new 8kB 
> message. Only difference here is we keep the message in the buffer for a 
> while instead of sending it directly. In theory, the proposed idea should not 
> bring any difference in the number of flushes and the size of data we send in 
> each time, but can remove unnecessary copies to the buffer in this case. I 
> guess the behaviour is also the same with or without the patch in case the 
> buffer has already some bytes.

Yes, it's never worse than today in terms of number of buffer flushes,
but it doesn't feel like great behavior, either. Users tend not to
like it when the behavior of an algorithm depends heavily on
incidental factors that shouldn't really be relevant, like whether the
buffer starts with 1 byte in it or 0 at the beginning of a long
sequence of messages. They see the performance varying "for no reason"
and they dislike it. They don't say "even the bad performance is no
worse than earlier versions so it's fine."

> You're right and I'm open to doing more legwork. I'd also appreciate any 
> suggestion about how to test this properly and/or useful scenarios to test. 
> That would be really helpful.

I think experimenting to see whether the long-short-long-short
behavior that Heikki postulated emerges in practice would be a really
good start.

Another experiment that I think would be interesting is: suppose you
create a patch that sends EVERY message without buffering and compare
that to master. My naive expectation would be that this will lose if
you pump short messages through that connection and win if you pump
long messages through that connection. Is that true? If yes, at what
point do we break even on performance? Does it depend on whether the
connection is local or over a network? Does it depend on whether it's
with or without SSL? Does it depend on Linux vs. Windows vs.
whateverBSD? What happens if you twiddle the 8kB buffer size up or,
say, down to just below the Ethernet frame size?

I think that what we really want to understand here is under what
circumstances the extra layer of buffering is a win vs. being a loss.
If all the stuff I just mentioned doesn't really matter and the answer
is, say, that an 8kB buffer is great and the breakpoint where extra
buffering makes sense is also 8kB, and that's consistent regardless of
other variables, then your algorithm or Jelte's variant or something
of that nature is probably just right. But if it turns out, say, that
the extra buffering is only a win for sub-1kB messages, that would be
rather nice to know before we finalize the approach. Also, if it turns
out that the answer differs dramatically based on whether you're using
a UNIX socket or TCP, that would also be nice to know before
finalizing an algorithm.

> I understand that I should provide more/better analysis around this change to 
> prove that it doesn't hurt (hopefully) but improves some cases even though 
> not all the cases. That may even help us to find a better approach than 
> what's already proposed. Just to clarify, I don't think anyone here suggests 
> that the bar should be at "if it can't lose relative to today, it's good 
> enough". IMHO "a change that improves some cases, but regresses nowhere" does 
> not translate to that.

Well, I thought those were fairly similar sentiments, so maybe I'm not
quite understanding the statement in the way it was meant.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Flushing large data immediately in pqcomm

2024-01-31 Thread Melih Mutlu

Robert Haas , 31 Oca 2024 Çar, 20:23 tarihinde şunu
yazdı:

> On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio 
> wrote:
> > I agree that it's hard to prove that such heuristics will always be
> > better in practice than the status quo. But I feel like we shouldn't
> > let perfect be the enemy of good here.
>
> Sure, I agree.
>
> > I one approach that is a clear
> > improvement over the status quo is:
> > 1. If the buffer is empty AND the data we are trying to send is larger
> > than the buffer size, then don't use the buffer.
> > 2. If not, fill up the buffer first (just like we do now) then send
> > that. And if the left over data is then still larger than the buffer,
> > then now the buffer is empty so 1. applies.
>
> That seems like it might be a useful refinement of Melih Mutlu's
> original proposal, but consider a message stream that consists of
> messages exactly 8kB in size. If that message stream begins when the
> buffer is empty, all messages are sent directly. If it begins when
> there are any number of bytes in the buffer, we buffer every message
> forever. That's kind of an odd artifact, but maybe it's fine in
> practice. I say again that it's good to test out a bunch of scenarios
> and see what shakes out.
>

Isn't this already the case? Imagine sending exactly 8kB messages, the
first pq_putmessage() call will buffer 8kB. Any call after this point
simply sends a 8kB message already buffered from the previous call and
buffers a new 8kB message. Only difference here is we keep the message in
the buffer for a while instead of sending it directly. In theory, the
proposed idea should not bring any difference in the number of flushes and
the size of data we send in each time, but can remove unnecessary copies to
the buffer in this case. I guess the behaviour is also the same with or
without the patch in case the buffer has already some bytes.

Robert Haas , 31 Oca 2024 Çar, 21:28 tarihinde şunu
yazdı:

> Personally, I don't think it's likely that anything will get committed
> here without someone doing more legwork than I've seen on the thread
> so far. I don't have any plan to pick up this patch anyway, but if I
> were thinking about it, I would abandon the idea unless I were
> prepared to go test a bunch of stuff myself. I agree with the core
> idea of this work, but not with the idea that the bar is as low as "if
> it can't lose relative to today, it's good enough."
>

You're right and I'm open to doing more legwork. I'd also appreciate any
suggestion about how to test this properly and/or useful scenarios to test.
That would be really helpful.

I understand that I should provide more/better analysis around this change
to prove that it doesn't hurt (hopefully) but improves some cases even
though not all the cases. That may even help us to find a better approach
than what's already proposed. Just to clarify, I don't think anyone here
suggests that the bar should be at "if it can't lose relative to today,
it's good enough". IMHO "a change that improves some cases, but regresses
nowhere" does not translate to that.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: psql not responding to SIGINT upon db reconnection

2024-01-31 Thread Jelte Fennema-Nio

On Wed, 31 Jan 2024 at 19:07, Tristan Partin  wrote:
> I was looking for documentation of PQsocket(), but didn't find any
> standalone (unless I completely missed it). So I just copied how
> PQsocket() is documented in PQconnectPoll(). I am happy to document it
> separately if you think it would be useful.

PQsocket its documentation is here:
https://www.postgresql.org/docs/16/libpq-status.html#LIBPQ-PQSOCKET

I think PQsocketPoll should probably have its API documentation
(describing arguments and return value at minimum) in this section of
the docs: https://www.postgresql.org/docs/16/libpq-async.html
And I think the example in the PQconnnectPoll API docs could benefit
from having PQsocketPoll used in that example.

Re: Flushing large data immediately in pqcomm

On Wed, Jan 31, 2024 at 12:49 PM Jelte Fennema-Nio  wrote:
> Testing a bunch of scenarios to find a good one sounds like a good
> idea, which can probably give us a more optimal heuristic. But it also
> sounds like a lot of work, and probably results in a lot of
> discussion. That extra effort might mean that we're not going to
> commit any change for PG17 (or even at all). If so, then I'd rather
> have a modest improvement from my refinement of Melih's proposal, than
> none at all.

Personally, I don't think it's likely that anything will get committed
here without someone doing more legwork than I've seen on the thread
so far. I don't have any plan to pick up this patch anyway, but if I
were thinking about it, I would abandon the idea unless I were
prepared to go test a bunch of stuff myself. I agree with the core
idea of this work, but not with the idea that the bar is as low as "if
it can't lose relative to today, it's good enough."

Of course, another committer may see it differently.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: scalability bottlenecks with (many) partitions (and more)

2024-01-31 Thread Tomas Vondra



On 1/29/24 16:42, Ronan Dunklau wrote:
> Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
>> I'm not sure work_mem is a good parameter to drive this. It doesn't say
>> how much memory we expect the backend to use - it's a per-operation
>> limit, so it doesn't work particularly well with partitioning (e.g. with
>> 100 partitions, we may get 100 nodes, which is completely unrelated to
>> what work_mem says). A backend running the join query with 1000
>> partitions uses ~90MB (judging by data reported by the mempool), even
>> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.
> 
> I understand your point,  I was basing my previous observations on what a 
> backend typically does during the execution.
> 
>>
>> The mempool could tell us how much memory we need (but we could track
>> this in some other way too, probably). And we could even adjust the mmap
>> parameters regularly, based on current workload.
>>
>> But there's then there's the problem that the mmap parameters don't tell
>> If we > > us how much memory to keep, but how large chunks to release.
>>
>> Let's say we want to keep the 90MB (to allocate the memory once and then
>> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
>> but then it takes just a little bit of extra memory to release all the
>> memory, or something.
> 
> For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a 
> certain amount of memory is always kept. 
> 
> But the way the dynamic adjustment works makes it sort-of work like this. 
> MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't 
> expect to keep much memory around. 
> 
> So even "small" memory allocations will be served using mmap at first. Once 
> mmaped memory is released, glibc's consider it a benchmark for "normal" 
> allocations that can be routinely freed, and adjusts mmap_threshold to the 
> released mmaped region size, and trim threshold to two times that. 
> 
> It means over time the two values will converge either to the max value (32MB 
> for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to 
> accomodate your released memory, since anything bigger than half trim 
> threshold will be allocated using mmap. 
> 
> Setting any parameter disable that.
> 

Thanks. I gave this a try, and I started the tests with this setting:

export MALLOC_TOP_PAD_=$((64*1024*1024))
export MALLOC_MMAP_THRESHOLD_=$((1024*1024))
export MALLOC_TRIM_THRESHOLD_=$((1024*1024))

which I believe means that:

1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for
future allocations

2) everything below 1MB (so most of the blocks we allocate for contexts)
will be allocated on heap (hence from the cache)

3) we won't trim heap unless there's at least 1MB of free contiguous
space (I wonder if this should be the same as MALLOC_TOP_PAD)

Those are mostly arbitrary values / guesses, and I don't have complete
results yet. But from the results I have it seems this has almost the
same effect as the mempool thing - see the attached PDF, with results
for the "partitioned join" benchmark.

first column - "master" (17dev) with no patches, default glibc

second column - 17dev + locking + mempool, default glibc

third column - 17dev + locking, tuned glibc

The color scale on the right is throughput comparison (third/second), as
a percentage with e.g. 90% meaning tuned glibc is 10% slower than the
mempool results. Most of the time it's slower but very close to 100%,
sometimes it's a bit faster. So overall it's roughly the same.

The color scales below the results is a comparison of each branch to the
master (without patches), showing comparison to current performance.
It's almost the same, although the tuned glibc has a couple regressions
that the mempool does not have.

> But I'm not arguing against the mempool, just chiming in with glibc's malloc 
> tuning possibilities :-)
> 

Yeah. I think the main problem with the glibc parameters is that it's
very implementation-specific and also static - the mempool is more
adaptive, I think. But it's an interesting experiment.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

glibc-malloc-tuning.pdf
Description: Adobe PDF document

Re: Fix some ubsan/asan related issues

2024-01-31 Thread Tristan Partin

On Tue Jan 30, 2024 at 10:00 PM CST, Alexander Lakhin wrote:

Hello,

30.01.2024 18:57, Tristan Partin wrote:
> Patch 1:
>
> Passing NULL as a second argument to memcpy breaks ubsan, ...

Maybe you would like to fix also one more similar place, reached with:
create extension xml2;
select xslt_process('',
$$http://www.w3.org/1999/XSL/Transform";>

$$);

varlena.c:201:26: runtime error: null pointer passed as argument 2, which is 
declared to never be null

There is also an issue with pg_bsd_indent, I stumble upon when doing
`make check-world` with the sanitizers enabled:
https://www.postgresql.org/message-id/591971ce-25c1-90f3-0526-5f54e3ebb32e%40gmail.com

31.01.2024 00:23, Andres Freund wrote:
> The reason asan fails is that it uses a "shadow stack" to track stack variable
> lifetimes. These confuse our stack depth check. CI doesn't have the issue
> because the compiler doesn't yet enable the feature, locally I get around it
> by using ASAN_OPTIONS=detect_stack_use_after_return=0:...

Even with detect_stack_use_after_return=0, clang-18's asan makes the test
012_subtransactions.pl fail:
2024-01-31 03:24:25.691 UTC [4112455] 012_subtransactions.pl LOG: statement: 
SELECT hs_subxids(201);
2024-01-31 03:24:25.714 UTC [4112455] 012_subtransactions.pl ERROR: stack depth 
limit exceeded
2024-01-31 03:24:25.714 UTC [4112455] 012_subtransactions.pl HINT: Increase the configuration parameter max_stack_depth 
(currently 2048kB), after ensuring the platform's stack depth limit is adequate.

(All the other tests pass.)
Though the same test passes when I use clang-16.

Thanks Alexander! I will try and take a look at these.

--
Tristan Partin
Neon (https://neon.tech)

Re: make dist using git archive

On Wed, Jan 31, 2024 at 10:50 AM Eli Schwartz  wrote:
> Ideally git commits should be signed, but that requires large numbers of
> people to have security-minded git commit habits. From a quick check of
> the postgres commit logs, only one person seems to be regularly signing
> commits, which does provide a certain measure of protection -- an
> attacker cannot attack via `git push --force` across that boundary, and
> those commits serve as verifiable states that multiple people have seen.
>
> The tags aren't signed either, which is a big issue for verifiably
> identifying the release artifacts published by the release manager. Even
> if not every commit is signed, having signed tags provides a known
> coordination point of code that has been broadly tested and code-signed
> for mass use.
>
> In summary, my opinion is that using git-get-tar-commit-id provides zero
> security guarantees, and if that's not something you are worried about
> then that's one thing, but if you were expecting it to *replace* signing
> the tarball, then that's very much another thing entirely, and not
> one I can agree at all with.

I read this part with interest. I think there's definitely something
to be said for strengthening some of our practices in this area. At
the same time, I think it's reasonable for Peter to want to pursue the
limited goal he stated in the original post, namely reproducible
tarball generation, without getting tangled up in possible policy
changes that might be controversial and might require a bunch of
planning and coordination. "GPG signatures are good" can be true
without "reproducible tarball generation is good" being false; and if
"git archive" allows for that and "meson dist" doesn't, then we're
unlikely to adopt "meson dist".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql not responding to SIGINT upon db reconnection

2024-01-31 Thread Tristan Partin

On Tue Jan 30, 2024 at 4:42 PM CST, Jelte Fennema-Nio wrote:

On Tue, 30 Jan 2024 at 23:20, Tristan Partin  wrote:
> Not next week, but here is a respin. I've exposed pqSocketPoll as
> PQsocketPoll and am just using that. You can see the diff is so much
> smaller, which is great!

The exports.txt change should be made part of patch 0001, also docs
are missing for the newly exposed function. PQsocketPoll does look
like quite a nice API to expose from libpq.

I was looking for documentation of PQsocket(), but didn't find any 
standalone (unless I completely missed it). So I just copied how 
PQsocket() is documented in PQconnectPoll(). I am happy to document it 
separately if you think it would be useful.

My bad on the exports.txt change being in the wrong commit. Git 
things... I will fix it on the next re-spin after resolving the previous 
paragraph.

--
Tristan Partin
Neon (https://neon.tech)

Re: Flushing large data immediately in pqcomm

2024-01-31 Thread Jelte Fennema-Nio

On Wed, 31 Jan 2024 at 18:23, Robert Haas  wrote:
> That's kind of an odd artifact, but maybe it's fine in
> practice.

I agree it's an odd artifact, but it's not a regression over the
status quo. Achieving that was the intent of my suggestion: A change
that improves some cases, but regresses nowhere.

> I say again that it's good to test out a bunch of scenarios
> and see what shakes out.

Testing a bunch of scenarios to find a good one sounds like a good
idea, which can probably give us a more optimal heuristic. But it also
sounds like a lot of work, and probably results in a lot of
discussion. That extra effort might mean that we're not going to
commit any change for PG17 (or even at all). If so, then I'd rather
have a modest improvement from my refinement of Melih's proposal, than
none at all.

Re: Reducing output size of nodeToString

On Wed, Jan 31, 2024 at 11:17 AM Matthias van de Meent
 wrote:
> I was also thinking about smaller per-attribute expression storage, for index 
> attribute expressions, table default expressions, and functions. Other than 
> that, less memory overhead for the serialized form of these constructs also 
> helps for catalog cache sizes, etc.
> People complained about the size of a fresh initdb, and I agreed with them, 
> so I started looking at low-hanging fruits, and this is one.
>
> I've not done any tests yet on whether it's more performant in general. I'd 
> expect the new code to do a bit better given the extremely verbose nature of 
> the data and the rather complex byte-at-a-time token read method used, but 
> this is currently hypothesis.
> I do think that serialization itself may be slightly slower, but given that 
> this generally happens only in DDL, and that we have to grow the output 
> buffer less often, this too may still be a net win (but, again, this is an 
> untested hypothesis).

I think we're going to have to have separate formats for debugging and
storage if we want to get very far here. The current format sucks for
readability because it's so verbose, and tightening that up where we
can makes sense to me. For me, that can include things like emitting
unset location fields for sure, but delta-encoding of bitmap sets is
more questionable. Turning 1 2 3 4 5 6 7 8 9 10 into 1-10 would be
fine with me because that is both shorter and more readable, but
turning 2 4 6 8 10 into 2 2 2 2 2 is way worse for a human reader.
Such optimizations might make sense in a format that is designed for
computer processing only but not one that has to serve multiple
purposes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: speed up a logical replica setup

On Wed, Jan 31, 2024 at 12:38 PM Euler Taveira  wrote:
>
>
> Hmm. I didn't try it with the failover patch that was recently applied.
Did you
> have any special configuration on primary?
>

Nothing special, here the configurations I've changed after bootstrap:

port = '5432'
wal_level = 'logical'
max_wal_senders = '8'
max_replication_slots = '6'
hot_standby_feedback = 'on'
max_prepared_transactions = '10'
max_locks_per_transaction = '512'

Regards,

--
Fabrízio de Royes Mello

Re: Flushing large data immediately in pqcomm

On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio  wrote:
> I agree that it's hard to prove that such heuristics will always be
> better in practice than the status quo. But I feel like we shouldn't
> let perfect be the enemy of good here.

Sure, I agree.

> I one approach that is a clear
> improvement over the status quo is:
> 1. If the buffer is empty AND the data we are trying to send is larger
> than the buffer size, then don't use the buffer.
> 2. If not, fill up the buffer first (just like we do now) then send
> that. And if the left over data is then still larger than the buffer,
> then now the buffer is empty so 1. applies.

That seems like it might be a useful refinement of Melih Mutlu's
original proposal, but consider a message stream that consists of
messages exactly 8kB in size. If that message stream begins when the
buffer is empty, all messages are sent directly. If it begins when
there are any number of bytes in the buffer, we buffer every message
forever. That's kind of an odd artifact, but maybe it's fine in
practice. I say again that it's good to test out a bunch of scenarios
and see what shakes out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Abort if dup fail (src/bin/pg_dump/compress_none.c)

2024-01-31 Thread Ranier Vilela

Hi.

Per Coverity.
CID 1506240: (#1 of 1): Argument cannot be negative (NEGATIVE_RETURNS)
3. negative_returns: dup(fd) is passed to a parameter that cannot be
negative.

pg_dump function open_none, tries to associate a stream to a file
descriptor,
using function dup, which may fail and return negative value.

fdopen cannot receive negative parameters, in this case fail and return
EBADF.

This can be confusing for the user, who will be trying to figure out what's
wrong.
Better abort and report the correct failure to the user.

Patch attached.

Best regards,
Ranier Vilela


abort-if-dup-fail-pg_dump.patch
Description: Binary data

Re: pgsql: Clean pg_walsummary's tmp_check directory.

On Wed, Jan 31, 2024 at 11:51 AM Tom Lane  wrote:
> Clean pg_walsummary's tmp_check directory.

Ugh. The reason I keep doing this is because I've switched to using
meson builds, where of course you don't get complaints about this.

And it seems like CI doesn't tell you either. Nor does the buildfarm.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reducing output size of nodeToString

2024-01-31 Thread Matthias van de Meent

On Wed, 31 Jan 2024, 09:16 Peter Eisentraut,  wrote:

> On 30.01.24 12:26, Matthias van de Meent wrote:
> >> Most of the other defaults I'm doubtful about.  First, we are colliding
> >> here between the goals of minimizing the storage size and making the
> >> debug output more readable.
> > I've never really wanted to make the output "more readable". The
> > current one is too verbose, yes.
>
> My motivations at the moment to work in this area are (1) to make the
> output more readable, and (2) to reduce maintenance burden of node
> support functions.
>
> There can clearly be some overlap with your goals.  For example, a less
> verbose and less redundant output can ease readability.  But it can also
> go the opposite direction; a very minimalized output can be less readable.
>
> I would like to understand your target more.  You have shown some
> figures how these various changes reduce storage size in pg_rewrite.
> But it's a few hundred kilobytes, if I read this correctly, maybe some
> megabytes if you add a lot of user views.  Does this translate into any
> other tangible benefits, like you can store more views, or processing
> views is faster, or something like that?

I was also thinking about smaller per-attribute expression storage, for
index attribute expressions, table default expressions, and functions.
Other than that, less memory overhead for the serialized form of these
constructs also helps for catalog cache sizes, etc.
People complained about the size of a fresh initdb, and I agreed with them,
so I started looking at low-hanging fruits, and this is one.

I've not done any tests yet on whether it's more performant in general. I'd
expect the new code to do a bit better given the extremely verbose nature
of the data and the rather complex byte-at-a-time token read method used,
but this is currently hypothesis.
I do think that serialization itself may be slightly slower, but given that
this generally happens only in DDL, and that we have to grow the output
buffer less often, this too may still be a net win (but, again, this is an
untested hypothesis).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Possibility to disable `ALTER SYSTEM`

On Wed, Jan 31, 2024 at 5:16 AM Gabriele Bartolini
 wrote:
> I very much like the idea of a file in the data directory that also controls 
> the copy operations.
>
> Just wanted to highlight though that in our operator we have already applied 
> the read-only postgresql.auto.conf trick to disable the system (see 
> https://cloudnative-pg.io/documentation/current/postgresql_conf/#enabling-alter-system).
>  However, having that file read-only triggered an issue when using pg_rewind 
> to resync a former primary, as pg_rewind immediately bails out when a 
> read-only file is encountered in the PGDATA (see 
> https://github.com/cloudnative-pg/cloudnative-pg/issues/3698).
>
> We might keep this in mind if we go down the path of the separate file.

Yeah. It would be possible to teach pg_rewind and other utilities to
handle unreadable or unwritable files in the data directory, but I'm
not sure that's the best path forward here, and it would require some
consensus that it's the way we want to go.

Another option I thought of would be to control these sorts of things
with a command-line switch. I doubt whether that does anything really
fundamental from a security point of view, but it removes the control
of the toggles from anything in the data directory while still leaving
it within the server administrator's remit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: make dist using git archive

2024-01-31 Thread Eli Schwartz

On 1/31/24 3:03 AM, Peter Eisentraut wrote:
>> What do you use this for? IMO a more robust way to track the commit used
>> is to use gitattributes export-subst to write a `.git_archival.txt` file
>> containing the commit sha1 and other info -- this can be read even after
>> the file is extracted, which means it can also be used to bake the ID
>> into the built binaries e.g. as part of --version output.
> 
> It's a marginal use case, for sure.  But it is something that git
> provides tooling for that is universally available.  Any alternative
> would be an ad-hoc solution that is specific to our project and would be
> different for the next project.

mercurial has the "archivemeta" config setting that exports similar
information, but forces the filename ".hg_archival.txt".

The setuptools-scm project follows this pattern by requiring the git
file to be called ".git_archival.txt" with a set pattern mimicking the
hg one:

https://setuptools-scm.readthedocs.io/en/latest/usage/#git-archives

So, I guess you could use this and then it would not be specific to your
project. :)

>> Overall I feel like much of this is about requiring dist tarballs to be
>> byte-identical to other dist tarballs, although reproducible builds is
>> mainly about artifacts, not sources, and for sources it doesn't
>> generally matter unless the sources are ephemeral and generated
>> on-demand (in which case it is indeed very important to produce the same
>> tarball each time).
> 
> The source tarball is, in a way, also an artifact.
> 
> I think it's useful that others can easily independently verify that the
> produced tarball matches what they have locally.  It's not an absolute
> requirement, but given that it is possible, it seems useful to take
> advantage of it.
> 
> In a way, this also avoids the need for signing the tarball, which we
> don't do.  So maybe that contributes to a different perspective.

Since you mention signing and not as a simple "aside"...

That's a fascinating perspective. I wonder how people independently
verify that what they have locally (I assume from git clones) matches
what the postgres committers have authorized.

I'm a bit skeptical that you can avoid the need to perform code-signing
at some stage, somewhere, somehow, by suggesting that people can simply
git clone, run some commands and compare the tarball. The point of
signing is to verify that no one has acquired an untraceable API token
they should not have and gotten write access to the authoritative server
then uploaded malicious code under various forged identities, possibly
overwriting previous versions, either in git or out of git.

Ideally git commits should be signed, but that requires large numbers of
people to have security-minded git commit habits. From a quick check of
the postgres commit logs, only one person seems to be regularly signing
commits, which does provide a certain measure of protection -- an
attacker cannot attack via `git push --force` across that boundary, and
those commits serve as verifiable states that multiple people have seen.

The tags aren't signed either, which is a big issue for verifiably
identifying the release artifacts published by the release manager. Even
if not every commit is signed, having signed tags provides a known
coordination point of code that has been broadly tested and code-signed
for mass use.

...

In summary, my opinion is that using git-get-tar-commit-id provides zero
security guarantees, and if that's not something you are worried about
then that's one thing, but if you were expecting it to *replace* signing
the tarball, then that's very much another thing entirely, and not
one I can agree at all with.

-- 
Eli Schwartz

OpenPGP_0x84818A6819AF4A9B.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: Synchronizing slots from primary to standby

2024-01-31 Thread Masahiko Sawada

On Wed, Jan 31, 2024 at 7:42 PM Amit Kapila  wrote:
>
> On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada  wrote:
> >
> > Thank you for updating the patches. As for the slotsync worker patch,
> > is there any reason why 0001, 0002, and 0004 patches are still
> > separated?
> >
>
> No specific reason, it could be easier to review those parts.

Okay, I think we can merge 0001 and 0002 at least as we don't need
bgworker codes.

>
> >
> > Beside, here are some comments on v74 0001, 0002, and 0004 patches:
> >
> > ---
> > +static char *
> > +wait_for_valid_params_and_get_dbname(void)
> > +{
> > +   char   *dbname;
> > +   int rc;
> > +
> > +   /* Sanity check. */
> > +   Assert(enable_syncslot);
> > +
> > +   for (;;)
> > +   {
> > +   if (validate_parameters_and_get_dbname(&dbname))
> > +   break;
> > +   ereport(LOG, errmsg("skipping slot synchronization"));
> > +
> > +   ProcessSlotSyncInterrupts(NULL);
> >
> > When reading this function, I expected that the slotsync worker would
> > resume working once the parameters became valid, but it was not
> > correct. For example, if I changed hot_standby_feedback from off to
> > on, the slotsync worker reads the config file, exits, and then
> > restarts. Given that the slotsync worker ends up exiting on parameter
> > changes anyway, why do we want to have it wait for parameters to
> > become valid?
> >
>
> Right, the reason for waiting is to avoid repeated re-start of
> slotsync worker if the required parameter is not changed. To follow
> that, I think we should simply continue when the required parameter is
> changed and is valid. But, I think during actual slotsync, if
> connection_info is changed then there is no option but to restart.

Agreed.

> >
> > ---
> > +bool
> > +SlotSyncWorkerCanRestart(void)
> > +{
> > +#define SLOTSYNC_RESTART_INTERVAL_SEC 10
> > +
> >
> > IIUC depending on how busy the postmaster is and the timing, the user
> > could wait for 1 min to re-launch the slotsync worker. But I think the
> > user might want to re-launch the slotsync worker more quickly for
> > example when the slotsync worker restarts due to parameter changes.
> > IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the
> > slotsync worker previously exited with 0 or 1.
> >
>
> Considering my previous where we don't want to restart for a required
> parameter change, isn't it better to avoid repeated restart (say when
> the user gave an invalid dbname)? BTW, I think this restart interval
> is added based on your previous complaint [1].

I think it's useful that the slotsync worker restarts immediately when
a required parameter is changed but waits to restart when it exits
with an error. IIUC the apply worker does so; if it restarts due to a
subscription parameter change, it resets the last-start time so that
the launcher will restart it without waiting. But if it exits with an
error, the launcher waits for wal_retrieve_retry_interval. I don't
think the slotsync worker must follow this behavior but I feel it's
useful behavior.

>
> >
> > ---
> > When I dropped a database on the primary that has a failover slot, I
> > got the following logs on the standby:
> >
> > 2024-01-31 17:25:21.750 JST [1103933] FATAL:  replication slot "s" is
> > active for PID 1103935
> > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT:  WAL redo at 0/3020D20
> > for Database/DROP: dir 1663/16384
> > 2024-01-31 17:25:21.751 JST [1103930] LOG:  startup process (PID
> > 1103933) exited with exit code 1
> >
> > It seems that because the slotsync worker created the slot on the
> > standby, the slot's active_pid is still valid.
> >
>
> But we release the slot after sync. And we do take a shared lock on
> the database to make the startup process wait for slotsync. There is
> one gap which is that we don't reset active_pid for temp slots in
> ReplicationSlotRelease(), so for temp slots such an error can occur
> but OTOH, we immediately make the slot persistent after sync. As per
> my understanding, it is only possible to get this error if the initial
> sync doesn't happen and the slot remains temporary. Is that your case?
> How did reproduce this?

I created a failover slot manually on the primary and dropped the
database where the failover slot is created. So this would not happen
in normal cases.

BTW I've tested the following switch/fail-back scenario but it seems
not to work fine. Am I missing something?

Setup:
node1 is the primary, node2 is the physical standby for node1, and
node3 is the subscriber connecting to node1.

Steps:
1. [node1]: create a table and a publication for the table.
2. [node2]: set enable_syncslot = on and start (to receive WALs from node1).
3. [node3]: create a subscription with failover = true for the publication.
4. [node2]: promote to the new standby.
5. [node3]: alter subscription to connect the new primary, node2.
6. [node1]: stop, set enable_syncslot = on (and other required
parameters), then start as a new standby.

Then I got the error "

Re: [PATCH] Add native windows on arm64 support

2024-01-31 Thread Andrew Dunstan




On 2024-01-31 We 10:34, Peter Eisentraut wrote:

On 31.01.24 16:20, Andrew Dunstan wrote:
- PostgreSQL will only build for the x64 architecture on 64-bit 
Windows. + PostgreSQL will only build for the x64 and ARM64 
architecture on 64-bit Windows.


Are there any other 64-bit architectures for Windows?

Possibly, the original sentence was meant to communicate that ARM was 
not supported, in which case it could now be removed?



x86? That is in fact the default setting for VS even on ARM64.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: libpq fails to build with TSAN

2024-01-31 Thread Tom Lane

Daniel Gustafsson  writes:
> I think it should, the idea of that check is to catch calls to actual exits,
> while this is instrumentation which has nothing to do with exit(2).  The
> attached diff should be enough to handle this.

+1

regards, tom lane

Re: speed up a logical replica setup

On Wed, Jan 31, 2024, at 11:55 AM, Fabrízio de Royes Mello wrote:
> 
> On Wed, Jan 31, 2024 at 11:35 AM Euler Taveira  wrote:
> >
> > On Wed, Jan 31, 2024, at 11:25 AM, Fabrízio de Royes Mello wrote:
> >
> > Jumping into this a bit late here... I'm trying a simple 
> > pg_createsubscriber but getting an error:
> >
> >
> > Try v11. It seems v12-0002 is not correct.
> 
> Using v11 I'm getting this error:
> 
> ~/pgsql took 22s 
> ✦ ➜ pg_createsubscriber -d fabrizio -r -D /tmp/replica5434 -S 'host=/tmp 
> port=5434' -P 'host=/tmp port=5432'
> NOTICE:  changed the failover state of replication slot 
> "pg_createsubscriber_16384_706609" on publisher to false
> pg_createsubscriber: error: could not drop replication slot 
> "pg_createsubscriber_706609_startpoint" on database "fabrizio": ERROR:  
> replication slot "pg_createsubscriber_706609_startpoint" does not exist
> Write-ahead log reset

Hmm. I didn't try it with the failover patch that was recently applied. Did you
have any special configuration on primary?


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: [PATCH] Add native windows on arm64 support

2024-01-31 Thread Peter Eisentraut


On 31.01.24 16:20, Andrew Dunstan wrote:
- PostgreSQL will only build for the x64 architecture on 64-bit Windows. 
+ PostgreSQL will only build for the x64 and ARM64 architecture on 
64-bit Windows.


Are there any other 64-bit architectures for Windows?

Possibly, the original sentence was meant to communicate that ARM was 
not supported, in which case it could now be removed?

Re: cleanup patches for incremental backup

On Tue, Jan 30, 2024 at 11:52 AM Robert Haas  wrote:
> Here's a patch for that. I now think
> a7097ca630a25dd2248229f21ebce4968d85d10a was actually misguided, and
> served only to mask some of the failures caused by waiting for the WAL
> summary file.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Incorrect cost for MergeAppend

2024-01-31 Thread Tom Lane

Alvaro Herrera  writes:
> Since we have a minor coming up very soon, I think it's not a good idea
> to backpatch right now.  Maybe you can push to master now, and consider
> whether to backpatch later.

As a rule, we don't back-patch changes like this ever.  People don't
appreciate plans changing in minor releases.

regards, tom lane

Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-01-31 Thread Tom Lane

Richard Guo  writes:
> On Wed, Jan 31, 2024 at 5:12 AM Tom Lane  wrote:
>> * Why is it okay to just use pull_varnos on a tablesample expression,
>> when contain_references_to() does something different?

> Hmm, the main reason is that the expression_tree_walker() function does
> not handle T_RestrictInfo nodes.  So we cannot just use pull_varnos or
> pull_var_clause on a restrictinfo list.

Right, the extract_actual_clauses step is not wanted for the
tablesample expression.  But my point is that surely the handling of
Vars and PlaceHolderVars should be the same, which it's not, and your
v11 makes it even less so.  How can it be OK to ignore Vars in the
restrictinfo case?

I think the code structure we need to end up with is a routine that
takes a RestrictInfo-free node tree (and is called directly in the
tablesample case) with a wrapper routine that strips the RestrictInfos
(for use on restriction lists).

> But I'm not sure about checking phinfo->ph_eval_at.  It seems to me that
> the ph_eval_at could not overlap the other join relation; otherwise the
> clause would not be a restriction clause but rather a join clause.

At least in the tablesample case, plain Vars as well as PHVs belonging
to the other relation are definitely possible.

regards, tom lane

Re: [PATCH] Add native windows on arm64 support

2024-01-31 Thread Andrew Dunstan

On 2024-01-30 Tu 17:54, Dave Cramer wrote:

On Tue, Jan 30, 2024 at 4:56 PM Andrew Dunstan  
wrote:

On 2024-01-30 Tu 09:50, Dave Cramer wrote:

On Tue, 30 Jan 2024 at 08:38, Andrew Dunstan
 wrote:

On 2024-01-29 Mo 11:20, Dave Cramer wrote:

Dave Cramer
www.postgres.rocks 

On Mon, 29 Jan 2024 at 11:16, Andrew Dunstan
 wrote:

On 2024-01-26 Fr 09:18, Dave Cramer wrote:

On Fri, 26 Jan 2024 at 07:36, Andrew Dunstan
 wrote:

On 2024-01-25 Th 20:32, Michael Paquier wrote:
> On Thu, Jan 25, 2024 at 04:52:30PM -0500, Dave
Cramer wrote:
>> On Thu, 25 Jan 2024 at 16:32, Andrew Dunstan
 wrote:
>>> On 2024-01-25 Th 16:17, Dave Cramer wrote:
>>> Yeah, I think the default Developer Command
Prompt for VS2022 is set up
>>> for x86 builds. AIUI you should start by
executing "vcvarsall x64_arm64".
>> Yup, now I'm in the same state you are
> Wait a minute here. Based on [1], x64_arm64 means
you can use a x64
> host and you'll be able to produce ARM64 builds,
still these will not
> be able to run on the host where they were built.
How much of the
> patch posted upthread is required to produce such
builds?  Basically
> everything from it, I guess, so as build
dependencies can be
> satisfied?
>
> [1]:

https://learn.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=msvc-170

If you look at the table here x86 and x64 are the
only supported host
architectures. But that's OK, the x64 binaries will
run on arm64 (W11
ARM64 has x64 emulation builtin). If that didn't
work Dave and I would
not have got as far as we have. But you want the
x64_arm64 argument to
vcvarsall so you will get ARM64 output.

I've rebuilt it using x64_arm64 and with the attached
(very naive patch) and I still get an x64 binary :(

With this patch I still get a build error, but it's
different :-)

[1406/2088] "link" @src/backend/postgres.exe.rsp
FAILED: src/backend/postgres.exe src/backend/postgres.pdb
"link" @src/backend/postgres.exe.rsp
   Creating library src\backend\postgres.exe.lib

storage_lmgr_s_lock.c.obj : error LNK2019: unresolved
external symbol spin_delay referenced in function
perform_spin_delay

src\backend\postgres.exe : fatal error LNK1120: 1
unresolved externals

Did you add the latest lock.patch ?

I'm a bit confused about exactly what needs to be applied.
Can you supply a complete patch to be applied to a pristine
checkout that will let me build?

cheers

See attached.

No, that is what is giving me the error shown above (just tried
again to be certain). And it's not surprising, as patch 2 #ifdef's
out the definition of spin_delay().

If you can get a complete build with these patches then I suspect
you're not doing a proper ARM64 build.

Okay I will look when I get home in a week

I made some progress. The attached is mostly taken from 

With it applied I was able to get a successful build using the buildfarm 
client. However, there are access violations when running some tests, so 
there is still some work to do, apparently.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index ed5b285a5e..d9b8649dab 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -4150,7 +4150,7 @@ make: *** [postgres] Error 1

 Special Considerations for 64-Bit Windows

- PostgreSQL will only build for the x64 architecture on 64-bit Windows.
+ PostgreSQL will only build for the x64 and ARM64 architecture on 64-bit Windows.

  Mixing 32- and 64-bit versions in the same build tree is not supported.
diff --git a/meson.build b/meson.build
index 8ed51b6aae..14aea924ec 100644
--- a/meson.build
+++ b/meson.build
@@ -2046,8 +2046,11 @@ int main(void)
 elif host_cpu == 'arm' or host_cpu == 'aarch64'

   prog = '''
+#ifdef _MSC_VER
+#include 
+#else
 #include 
-
+#endif
 int main(void)
 {
 unsigned int crc = 0;
@@ -2061,7 +2064,11 @@ int main(void)
 }
 '''

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc

Re: speed up a logical replica setup

On Wed, Jan 31, 2024 at 11:35 AM Euler Taveira  wrote:
>
> On Wed, Jan 31, 2024, at 11:25 AM, Fabrízio de Royes Mello wrote:
>
> Jumping into this a bit late here... I'm trying a simple
pg_createsubscriber but getting an error:
>
>
> Try v11. It seems v12-0002 is not correct.

Using v11 I'm getting this error:

~/pgsql took 22s
✦ ➜ pg_createsubscriber -d fabrizio -r -D /tmp/replica5434 -S 'host=/tmp
port=5434' -P 'host=/tmp port=5432'
NOTICE:  changed the failover state of replication slot
"pg_createsubscriber_16384_706609" on publisher to false
pg_createsubscriber: error: could not drop replication slot
"pg_createsubscriber_706609_startpoint" on database "fabrizio": ERROR:
 replication slot "pg_createsubscriber_706609_startpoint" does not exist
Write-ahead log reset

Attached the output log.

Regards,

-- 
Fabrízio de Royes Mello
2024-01-31 11:53:31.882 -03 [706626] LOG:  starting PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2024-01-31 11:53:31.882 -03 [706626] LOG:  listening on IPv6 address "::1", port 5434
2024-01-31 11:53:31.882 -03 [706626] LOG:  listening on IPv4 address "127.0.0.1", port 5434
2024-01-31 11:53:31.990 -03 [706626] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5434"
2024-01-31 11:53:32.082 -03 [706639] LOG:  database system was shut down in recovery at 2024-01-31 11:53:30 -03
2024-01-31 11:53:32.084 -03 [706639] LOG:  entering standby mode
2024-01-31 11:53:32.100 -03 [706639] LOG:  redo starts at 0/428
2024-01-31 11:53:32.189 -03 [706639] LOG:  consistent recovery state reached at 0/69EFF00
2024-01-31 11:53:32.189 -03 [706639] LOG:  invalid record length at 0/69F01D0: expected at least 24, got 0
2024-01-31 11:53:32.189 -03 [706626] LOG:  database system is ready to accept read-only connections
2024-01-31 11:53:32.200 -03 [706640] LOG:  started streaming WAL from primary at 0/600 on timeline 1
2024-01-31 11:53:32.218 -03 [706639] LOG:  recovery stopping after WAL location (LSN) "0/6A1F4C0"
2024-01-31 11:53:32.218 -03 [706639] LOG:  redo done at 0/6A1F4C0 system usage: CPU: user: 0.07 s, system: 0.01 s, elapsed: 0.11 s
2024-01-31 11:53:32.219 -03 [706639] LOG:  last completed transaction was at log time 2024-01-31 11:53:31.745182-03
2024-01-31 11:53:32.220 -03 [706640] FATAL:  terminating walreceiver process due to administrator command
2024-01-31 11:53:32.266 -03 [706639] LOG:  selected new timeline ID: 2
2024-01-31 11:53:32.390 -03 [706639] LOG:  archive recovery complete
2024-01-31 11:53:32.403 -03 [706637] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-01-31 11:53:32.809 -03 [706637] LOG:  checkpoint complete: wrote 2141 buffers (13.1%); 0 WAL file(s) added, 0 removed, 2 recycled; write=0.067 s, sync=0.198 s, total=0.419 s; sync files=51, longest=0.018 s, average=0.004 s; distance=43133 kB, estimate=43133 kB; lsn=0/6A1F4F8, redo lsn=0/6A1F4F8
2024-01-31 11:53:32.827 -03 [706626] LOG:  database system is ready to accept connections
2024-01-31 11:53:33.398 -03 [706626] LOG:  received fast shutdown request
2024-01-31 11:53:33.409 -03 [706650] LOG:  logical replication apply worker for subscription "pg_createsubscriber_16384_706609" has started
2024-01-31 11:53:33.414 -03 [706626] LOG:  aborting any active transactions
2024-01-31 11:53:33.414 -03 [706650] FATAL:  terminating logical replication worker due to administrator command
2024-01-31 11:53:33.415 -03 [706626] LOG:  background worker "logical replication launcher" (PID 706645) exited with exit code 1
2024-01-31 11:53:33.415 -03 [706626] LOG:  background worker "logical replication apply worker" (PID 706650) exited with exit code 1
2024-01-31 11:53:33.415 -03 [706637] LOG:  shutting down
2024-01-31 11:53:33.438 -03 [706637] LOG:  checkpoint starting: shutdown immediate
2024-01-31 11:53:33.642 -03 [706637] LOG:  checkpoint complete: wrote 23 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.052 s, sync=0.047 s, total=0.228 s; sync files=14, longest=0.011 s, average=0.004 s; distance=3 kB, estimate=38820 kB; lsn=0/6A20270, redo lsn=0/6A20270
2024-01-31 11:53:33.648 -03 [706626] LOG:  database system is shut down

Re: remaining sql/json patches

2024-01-31 Thread jian he

Hi.
minor issues.
I am wondering do we need add `pg_node_attr(query_jumble_ignore)`
to some of our created structs in src/include/nodes/parsenodes.h in
v39-0001-Add-SQL-JSON-query-functions.patch

diff --git a/src/backend/parser/parse_jsontable.c
b/src/backend/parser/parse_jsontable.c
new file mode 100644
index 00..25b8204dc6
--- /dev/null
+++ b/src/backend/parser/parse_jsontable.c
@@ -0,0 +1,718 @@
+/*-
+ *
+ * parse_jsontable.c
+ *  parsing of JSON_TABLE
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *  src/backend/parser/parse_jsontable.c
+ *
+ *-
+ */
2022 should change to 2024.

Re: Possibility to disable `ALTER SYSTEM`

On Wed, Jan 31, 2024 at 12:28 AM Tom Lane  wrote:
> You cannot enforce such restrictions within Postgres.
> It has to be done by an outside mechanism.  If you think
> different, you are mistaken.

It seems like the biggest reason why we can't enforce such
restrictions with Postgres is that you won't hear of anyone committing
any code which would let us enforce such restrictions in Postgres. I'm
not saying that there's no other problem here, but you're just digging
in your heels. I wrote upthread "We can't resolve the tension between
those two things in either direction by somebody hammering on the side
of the argument that they believe to be correct and ignoring the other
one" and you replied to that by quoting what I said about the side of
the argument that you believe and hammering on it some more. I really
wish you wouldn't do stuff like that.

One thing that I think might be helpful here is to address the
question of exactly how the superuser can get general-purpose
filesystem access. They can definitely do it if there are any
untrusted PLs installed, but the person who configures the machine can
control that. They can also do it if extensions like adminpack are
available, but the server administrator can control that, too. They
can do it through COPY TO/FROM PROGRAM, but we could provide a way to
restrict that, and I think an awful lot of people want that. I don't
know of any other "normal" way of getting filesystem access, but the
superuser can also hack the system catalogs. That means they can
create a function definition that tries to run an arbitrary function
either in PostgreSQL itself or any .so they can get their hands on --
but this is a much less powerful technique since
5ded4bd21403e143dd3eb66b92d52732fdac1945 removed the version 0 calling
convention. You can no longer manufacture calls to random C functions
that aren't expecting to be called from SQL. The superuser can also
arrange to call a function that *is* intended to be SQL-callable with
the wrong argument types. It's not hard to manufacture a crash that
way, because if you call a function that's expecting a varlena with an
integer, you can induce the function to read more memory than intended
and run right off the stack. I'm not quite sure whether this can be
parlayed into arbitrary code execution; I think it's possible.

And, then, of course, you can use ALTER SYSTEM to set archive_command
or restore_command or similar to a shell command of your choosing.

What else is there? We should actually document the whole list of ways
that a superuser can escape the sandbox. Because right now there are
tons of people, even experienced PG users, who think that superusers
can't escape from PG at all, or that it's just about COPY TO/FROM
PROGRAM. The lack of clarity about what the issues are makes
intelligent discussion difficult. Our documentation hints at the fact
that there's no privilege boundary between the superuser and the OS
user, but it's not very clear or very detailed or in any very central
place, and it's not surprising that not everyone understands the
situation correctly.

At any rate, unless there are way more ways to get filesystem access
than what I've listed here, it's not unreasonable for people to want
to shut off the most obvious ones, like COPY TO/FROM PROGRAM and ALTER
SYSTEM. And there's no real reason we can't provide a way to do that.
It's just sticking your head in the stand to say "well, because we
can't prevent people from crafting a stack overrun attack to access
the filesystem, we shouldn't have a feature that tells them ALTER
SYSTEM is disabled on this instance."

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: speed up a logical replica setup

On Wed, Jan 31, 2024, at 11:25 AM, Fabrízio de Royes Mello wrote:
> Jumping into this a bit late here... I'm trying a simple pg_createsubscriber 
> but getting an error:

Try v11. It seems v12-0002 is not correct.

> Seems we need to escape connection params similar we do in dblink [1]

I think it is a consequence of v12-0003. I didn't review v12 yet but although I
have added a comment saying it might be possible to use primary_conninfo, I'm
not 100% convinced that's the right direction.

/*
 * TODO use primary_conninfo (if available) from subscriber and
 * extract publisher connection string. Assume that there are
 * identical entries for physical and logical replication. If there is
 * not, we would fail anyway.
 */

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

On Wed, Jan 31, 2024 at 9:52 AM Hayato Kuroda (Fujitsu) <
kuroda.hay...@fujitsu.com> wrote:
>
> Dear Euler,
>
> I extracted some review comments which may require many efforts. I hope
this makes them
> easy to review.
>
> 0001: not changed from yours.
> 0002: avoid to use replication connections. Source: comment #3[1]
> 0003: Remove -P option and use primary_conninfo instead. Source: [2]
> 0004: Exit earlier when dry_run is specified. Source: [3]
> 0005: Refactor data structures. Source: [4]
>
> [1]:
https://www.postgresql.org/message-id/TY3PR01MB9889593399165B9A04106741F5662%40TY3PR01MB9889.jpnprd01.prod.outlook.com
> [2]:
https://www.postgresql.org/message-id/TY3PR01MB98897C85700C6DF942D2D0A3F5792%40TY3PR01MB9889.jpnprd01.prod.outlook.com
> [3]:
https://www.postgresql.org/message-id/TY3PR01MB98897C85700C6DF942D2D0A3F5792%40TY3PR01MB9889.jpnprd01.prod.outlook.com
> [4]:
https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
>

Hey folks,

Jumping into this a bit late here... I'm trying a simple
pg_createsubscriber but getting an error:

~/pgsql took 19s
✦ ➜ pg_createsubscriber -d fabrizio -r -D /tmp/replica5434 -S 'host=/tmp
port=5434'
pg_createsubscriber: error: could not create subscription
"pg_createsubscriber_16384_695617" on database "fabrizio": ERROR:  syntax
error at or near "/"
LINE 1: ..._16384_695617 CONNECTION 'user=fabrizio passfile='/home/fabr...
 ^
pg_createsubscriber: error: could not drop replication slot
"pg_createsubscriber_16384_695617" on database "fabrizio":
pg_createsubscriber: error: could not drop replication slot
"pg_subscriber_695617_startpoint" on database "fabrizio": ERROR:
 replication slot "pg_subscriber_695617_startpoint" does not exist

And the LOG contains the following:

~/pgsql took 12s
✦ ➜ cat
/tmp/replica5434/pg_createsubscriber_output.d/server_start_20240131T110318.730.log

2024-01-31 11:03:19.138 -03 [695632] LOG:  starting PostgreSQL 17devel on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0,
64-bit
2024-01-31 11:03:19.138 -03 [695632] LOG:  listening on IPv6 address "::1",
port 5434
2024-01-31 11:03:19.138 -03 [695632] LOG:  listening on IPv4 address
"127.0.0.1", port 5434
2024-01-31 11:03:19.158 -03 [695632] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5434"
2024-01-31 11:03:19.179 -03 [695645] LOG:  database system was shut down in
recovery at 2024-01-31 11:03:18 -03
2024-01-31 11:03:19.180 -03 [695645] LOG:  entering standby mode
2024-01-31 11:03:19.192 -03 [695645] LOG:  redo starts at 0/428
2024-01-31 11:03:19.198 -03 [695645] LOG:  consistent recovery state
reached at 0/504DB08
2024-01-31 11:03:19.198 -03 [695645] LOG:  invalid record length at
0/504DB08: expected at least 24, got 0
2024-01-31 11:03:19.198 -03 [695632] LOG:  database system is ready to
accept read-only connections
2024-01-31 11:03:19.215 -03 [695646] LOG:  started streaming WAL from
primary at 0/500 on timeline 1
2024-01-31 11:03:29.587 -03 [695645] LOG:  recovery stopping after WAL
location (LSN) "0/504F260"
2024-01-31 11:03:29.587 -03 [695645] LOG:  redo done at 0/504F260 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 10.39 s
2024-01-31 11:03:29.587 -03 [695645] LOG:  last completed transaction was
at log time 2024-01-31 11:03:18.761544-03
2024-01-31 11:03:29.587 -03 [695646] FATAL:  terminating walreceiver
process due to administrator command
2024-01-31 11:03:29.598 -03 [695645] LOG:  selected new timeline ID: 2
2024-01-31 11:03:29.680 -03 [695645] LOG:  archive recovery complete
2024-01-31 11:03:29.690 -03 [695643] LOG:  checkpoint starting:
end-of-recovery immediate wait
2024-01-31 11:03:29.795 -03 [695643] LOG:  checkpoint complete: wrote 51
buffers (0.3%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.021 s,
sync=0.034 s, total=0.115 s; sync files=17, longest=0.011 s, average=0.002
s; distance=16700 kB, estimate=16700 kB; lsn=0/504F298, redo lsn=0/504F298
2024-01-31 11:03:29.805 -03 [695632] LOG:  database system is ready to
accept connections
2024-01-31 11:03:30.332 -03 [695658] ERROR:  syntax error at or near "/" at
character 90
2024-01-31 11:03:30.332 -03 [695658] STATEMENT:  CREATE SUBSCRIPTION
pg_createsubscriber_16384_695617 CONNECTION 'user=fabrizio
passfile='/home/fabrizio/.pgpass' channel_binding=prefer host=localhost
port=5432 sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1
ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres
gssdelegation=0 target_session_attrs=any load_balance_hosts=disable
dbname=fabrizio' PUBLICATION pg_createsubscriber_16384 WITH (create_slot =
false, copy_data = false, enabled = false)

Seems we need to escape connection params similar we do in dblink [1]

Regards,

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/dblink/dblink.c;h=19a362526d21dff5d8b1cdc68b15afebe7d40249;hb=HEAD#l2882

-- 
Fabrízio de Royes Mello

Re: Incorrect cost for MergeAppend

2024-01-31 Thread Daniel Westermann (DWE)

Hi,

>Since we have a minor coming up very soon, I think it's not a good idea
>to backpatch right now.  Maybe you can push to master now, and consider
>whether to backpatch later.

>The problem is -- if somebody has an application that gets good plans
>with the current cost model, and you change the cost model and the plans
>become worse, what do they do?  If you change this in a major release,
>this is not an issue because they must test their queries before
>upgrading and if they fail to realize a problem exists then it's their
>fault.  If you change it in a minor release, then those people will be
>very upset that things were changed suddenly, and they may get wary of
>future minor upgrades, which we don't want.

I agree with this, especially as we tell our customers that such changes do not 
happen from one minor release to another.

Regards
Daniel

Re: Parallelize correlated subqueries that execute within each worker

2024-01-31 Thread James Coleman

On Tue, Jan 30, 2024 at 10:34 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > I've finally had a chance to look at this, and I don't believe there's
> > any real failure here, merely drift of how the planner works on master
> > resulting in this query now being eligible for a different plan shape.
>
> > I was a bit wary at first because the changing test query is one I'd
> > previously referenced in [1] as likely exposing the bug I'd fixed
> > where params where being used across worker boundaries. However
> > looking at the diff in the patch at that point (v10) that particular
> > test query formed a different plan shape (there were two gather nodes
> > being created, and params crossing between them).
>
> > But in the current revision of master with the current patch applied
> > that's no longer true: we have a Gather node, and the Subplan using
> > the param is properly under that Gather node, and the param should be
> > being both generated and consumed within the same worker process.
>
> Hmm ... so the question this raises for me is: was that test intended
> to verify behavior of params being passed across workers?  If so,
> haven't you broken the point of the test?  This doesn't mean that
> your code change is wrong; but I think maybe you need to find a way
> to modify that test case so that it still tests what it's meant to.
> This is a common hazard when changing the planner's behavior.

I'd been thinking it was covered by another test I'd added in 0001,
but looking at it again that test doesn't exercise parallel append
(though it does exercise a different case of cross-worker param
usage), so I'll add another test for the parallel append behavior.

Regards,
James Coleman

Re: Incorrect cost for MergeAppend

On 2024-Jan-31, Alexander Kuzmenkov wrote:

> To put it another way, this change enables our usual cost model for
> MergeAppend to work correctly in the presence of filters. I think we
> can consider this model to be reasonably correct, and we don't
> currently have major problems with MergeAppend being chosen instead of
> Sort + Append in cases where it's suboptimal, right? So applying it
> properly in case with filters is not likely to introduce problems.

Since we have a minor coming up very soon, I think it's not a good idea
to backpatch right now.  Maybe you can push to master now, and consider
whether to backpatch later.

The problem is -- if somebody has an application that gets good plans
with the current cost model, and you change the cost model and the plans
become worse, what do they do?  If you change this in a major release,
this is not an issue because they must test their queries before
upgrading and if they fail to realize a problem exists then it's their
fault.  If you change it in a minor release, then those people will be
very upset that things were changed suddenly, and they may get wary of
future minor upgrades, which we don't want.

Plus, they would need to do careful testing before doing the minor
upgrade.

Maybe plans can only go from bad to good and never from good to bad.
But are we 100% certain that that is the case?

People who are **desperate** to get this improvement can perhaps run a
patched version in the meantime.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Incorrect cost for MergeAppend

2024-01-31 Thread Alexander Kuzmenkov

On Wed, Jan 31, 2024 at 1:33 PM Alexander Kuzmenkov
 wrote:
> I'd be happy to see this backpatched. What kind of regressions are we
> worried about? I'd say partition-wise sort + merge should be faster
> than append + sort for reasonably sized tables. That's basically what
> tuplesort does inside. Moreso, this can enable index scans on
> partitions, which is an even better plan.

To put it another way, this change enables our usual cost model for
MergeAppend to work correctly in the presence of filters. I think we
can consider this model to be reasonably correct, and we don't
currently have major problems with MergeAppend being chosen instead of
Sort + Append in cases where it's suboptimal, right? So applying it
properly in case with filters is not likely to introduce problems.

Re: Emitting JSON to file using COPY TO

On 2024-Jan-23, jian he wrote:

> > +   | FORMAT_LA copy_generic_opt_arg
> > +   {
> > +   $$ = makeDefElem("format", $2, @1);
> > +   }
> > ;
> >
> > I think it's not necessary. "format" option is already handled in
> > copy_generic_opt_elem.
> 
> test it, I found out this part is necessary.
> because a query with WITH like `copy (select 1)  to stdout with
> (format json, force_array false); ` will fail.

Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c
(see base_yylex there).  I'm not really sure but I think it might be
better to make it "| FORMAT_LA JSON" instead of invoking the whole
copy_generic_opt_arg syntax.  Not because of performance, but just
because it's much clearer what's going on.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-01-31 Thread Bharath Rupireddy

On Sat, Jan 27, 2024 at 1:18 AM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 11, 2024 at 10:48 AM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > Replication slots in postgres will prevent removal of required
> > resources when there is no connection using them (inactive). This
> > consumes storage because neither required WAL nor required rows from
> > the user tables/system catalogs can be removed by VACUUM as long as
> > they are required by a replication slot. In extreme cases this could
> > cause the transaction ID wraparound.
> >
> > Currently postgres has the ability to invalidate inactive replication
> > slots based on the amount of WAL (set via max_slot_wal_keep_size GUC)
> > that will be needed for the slots in case they become active. However,
> > the wraparound issue isn't effectively covered by
> > max_slot_wal_keep_size - one can't tell postgres to invalidate a
> > replication slot if it is blocking VACUUM. Also, it is often tricky to
> > choose a default value for max_slot_wal_keep_size, because the amount
> > of WAL that gets generated and allocated storage for the database can
> > vary.
> >
> > Therefore, it is often easy for developers to do the following:
> > a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5
> > billion, after which the slots get invalidated.
> > b) set a timeout of say 1 or 2 or 3 days, after which the inactive
> > slots get invalidated.
> >
> > To implement (a), postgres needs a new GUC called max_slot_xid_age.
> > The checkpointer then invalidates all the slots whose xmin (the oldest
> > transaction that this slot needs the database to retain) or
> > catalog_xmin (the oldest transaction affecting the system catalogs
> > that this slot needs the database to retain) has reached the age
> > specified by this setting.
> >
> > To implement (b), first postgres needs to track the replication slot
> > metrics like the time at which the slot became inactive (inactive_at
> > timestamptz) and the total number of times the slot became inactive in
> > its lifetime (inactive_count numeric) in ReplicationSlotPersistentData
> > structure. And, then it needs a new timeout GUC called
> > inactive_replication_slot_timeout. Whenever a slot becomes inactive,
> > the current timestamp and inactive count are stored in
> > ReplicationSlotPersistentData structure and persisted to disk. The
> > checkpointer then invalidates all the slots that are lying inactive
> > for about inactive_replication_slot_timeout duration starting from
> > inactive_at.
> >
> > In addition to implementing (b), these two new metrics enable
> > developers to improve their monitoring tools as the metrics are
> > exposed via pg_replication_slots system view. For instance, one can
> > build a monitoring tool that signals when replication slots are lying
> > inactive for a day or so using inactive_at metric, and/or when a
> > replication slot is becoming inactive too frequently using inactive_at
> > metric.
> >
> > I’m attaching the v1 patch set as described below:
> > 0001 - Tracks invalidation_reason in pg_replication_slots. This is
> > needed because slots now have multiple reasons for slot invalidation.
> > 0002 - Tracks inactive replication slot information inactive_at and
> > inactive_timeout.
> > 0003 - Adds inactive_timeout based replication slot invalidation.
> > 0004 - Adds XID based replication slot invalidation.
> >
> > Thoughts?
>
> Needed a rebase due to c393308b. Please find the attached v2 patch set.

Needed a rebase due to commit 776621a (conflict in
src/test/recovery/meson.build for new TAP test file added). Please
find the attached v3 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From ecfb669fa1f4356d75ef9a8ef0560de804cdaf56 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 31 Jan 2024 12:16:33 +
Subject: [PATCH v3 4/4] Add XID based replication slot invalidation

Currently postgres has the ability to invalidate inactive
replication slots based on the amount of WAL (set via
max_slot_wal_keep_size GUC) that will be needed for the slots in
case they become active. However, choosing a default value for
max_slot_wal_keep_size is tricky. Because the amount of WAL a
customer generates, and their allocated storage will vary greatly
in production, making it difficult to pin down a one-size-fits-all
value. It is often easy for developers to set an XID age (age of
slot's xmin or catalog_xmin) of say 1 or 1.5 billion, after which
the slots get invalidated.

To achieve the above, postgres uses replication slot xmin (the
oldest transaction that this slot needs the database to retain) or
catalog_xmin (the oldest transaction affecting the system catalogs
that this slot needs the database to retain), and a new GUC
max_slot_xid_age. The checkpointer then looks at all replication
slots invalidating the slots based on the age set.
---
 doc/src/sgml/config.sgml  | 21 +
 src/backend/

Re: speed up a logical replica setup

On Thu, Jan 18, 2024 at 6:19 AM Peter Eisentraut 
wrote:
>
> Very early in this thread, someone mentioned the name
> pg_create_subscriber, and of course there is pglogical_create_subscriber
> as the historical predecessor.  Something along those lines seems better
> to me.  Maybe there are other ideas.
>

I've mentioned it upthread because of this pet project [1] that is one of
the motivations behind upstream this facility.

[1] https://github.com/fabriziomello/pg_create_subscriber

-- 
Fabrízio de Royes Mello

Re: Switching XLog source from archive to streaming when primary available

2024-01-31 Thread Bharath Rupireddy

On Wed, Jan 3, 2024 at 4:58 PM Bharath Rupireddy
 wrote:
>
> On Thu, Dec 28, 2023 at 5:26 PM Bharath Rupireddy
>  wrote:
> >
> > I took a closer look at v14 and came up with the following changes:
> >
> > 1. Used advance_wal introduced by commit c161ab74f7.
> > 2. Simplified the core logic and new TAP tests.
> > 3. Reworded the comments and docs.
> > 4. Simplified new DEBUG messages.
> >
> > I've attached the v15 patch for further review.
>
> Per a recent commit c538592, FATAL-ized perl warnings in the newly
> added TAP test and attached the v16 patch.

Needed a rebase due to commit 776621a (conflict in
src/test/recovery/meson.build for new TAP test file added). Please
find the attached v17 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 95a6f6c484dcff68f01ff90d9011abff2f15ad89 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 31 Jan 2024 11:59:02 +
Subject: [PATCH v17] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  47 +++
 doc/src/sgml/high-availability.sgml   |  15 ++-
 src/backend/access/transam/xlogrecovery.c | 115 --
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/041_wal_source_switch.pl  |  93 ++
 8 files changed, 269 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/041_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..f0e45cf49d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4867,6 +4867,53 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from archive to streaming replication (i.e., getting WAL from
+primary). However, the standby exhausts all the WAL present in pg_wal
+before switching. If the standby fails to switch to stream mode, it
+falls back to archive mode. If this parameter value is specified
+without units, it is taken as milliseconds. Default is
+5min. With a lower value for this parameter, the
+standby makes frequent WAL source switch attempts. To avoid this, it is
+recommended to set a reasonable value. A setting of 0
+disables the feature. When disabled, the standby typically switches to
+stream mode only after receiving WAL from archive finishes (i.e., no
+more WAL left there) or fails for any reason. This parameter can only
+be set in the postgresql.conf file or on the
+server command line.
+   
+   
+
+ Standby may not always attempt to switch source from WAL archive to
+ streaming replication at exact
+ streaming_replication_retry_interval intervals. For
+ example, if the parameter is set to 1min and
+ fetching WAL file from archive takes about 2min,
+ then the source switch attempt happens for the next WAL file after
+ current WAL file fetched from archive is fully applied.
+
+   
+   
+Reading WAL from archive may not always be as efficient and fast as
+reading from primary. This can be due to the differences in disk types,
+IO costs, network latencies etc.. All of these can impact the recovery

Re: libpq fails to build with TSAN

2024-01-31 Thread Daniel Gustafsson

> On 31 Jan 2024, at 04:21, Roman Lozko  wrote:
> 
> Hi, so libpq has this line in its Makefile
> https://github.com/postgres/postgres/blob/6ee26c6a4bafabbd22a85f575d2446fd5ec6ad0d/src/interfaces/libpq/Makefile#L116
> which checks that libpq does not use any "exit" functions. With
> ThreadSanitizer it triggers on function `__tsan_func_exit` which is
> used to record returns from functions. Should it be added to exclusion
> list here?

I think it should, the idea of that check is to catch calls to actual exits,
while this is instrumentation which has nothing to do with exit(2).  The
attached diff should be enough to handle this.

--
Daniel Gustafsson



tsan_exit_exclusion.diff
Description: Binary data

Re: Incorrect cost for MergeAppend

2024-01-31 Thread Alexander Kuzmenkov

I'd be happy to see this backpatched. What kind of regressions are we
worried about? I'd say partition-wise sort + merge should be faster
than append + sort for reasonably sized tables. That's basically what
tuplesort does inside. Moreso, this can enable index scans on
partitions, which is an even better plan.

On Wed, Jan 31, 2024 at 7:46 AM Ashutosh Bapat
 wrote:
>
> On Wed, Jan 31, 2024 at 12:12 PM David Rowley  wrote:
> >
> > What is relevant are things like:
> >
> > For:
> > * It's a clear bug and what's happening now is clearly wrong.
> > * inheritance/partitioned table plan changes for the better in minor 
> > versions
> >
> > Against:
> > * Nobody has complained for 13 years, so maybe it's unlikely anyone is
> > suffering too much.
> > * Possibility of inheritance/partitioned table plans changing for the
> > worse in minor versions
> >
>
> That's what I am thinking as well. And the plans that may change for
> the worse are the ones where the costs with and without the patch are
> close.
>
> Just to be clear, the change is for good and should be committed to
> the master. It's the backpatching I am worried about.
>
> --
> Best Wishes,
> Ashutosh Bapat

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-31 Thread Bharath Rupireddy

On Wed, Jan 31, 2024 at 3:01 PM Alvaro Herrera  wrote:
>
> Looking at 0003, where an XXX comment is added about taking a spinlock
> to read LogwrtResult, I suspect the answer is probably not, because it
> is likely to slow down the other uses of LogwrtResult.

We avoided keeping LogwrtResult latest as the current callers for
WALReadFromBuffers() all determine the flush LSN using
GetFlushRecPtr(), see comment #4 from
https://www.postgresql.org/message-id/CALj2ACV%3DC1GZT9XQRm4iN1NV1T%3DhLA_hsGWNx2Y5-G%2BmSwdhNg%40mail.gmail.com.

> But I wonder if
> a better path forward would be to base further work on my older
> uncommitted patch to make LogwrtResult use atomics.  With that, you
> wouldn't have to block others in order to read the value.  I last posted
> that patch in [1] in case you're curious.
>
> [1] https://postgr.es/m/20220728065920.oleu2jzsatchakfj@alvherre.pgsql
>
> The reason I abandoned that patch is that the performance problem that I
> was fixing no longer existed -- it was fixed in a different way.

Nice. I'll respond in that thread.  FWIW, there's been a recent
attempt at turning unloggedLSN to 64-bit atomic -
https://commitfest.postgresql.org/46/4330/ and that might need
pg_atomic_monotonic_advance_u64. I guess we would have to bring your
patch and the unloggedLSN into a single thread to have a better
discussion.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: POC, WIP: OR-clause support for indexes

2024-01-31 Thread Alena Rybakina


Hi, thank you for your review and interest in this subject.

On 31.01.2024 13:15, jian he wrote:

On Wed, Jan 31, 2024 at 10:55 AM jian he  wrote:

based on my understanding of
https://www.postgresql.org/docs/current/xoper-optimization.html#XOPER-COMMUTATOR
I think you need move commutator check right after the `if
(get_op_rettype(opno) != BOOLOID)` branch


I was wrong about this part. sorry for the noise.


I have made some changes (attachment).
* if the operator expression left or right side type category is
{array | domain | composite}, then don't do the transformation.
(i am not 10% sure with composite)


To be honest, I'm not sure about this check, because we check the type 
of variable there:


if (!IsA(orqual, OpExpr))
    {
    or_list = lappend(or_list, orqual);
    continue;
    }
And below:
if (IsA(leftop, Const))
    {
        opno = get_commutator(opno);

        if (!OidIsValid(opno))
        {
            /* Commuter doesn't exist, we can't reverse the order */
            or_list = lappend(or_list, orqual);
            continue;
        }

        nconst_expr = get_rightop(orqual);
        const_expr = get_leftop(orqual);
    }
    else if (IsA(rightop, Const))
    {
        const_expr = get_rightop(orqual);
        nconst_expr = get_leftop(orqual);
    }
    else
    {
        or_list = lappend(or_list, orqual);
        continue;
    }

Isn't that enough?

Besides, some of examples (with ARRAY) works fine:

postgres=# CREATE TABLE sal_emp (
    pay_by_quarter  integer[],
    pay_by_quater1 integer[]
);
CREATE TABLE
postgres=# INSERT INTO sal_emp
    VALUES (
    '{1, 1, 1, 1}',
    '{1,2,3,4}');
INSERT 0 1
postgres=# select * from sal_emp where pay_by_quarter[1] = 1 or 
pay_by_quarter[1]=2;

  pay_by_quarter   | pay_by_quater1
---+
 {1,1,1,1} | {1,2,3,4}
(1 row)

postgres=# explain select * from sal_emp where pay_by_quarter[1] = 1 
or pay_by_quarter[1]=2;

  QUERY PLAN
--
 Seq Scan on sal_emp  (cost=0.00..21.00 rows=9 width=64)
   Filter: (pay_by_quarter[1] = ANY ('{1,2}'::integer[]))
(2 rows)


* if the left side of the operator expression node contains volatile
functions, then don't do the transformation.


I'm also not sure about the volatility check function, because we 
perform such a conversion at the parsing stage, and at this stage we 
don't have a RelOptInfo variable and especially a RestictInfo such as 
PathTarget.


Speaking of NextValueExpr, I couldn't find any examples where the 
current patch wouldn't work. I wrote one of them below:


postgres=# create table foo (f1 int, f2 int generated always as identity);
CREATE TABLE
postgres=# insert into foo values(1);
INSERT 0 1

postgres=# explain verbose update foo set f1 = 2 where f1=1 or f1=2 ;
    QUERY PLAN
---
 Update on public.foo  (cost=0.00..38.25 rows=0 width=0)
   ->  Seq Scan on public.foo  (cost=0.00..38.25 rows=23 width=10)
 Output: 2, ctid
 Filter: (foo.f1 = ANY ('{1,2}'::integer[]))
(4 rows)

Maybe I missed something. Do you have any examples?


* some other minor  cosmetic changes.

Thank you, I agree with them.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Re: CI and test improvements

On 2024-Jan-31, vignesh C wrote:

> Are we planning to do anything more on this? I was not sure if we
> should move this to next commitfest or return it.

Well, the patches don't apply anymore since .cirrus.tasks.yml has been
created.  However, I'm sure we still want [some of] the improvements
to the tests in [1].  I can volunteer to rebase the patches in time for the
March commitfest, if Justin is not available to do so.  If you can
please move it forward to the March cf and set it WoA, I'd appreciate
that.

Thanks

[1] https://postgr.es/m/ZA/+mkdx9zwfh...@telsasoft.com

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)

Re: Improve the connection failure error messages

2024-01-31 Thread Nisha Moond

> AFAIK some recent commits patches (e,g [1]  for the "slot sync"
> development) have created some more cases of "could not connect..."
> messages. So, you might need to enhance your patch to deal with any
> new ones in the latest HEAD.
>
> ==
> [1]
> https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a
>
> Thank you for the update.
The v3 patch has the changes needed as per the latest HEAD.

--
Thanks,
Nisha


v3-0001-Improve-the-connection-failure-error-messages.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-01-31 Thread Amit Kapila

On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada  wrote:
>
> Thank you for updating the patches. As for the slotsync worker patch,
> is there any reason why 0001, 0002, and 0004 patches are still
> separated?
>

No specific reason, it could be easier to review those parts.

>
> Beside, here are some comments on v74 0001, 0002, and 0004 patches:
>
> ---
> +static char *
> +wait_for_valid_params_and_get_dbname(void)
> +{
> +   char   *dbname;
> +   int rc;
> +
> +   /* Sanity check. */
> +   Assert(enable_syncslot);
> +
> +   for (;;)
> +   {
> +   if (validate_parameters_and_get_dbname(&dbname))
> +   break;
> +   ereport(LOG, errmsg("skipping slot synchronization"));
> +
> +   ProcessSlotSyncInterrupts(NULL);
>
> When reading this function, I expected that the slotsync worker would
> resume working once the parameters became valid, but it was not
> correct. For example, if I changed hot_standby_feedback from off to
> on, the slotsync worker reads the config file, exits, and then
> restarts. Given that the slotsync worker ends up exiting on parameter
> changes anyway, why do we want to have it wait for parameters to
> become valid?
>

Right, the reason for waiting is to avoid repeated re-start of
slotsync worker if the required parameter is not changed. To follow
that, I think we should simply continue when the required parameter is
changed and is valid. But, I think during actual slotsync, if
connection_info is changed then there is no option but to restart.

>
> ---
> +bool
> +SlotSyncWorkerCanRestart(void)
> +{
> +#define SLOTSYNC_RESTART_INTERVAL_SEC 10
> +
>
> IIUC depending on how busy the postmaster is and the timing, the user
> could wait for 1 min to re-launch the slotsync worker. But I think the
> user might want to re-launch the slotsync worker more quickly for
> example when the slotsync worker restarts due to parameter changes.
> IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the
> slotsync worker previously exited with 0 or 1.
>

Considering my previous where we don't want to restart for a required
parameter change, isn't it better to avoid repeated restart (say when
the user gave an invalid dbname)? BTW, I think this restart interval
is added based on your previous complaint [1].

>
> ---
> When I dropped a database on the primary that has a failover slot, I
> got the following logs on the standby:
>
> 2024-01-31 17:25:21.750 JST [1103933] FATAL:  replication slot "s" is
> active for PID 1103935
> 2024-01-31 17:25:21.750 JST [1103933] CONTEXT:  WAL redo at 0/3020D20
> for Database/DROP: dir 1663/16384
> 2024-01-31 17:25:21.751 JST [1103930] LOG:  startup process (PID
> 1103933) exited with exit code 1
>
> It seems that because the slotsync worker created the slot on the
> standby, the slot's active_pid is still valid.
>

But we release the slot after sync. And we do take a shared lock on
the database to make the startup process wait for slotsync. There is
one gap which is that we don't reset active_pid for temp slots in
ReplicationSlotRelease(), so for temp slots such an error can occur
but OTOH, we immediately make the slot persistent after sync. As per
my understanding, it is only possible to get this error if the initial
sync doesn't happen and the slot remains temporary. Is that your case?
How did reproduce this?

 That is why the startup
> process could not drop the slot.
>

[1] - 
https://www.postgresql.org/message-id/CAD21AoApGoTZu7D_7%3DbVYQqKnj%2BPZ2Rz%2Bnc8Ky1HPQMS_XL6%2BA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Extending SMgrRelation lifetimes

2024-01-31 Thread Heikki Linnakangas


On 31/01/2024 10:54, Thomas Munro wrote:

On Wed, Nov 29, 2023 at 1:42 PM Heikki Linnakangas  wrote:

I spent some more time digging into this, experimenting with different
approaches. Came up with pretty significant changes; see below:


Hi Heikki,

I think this approach is good.  As I wrote in the first email, I had
briefly considered reference counting, but at the time I figured there
wasn't much point if it's only ever going to be 0 or 1, so I was
trying to find the smallest change.  But as you explained, there is
already an interesting case where it goes to 2, and modelling it that
way removes a weird hack, so it's a net improvement over the unusual
'owner' concept.  +1 for your version.  Are there any further tidying
or other improvements you want to make?


Ok, no, this is good to go then. I'll rebase, fix the typos, run the 
regression tests again, and push this shortly. Thanks!


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Possibility to disable `ALTER SYSTEM`

2024-01-31 Thread Gabriele Bartolini

Hi there,

I very much like the idea of a file in the data directory that also
controls the copy operations.

Just wanted to highlight though that in our operator we have already
applied the read-only postgresql.auto.conf trick to disable the system (see
https://cloudnative-pg.io/documentation/current/postgresql_conf/#enabling-alter-system).
However, having that file read-only triggered an issue when using pg_rewind
to resync a former primary, as pg_rewind immediately bails out when a
read-only file is encountered in the PGDATA (see
https://github.com/cloudnative-pg/cloudnative-pg/issues/3698).

We might keep this in mind if we go down the path of the separate file.

Thanks,
Gabriele

On Wed, 31 Jan 2024 at 08:43, Peter Eisentraut  wrote:

> On 31.01.24 06:28, Tom Lane wrote:
> >> The idea of adding a file to the data directory appeals to me.
> >>
> >> optional_runtime_features.conf
> >> alter_system=enabled
> >> copy_from_program=enabled
> >> copy_to_program=disabled
> > ... so, exactly what keeps an uncooperative superuser from
> > overwriting that file?
>
> The point of this feature would be to keep the honest people honest.
>
> The first thing I did when ALTER SYSTEM came out however many years ago
> was to install Nagios checks to warn when postgresql.auto.conf exists.
> Because the thing is an attractive nuisance, especially when you want to
> do centralized configuration control.  Of course you can bypass it using
> COPY PROGRAM etc., but then you *know* that you are *bypassing*
> something.  If you just see ALTER SYSTEM, you'll think, "that is
> obviously the appropriate tool", and there is no generally accepted way
> to communicate that, in particular environment, it might not be.
>
>

-- 
Gabriele Bartolini
Vice President, Cloud Native at EDB
enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

2024-01-31 Thread jian he

On Wed, Jan 31, 2024 at 10:55 AM jian he  wrote:
>
> based on my understanding of
> https://www.postgresql.org/docs/current/xoper-optimization.html#XOPER-COMMUTATOR
> I think you need move commutator check right after the `if
> (get_op_rettype(opno) != BOOLOID)` branch
>
I was wrong about this part. sorry for the noise.

I have made some changes (attachment).
* if the operator expression left or right side type category is
{array | domain | composite}, then don't do the transformation.
(i am not 10% sure with composite)

* if the left side of the operator expression node contains volatile
functions, then don't do the transformation.

* some other minor  cosmetic changes.

v14_comments.no-cfbot
Description: Binary data

Re: why there is not VACUUM FULL CONCURRENTLY?