date:20230731

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Kyotaro Horiguchi

At Tue, 01 Aug 2023 15:28:54 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> While we will not agree, we could establish a defalut behavior where
> an OOM during recovery immediately triggers an ERROR. Then, we could
> introduce a *GUC* that causes recovery to regard OOM as an
> end-of-recovery error.

If we do that, the reverse might be preferable. (OOMs are
end-of-reocvery by default. That can be changed to ERROR by GUC.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Pgoutput not capturing the generated columns

2023-07-31 Thread Rajendra Kumar Dangwal

Hi PG Users.

We are using Debezium to capture the CDC events into Kafka. 
With decoderbufs and wal2json plugins the connector is able to capture the 
generated columns in the table but not with pgoutput plugin.

We tested with the following example:

CREATE TABLE employees (
   id SERIAL PRIMARY KEY,
   first_name VARCHAR(50),
   last_name VARCHAR(50),
   full_name VARCHAR(100) GENERATED ALWAYS AS (first_name || ' ' || last_name) 
STORED
);

// Inserted few records when the connector was running

Insert into employees (first_name, last_name) VALUES ('ABC' , 'XYZ’);


With decoderbufs and wal2json the connector is able to capture the generated 
column `full_name` in above example. But with pgoutput the generated column was 
not captured. 
Is this a known limitation of pgoutput plugin? If yes, where can we request to 
add support for this feature?

Thanks.
Rajendra.

Re: Fix compilation warnings when CFLAGS -Og is specified

2023-07-31 Thread Kyotaro Horiguchi

At Tue, 1 Aug 2023 04:51:55 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear hackers,
> 
> # Background
> 
> Based on [1], I did configure and build with options:
> (I used Meson build system, but it could be reproduced by Autoconf/Make)
> 
> ```
> $ meson setup -Dcassert=true -Ddebug=true -Dc_args=-Og  ../builder
> $ cd ../builder
> $ ninja
> ```
> 
> My gcc version is 4.8.5, and ninja is 1.10.2.

gcc 4.8 looks very old? 

AFAIS all of those complaints are false positives and if I did this
correclty, gcc 11.3 seems to have been fixed in this regard.

> # Solution
> 
> PSA the patch to keep the compiler quiet. IIUC my compiler considered that
> substitutions in PG_TRY() might be skipped. I'm not sure it is real problem,
> but the workarounds are harmless.
> 
> Or, did I miss something for ignoring above?
> 
> [1]: 
> https://wiki.postgresql.org/wiki/Developer_FAQ#:~:text=or%20MSVC%20tracepoints.-,What%20debugging%20features%20are%20available%3F,-Compile%2Dtime

I think we don't want "fix" those as far as modern compilers don't
emit the false positives.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Simplify some logical replication worker type checking

2023-07-31 Thread Alvaro Herrera

On 2023-Aug-01, Peter Smith wrote:

> PSA a small patch making those above-suggested changes. The 'make
> check' and TAP subscription tests are all passing OK.

I think the code ends up more readable with this style of changes, so
+1.  I do wonder if these calls should appear in a proc_exit callback or
some such instead, though.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
 (Carlos Caszeli)

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-31 Thread vignesh C

On Tue, 1 Aug 2023 at 09:44, Peter Smith  wrote:
>
> On Fri, Jul 28, 2023 at 5:22 PM Peter Smith  wrote:
> >
> > Hi Melih,
> >
> > BACKGROUND
> > --
> >
> > We wanted to compare performance for the 2 different reuse-worker
> > designs, when the apply worker is already busy handling other
> > replications, and then simultaneously the test table tablesyncs are
> > occurring.
> >
> > To test this scenario, some test scripts were written (described
> > below). For comparisons, the scripts were then run using a build of
> > HEAD; design #1 (v21); design #2 (0718).
> >
> > HOW THE TEST WORKS
> > --
> >
> > Overview:
> > 1. The apply worker is made to subscribe to a 'busy_tbl'.
> > 2. After the SUBSCRIPTION is created, the publisher-side then loops
> > (forever) doing INSERTS into that busy_tbl.
> > 3. While the apply worker is now busy, the subscriber does an ALTER
> > SUBSCRIPTION REFRESH PUBLICATION to subscribe to all the other test
> > tables.
> > 4. We time how long it takes for all tablsyncs to complete
> > 5. Repeat above for different numbers of empty tables (10, 100, 1000,
> > 2000) and different numbers of sync workers (2, 4, 8, 16)
> >
> > Scripts
> > ---
> >
> > (PSA 4 scripts to implement this logic)
> >
> > testrun script
> > - this does common setup (do_one_test_setup) and then the pub/sub
> > scripts (do_one_test_PUB and do_one_test_SUB -- see below) are run in
> > parallel
> > - repeat 10 times
> >
> > do_one_test_setup script
> > - init and start instances
> > - ipc setup tables and procedures
> >
> > do_one_test_PUB script
> > - ipc setup pub/sub
> > - table setup
> > - publishes the "busy_tbl", but then waits for the subscriber to
> > subscribe to only this one
> > - alters the publication to include all other tables (so subscriber
> > will see these only after the ALTER SUBSCRIPTION PUBLICATION REFRESH)
> > - enter a busy INSERT loop until it informed by the subscriber that
> > the test is finished
> >
> > do_one_test_SUB script
> > - ipc setup pub/sub
> > - table setup
> > - subscribes only to "busy_tbl", then informs the publisher when that
> > is done (this will cause the publisher to commence the stay_busy loop)
> > - after it knows the publishing busy loop has started it does
> > - ALTER SUBSCRIPTION REFRESH PUBLICATION
> > - wait until all the tablesyncs are ready <=== This is the part that
> > is timed for the test RESULT
> >
> > PROBLEM
> > ---
> >
> > Looking at the output files (e.g. *.dat_PUB and *.dat_SUB)  they seem
> > to confirm the tests are working how we wanted.
> >
> > Unfortunately, there is some slot problem for the patched builds (both
> > designs #1 and #2). e.g. Search "ERROR" in the *.log files and see
> > many slot-related errors.
> >
> > Please note - running these same scripts with HEAD build gave no such
> > errors. So it appears to be a patch problem.
> >
>
> Hi
>
> FYI, here is some more information about ERRORs seen.
>
> The patches were re-tested -- applied in stages (and also against the
> different scripts) to identify where the problem was introduced. Below
> are the observations:
>
> ~~~
>
> Using original test scripts
>
> 1. Using only patch v21-0001
> - no errors
>
> 2. Using only patch v21-0001+0002
> - no errors
>
> 3. Using patch v21-0001+0002+0003
> - no errors
>
> ~~~
>
> Using the "busy loop" test scripts for long transactions
>
> 1. Using only patch v21-0001
> - no errors
>
> 2. Using only patch v21-0001+0002
> - gives errors for "no copy in progress issue"
> e.g. ERROR:  could not send data to WAL stream: no COPY in progress
>
> 3. Using patch v21-0001+0002+0003
> - gives the same "no copy in progress issue" errors as above
> e.g. ERROR:  could not send data to WAL stream: no COPY in progress
> - and also gives slot consistency point errors
> e.g. ERROR:  could not create replication slot
> "pg_16700_sync_16514_7261998170966054867": ERROR:  could not find
> logical decoding starting point
> e.g. LOG:  could not drop replication slot
> "pg_16700_sync_16454_7261998170966054867" on publisher: ERROR:
> replication slot "pg_16700_sync_16454_7261998170966054867" does not
> exist

I agree that  "no copy in progress issue" issue has nothing to do with
0001 patch. This issue is present with the 0002 patch.
In the case when the tablesync worker has to apply the transactions
after the table is synced, the tablesync worker sends the feedback of
writepos, applypos and flushpos which results in "No copy in progress"
error as the stream has ended already. Fixed it by exiting the
streaming loop if the tablesync worker is done with the
synchronization. The attached 0004 patch has the changes for the same.
The rest of v22 patches are the same patch that were posted by Melih
in the earlier mail.

Regards,
Vignesh
From bd18bd59be0a263cb3385353e73ec25542bdeff2 Mon Sep 17 00:00:00 2001
From: Melih Mutlu 
Date: Tue, 4 Jul 2023 22:04:46 +0300
Subject: [PATCH v22 2/3] Reuse Tablesync Workers

Before this patch, tablesync workers were capa

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Kyotaro Horiguchi

At Tue, 1 Aug 2023 14:03:36 +0900, Michael Paquier  wrote 
in 
> On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote:
> > I believe a database server is not supposed to be executed under such
> > a memory-constrained environment.
> 
> I don't really follow this argument.  The backend and the frontends
> are reliable on OOM, where we generate ERRORs or even FATALs depending
> on the code path involved.  A memory bounded environment is something
> that can easily happen if one's not careful enough with the sizing of

I didn't meant that OOM should not happen. I mentioned an environemnt
where allocation failure can happen while crash recovery. Anyway I
didn't meant that we shouldn't "fix" it.

> the instance.  For example, this error can be triggered on a standby
> with read-only queries that put pressure on the host's memory.

I thoght that the failure on a stanby results in continuing to retry
reading the next record. However, I found that there's a case where
start process stops in response to OOM [1].

> > One issue on changing that behavior is that there's not a simple way
> > to detect a broken record before loading it into memory. We might be
> > able to implement a fallback mechanism for example that loads the
> > record into an already-allocated buffer (which is smaller than the
> > specified length) just to verify if it's corrupted. However, I
> > question whether it's worth the additional complexity. And I'm not
> > sure what if the first allocation failed.
> 
> Perhaps we could rely more on a fallback memory, especially if it is
> possible to use that for the header validation.  That seems like a
> separate thing, still.

Once a record have been read, that size of memory is already
allocated.

While we will not agree, we could establish a defalut behavior where
an OOM during recovery immediately triggers an ERROR. Then, we could
introduce a *GUC* that causes recovery to regard OOM as an
end-of-recovery error.

regards.

[1] 
https://www.postgresql.org/message-id/17928-aa92416a70ff44a2%40postgresql.org

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Fix pg_stat_reset_single_table_counters function

2023-07-31 Thread Masahiro Ikeda


Hi,

My colleague, Mitsuru Hinata (in CC), found the following issue.

The documentation of pg_stat_reset_single_table_counters() says

pg_stat_reset_single_table_counters ( oid ) → void
Resets statistics for a single table or index in the current database 
or shared across all databases in the cluster to zero.
This function is restricted to superusers by default, but other users 
can be granted EXECUTE to run the function.

https://www.postgresql.org/docs/devel/monitoring-stats.html

But, the stats will not be reset if the table shared across all
databases is specified. IIUC, 5891c7a8e seemed to have mistakenly
removed the feature implemented in e04267844. What do you think?

* reproduce procedure

SELECT COUNT(*) FROM pg_stat_database;
SELECT pg_stat_reset_single_table_counters('pg_database'::regclass);
SELECT seq_scan FROM pg_stat_all_tables WHERE relid = 
'pg_database'::regclass;


* unexpected result
 * Rename OverrideSearchPath to SearchPathMatcher (current HEAD: 
d3a38318a)

 * pgstat: store statistics in shared memory (5891c7a8e)

psql=# SELECT seq_scan FROM pg_stat_all_tables WHERE relid = 
'pg_database'::regclass;

 seq_scan
--
   11
(1 row)

* expected result
 * Enhance pg_stat_reset_single_table_counters function (e04267844)
 * pgstat: normalize function naming (be902e2651), which is previous 
commit of 5891c7a8e.


psql=# SELECT seq_scan FROM pg_stat_all_tables WHERE relid = 
'pg_database'::regclass;

 seq_scan
--
0
(1 row)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 561c4391c9dac30b5478637a6baf8c8689226da5 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Tue, 1 Aug 2023 13:46:00 +0900
Subject: [PATCH] Fix pg_stat_reset_single_table_counters function.

This commit revives the feature to reset statistics for a single
relation shared across all databases in the cluster to zero, which
was implemented by the following commit.
* Enhance pg_stat_reset_single_table_counters function(e04267844)

The following commit accidentally deleted the feature.
* pgstat: store statistics in shared memory(5891c7a8e)

Bump catalog version.

Need to backpatch from 15.

Reported-by: Mitsuru Hinata
---
 src/backend/utils/adt/pgstatfuncs.c |  9 -
 src/include/catalog/catversion.h|  2 +-
 src/test/regress/expected/stats.out | 60 +
 src/test/regress/sql/stats.sql  | 18 +
 4 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2a4c8ef87f..2b9742ad21 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -17,6 +17,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -1776,13 +1777,17 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-/* Reset a single counter in the current database */
+/*
+ * Reset a statistics for a single object, which may be of current
+ * database or shared across all databases in the cluster.
+ */
 Datum
 pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS)
 {
 	Oid			taboid = PG_GETARG_OID(0);
+	Oid			dboid = (IsSharedRelation(taboid) ? InvalidOid : MyDatabaseId);
 
-	pgstat_reset(PGSTAT_KIND_RELATION, MyDatabaseId, taboid);
+	pgstat_reset(PGSTAT_KIND_RELATION, dboid, taboid);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f507b49bb2..5a534771ed 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	202307261
+#define CATALOG_VERSION_NO	202308011
 
 #endif
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 8e63340782..23450d28a8 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -592,6 +592,66 @@ SELECT seq_scan, idx_scan FROM pg_stat_all_tables WHERE relid = 'test_last_scan'
 0 |0
 (1 row)
 
+-- ensure to reset statistics for a table and a index shared across all databases
+BEGIN;
+SET LOCAL enable_seqscan TO on;
+SET LOCAL enable_indexscan TO off;
+SET LOCAL enable_indexonlyscan TO off;
+EXPLAIN (COSTS off) SELECT count(*) FROM pg_database;
+  QUERY PLAN   
+---
+ Aggregate
+   ->  Seq Scan on pg_database
+(2 rows)
+
+SELECT count(*) FROM pg_database; -- increment stats for the table
+ count 
+---
+ 4
+(1 row)
+
+SET LOCAL enable_seqscan TO off;
+SET LOCAL enable_indexscan TO on;
+SET LOCAL enable_indexonlyscan TO off;
+EXPLAIN (COSTS off) SELECT count(*) FROM pg_database WHERE oid = 1;
+ QUERY PLAN  
+-
+ Aggregate
+   ->  Index Scan using pg_database_oid_index on pg_database
+

Re: Simplify some logical replication worker type checking

2023-07-31 Thread Peter Smith

On Tue, Aug 1, 2023 at 12:59 PM Zhijie Hou (Fujitsu)
 wrote:
>
>
> About 2,3,4, it seems you should use "if (am_leader_apply_worker())" instead 
> of
> "if (!am_leader_apply_worker())" because only leader apply worker should 
> invoke
> this function.
>

Hi Hou-san,

Thanks for your review!

Oops. Of course, you are right. My cut/paste typo was mostly confined
to the post, but there was one instance also in the patch.

Fixed in v2.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-Simplify-worker-type-checks.patch
Description: Binary data

Re: Faster "SET search_path"

2023-07-31 Thread Jeff Davis

On Sat, 2023-07-29 at 12:44 -0400, Isaac Morland wrote:
> Essentially, "just" observe efficiently (somehow) that no change is
> needed, and skip changing it?

I gave this a try and it speeds things up some more.

There might be a surprise factor with an optimization like that,
though. If someone becomes accustomed to the function running fast,
then changing the search_path in the caller could slow things down a
lot and it would be hard for the user to understand what happened.

Also, there are a few implementation complexities because I think we
need to still trigger the same invalidation that happens in
assign_search_path().

Regards,
Jeff Davis

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Michael Paquier

On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote:
> I believe a database server is not supposed to be executed under such
> a memory-constrained environment.

I don't really follow this argument.  The backend and the frontends
are reliable on OOM, where we generate ERRORs or even FATALs depending
on the code path involved.  A memory bounded environment is something
that can easily happen if one's not careful enough with the sizing of
the instance.  For example, this error can be triggered on a standby
with read-only queries that put pressure on the host's memory.

> One issue on changing that behavior is that there's not a simple way
> to detect a broken record before loading it into memory. We might be
> able to implement a fallback mechanism for example that loads the
> record into an already-allocated buffer (which is smaller than the
> specified length) just to verify if it's corrupted. However, I
> question whether it's worth the additional complexity. And I'm not
> sure what if the first allocation failed.

Perhaps we could rely more on a fallback memory, especially if it is
possible to use that for the header validation.  That seems like a
separate thing, still.
--
Michael

signature.asc
Description: PGP signature

Fix compilation warnings when CFLAGS -Og is specified

2023-07-31 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

# Background

Based on [1], I did configure and build with options:
(I used Meson build system, but it could be reproduced by Autoconf/Make)

```
$ meson setup -Dcassert=true -Ddebug=true -Dc_args=-Og  ../builder
$ cd ../builder
$ ninja
```

My gcc version is 4.8.5, and ninja is 1.10.2.

# Problem

I got warnings while compiling. I found that these were reported when -Og was 
specified.
All of raised said that the variable might be used without initialization.

```
[122/1910] Compiling C object src/backend/postgres_lib.a.p/libpq_be-fsstubs.c.o
../postgres/src/backend/libpq/be-fsstubs.c: In function ‘be_lo_export’:
../postgres/src/backend/libpq/be-fsstubs.c:537:24: warning: ‘fd’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  if (CloseTransientFile(fd) != 0)
^
[390/1910] Compiling C object src/backend/postgres_lib.a.p/commands_trigger.c.o
In file included from ../postgres/src/backend/commands/trigger.c:14:0:
../postgres/src/backend/commands/trigger.c: In function ‘ExecCallTriggerFunc’:
../postgres/src/include/postgres.h:314:2: warning: ‘result’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  return (Pointer) X;
  ^
../postgres/src/backend/commands/trigger.c:2316:9: note: ‘result’ was declared 
here
  Datum  result;
 ^
[1023/1910] Compiling C object src/backend/postgres_lib.a.p/utils_adt_xml.c.o
../postgres/src/backend/utils/adt/xml.c: In function ‘xml_pstrdup_and_free’:
../postgres/src/backend/utils/adt/xml.c:1359:2: warning: ‘result’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  return result;
  ^
[1600/1910] Compiling C object 
contrib/pg_stat_statements/pg_stat_statements.so.p/pg_stat_statements.c.o
../postgres/contrib/pg_stat_statements/pg_stat_statements.c: In function 
‘pgss_planner’:
../postgres/contrib/pg_stat_statements/pg_stat_statements.c:960:2: warning: 
‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  return result;
  ^
[1651/1910] Compiling C object 
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
../postgres/contrib/postgres_fdw/postgres_fdw.c: In function 
‘postgresAcquireSampleRowsFunc’:
../postgres/contrib/postgres_fdw/postgres_fdw.c:5346:14: warning: ‘reltuples’ 
may be used uninitialized in this function [-Wmaybe-uninitialized]
   *totalrows = reltuples;
  ^
[1910/1910] Linking target src/interfaces/ecpg/test/thread/alloc
```

# Solution

PSA the patch to keep the compiler quiet. IIUC my compiler considered that
substitutions in PG_TRY() might be skipped. I'm not sure it is real problem,
but the workarounds are harmless.

Or, did I miss something for ignoring above?

[1]: 
https://wiki.postgresql.org/wiki/Developer_FAQ#:~:text=or%20MSVC%20tracepoints.-,What%20debugging%20features%20are%20available%3F,-Compile%2Dtime

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



keep_quiet.patch
Description: keep_quiet.patch

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Kyotaro Horiguchi

At Tue, 1 Aug 2023 12:43:21 +0900, Michael Paquier  wrote 
in 
> A colleague, Ethan Mertz (in CC), has discovered that we don't handle
> correctly WAL records that are failing because of an OOM when
> allocating their required space.  In the case of Ethan, we have bumped
> on the failure after an allocation failure on XLogReadRecordAlloc():
> "out of memory while trying to decode a record of length"

I believe a database server is not supposed to be executed under such
a memory-constrained environment.

> In crash recovery, any records after the OOM would not be replayed.
> At quick glance, it seems to me that this can also impact standbys,
> where recovery could stop earlier than it should once a consistent
> point has been reached.

Actually the code is assuming that OOM happens solely due to a broken
record length field. I believe that we intentionally put that
assumption.

> A patch is registered in the commit fest to improve the error
> detection handling, but as far as I can see it fails to handle the OOM
> case and replaces ReadRecord() to use a WARNING in the redo loop:
> https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

It doesn't change behavior unrelated to the case where the last record
is followed by zeroed trailing bytes.

> On top of my mind, any solution I can think of needs to add more
> information to XLogReaderState, where we'd either track the type of
> error that happened close to errormsg_buf which is where these errors
> are tracked, but any of that cannot be backpatched, unfortunately.

One issue on changing that behavior is that there's not a simple way
to detect a broken record before loading it into memory. We might be
able to implement a fallback mechanism for example that loads the
record into an already-allocated buffer (which is smaller than the
specified length) just to verify if it's corrupted. However, I
question whether it's worth the additional complexity. And I'm not
sure what if the first allocation failed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Extract numeric filed in JSONB more effectively

2023-07-31 Thread Andy Fan

Hi:

Currently if we want to extract a numeric field in jsonb, we need to use
the following expression:  cast (a->>'a' as numeric). It will turn a numeric
to text first and then turn the text to numeric again. See
jsonb_object_field_text and JsonbValueAsText.  However the binary format
of numeric in JSONB is compatible with the numeric in SQL, so I think we
can have an operator to extract the numeric directly. If the value of a
given
field is not a numeric data type, an error will be raised, this can be
documented.

In this patch, I added a new operator for this purpose, here is the
performance gain because of this.

create table tb (a jsonb);
insert into tb select '{"a": 1}'::jsonb from generate_series(1, 10)i;

current method:
select count(*) from tb where cast (a->>'a' as numeric) = 2;
167ms.

new method:
select count(*) from tb where a@->'a' = 2;
65ms.

Is this the right way to go? Testcase, document and catalog version are
updated.


-- 
Best Regards
Andy Fan


v1-0001-Add-jsonb-operator-to-return-a-numeric-directly.patch
Description: Binary data

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-07-31 Thread Palak Chaturvedi

Hii,
Thanks for your feedback. We have decided to add a role for the
extension to solve that problem.
And concerning to pg_unwarm table I think we can create a new function
to do that but I think a general user would not require to clear a
table from buffercache.
We can use bufferid and where statements to do the same if a
superuser/(specific role) requests it.

Thanks.

On Sat, 29 Jul 2023 at 02:55, Cary Huang  wrote:
>
>  Hello
>
> I had a look at the patch and tested it on CI bot, it compiles and tests fine 
> both autoconf and meson. I noticed that the v2 patch contains the v1 patch 
> file as well. Not sure if intended but put there my accident.
>
> > I don't think "invalidating" is the right terminology. Note that we already
>  > have InvalidateBuffer() - but it's something we can't allow users to do, 
> as it
>  > throws away dirty buffer contents (it's used for things like dropping a
>  > table).
>  >
>  > How about using "discarding" for this functionality?
>
> I think "invalidating" is the right terminology here, it is exactly what the 
> feature is doing, it tries to invalidate a buffer ID by calling 
> InvalidateBuffer() routine inside buffer manager and calls FlushBuffer() 
> before invalidating if marked dirty.
>
> The problem here is that InvalidateBuffer() could be dangerous because it 
> allows a user to invalidate buffer that may have data in other tables not 
> owned by the current user,
>
> I think it all comes down to the purpose of this feature. Based on the 
> description in this email thread, I feel like this feature should be 
> categorized as a developer-only feature, to be used by PG developer to 
> experiment and observe some development works by invalidating one more more 
> specific buffers. If this is the case, it may be helpful to add a 
> "DEVELOPER_OPTIONS" in GUC, which allows or disallows the 
> TryInvalidateBuffer() to run or to return error if user does not have this 
> developer option enabled.
>
> If the purpose of this feature is for general users, then it would make sense 
> to have something like pg_unwarm (exactly opposite of pg_prewarm) that takes 
> table name (instead of buffer ID) and drop all buffers associated with that 
> table name. There will be permission checks as well so a user cannot 
> pg_unwarm a table owned by someone else. User in this case won't be able to 
> invalidate a particular buffer, but he/she should not have to as a regular 
> user anyway.
>
> thanks!
>
> Cary Huang
> -
> HighGo Software Inc. (Canada)
> cary.hu...@highgo.ca
> www.highgo.ca
>

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-31 Thread Peter Smith

On Fri, Jul 28, 2023 at 5:22 PM Peter Smith  wrote:
>
> Hi Melih,
>
> BACKGROUND
> --
>
> We wanted to compare performance for the 2 different reuse-worker
> designs, when the apply worker is already busy handling other
> replications, and then simultaneously the test table tablesyncs are
> occurring.
>
> To test this scenario, some test scripts were written (described
> below). For comparisons, the scripts were then run using a build of
> HEAD; design #1 (v21); design #2 (0718).
>
> HOW THE TEST WORKS
> --
>
> Overview:
> 1. The apply worker is made to subscribe to a 'busy_tbl'.
> 2. After the SUBSCRIPTION is created, the publisher-side then loops
> (forever) doing INSERTS into that busy_tbl.
> 3. While the apply worker is now busy, the subscriber does an ALTER
> SUBSCRIPTION REFRESH PUBLICATION to subscribe to all the other test
> tables.
> 4. We time how long it takes for all tablsyncs to complete
> 5. Repeat above for different numbers of empty tables (10, 100, 1000,
> 2000) and different numbers of sync workers (2, 4, 8, 16)
>
> Scripts
> ---
>
> (PSA 4 scripts to implement this logic)
>
> testrun script
> - this does common setup (do_one_test_setup) and then the pub/sub
> scripts (do_one_test_PUB and do_one_test_SUB -- see below) are run in
> parallel
> - repeat 10 times
>
> do_one_test_setup script
> - init and start instances
> - ipc setup tables and procedures
>
> do_one_test_PUB script
> - ipc setup pub/sub
> - table setup
> - publishes the "busy_tbl", but then waits for the subscriber to
> subscribe to only this one
> - alters the publication to include all other tables (so subscriber
> will see these only after the ALTER SUBSCRIPTION PUBLICATION REFRESH)
> - enter a busy INSERT loop until it informed by the subscriber that
> the test is finished
>
> do_one_test_SUB script
> - ipc setup pub/sub
> - table setup
> - subscribes only to "busy_tbl", then informs the publisher when that
> is done (this will cause the publisher to commence the stay_busy loop)
> - after it knows the publishing busy loop has started it does
> - ALTER SUBSCRIPTION REFRESH PUBLICATION
> - wait until all the tablesyncs are ready <=== This is the part that
> is timed for the test RESULT
>
> PROBLEM
> ---
>
> Looking at the output files (e.g. *.dat_PUB and *.dat_SUB)  they seem
> to confirm the tests are working how we wanted.
>
> Unfortunately, there is some slot problem for the patched builds (both
> designs #1 and #2). e.g. Search "ERROR" in the *.log files and see
> many slot-related errors.
>
> Please note - running these same scripts with HEAD build gave no such
> errors. So it appears to be a patch problem.
>

Hi

FYI, here is some more information about ERRORs seen.

The patches were re-tested -- applied in stages (and also against the
different scripts) to identify where the problem was introduced. Below
are the observations:

~~~

Using original test scripts

1. Using only patch v21-0001
- no errors

2. Using only patch v21-0001+0002
- no errors

3. Using patch v21-0001+0002+0003
- no errors

~~~

Using the "busy loop" test scripts for long transactions

1. Using only patch v21-0001
- no errors

2. Using only patch v21-0001+0002
- gives errors for "no copy in progress issue"
e.g. ERROR:  could not send data to WAL stream: no COPY in progress

3. Using patch v21-0001+0002+0003
- gives the same "no copy in progress issue" errors as above
e.g. ERROR:  could not send data to WAL stream: no COPY in progress
- and also gives slot consistency point errors
e.g. ERROR:  could not create replication slot
"pg_16700_sync_16514_7261998170966054867": ERROR:  could not find
logical decoding starting point
e.g. LOG:  could not drop replication slot
"pg_16700_sync_16454_7261998170966054867" on publisher: ERROR:
replication slot "pg_16700_sync_16454_7261998170966054867" does not
exist

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Michael Paquier

Hi all,

A colleague, Ethan Mertz (in CC), has discovered that we don't handle
correctly WAL records that are failing because of an OOM when
allocating their required space.  In the case of Ethan, we have bumped
on the failure after an allocation failure on XLogReadRecordAlloc():
"out of memory while trying to decode a record of length"

As far as I can see, PerformWalRecovery() uses LOG as elevel for its
private callback in the xlogreader, when doing through ReadRecord(),
which leads to a failure being reported, but recovery considers that
the failure is the end of WAL and decides to abruptly end recovery,
leading to some data lost.

In crash recovery, any records after the OOM would not be replayed.
At quick glance, it seems to me that this can also impact standbys,
where recovery could stop earlier than it should once a consistent
point has been reached.

Attached is a patch that can be applied on HEAD to inject an error,
then just run the script xlogreader_oom.bash attached, or something
similar, to see the failure in the logs:
LOG:  redo starts at 0/1913CD0
LOG:  out of memory while trying to decode a record of length 57
LOG:  redo done at 0/1917358 system usage: CPU: user: 0.00 s, system: 0.00 s, 
elapsed: 0.00 s

It also looks like recovery_prefetch may mitigate a bit the issue if
we do a read in non-blocking mode, but that's not a strong guarantee
either, especially if the host is under memory pressure.

A patch is registered in the commit fest to improve the error
detection handling, but as far as I can see it fails to handle the OOM
case and replaces ReadRecord() to use a WARNING in the redo loop:
https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

On top of my mind, any solution I can think of needs to add more
information to XLogReaderState, where we'd either track the type of
error that happened close to errormsg_buf which is where these errors
are tracked, but any of that cannot be backpatched, unfortunately.

Comments?
--
Michael
#!/bin/bash

killall -9 postgres

# Setup Postgres
rm -rf /tmp/data
initdb -D /tmp/data
cat <> /tmp/data/postgresql.conf
logging_collector = on
log_min_messages = debug1
log_min_error_statement = debug1
#log_line_prefix = '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h '
recovery_prefetch = off
EOL
pg_ctl start -D /tmp/data
createdb $USER

psql  = 100)
+			{
+trigger_oom = true;
+
+/* Reset counter, to not fail when shutting down WAL */
+counter = 0;
+			}
+		}
+	}
+#endif
+
+	if (decoded == NULL || trigger_oom)
 	{
 		/*
 		 * There is no space in the decode buffer.  The caller should help
-- 
2.40.1



signature.asc
Description: PGP signature

Re: Support to define custom wait events for extensions

2023-07-31 Thread Andres Freund

Hi,

On 2023-08-01 12:14:49 +0900, Michael Paquier wrote:
> On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote:
> > Thanks for committing the main patch.
> > 
> > In my understanding, the rest works are
> > * to support WaitEventExtensionMultiple()
> > * to replace WAIT_EVENT_EXTENSION to custom wait events
> > 
> > Do someone already works for them? If not, I'll consider
> > how to realize them.
> 
> Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have
> no dependency to shared_preload_libraries.  Perhaps these could just
> use a dynamic handling but that deserves a separate discussion because
> of the fact that they'd need shared memory without being able to
> request it.  autoprewarm.c is much simpler.

This is why the scheme as implemented doesn't really make sense to me. It'd be
much easier to use if we had a shared hashtable with the identifiers than
what's been merged now.

In plenty of cases it's not realistic for an extension library to run in each
backend, while still needing to wait for things.

Greetings,

Andres Freund

Re: Support to define custom wait events for extensions

2023-07-31 Thread Michael Paquier

On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote:
> Thanks for committing the main patch.
> 
> In my understanding, the rest works are
> * to support WaitEventExtensionMultiple()
> * to replace WAIT_EVENT_EXTENSION to custom wait events
> 
> Do someone already works for them? If not, I'll consider
> how to realize them.

Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have
no dependency to shared_preload_libraries.  Perhaps these could just
use a dynamic handling but that deserves a separate discussion because
of the fact that they'd need shared memory without being able to
request it.  autoprewarm.c is much simpler.
--
Michael


signature.asc
Description: PGP signature

Re: logical decoding and replication of sequences, take 2

2023-07-31 Thread Amit Kapila

On Mon, Jul 31, 2023 at 5:04 PM Tomas Vondra
 wrote:
>
> On 7/31/23 11:25, Amit Kapila wrote:
> > On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
> >  wrote:
> >>
> >> On 7/28/23 14:44, Ashutosh Bapat wrote:
> >>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> >>>  wrote:
> 
>  Anyway, I was thinking about this a bit more, and it seems it's not as
>  difficult to use the page LSN to ensure sequences don't go backwards.
>  The 0005 change does that, by:
> 
>  1) adding pg_sequence_state, that returns both the sequence state and
> the page LSN
> 
>  2) copy_sequence returns the page LSN
> 
>  3) tablesync then sets this LSN as origin_startpos (which for tables is
> just the LSN of the replication slot)
> 
>  AFAICS this makes it work - we start decoding at the page LSN, so that
>  we  skip the increments that could lead to the sequence going backwards.
> 
> >>>
> >>> I like this design very much. It makes things simpler than complex.
> >>> Thanks for doing this.
> >>>
> >>
> >> I agree it seems simpler. It'd be good to try testing / reviewing it a
> >> bit more, so that it doesn't misbehave in some way.
> >>
> >
> > Yeah, I also think this needs a review. This is a sort of new concept
> > where we don't use the LSN of the slot (for cases where copy returned
> > a larger value of LSN) or a full_snapshot created corresponding to the
> > sync slot by Walsender. For the case of the table, we build a full
> > snapshot because we use that for copying the table but why do we need
> > to build that for copying the sequence especially when we directly
> > copy it from the sequence relation without caring for any snapshot?
> >
>
> We need the slot to decode/apply changes during catchup. The main
> subscription may get ahead, and we need to ensure the WAL is not
> discarded or something like that. This applies even if the initial sync
> step does not use the slot/snapshot directly.
>

AFAIK, none of these needs a full_snapshot (see usage of
SnapBuild->building_full_snapshot). The full_snapshot tracks both
catalog and non-catalog xacts in the snapshot where we require to
track non-catalog ones because we want to copy the table using that
snapshot. It is relatively expensive to build a full snapshot and we
don't do that unless it is required. For the current usage of this
patch, I think using CRS_NOEXPORT_SNAPSHOT would be sufficient.

-- 
With Regards,
Amit Kapila.

RE: Simplify some logical replication worker type checking

2023-07-31 Thread Zhijie Hou (Fujitsu)

On Tuesday, August 1, 2023 9:36 AM Peter Smith 
> PROBLEM / SOLUTION
> 
> During recent reviews, I noticed some of these conditions are a bit unusual.

Thanks for the patch.

> 
> ==
> worker.c
> 
> 1. apply_worker_exit
> 
> /*
> * Reset the last-start time for this apply worker so that the launcher
> * will restart it without waiting for wal_retrieve_retry_interval if the
> * subscription is still active, and so that we won't leak that hash table
> * entry if it isn't.
> */
> if (!am_tablesync_worker())
> ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);
>
> ~
> 
> In this case, it cannot be a parallel apply worker (there is a check
> prior), so IMO it is better to simplify the condition here to below.
> This also makes the code consistent with all the subsequent
> suggestions in this post.
> 
> if (am_apply_leader_worker())
> ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);

This change looks OK to me.

> ~~~
> 
> 2. maybe_reread_subscription
> 
> /* Ensure we remove no-longer-useful entry for worker's start time */
> if (!am_tablesync_worker() && !am_parallel_apply_worker())
> ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);
> proc_exit(0);
> 
> ~
> 
> Should simplify the above condition to say:
> if (!am_leader_apply_worker())
> 
> ~~~
> 
> 3. InitializeApplyWorker
> 
> /* Ensure we remove no-longer-useful entry for worker's start time */
> if (!am_tablesync_worker() && !am_parallel_apply_worker())
> ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);
> proc_exit(0);
> 
> ~
> 
> Ditto. Should simplify the above condition to say:
> if (!am_leader_apply_worker())
> 
> ~~~
> 
> 4. DisableSubscriptionAndExit
> 
> /* Ensure we remove no-longer-useful entry for worker's start time */
> if (!am_tablesync_worker() && !am_parallel_apply_worker())
> ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);
> 
> ~
> 
> Ditto. Should simplify the above condition to say:
> if (!am_leader_apply_worker())
> 
> --
> 
> PSA a small patch making those above-suggested changes. The 'make
> check' and TAP subscription tests are all passing OK.

About 2,3,4, it seems you should use "if (am_leader_apply_worker())" instead of
"if (!am_leader_apply_worker())" because only leader apply worker should invoke
this function.

Best Regards,
Hou zj

Re: Support to define custom wait events for extensions

2023-07-31 Thread Masahiro Ikeda


On 2023-07-31 19:22, Michael Paquier wrote:

I am not sure that any of that is necessary.  Anyway, I have applied
v11 to get the basics done.


Thanks for committing the main patch.

In my understanding, the rest works are
* to support WaitEventExtensionMultiple()
* to replace WAIT_EVENT_EXTENSION to custom wait events

Do someone already works for them? If not, I'll consider
how to realize them.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Ignore 2PC transaction GIDs in query jumbling

2023-07-31 Thread Julien Rouhaud

On Tue, Aug 01, 2023 at 11:37:49AM +0900, Michael Paquier wrote:
> On Tue, Aug 01, 2023 at 10:22:09AM +0800, Julien Rouhaud wrote:
> > Looking at the rest of the ignored patterns, the only remaining one would be
> > DEALLOCATE, which AFAICS doesn't have a query_jumble_ignore tag for now.
>
> This one seems to be simple as well with a location field, looking
> quickly at its Node.

Agreed, it should be as trivial to implement as for the 2pc commands :)

Re: Ignore 2PC transaction GIDs in query jumbling

2023-07-31 Thread Michael Paquier

On Tue, Aug 01, 2023 at 10:22:09AM +0800, Julien Rouhaud wrote:
> Looking at the rest of the ignored patterns, the only remaining one would be
> DEALLOCATE, which AFAICS doesn't have a query_jumble_ignore tag for now.

This one seems to be simple as well with a location field, looking
quickly at its Node.
--
Michael


signature.asc
Description: PGP signature

Re: Inaccurate comments in ReorderBufferCheckMemoryLimit()

2023-07-31 Thread Amit Kapila

On Mon, Jul 31, 2023 at 8:46 PM Masahiko Sawada  wrote:
>
> While reading the code, I realized that the following code comments
> might not be accurate:
>
> /*
>  * Pick the largest transaction (or subtransaction) and evict it from
>  * memory by streaming, if possible.  Otherwise, spill to disk.
>  */
> if (ReorderBufferCanStartStreaming(rb) &&
> (txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
> {
> /* we know there has to be one, because the size is not zero */
> Assert(txn && rbtxn_is_toptxn(txn));
> Assert(txn->total_size > 0);
> Assert(rb->size >= txn->total_size);
>
> ReorderBufferStreamTXN(rb, txn);
> }
>
> AFAICS since ReorderBufferLargestStreamableTopTXN() returns only
> top-level transactions, the comment above the if statement is not
> right. It would not pick a subtransaction.
>

I think the subtransaction case is for the spill-to-disk case as both
cases are explained in the same comment.

> Also, I'm not sure that the second comment "we know there has to be
> one, because the size is not zero" is right since there might not be
> top-transactions that are streamable.
>

I think this comment is probably referring to asserts related to the
size similar to spill to disk case.

How about if we just remove (or subtransaction) from the following
comment: "Pick the largest transaction (or subtransaction) and evict
it from memory by streaming, if possible.  Otherwise, spill to disk."?
Then by referring to streaming/spill-to-disk cases, one can understand
in which cases only top-level xacts are involved and in which cases
both are involved.

-- 
With Regards,
Amit Kapila.

Re: Ignore 2PC transaction GIDs in query jumbling

2023-07-31 Thread Julien Rouhaud

On Tue, Aug 01, 2023 at 11:00:32AM +0900, Michael Paquier wrote:
> On Tue, Aug 01, 2023 at 09:28:08AM +0800, Julien Rouhaud wrote:
>
> > FTR we had to entirely ignore all those statements in powa years ago to try 
> > to
> > make the tool usable in such case for some users who where using 2pc, it 
> > would
> > be nice to be able to track them back for pg16+.
>
> That would be nice for your users.

Indeed, although I will need to make that part a runtime variable based on the
server version rather than a hardcoded string since the extension framework
doesn't provide a way to do that cleanly across major pg versions.

> Are there more query patterns you'd
> be interested in grouping in the backend?  I had a few folks aiming
> for CallStmt and SetStmt, but both a a bit tricky without a custom
> routine.

Looking at the rest of the ignored patterns, the only remaining one would be
DEALLOCATE, which AFAICS doesn't have a query_jumble_ignore tag for now.

Re: [PATCH] Add support function for containment operators

2023-07-31 Thread Laurenz Albe

On Sat, 2023-07-08 at 08:11 +0200, Kim Johan Andersson wrote:
> On 07-07-2023 13:20, Laurenz Albe wrote:
> > I wrote:
> > > You implement both "SupportRequestIndexCondition" and 
> > > "SupportRequestSimplify",
> > > but when I experimented, the former was never called.  That does not
> > > surprise me, since any expression of the shape "expr <@ range constant"
> > > can be simplified.  Is the "SupportRequestIndexCondition" branch dead 
> > > code?
> > > If not, do you have an example that triggers it?
> 
> I would think it is dead code, I came to the same conclusion. So we can 
> drop SupportRequestIndexCondition, since the simplification happens to 
> take care of everything.
> 
> 
> > I had an idea about this:
> > So far, you only consider constant ranges.  But if we have a STABLE range
> > expression, you could use an index scan for "expr <@ range", for example
> > Index Cond (expr >= lower(range) AND expr < upper(range)).
> > 
> 
> I will try to look into this. Originally that was what I was hoping for, 
> but didn't see way of going about it.
> 
> Thanks for your comments, I will also look at the locale-related 
> breakage you spotted.

I have marked the patch as "returned with feedback".

I encourage you to submit an improved version in a future commitfest!

Yours,
Laurenz Albe

Re: Ignore 2PC transaction GIDs in query jumbling

2023-07-31 Thread Michael Paquier

On Tue, Aug 01, 2023 at 09:28:08AM +0800, Julien Rouhaud wrote:
> Having an application relying on 2pc leads to pg_stat_statements being
> virtually unusable on the whole instance, so +1 for the patch.

Cool, thanks for the feedback!

> FTR we had to entirely ignore all those statements in powa years ago to try to
> make the tool usable in such case for some users who where using 2pc, it would
> be nice to be able to track them back for pg16+.

That would be nice for your users.  Are there more query patterns you'd
be interested in grouping in the backend?  I had a few folks aiming
for CallStmt and SetStmt, but both a a bit tricky without a custom
routine.
--
Michael


signature.asc
Description: PGP signature

Simplify some logical replication worker type checking

2023-07-31 Thread Peter Smith

Hi hackers,

BACKGROUND

There are 3 different types of logical replication workers.
1. apply leader workers
2. parallel apply workers
3. tablesync workers

And there are a number of places where the current worker type is
checked using the am_ inline functions.

PROBLEM / SOLUTION

During recent reviews, I noticed some of these conditions are a bit unusual.

==
worker.c

1. apply_worker_exit

/*
* Reset the last-start time for this apply worker so that the launcher
* will restart it without waiting for wal_retrieve_retry_interval if the
* subscription is still active, and so that we won't leak that hash table
* entry if it isn't.
*/
if (!am_tablesync_worker())
ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);

~

In this case, it cannot be a parallel apply worker (there is a check
prior), so IMO it is better to simplify the condition here to below.
This also makes the code consistent with all the subsequent
suggestions in this post.

if (am_apply_leader_worker())
ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);

~~~

2. maybe_reread_subscription

/* Ensure we remove no-longer-useful entry for worker's start time */
if (!am_tablesync_worker() && !am_parallel_apply_worker())
ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);
proc_exit(0);

~

Should simplify the above condition to say:
if (!am_leader_apply_worker())

~~~

3. InitializeApplyWorker

/* Ensure we remove no-longer-useful entry for worker's start time */
if (!am_tablesync_worker() && !am_parallel_apply_worker())
ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);
proc_exit(0);

~

Ditto. Should simplify the above condition to say:
if (!am_leader_apply_worker())

~~~

4. DisableSubscriptionAndExit

/* Ensure we remove no-longer-useful entry for worker's start time */
if (!am_tablesync_worker() && !am_parallel_apply_worker())
ApplyLauncherForgetWorkerStartTime(MyLogicalRepWorker->subid);

~

Ditto. Should simplify the above condition to say:
if (!am_leader_apply_worker())

--

PSA a small patch making those above-suggested changes. The 'make
check' and TAP subscription tests are all passing OK.


---
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Simplify-worker-type-checks.patch
Description: Binary data

Re: Ignore 2PC transaction GIDs in query jumbling

2023-07-31 Thread Julien Rouhaud

Hi,

On Tue, Aug 01, 2023 at 09:38:14AM +0900, Michael Paquier wrote:
>
> 31de7e6 has silenced savepoint names in the query jumbling, and
> something similar can be done for 2PC transactions once the GID is
> ignored in TransactionStmt.  This leads to the following grouping in
> pg_stat_statements, for instance, which is something that matters with
> workloads that heavily rely on 2PC:
> COMMIT PREPARED $1
> PREPARE TRANSACTION $1
> ROLLBACK PREPARED $1

Having an application relying on 2pc leads to pg_stat_statements being
virtually unusable on the whole instance, so +1 for the patch.

FTR we had to entirely ignore all those statements in powa years ago to try to
make the tool usable in such case for some users who where using 2pc, it would
be nice to be able to track them back for pg16+.

The patch LGTM.

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-27 20:53:16 +1200, David Rowley wrote:
> To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my
> AMD 3990x machine.  Today's REL_16_STABLE takes 530.344 ms. We're
> talking about another patch to speed up the pg_strtoint functions
> which gets this down to 483.790 ms. Do we need to do this for v16 or
> can we just leave this as it is already?  The slowdown does not seem
> to be much above what we'd ordinarily classify as noise using this
> test on my machine.

I think we need to do something for 16 - it appears on recent-ish AMD the
regression is quite a bit smaller than on intel.  You see something < 1%, I
see more like 4%.  I think there's also other cases where the slowdown is more
substantial.

Besides intel vs amd, it also looks like the gcc version might make a
difference. The code generated by 13 is noticeably slower than 12 for me...

> Benchmark setup:
> 
> COPY (SELECT generate_series(1, 200) a, (random() * 10 -
> 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
> DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

There's a lot of larger numbers in the file, which likely reduces the impact
some. And there's the overhead of actually inserting the rows into the table,
making the difference appear smaller than it is.

If I avoid the actual insert into the table and use more columns, I see an about
10% regression here.

COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM 
generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy';

psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE 
lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \
  pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM 
'/tmp/lotsaints_wide.copy' WHERE false") -t 5

15: 2992.605
HEAD:   3325.201
fastpath1.patch 2932.606
fastpath2.patch 2783.915

Interestingly fastpath1 is slower now, even though it wasn't with earlier
patches (which still is repeatable). I do not have the foggiest as to why.

Greetings,

Andres Freund

PostgreSQL 16 Beta 3 release date

2023-07-31 Thread Jonathan S. Katz


Hi,

The release date for PostgreSQL 16 Beta 3 is August 10, 2023, alongside 
the regular update release[1]. Please be  sure to commit any open 
items[2] for the Beta 3 release before August 6, 2023 0:00 AoE[3] to 
give them enough time to work through the buildfarm.


Thanks,

Jonathan

[1] https://www.postgresql.org/developer/roadmap/
[2] https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items
[3] https://en.wikipedia.org/wiki/Anywhere_on_Earth


OpenPGP_signature
Description: OpenPGP digital signature

Re: pg_upgrade fails with in-place tablespace

2023-07-31 Thread Michael Paquier

On Sat, Jul 29, 2023 at 11:10:22PM +0800, Rui Zhao wrote:
>  2) Only check the tablespace with an absolute path in pg_upgrade.
>  There are also other solutions, such as supporting the creation of
>  relative-path tablespace in function CreateTableSpace. But do we
>  really need relative-path tablespace? I think in-place tablespace
>  is enough by now. My solution may be more lightweight and
>  harmless.

+   /* The path of the in-place tablespace is always pg_tblspc/. */
if (!S_ISLNK(st.st_mode))
-   PG_RETURN_TEXT_P(cstring_to_text(sourcepath));
+   PG_RETURN_TEXT_P(cstring_to_text(""));

I don't think that your solution is the correct move.  Having
pg_tablespace_location() return the physical location of the
tablespace is very useful because that's the location where the
physical files are, and client applications don't need to know the
logic behind the way a path is built.

-"  spcname != 'pg_global'");
+"  spcname != 'pg_global' AND "
+"  pg_catalog.pg_tablespace_location(oid) ~ '^/'");
That may not work on Windows where the driver letter is appended at
the beginning of the path, no?  There is is_absolute_path() to do this
job in a more portable way.
--
Michael


signature.asc
Description: PGP signature

Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-07-31 Thread Zhang Mingli

On Aug 1, 2023, at 03:35, Andrew Dunstan  wrote:I was hoping it be able to get to it today but that's not happening. If you want to submit a revised patch as above that will be good. I hope to get to it later this week.HI, Andrew Patch rebased and updated like above, thanks.

v4-0001-COPY-FROM-enable-FORCE_NULL-FORCE_NOT_NULL-on-all-co.patch
Description: Binary data

Zhang Minglihttps://www.hashdata.xyz

Ignore 2PC transaction GIDs in query jumbling

2023-07-31 Thread Michael Paquier

Hi all,

31de7e6 has silenced savepoint names in the query jumbling, and
something similar can be done for 2PC transactions once the GID is
ignored in TransactionStmt.  This leads to the following grouping in
pg_stat_statements, for instance, which is something that matters with
workloads that heavily rely on 2PC:
COMMIT PREPARED $1
PREPARE TRANSACTION $1
ROLLBACK PREPARED $1

Thoughts or comments?
--
Michael
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index fe003ded50..2565348303 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3540,7 +3540,8 @@ typedef struct TransactionStmt
 	List	   *options;		/* for BEGIN/START commands */
 	/* for savepoint commands */
 	char	   *savepoint_name pg_node_attr(query_jumble_ignore);
-	char	   *gid;			/* for two-phase-commit related commands */
+	/* for two-phase-commit related commands */
+	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 15ece871a0..b3bdf947b6 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10924,7 +10924,7 @@ TransactionStmt:
 
 	n->kind = TRANS_STMT_PREPARE;
 	n->gid = $3;
-	n->location = -1;
+	n->location = @3;
 	$$ = (Node *) n;
 }
 			| COMMIT PREPARED Sconst
@@ -10933,7 +10933,7 @@ TransactionStmt:
 
 	n->kind = TRANS_STMT_COMMIT_PREPARED;
 	n->gid = $3;
-	n->location = -1;
+	n->location = @3;
 	$$ = (Node *) n;
 }
 			| ROLLBACK PREPARED Sconst
@@ -10942,7 +10942,7 @@ TransactionStmt:
 
 	n->kind = TRANS_STMT_ROLLBACK_PREPARED;
 	n->gid = $3;
-	n->location = -1;
+	n->location = @3;
 	$$ = (Node *) n;
 }
 		;
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index 3d920fb5f7..93735d5d85 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -197,6 +197,29 @@ SELECT pg_stat_statements_reset();
  
 (1 row)
 
+-- Two-phase transactions
+BEGIN;
+PREPARE TRANSACTION 'stat_trans1';
+COMMIT PREPARED 'stat_trans1';
+BEGIN;
+PREPARE TRANSACTION 'stat_trans2';
+ROLLBACK PREPARED 'stat_trans2';
+SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+ calls | rows |   query   
+---+--+---
+ 2 |0 | BEGIN
+ 1 |0 | COMMIT PREPARED $1
+ 2 |0 | PREPARE TRANSACTION $1
+ 1 |0 | ROLLBACK PREPARED $1
+ 1 |1 | SELECT pg_stat_statements_reset()
+(5 rows)
+
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
 -- Savepoints
 BEGIN;
 SAVEPOINT sp1;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.conf b/contrib/pg_stat_statements/pg_stat_statements.conf
index 13346e2807..0e900d7119 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.conf
+++ b/contrib/pg_stat_statements/pg_stat_statements.conf
@@ -1 +1,2 @@
 shared_preload_libraries = 'pg_stat_statements'
+max_prepared_transactions = 5
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index 859e57955e..87666d9135 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -115,6 +115,16 @@ COMMIT;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset();
 
+-- Two-phase transactions
+BEGIN;
+PREPARE TRANSACTION 'stat_trans1';
+COMMIT PREPARED 'stat_trans1';
+BEGIN;
+PREPARE TRANSACTION 'stat_trans2';
+ROLLBACK PREPARED 'stat_trans2';
+SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset();
+
 -- Savepoints
 BEGIN;
 SAVEPOINT sp1;


signature.asc
Description: PGP signature

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Tom Lane

Andres Freund  writes:
> On 2023-07-31 19:11:38 -0400, Tom Lane wrote:
>> Huh.  Maybe worth reporting as a FreeBSD bug?

> Yea. Hoping our local freebsd developer has a suggestion as to which component
> to report it to, or even fix it :).

You already have a reproducer using just tcl, so I'd suggest filing
it against tcl and letting them drill down as needed.  (It's possible
that it actually is tcl that's misbehaving, seeing that our core
regression tests are passing in the same environment.)

regards, tom lane

Re: Getting rid of OverrideSearhPath in namespace.c

2023-07-31 Thread Noah Misch

On Mon, Jul 17, 2023 at 05:11:46PM +0300, Aleksander Alekseev wrote:
> > As a follow-up for the CVE-2023-2454 fix, I think that it makes sense to
> > completely remove unsafe functions
> > PushOverrideSearchPath()/PopOverrideSearchPath(), which are not used in the
> > core now.
> > Please look at the patch attached.
> >
> > [...]
> >
> > What do you think?
> 
> +1 to remove dead code.
> 
> The proposed patch however removes get_collation_oid(), apparently by
> mistake. Other than that the patch looks fine and passes `make
> installcheck-world`.
> 
> I added an entry to the nearest CF [1].
> 
> > Beside that, maybe it's worth to rename three functions in "Override" in
> > their names: GetOverrideSearchPath(), CopyOverrideSearchPath(),
> > OverrideSearchPathMatchesCurrent(), and then maybe struct 
> > OverrideSearchPath.
> > Noah Misch proposed name GetSearchPathMatcher() for the former.
> 
> +1 as well. I added the corresponding 0002 patch.

Pushed both.  Thanks.

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 19:11:38 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-07-31 18:33:37 -0400, Tom Lane wrote:
> >> (And could it be that we had one in the predecessor 13.1 image?)
> 
> > No, I checked, and it's not in there either... It looks like the difference 
> > is
> > that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't.
> 
> Huh.  Maybe worth reporting as a FreeBSD bug?

Yea. Hoping our local freebsd developer has a suggestion as to which component
to report it to, or even fix it :).


> In any case I don't think it's *our* bug.

Agreed. An explicit tzsetup UTC isn't much to carry going forward...

Greetings,

Andres Freund

Re: Avoid possible memory leak (src/common/rmtree.c)

2023-07-31 Thread Michael Paquier

On Mon, Jul 31, 2023 at 08:10:55PM -0300, Ranier Vilela wrote:
> Thanks for the commit, Michael.

Sorry for the lack of update here.  For the sake of the archives, this
is f1e9f6b.
--
Michael


signature.asc
Description: PGP signature

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 21:25:06 +, José Neves wrote:
> Ok, if I understood you correctly, I start to see where my logic is faulty. 
> Just to make sure that I got it right, taking the following example again:
>
> T-1
> INSERT LSN1-1000
> UPDATE LSN2-2000
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
>
> T-2
> INSERT LSN1-500
> UPDATE LSN2-1500
> UPDATE LSN3-2500
> COMMIT  LSN4-5500
>
> Where data will arrive in this order:
>
> INSERT LSN1-500
> INSERT LSN1-1000
> UPDATE LSN2-1500
> UPDATE LSN2-2000
> UPDATE LSN3-2500
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
> COMMIT  LSN4-5500

No, they won't arrive in that order. They will arive as

BEGIN
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000
BEGIN
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Because T1 committed before T2. Changes are only streamed out at commit /
prepare transaction (*). Within a transaction, they however *will* be ordered
by LSN.

(*) Unless you use streaming mode, in which case it'll all be more
complicated, as you'll also receive changes for transactions that might still
abort.


> You are saying that the LSN3-3000 will never be missing, either the entire
> connection will fail at that point, or all should be received in the
> expected order (which is different from the "numeric order" of LSNs).

I'm not quite sure what you mean with the "different from the numeric order"
bit...


> If the connection is down, upon restart, I will receive the entire T-1
> transaction again (well, all example data again).

Yes, unless you already acknowledged receipt up to LSN4-4000 and/or are only
asking for newer transactions when reconnecting.


> In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger
> numeric value" than the ones representing INSERT and UPDATE events on T-2, I
> will be receiving the entire T-2 transaction again, as the LSN4-5500 is
> still uncommitted.

I don't quite know what you mean with "commit LSN4-4000" here.


> This makes sense to me, but just to be extra clear, I will never receive a
> transaction commit before receiving all other events for that transaction.

Correct.

Greetings,

Andres Freund

Re: Avoid possible memory leak (src/common/rmtree.c)

2023-07-31 Thread Ranier Vilela

Em sex, 28 de jul de 2023 11:54 PM, Michael Paquier 
escreveu:

> On Tue, Jul 25, 2023 at 04:45:22PM +0200, Daniel Gustafsson wrote:
> > Skimming the tree there doesn't seem to be any callers which aren't
> exiting or
> > ereporting on failure so the real-world impact seems low.  That being
> said,
> > silencing static analyzers could be reason enough to delay allocation.
>
> A different reason would be out-of-core code that uses rmtree() in a
> memory context where the leak would be an issue if facing a failure
> continuously?  Delaying the allocation after the OPENDIR() seems like
> a good practice anyway.
>
Thanks for the commit, Michael.

best regards,
Ranier Vilela

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Tom Lane

Andres Freund  writes:
> On 2023-07-31 18:33:37 -0400, Tom Lane wrote:
>> (And could it be that we had one in the predecessor 13.1 image?)

> No, I checked, and it's not in there either... It looks like the difference is
> that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't.

Huh.  Maybe worth reporting as a FreeBSD bug?  In any case I don't
think it's *our* bug.

regards, tom lane

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 18:33:37 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I saw that CI image builds for freebsd were failing, because 13.1, used 
> > until
> > now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 
> > 13.2
> > - unfortunately that failed. In pltcl of all places.
> 
> I tried to replicate this in a freshly-installed 13.2 VM, and could
> not, which is unsurprising because the FreeBSD installation process
> does not let you skip selecting a timezone --- which creates
> /etc/localtime AFAICT.  So I'm unconvinced that we ought to worry
> about the case where that's not there.  How is it that the CI image
> lacks that file?

I don't know why it lacks the file - the CI image is based on the google cloud
image freebsd maintains / publishes ([1][2]). Which doesn't have /etc/localtime.

I've now added a "tzsetup UTC" to the image generation, which fixes the test
failure.


> (And could it be that we had one in the predecessor 13.1 image?)

No, I checked, and it's not in there either... It looks like the difference is
that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't.

Upstream 13.1 image:

truss date 2>&1|grep -E 'open|stat'
...
open("/etc/localtime",O_RDONLY,0101401200)   ERR#2 'No such file or 
directory'
open("/usr/share/zoneinfo/UTC",O_RDONLY,00)  = 3 (0x3)
fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0)
open("/usr/share/zoneinfo/posixrules",O_RDONLY,014330460400) = 3 (0x3)
fstat(3,{ mode=-r--r--r-- ,inode=356621,size=3535,blksize=32768 }) = 0 (0x0)
fstat(1,{ mode=p- ,inode=696,size=0,blksize=4096 }) = 0 (0x0)

Upstream 13.2 image:
open("/etc/localtime",O_RDONLY,01745)ERR#2 'No such file or 
directory'
fstat(1,{ mode=p- ,inode=658,size=0,blksize=4096 }) = 0 (0x0)


Why not reading the UTC zone leads to timestamps being out of range, I do not
know...

Greetings,

Andres Freund

[1] https://wiki.freebsd.org/GoogleCloudPlatform
[2] https://cloud.google.com/compute/docs/images#freebsd

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Tom Lane

Andres Freund  writes:
> I saw that CI image builds for freebsd were failing, because 13.1, used until
> now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 13.2
> - unfortunately that failed. In pltcl of all places.

I tried to replicate this in a freshly-installed 13.2 VM, and could
not, which is unsurprising because the FreeBSD installation process
does not let you skip selecting a timezone --- which creates
/etc/localtime AFAICT.  So I'm unconvinced that we ought to worry
about the case where that's not there.  How is it that the CI image
lacks that file?  (And could it be that we had one in the predecessor
13.1 image?)

regards, tom lane

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-31 Thread Jeff Davis

On Mon, 2023-07-31 at 13:17 -0400, Joe Conway wrote:
> But the analysis of the issue needs to go one step further. Even if
> the 
> search_path does not change from the originally intended one, a newly
> created function can shadow the intended one based on argument
> coercion 
> rules.

There are quite a few issues going down this path:

* The set of objects in each schema can change. Argument coercion is a
particularly subtle one, but there are other ways that it could find
the wrong object. The temp namespace also has some subtle issues.

* Schema USAGE privileges may vary over time or from caller to caller,
affecting which items in the search path are searched at all. The same
goes if theres an object access hook in place.

* $user should be resolved to a specific schema (or perhaps not in some
cases?)

* There are other GUCs and environment that can affect function
behavior. Is it worth trying to lock those down?

I agree that each of these is some potential problem, but these are
much smaller problems than allowing the caller to have complete control
over the search_path.

Regards,
Jeff Davis

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-31 Thread Jeff Davis

On Mon, 2023-07-31 at 12:53 -0400, Robert Haas wrote:
> I agree. I think there are actually two interrelated problems here.
> 
> One is that virtually all code needs to run with the originally
> intended search_path rather than some search_path chosen at another
> time and maybe by a different user. If not, it's going to break, or
> compromise security, depending on the situation. The other is that
> running arbitrary code written by somebody else as yourself is
> basically instant death, from a security perspective.

Good framing.

The search_path is a particularly nasty problem in our system because
it means that users can't even trust the code that they write
themselves! A function author has no way to know how their own function
will behave under a different search_path.

> It's a little hard to imagine a world in which these problems don't
> exist at all, but it somehow feels like the design of the system
> pushes you toward doing this stuff incorrectly rather than doing it
> correctly. For instance, you can imagine a system where when you run
> CREATE OR REPLACE FUNCTION, the prevailing search_path is captured
> and
> automatically included in proconfig.

Capturing the environment is not ideal either, in my opinion. It makes
it easy to carelessly depend on a schema that others might not have
USAGE privileges on, which would then create a runtime problem for
other callers. Also, I don't think we could just depend on the raw
search_path, we'd need to do some processing for $user, and there are
probably a few other annoyances.

It's one possibility and we don't have a lot of great options, so I
don't want to rule it out though. If nothing else it could be a
transition path to something better.

> But you can maybe imagine a system where
> all code associated with a table is run as the table owner in all
> cases, regardless of SECURITY INVOKER/DEFINER, which I think would at
> least close some holes.
> 
> The difficulty is that it's a bit hard to imagine making these kinds
> of definitional changes now, because they'd probably be breaking
> changes for pretty significant numbers of users.

I believe we can get close to a good model with minimal breakage. And
when we make the problem small enough I believe other solutions will
emerge. We will probably have to hedge with some compatibility GUCs.

>  But on the other
> hand, if we don't start thinking about systemic changes here, it
> feels
> like we're just playing whack-a-mole.

Exactly. If we can agree on where we're going then I think we can get
there.

Regards,
Jeff Davis

RE: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread José Neves

Hi Andres, thanks for your reply.

Ok, if I understood you correctly, I start to see where my logic is faulty. 
Just to make sure that I got it right, taking the following example again:

T-1
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000

T-2
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Where data will arrive in this order:

INSERT LSN1-500
INSERT LSN1-1000
UPDATE LSN2-1500
UPDATE LSN2-2000
UPDATE LSN3-2500
UPDATE LSN3-3000
COMMIT  LSN4-4000
COMMIT  LSN4-5500

You are saying that the LSN3-3000 will never be missing, either the entire 
connection will fail at that point, or all should be received in the expected 
order (which is different from the "numeric order" of LSNs). If the connection 
is down, upon restart, I will receive the entire T-1 transaction again (well, 
all example data again).
In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger 
numeric value" than the ones representing INSERT and UPDATE events on T-2, I 
will be receiving the entire T-2 transaction again, as the LSN4-5500 is still 
uncommitted.
This makes sense to me, but just to be extra clear, I will never receive a 
transaction commit before receiving all other events for that transaction.
Are these statements correct?

>Are you using the 'streaming' mode / option to pgoutput?
No.

>Not sure what you mean with "unordered offsets"?
Ordered: EB53/E0D88188, EB53/E0D88189, EB53/E0D88190
Unordered: EB53/E0D88190, EB53/E0D88188, EB53/E0D88189

Extra question: When I get a begin message, I get a transaction starting at 
LSN-1000, and a transaction ending at LSN-2000. But as the example above shows, 
I can have data points from other transactions with LSNs in that interval. I 
have no way to identify to which transaction they belong, correct?

Thanks again. Regards,
José Neves

De: Andres Freund 
Enviado: 31 de julho de 2023 21:39
Para: José Neves 
Cc: Amit Kapila ; pgsql-hack...@postgresql.org 

Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom 
client

Hi,

On 2023-07-31 14:16:22 +, José Neves wrote:
> Hi Amit, thanks for the reply.
>
> In our worker (custom pg replication client), we care only about INSERT,
> UPDATE, and DELETE operations, which - sure - may be part of the issue.

That seems likely. Postgres streams out changes in commit order, not in order
of the changes having been made (that'd not work due to rollbacks etc). If you
just disregard transactions entirely, you'll get something bogus after
retries.

You don't need to store the details for each commit in the target system, just
up to which LSN you have processed *commit records*. E.g. if you have received
and safely stored up to commit 0/1000, you need to remember that.

Are you using the 'streaming' mode / option to pgoutput?

> 1. We have no way to match LSN operations with the respective commit, as
> they have unordered offsets.

Not sure what you mean with "unordered offsets"?

> Assuming that all of them were received in order, we would commit all data 
> with the commit message LSN4-4000 as other events would match the transaction 
> start and end LSN interval of it.

Logical decoding sends out changes in a deterministic order and you won't see
out of order data when using TCP (the entire connection can obviously fail
though).

Andres

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-31 Thread Jeff Davis

On Mon, 2023-07-31 at 16:06 -0400, Robert Haas wrote:
> if you
> include in your search_path a schema to which some other user can
> write, you are pretty much agreeing to execute code provided by that
> user.

Agreed on all counts here. I don't think it's reasonable for us to try
to make such a setup secure, and I don't think users have much need for
such a setup anyway.

> One thing we might be able to do to prevent that sort of thing is to
> have a feature to prevent "accidental" code execution, as in the
> "function trust" mechanism proposed previously. Say I trust all users
> who can SET ROLE to me and/or who inherit my privileges. Additionally
> I can decide to trust users who do neither of those things by some
> sort of explicit declaration. If I don't trust a user then if I do
> anything that would cause code supplied by that user to get executed,
> it just errors out:
> 
> ERROR: role "rhaas" should not execute arbitrary code provided by
> role "jconway"
> HINT: If this should be allowed, use the TRUST command to permit it.

+1, though I'm not sure we need an extensive trust mechanism beyond
what we already have with the SET ROLE privilege.

> And
> we probably also still need to find ways to control search_path in a
> lot more widely than we do today. Otherwise, even if stuff is
> technically secure, it may just not work.

+1.

Regards,
Jeff Davis

Re: Correct the documentation for work_mem

2023-07-31 Thread Tristen Raab

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Hello,

I've reviewed and built the documentation for the updated patch. As it stands 
right now I think the documentation for this section is quite clear.

> I'm wondering about adding "and more than one of these operations may
> be in progress simultaneously".  Are you talking about concurrent
> sessions running other queries which are using work_mem too?

This appears to be referring to the "sort and hash" operations mentioned prior.

> If so,
> isn't that already covered by the final sentence in the quoted text
> above? if not, what is running simultaneously?

I believe the last sentence is referring to another session that is running its 
own sort and hash operations. So the first section you mention is describing 
how sort and hash operations can be in execution at the same time for a query, 
while the second refers to how sessions may overlap in their execution of sort 
and hash operations if I am understanding this correctly.

I also agree that changing "sort or hash" to "sort and hash" is a better 
description.

Tristen

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 14:16:22 +, José Neves wrote:
> Hi Amit, thanks for the reply.
>
> In our worker (custom pg replication client), we care only about INSERT,
> UPDATE, and DELETE operations, which - sure - may be part of the issue.

That seems likely. Postgres streams out changes in commit order, not in order
of the changes having been made (that'd not work due to rollbacks etc). If you
just disregard transactions entirely, you'll get something bogus after
retries.

You don't need to store the details for each commit in the target system, just
up to which LSN you have processed *commit records*. E.g. if you have received
and safely stored up to commit 0/1000, you need to remember that.

Are you using the 'streaming' mode / option to pgoutput?

> 1. We have no way to match LSN operations with the respective commit, as
> they have unordered offsets.

Not sure what you mean with "unordered offsets"?

> Assuming that all of them were received in order, we would commit all data 
> with the commit message LSN4-4000 as other events would match the transaction 
> start and end LSN interval of it.

Logical decoding sends out changes in a deterministic order and you won't see
out of order data when using TCP (the entire connection can obviously fail
though).

Andres

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-31 Thread Robert Haas

On Mon, Jul 31, 2023 at 1:18 PM Joe Conway  wrote:
> But the analysis of the issue needs to go one step further. Even if the
> search_path does not change from the originally intended one, a newly
> created function can shadow the intended one based on argument coercion
> rules.

Yeah, this is a complicated issue. As the system works today,  if you
include in your search_path a schema to which some other user can
write, you are pretty much agreeing to execute code provided by that
user. If that user has strictly greater privileges than you, e.g. they
are the super-user, then that's fine, because they can compromise your
account anyway, but otherwise, you're probably doomed. Not only can
they try to capture references with similarly-named objects, they can
also do things like create objects whose names are common
mis-spellings of the objects that are supposed to be there and hope
you access the wrong one by mistake. Maybe there are other attacks as
well, but even if not, I think it's already a pretty hopeless
situation. I think the UNIX equivalent would be including a directory
in your PATH that is world-writable and hoping your account will stay
secure. Not very likely.

We have already taken an important step in terms of preventing this
attack in commit b073c3ccd06e4cb845e121387a43faa8c68a7b62, which
removed PUBLIC CREATE from the public schema. Before that, we were
shipping a configuration analogous to a UNIX system where /usr/bin was
world-writable -- something which no actual UNIX system has likely
done any time in the last 40 years, because it's so clearly insane. We
could maybe go a step further by changing the default search_path to
not even include public, to further discourage people from using that
as a catch-all where everybody can just dump their objects. Right now,
anybody can revert that change with a single GRANT statement, and if
we removed public from the default search_path as well, people would
have one extra step to restore that insecure configuration. I don't
know such a change is worthwhile, though. It would still be super-easy
for users to create insecure configurations: as soon as user A can
write a schema and user B has it in the search_path, B is in a lot of
trouble if A turns out to be malicious.

One thing we might be able to do to prevent that sort of thing is to
have a feature to prevent "accidental" code execution, as in the
"function trust" mechanism proposed previously. Say I trust all users
who can SET ROLE to me and/or who inherit my privileges. Additionally
I can decide to trust users who do neither of those things by some
sort of explicit declaration. If I don't trust a user then if I do
anything that would cause code supplied by that user to get executed,
it just errors out:

ERROR: role "rhaas" should not execute arbitrary code provided by role "jconway"
HINT: If this should be allowed, use the TRUST command to permit it.

Even if we do this, I still think we need the kinds of fixes that I
mentioned earlier. An error like this is fine if you're trying to do
something to a table owned by another user and they've got a surprise
trigger on there or something, but a maintenance command like VACUUM
should find a way to work that is both secure and non-astonishing. And
we probably also still need to find ways to control search_path in a
lot more widely than we do today. Otherwise, even if stuff is
technically secure, it may just not work.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 12:15:10 -0700, Andres Freund wrote:
> It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even
> parse what it generates:
> 
> echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] 
> -format "%Y/%m/%d"]'|tclsh8.6
> 
> Which works on 13.1 (and other operating systems), without a problem.
> 
> I used truss as a very basic way to see differences between 13.1 and 13.2 -
> the big difference was .2 failing just after
> access("/etc/localtime",F_OK)  ERR#2 'No such file or 
> directory'
> open("/etc/localtime",O_RDONLY,077)ERR#2 'No such file or 
> directory'
> 
> whereas 13.1 also saw that, but then continued to
> 
> issetugid()= 0 (0x0)
> open("/usr/share/zoneinfo/UTC",O_RDONLY,00)= 3 (0x3)
> fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0)
> ...
> 
> which made me test specifying the timezone explicitly:
> echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" 
> -timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6
> 
> Which, surprise, works.
> 
> So does specifying the timezone via the TZ='UTC' environment variable.
> 
> 
> I guess there could be a libc behaviour change or such around timezones? I do
> see
> https://www.freebsd.org/releases/13.2R/relnotes/
> "tzcode has been upgraded to version 2022g with improved timezone change 
> detection and reliability fixes."

One additional datapoint: If I configure the timezone with "tzsetup" (which
creates /etc/localtime), the problem vanishes as well.

Greetings,

Andres Freund

Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-07-31 Thread Andrew Dunstan



On 2023-07-26 We 03:03, Zhang Mingli wrote:

HI,


I've looked at this patch and it looks mostly fine, though I do not
intend to commit it myself; perhaps Andrew will.


HI, Amit, thanks for review.



A few minor things to improve:

+  If * is specified, it will be applied in 
all columns.

...
+  If * is specified, it will be applied in 
all columns.


Please write "it will be applied in" as "the option will be applied to".


+1



+   bool    force_notnull_all;  /* FORCE_NOT_NULL * */
...
+   bool    force_null_all; /* FORCE_NULL * */

Like in the comment for force_quote, please add a "?" after * in the
above comments.


+1



+   if (cstate->opts.force_notnull_all)
+   MemSet(cstate->opts.force_notnull_flags, true, num_phys_attrs
* sizeof(bool));
...
+   if (cstate->opts.force_null_all)
+   MemSet(cstate->opts.force_null_flags, true, num_phys_attrs *
sizeof(bool));

While I am not especially opposed to using this 1-line variant to set
the flags array, it does mean that there are now different styles
being used for similar code, because force_quote_flags uses a for
loop:

   if (cstate->opts.force_quote_all)
   {
   int i;

   for (i = 0; i < num_phys_attrs; i++)
   cstate->opts.force_quote_flags[i] = true;
   }

Perhaps we could fix the inconsistency by changing the force_quote_all
code to use MemSet() too.  I'll defer whether to do that to Andrew's
judgement.


Sure, let’s wait for Andrew and I will put everything in one pot then.




I was hoping it be able to get to it today but that's not happening. If 
you want to submit a revised patch as above that will be good. I hope to 
get to it later this week.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

I saw that CI image builds for freebsd were failing, because 13.1, used until
now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 13.2
- unfortunately that failed. In pltcl of all places.

https://api.cirrus-ci.com/v1/artifact/task/5275616266682368/testrun/build/testrun/pltcl/regress/regression.diffs

Notably both 13.1 and 13.2 are using tcl 8.6.13.

The code for the relevant function is this:

create function tcl_date_week(int4,int4,int4) returns text as $$
return [clock format [clock scan "$2/$3/$1"] -format "%U"]
$$ language pltcl immutable;

select tcl_date_week(2010,1,26);


It doesn't surprise me that there are problems with above clock scan, it uses
the MM/DD/ format without making that explicit. But why that doesn't work
on freebsd 13.2, I can't explain.  It looks like tcl specifies the MM/DD/
bit for "free format scans":
https://www.tcl.tk/man/tcl8.6/TclCmd/clock.html#M80

It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even
parse what it generates:

echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] 
-format "%Y/%m/%d"]'|tclsh8.6

Which works on 13.1 (and other operating systems), without a problem.

I used truss as a very basic way to see differences between 13.1 and 13.2 -
the big difference was .2 failing just after
access("/etc/localtime",F_OK)ERR#2 'No such file or 
directory'
open("/etc/localtime",O_RDONLY,077)  ERR#2 'No such file or 
directory'

whereas 13.1 also saw that, but then continued to

issetugid()  = 0 (0x0)
open("/usr/share/zoneinfo/UTC",O_RDONLY,00)  = 3 (0x3)
fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0)
...

which made me test specifying the timezone explicitly:
echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" 
-timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6

Which, surprise, works.

So does specifying the timezone via the TZ='UTC' environment variable.


I guess there could be a libc behaviour change or such around timezones? I do
see
https://www.freebsd.org/releases/13.2R/relnotes/
"tzcode has been upgraded to version 2022g with improved timezone change 
detection and reliability fixes."

Greetings,

Andres Freund

RE: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread José Neves

Hi Euler, thank you for your reply.

Your output is exactly how mine doesn't look like, I don't have such an order - 
that is - not only under heavy load.
Conditions in which this occurs make it challenging to provide detailed 
information, and will also take a while to trigger.  I've sent a previous email 
explaining how my output looks like, from a previous debug.

I can gather more information if needs be, but I was interested in this bit:
> Instead, Postgres provides a replication progress mechanism [1] to do it.
It's not 100% clear to me how that would look like at the code level, can you 
provide a high-level algorithm on how such code would work? For reference, our 
implementation - to the bones - is very similar to this: 
https://adam-szpilewicz.pl/cdc-replication-from-postgresql-using-go-golang

Thanks for your help. Regards,
José Neves


De: Euler Taveira 
Enviado: 31 de julho de 2023 15:27
Para: José Neves ; pgsql-hackers 

Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom 
client

On Sat, Jul 29, 2023, at 8:07 PM, José Neves wrote:
I'm attempting to develop a CDC on top of Postgres, currently using 12, the 
last minor, with a custom client, and I'm running into issues with data loss 
caused by out-of-order logical replication messages.

Can you provide a test case to show this issue? Did you try in a newer version?

The problem is as follows: postgres streams A, B, D, G, K, I, P logical 
replication events, upon exit signal we stop consuming new events at LSN K, and 
we wait 30s for out-of-order events. Let's say that we only got A, (and K ofc) 
so in the following 30s, we get B, D, however, for whatever reason, G never 
arrived. As with pgoutput-based logical replication we have no way to calculate 
the next LSN, we have no idea that G was missing, so we assumed that it all 
arrived, committing K to postgres slot and shutdown. In the next run, our 
worker will start receiving data from K forward, and G is lost forever...
Meanwhile postgres moves forward with archiving and we can't go back to check 
if we lost anything. And even if we could, would be extremely inefficient.

Logical decoding provides the changes to output plugin at commit time. You
mentioned the logical replication events but didn't say which are part of the
same transaction. Let's say A, B, D and K are changes from the same transaction
and G, I and P are changes from another transaction. The first transaction will
be available when it processes K. The second transaction will be provided when
the logical decoding processes P.

You didn't say how your consumer is working. Are you sure your consumer doesn't
get the second transaction? If your consumer is advancing the replication slot
*after* receiving K (using pg_replication_slot_advance), it is doing it wrong.
Another common problem with consumer is that it uses
pg_logical_slot_get_changes() but *before* using the data it crashes; in this
case, the data is lost.

It is hard to say where the problem is if you didn't provide enough information
about the consumer logic and the WAL information (pg_waldump output) around the
time you detect the data loss.

In sum, the issue comes from the fact that postgres will stream events with 
unordered LSNs on high transactional systems, and that pgoutput doesn't have 
access to enough information to calculate the next or last LSN, so we have no 
way to check if we receive all the data that we are supposed to receive, 
risking committing an offset that we shouldn't as we didn't receive yet 
preceding data.

It seems very either to me that none of the open-source CDC projects that I 
looked into care about this. They always assume that the next LSN received 
is... well the next one, and commit that one, so upon restart, they are 
vulnerable to the same issue. So... either I'm missing something... or we have 
a generalized assumption causing data loss under certain conditions all over.

Let me illustrate the current behavior. Let's say there are 3 concurrent
transactions.

Session A
==

euler=# SELECT pg_create_logical_replication_slot('repslot1', 'wal2json');
pg_create_logical_replication_slot

(repslot1,0/369DF088)
(1 row)

euler=# create table foo (a int primary key);
CREATE TABLE
euler=# BEGIN;
BEGIN
euler=*# INSERT INTO foo (a) SELECT generate_series(1, 2);
INSERT 0 2

Session B
==

euler=# BEGIN;
BEGIN
euler=*# INSERT INTO foo (a) SELECT generate_series(11, 12);
INSERT 0 2

Session C
==

euler=# BEGIN;
BEGIN
euler=*# INSERT INTO foo (a) SELECT generate_series(21, 22);
INSERT 0 2

Session A
==

euler=*# INSERT INTO foo (a) VALUES(3);
INSERT 0 1

Session B
==

euler=*# INSERT INTO foo (a) VALUES(13);
INSERT 0 1

Session C
==

euler=*# INSERT INTO foo (a) VALUES(23);
INSERT 0 1
euler=*# COMMIT;
COMMIT

Session B
==

euler=*# COMMIT;
COMMIT

Session A
==

euler=*# COMMIT;
COMMIT

The o

Re: should frontend tools use syncfs() ?

2023-07-31 Thread Nathan Bossart

On Mon, Jul 31, 2023 at 10:51:38AM -0700, Nathan Bossart wrote:
> Here is a new version of the patch with documentation updates and a couple
> other small improvements.

I just realized I forgot to update the --help output for these utilities.
I'll do that in the next version of the patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: add timing information to pg_upgrade

2023-07-31 Thread Nathan Bossart

On Mon, Jul 31, 2023 at 11:34:57AM +0530, Bharath Rupireddy wrote:
> Either of "Checking for \"aclitem\" data type usage" or "Checking for
> \"aclitem\" data type in user tables"  seems okay to me, however, I
> prefer the second wording.

Okay.  I used the second wording for all the data type checks in v3.  I
also marked the timing strings for translation.  I considered trying to
extract psql's PrintTiming() so that it could be reused in other utilities,
but the small differences would likely make translation difficult, and the
logic isn't terribly long or sophisticated.

My only remaining concern is that this timing information could cause
pg_upgrade's output to exceed 80 characters per line.  I don't know if this
is something that folks are very worried about in 2023, but it still seemed
worth bringing up.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c52513341fb2a315203e9ba5dd95761220744a74 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 31 Jul 2023 11:02:27 -0700
Subject: [PATCH v3 1/2] harmonize data type status messages in pg_upgrade

---
 src/bin/pg_upgrade/check.c   | 4 ++--
 src/bin/pg_upgrade/version.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 64024e3b9e..38a4227be0 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -1215,7 +1215,7 @@ check_for_aclitem_data_type_usage(ClusterInfo *cluster)
 {
 	char		output_path[MAXPGPATH];
 
-	prep_status("Checking for incompatible \"aclitem\" data type in user tables");
+	prep_status("Checking for \"aclitem\" data type in user tables");
 
 	snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt");
 
@@ -1243,7 +1243,7 @@ check_for_jsonb_9_4_usage(ClusterInfo *cluster)
 {
 	char		output_path[MAXPGPATH];
 
-	prep_status("Checking for incompatible \"jsonb\" data type");
+	prep_status("Checking for \"jsonb\" data type in user tables");
 
 	snprintf(output_path, sizeof(output_path), "%s/%s",
 			 log_opts.basedir,
diff --git a/src/bin/pg_upgrade/version.c b/src/bin/pg_upgrade/version.c
index 403a6d7cfa..9755af920f 100644
--- a/src/bin/pg_upgrade/version.c
+++ b/src/bin/pg_upgrade/version.c
@@ -181,7 +181,7 @@ old_9_3_check_for_line_data_type_usage(ClusterInfo *cluster)
 {
 	char		output_path[MAXPGPATH];
 
-	prep_status("Checking for incompatible \"line\" data type");
+	prep_status("Checking for \"line\" data type in user tables");
 
 	snprintf(output_path, sizeof(output_path), "%s/%s",
 			 log_opts.basedir,
@@ -221,7 +221,7 @@ old_9_6_check_for_unknown_data_type_usage(ClusterInfo *cluster)
 {
 	char		output_path[MAXPGPATH];
 
-	prep_status("Checking for invalid \"unknown\" user columns");
+	prep_status("Checking for \"unknown\" data type in user tables");
 
 	snprintf(output_path, sizeof(output_path), "%s/%s",
 			 log_opts.basedir,
@@ -366,7 +366,7 @@ old_11_check_for_sql_identifier_data_type_usage(ClusterInfo *cluster)
 {
 	char		output_path[MAXPGPATH];
 
-	prep_status("Checking for invalid \"sql_identifier\" user columns");
+	prep_status("Checking for \"sql_identifier\" data type in user tables");
 
 	snprintf(output_path, sizeof(output_path), "%s/%s",
 			 log_opts.basedir,
-- 
2.25.1

>From 99ba1903472b72480d4ace13568300f35320cbaf Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 27 Jul 2023 16:16:45 -0700
Subject: [PATCH v3 2/2] add timing information to pg_upgrade

---
 src/bin/pg_upgrade/util.c | 55 ++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c
index 21ba4c8f12..23fd5f87af 100644
--- a/src/bin/pg_upgrade/util.c
+++ b/src/bin/pg_upgrade/util.c
@@ -9,14 +9,19 @@
 
 #include "postgres_fe.h"
 
+#include 
 #include 
 
 #include "common/username.h"
 #include "pg_upgrade.h"
+#include "portability/instr_time.h"
 
 LogOpts		log_opts;
 
+static instr_time step_start;
+
 static void pg_log_v(eLogType type, const char *fmt, va_list ap) pg_attribute_printf(2, 0);
+static char *get_time_str(double time_ms);
 
 
 /*
@@ -137,6 +142,8 @@ prep_status(const char *fmt,...)
 
 	/* trim strings */
 	pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message);
+
+	INSTR_TIME_SET_CURRENT(step_start);
 }
 
 /*
@@ -170,6 +177,8 @@ prep_status_progress(const char *fmt,...)
 		pg_log(PG_REPORT, "%-*s", MESSAGE_WIDTH, message);
 	else
 		pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message);
+
+	INSTR_TIME_SET_CURRENT(step_start);
 }
 
 static void
@@ -280,11 +289,55 @@ pg_fatal(const char *fmt,...)
 }
 
 
+static char *
+get_time_str(double time_ms)
+{
+	double		seconds;
+	double		minutes;
+	double		hours;
+	double		days;
+
+	if (time_ms < 1000.0)
+		return psprintf(_("%.3f ms"), time_ms);
+
+	seconds = time_ms / 1000.0;
+	minutes = floor(seconds / 60.0);
+	seconds -= 60.0 * minutes;
+	if (minutes < 60.0)
+		return psprintf(_("%.3f ms (%02d:%06.3f)"),
+		time_ms, (int) minutes, seconds);
+
+	hours = flo

Re: should frontend tools use syncfs() ?

2023-07-31 Thread Nathan Bossart

On Sat, Jul 29, 2023 at 02:40:10PM -0700, Nathan Bossart wrote:
> I was about to start a new thread, but I found this one with some good
> preliminary discussion.  I came to the same conclusion about introducing a
> new option instead of using syncfs() by default wherever it is available.
> The attached patch is still a work-in-progress, but it seems to behave as
> expected.  I began investigating this because I noticed that the
> sync-data-directory step on pg_upgrade takes quite a while when there are
> many files, and I am looking for ways to reduce the amount of downtime
> required for pg_upgrade.
> 
> The attached patch adds a new --sync-method option to the relevant frontend
> utilities, but I am not wedded to that name/approach.

Here is a new version of the patch with documentation updates and a couple
other small improvements.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6b65605c90703ee3aebca0b90c771c9f14b59545 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 28 Jul 2023 15:56:24 -0700
Subject: [PATCH v2 1/1] allow syncfs in frontend utilities

---
 doc/src/sgml/ref/initdb.sgml  | 27 
 doc/src/sgml/ref/pg_basebackup.sgml   | 30 +
 doc/src/sgml/ref/pg_checksums.sgml| 27 
 doc/src/sgml/ref/pg_dump.sgml | 27 
 doc/src/sgml/ref/pg_rewind.sgml   | 27 
 doc/src/sgml/ref/pgupgrade.sgml   | 29 +
 src/bin/initdb/initdb.c   | 11 +++-
 src/bin/pg_basebackup/pg_basebackup.c | 10 ++-
 src/bin/pg_checksums/pg_checksums.c   |  8 ++-
 src/bin/pg_dump/pg_backup.h   |  4 +-
 src/bin/pg_dump/pg_backup_archiver.c  | 13 ++--
 src/bin/pg_dump/pg_backup_archiver.h  |  1 +
 src/bin/pg_dump/pg_backup_directory.c |  2 +-
 src/bin/pg_dump/pg_dump.c |  9 ++-
 src/bin/pg_rewind/file_ops.c  |  2 +-
 src/bin/pg_rewind/pg_rewind.c |  8 +++
 src/bin/pg_rewind/pg_rewind.h |  2 +
 src/bin/pg_upgrade/option.c   | 12 
 src/bin/pg_upgrade/pg_upgrade.c   |  6 +-
 src/bin/pg_upgrade/pg_upgrade.h   |  1 +
 src/common/file_utils.c   | 91 ++-
 src/fe_utils/option_utils.c   | 18 ++
 src/include/common/file_utils.h   | 17 -
 src/include/fe_utils/option_utils.h   |  3 +
 src/include/storage/fd.h  |  4 ++
 src/tools/pgindent/typedefs.list  |  1 +
 26 files changed, 369 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 22f1011781..872fef5705 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -365,6 +365,33 @@ PostgreSQL documentation
   
  
 
+ 
+  --sync-method
+  
+   
+When set to fsync, which is the default,
+initdb will recursively open and synchronize all
+files in the data directory.  The search for files will follow symbolic
+links for the WAL directory and each configured tablespace.
+   
+   
+On Linux, syncfs may be used instead to ask the
+operating system to synchronize the whole file systems that contain the
+data directory, the WAL files, and each tablespace.  This may be a lot
+faster than the fsync setting, because it doesn't
+need to open each file one by one.  On the other hand, it may be slower
+if a file system is shared by other applications that modify a lot of
+files, since those files will also be written to disk.  Furthermore, on
+versions of Linux before 5.8, I/O errors encountered while writing data
+to disk may not be reported to initdb, and relevant
+error messages may appear only in kernel logs.
+   
+   
+This option has no effect when --no-sync is used.
+   
+  
+ 
+
  
   -S
   --sync-only
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 79d3e657c3..af8eb43251 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -594,6 +594,36 @@ PostgreSQL documentation
   
  
 
+ 
+  --sync-method
+  
+   
+When set to fsync, which is the default,
+pg_basebackup will recursively open and synchronize
+all files in the backup directory.  When the plain format is used, the
+search for files will follow symbolic links for the WAL directory and
+each configured tablespace.
+   
+   
+On Linux, syncfs may be used instead to ask the
+operating system to synchronize the whole file system that contains the
+backup directory.  When the plain format is used,
+pg_basebackup will also synchronize the file systems
+that contain the WAL files and each tablespace.  This may be a lot
+faster than the fsync setting, because it doesn't
+need to open each file one by one.  On the other hand, it may be slower
+

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-31 Thread Joe Conway


On 7/31/23 12:53, Robert Haas wrote:

On Fri, Jun 30, 2023 at 3:41 AM Jeff Davis  wrote:

I'm not sure that everyone in this thread realizes just how broken it
is to depend on search_path in a functional index at all. And doubly so
if it depends on a schema other than pg_catalog in the search_path.

Let's also not forget that logical replication always uses
search_path=pg_catalog, so if you depend on a different search_path for
any function attached to the table (not just functional indexes, also
functions inside expressions or trigger functions), then those are
already broken in version 15. And if a superuser is executing
maintenance commands, there's little reason to think they'll have the
same search path as the user that created the table.

At some point in the very near future (though I realize that point may
come after version 16), we need to lock down the search path in a lot
of cases (not just maintenance commands), and I don't see any way
around that.


I agree. I think there are actually two interrelated problems here.

One is that virtually all code needs to run with the originally
intended search_path rather than some search_path chosen at another
time and maybe by a different user. If not, it's going to break, or
compromise security, depending on the situation. The other is that
running arbitrary code written by somebody else as yourself is
basically instant death, from a security perspective.


I agree too.

But the analysis of the issue needs to go one step further. Even if the 
search_path does not change from the originally intended one, a newly 
created function can shadow the intended one based on argument coercion 
rules.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-31 Thread Peter Geoghegan

On Mon, Jul 31, 2023 at 12:24 PM Alena Rybakina
 wrote:
> I noticed that you are going to add CNF->DNF transformation at the index
> construction stage. If I understand correctly, you will rewrite
> restrictinfo node,
> change boolean "AND" expressions to "OR" expressions, but would it be
> possible to apply such a procedure earlier?

Sort of. I haven't really added any new CNF->DNF transformations. The
code you're talking about is really just checking that every index
path has clauses that we know that nbtree can handle. That's a big,
ugly modularity violation -- many of these details are quite specific
to the nbtree index AM (in theory we could have other index AMs that
are amsearcharray).

At most, v1 of the patch makes greater use of an existing
transformation that takes place in the nbtree index AM, as it
preprocesses scan keys for these types of queries (it's not inventing
new transformations at all). This is a slightly creative
interpretation, too. Tom's commit 9e8da0f7 didn't actually say
anything about CNF/DNF.

> Otherwise I suppose you
> could face the problem of
> incorrect selectivity of the calculation and, consequently, the
> cardinality calculation?

I can't think of any reason why that should happen as a direct result
of what I have done here. Multi-column index paths + multiple SAOP
clauses are not a new thing. The number of rows returned does not
depend on whether we have some columns as filter quals or not.

Of course that doesn't mean that the costing has no problems. The
costing definitely has several problems right now.

It also isn't necessarily okay that it's "just as good as before" if
it turns out that it needs to be better now. But I don't see why it
would be. (Actually, my hope is that selectivity estimation might be
*less* important as a practical matter with the patch.)

> I can't clearly understand at what stage it is clear that the such a
> transformation needs to be applied?

I don't know either.

I think that most of this work needs to take place in the nbtree code,
during preprocessing. But it's not so simple. There is a mutual
dependency between the code that generates index paths in the planner
and nbtree scan key preprocessing. The planner needs to know what
kinds of index paths are possible/safe up-front, so that it can choose
the fastest plan (the fastest that the index AM knows how to execute
correctly). But, there are lots of small annoying nbtree
implementation details that might matter, and can change.

I think we need to have nbtree register a callback, so that the
planner can initialize some preprocessing early. I think that we
require a "two way conversation" between the planner and the index AM.

-- 
Peter Geoghegan

Re: Faster "SET search_path"

2023-07-31 Thread Robert Haas

On Sat, Jul 29, 2023 at 11:59 AM Jeff Davis  wrote:
> Unfortunately, adding a "SET search_path" clause to functions slows
> them down. The attached patches close the performance gap
> substantially.

I haven't reviewed the code but I like the concept a lot. This is badly needed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-31 Thread Robert Haas

On Fri, Jun 30, 2023 at 3:41 AM Jeff Davis  wrote:
> I'm not sure that everyone in this thread realizes just how broken it
> is to depend on search_path in a functional index at all. And doubly so
> if it depends on a schema other than pg_catalog in the search_path.
>
> Let's also not forget that logical replication always uses
> search_path=pg_catalog, so if you depend on a different search_path for
> any function attached to the table (not just functional indexes, also
> functions inside expressions or trigger functions), then those are
> already broken in version 15. And if a superuser is executing
> maintenance commands, there's little reason to think they'll have the
> same search path as the user that created the table.
>
> At some point in the very near future (though I realize that point may
> come after version 16), we need to lock down the search path in a lot
> of cases (not just maintenance commands), and I don't see any way
> around that.

I agree. I think there are actually two interrelated problems here.

One is that virtually all code needs to run with the originally
intended search_path rather than some search_path chosen at another
time and maybe by a different user. If not, it's going to break, or
compromise security, depending on the situation. The other is that
running arbitrary code written by somebody else as yourself is
basically instant death, from a security perspective.

It's a little hard to imagine a world in which these problems don't
exist at all, but it somehow feels like the design of the system
pushes you toward doing this stuff incorrectly rather than doing it
correctly. For instance, you can imagine a system where when you run
CREATE OR REPLACE FUNCTION, the prevailing search_path is captured and
automatically included in proconfig. Then the default behavior would
be to run functions and procedures with the search_path that was in
effect when they were created, rather than what we actually have,
where it's the one in effect at execution time as it is currently.

It's a little harder to imagine something similar around all the user
switching behavior, just because we have so many ways of triggering
arbitrary code execution: views, triggers, event triggers, expression
indexes, constraints, etc. But you can maybe imagine a system where
all code associated with a table is run as the table owner in all
cases, regardless of SECURITY INVOKER/DEFINER, which I think would at
least close some holes.

The difficulty is that it's a bit hard to imagine making these kinds
of definitional changes now, because they'd probably be breaking
changes for pretty significant numbers of users. But on the other
hand, if we don't start thinking about systemic changes here, it feels
like we're just playing whack-a-mole.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

2023-07-31 Thread Alena Rybakina


Hi!


I think it really helps to speed up the search with similar deep
filtering compared to cluster indexes, but do we have cases where we
don't use this algorithm because it takes longer than an usual index?
I thought about the situation with wide indexes (with a lot of multiple
columns) and having a lot of filtering predicates for them.

I think that it should be possible for the optimizer to only use
multi-column SAOP index paths when there is at least likely to be some
small advantage -- that's definitely my goal. Importantly, we may not
really need to accurately model the costs where the new approach turns
out to be much faster. The only essential thing is that we avoid cases
where the new approach is much slower than the old approach. Which is
possible (in at least some cases) by making the runtime behavior
adaptive.

The best decision that the planner can make may be no decision at all.
Better to wait until runtime where at all possible, since that gives
us the latest and most accurate picture of things.


But I'm not sure about this, so it seems to me that this is a problem of
improper use of indexes rather.

It's hard to say how true that is.

Certainly, workloads similar to the TPC-DS benchmark kinda need
something like MDAM. It's just not practical to have enough indexes to
support every possible query -- the benchmark is deliberately designed
to have unpredictable, hard-to-optimize access paths. It seems to
require having fewer, more general indexes that can support
multi-dimensional access reasonably efficiently.

Of course, with OLTP it's much more likely that the workload will have
predictable access patterns. That makes having exactly the right
indexes much more practical. So maybe you're right there. But, I still
see a lot of value in a design that is as forgiving as possible. Users
really like that kind of thing in my experience.
I tend to agree with you, but a runtime estimate cannot give us an 
accurate picture when using indexes correctly or
any other optimizations due to the unstable state of the environment in 
which the query is executed.

I believe that a more complex analysis is needed here.

Currently, the optimizer doesn't recognize multi-column indexes with
SAOPs on every column as having a valid sort order, except on the
first column. It seems possible that that has consequences for your
patch. (I'm really only guessing, though; don't trust anything that I
say about the optimizer too much.)


Honestly, I couldn't understand your concerns very well, could you
describe it in more detail?

Well, I'm not sure if there is any possible scenario where the
transformation from your patch makes it possible to go from an access
path that has a valid sort order (usually because there is an
underlying index scan) into an access path that doesn't. In fact, the
opposite situation seems more likely (which is good news) --
especially if you assume that my own patch is also present.

Going from a bitmap OR (which cannot return sorted output) to a
multi-column SAOP index scan (which now can) may have significant
value in some specific circumstances. Most obviously, it's really
useful when it enables us to feed tuples into a GroupAggregate without
a separate sort step, and without a hash aggregate (that's why I see
value in combining your patch with my own patch). You just need to be
careful about allowing the opposite situation to take place.

More generally, there is a need to think about strange second order
effects. We want to be open to useful second order effects that make
query execution much faster in some specific context, while avoiding
harmful second order effects. Intuitively, I think that it should be
possible to do this with the transformations performed by your patch.

In other words, "helpful serendipity" is an important advantage, while
"harmful anti-serendipity" is what we really want to avoid. Ideally by
making the harmful cases impossible "by construction".

I noticed only one thing there: when we have unsorted array values in 
SOAP, the query takes longer than
when it has a sorted array. I'll double-check it just in case and write 
about the results later.


I am also testing some experience with multi-column indexes using SAOPs.

--
Regards,
Alena Rybakina
Postgres Professional

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-31 Thread Peter Geoghegan

On Thu, Jul 27, 2023 at 10:00 AM Matthias van de Meent
 wrote:
> My idea is not quite block nested loop join. It's more 'restart the
> index scan at the location the previous index scan ended, if
> heuristics say there's a good chance that might save us time'. I'd say
> it is comparable to the fast tree descent optimization that we have
> for endpoint queries, and comparable to this patch's scankey
> optimization, but across AM-level rescans instead of internal rescans.

Yeah, I see what you mean. Seems related, even though what you've
shown in your prototype patch doesn't seem like it fits into my
taxonomy very neatly.

(BTW, I was a little confused by the use of the term "endpoint" at
first, since there is a function that uses that term to refer to a
descent of the tree that happens without any insertion scan key. This
path is used whenever the best we can do in _bt_first is to descend to
the rightmost or leftmost page.)

> The basic design of that patch is this: We keep track of how many
> times we've rescanned, and the end location of the index scan. If a
> new index scan hits the same page after _bt_search as the previous
> scan ended, we register that.

I can see one advantage that block nested loop join would retain here:
it does block-based accesses on both sides of the join. Since it
"looks ahead" on both sides of the join, more repeat accesses are
likely to be avoided.

Not too sure how much that matters in practice, though.

-- 
Peter Geoghegan

Re: Request for comment on setting binary format output per session

2023-07-31 Thread Dave Cramer

Dave Cramer


On Mon, 10 Jul 2023 at 03:56, Daniel Gustafsson  wrote:

> > On 25 Apr 2023, at 16:47, Dave Cramer  wrote:
>
> > Patch attached with comments removed
>
> This patch no longer applies, please submit a rebased version on top of
> HEAD.
>

Rebased see attached



>
> --
> Daniel Gustafsson
>
>


0001-Created-protocol.h.patch
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-31 Thread Alena Rybakina


Hi, all!


  CNF -> DNF conversion
=

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]. Much work remains here.

Honestly, I'm just reading and delving into this thread and other topics 
related to it, so excuse me if I ask you a few obvious questions.


I noticed that you are going to add CNF->DNF transformation at the index 
construction stage. If I understand correctly, you will rewrite 
restrictinfo node,
change boolean "AND" expressions to "OR" expressions, but would it be 
possible to apply such a procedure earlier? Otherwise I suppose you 
could face the problem of
incorrect selectivity of the calculation and, consequently, the 
cardinality calculation?
I can't clearly understand at what stage it is clear that the such a 
transformation needs to be applied?


--
Regards,
Alena Rybakina
Postgres Professional

Re: proposal: psql: show current user in prompt

2023-07-31 Thread Jelte Fennema

On Mon, 24 Jul 2023 at 21:16, Pavel Stehule  wrote:
> I don't understand how it can be possible to do it without.  I need to 
> process possible errors, and then I need to read and synchronize protocol. I 
> didn't inject
> this feature to some oher flow, so I need to implement a complete process.

I think I might be missing the reason for this then. Could you explain
a bit more why you didn't inject the feature into another flow?
Because it seems like it would be better to inserting the logic for
handling the new response packet in pqParseInput3(), and then wait on
the result with PQexecFinish(). This would allow sending these
messages in a pipelined mode.

> For example, PQsetClientEncoding does a PQexec call, which is much more 
> expensive.

Yeah, but you'd usually only call that once for the life of the
connection. But honestly it would still be good if there was a
pipelined version of that function.

> Unfortunately, for this feature, I cannot change some local state variables, 
> but I need to change the state of the server. Own message is necessary, 
> because we don't want to be limited by the current transaction state, and 
> then we cannot reuse PQexec.

I guess this is your reasoning for why it needs its own state machine,
but I don't think I understand the problem. Could you expand a bit
more on what you mean? Note that different message types use
PQexecFinish to wait for their result, e.g. PQdescribePrepared and
PQclosePrepared use PQexecFinish too and those wait for a
RowDescription and a Close message respectively. I added the logic for
PQclosePerpared recently, that patch might be some helpful example
code: 
https://github.com/postgres/postgres/commit/28b5726561841556dc3e00ffe26b01a8107ee654

> The API can be changed from PQlinkParameterStatus(PGconn *conn, const char 
> *paramName)
>
> to
>
> PQlinkParameterStatus(PGconn *conn, int nParamNames, const char *paramNames)

That would definitely address the issue with the many round trips
being needed. But it would still leave the issue of introducing a
second state machine in the libpq code. So if it's possible to combine
this new code into the existing state machine, then that seems a lot
better.

Re: pg_upgrade fails with in-place tablespace

2023-07-31 Thread Junwang Zhao

On Mon, Jul 31, 2023 at 5:36 PM Rui Zhao  wrote:
>
> Hello postgres hackers,
> Recently I encountered an issue: pg_upgrade fails when dealing with in-place 
> tablespace. As we know, pg_upgrade uses pg_dumpall to dump objects and 
> pg_restore to restore them. The problem seems to be that pg_dumpall is 
> dumping in-place tablespace as relative path, which can't be restored.
>
> Here is the error message of pg_upgrade:
> psql:/home/postgres/postgresql/src/bin/pg_upgrade/tmp_check/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230729T210058.329/dump/pg_upgrade_dump_globals.sql:36:
>  ERROR:  tablespace location must be an absolute path
>
> To help reproduce the failure, I have attached a tap test. The test also 
> fails with tablespace regression, and it change the default value of 
> allow_in_place_tablespaces to true only for debug, so it may not be fit for 
> production. However, it is enough to reproduce this failure.
> I have also identified a solution for this problem, which I have included in 
> the patch. The solution has two modifications:
>   1) Make the function pg_tablespace_location returns path "" with in-place 
> tablespace, rather than relative path. Because the path of the in-place 
> tablespace is always 'pg_tblspc/'.
>   2) Only check the tablespace with an absolute path in pg_upgrade.
>
>   There are also other solutions, such as supporting the creation of 
> relative-path tablespace in function CreateTableSpace. But do we really need 
> relative-path tablespace? I think in-place tablespace is enough by now. My 
> solution may be more lightweight and harmless.
>
> Thank you for your attention to this matter.
>
> Best regards,
> Rui Zhao

Seems like allow_in_place_tablespaces is a developer only guc, and it
is not for end user usage.
check this commit 7170f2159fb21b62c263acd458d781e2f3c3f8bb


-- 
Regards
Junwang Zhao

Inaccurate comments in ReorderBufferCheckMemoryLimit()

2023-07-31 Thread Masahiko Sawada

Hi all,

While reading the code, I realized that the following code comments
might not be accurate:

/*
 * Pick the largest transaction (or subtransaction) and evict it from
 * memory by streaming, if possible.  Otherwise, spill to disk.
 */
if (ReorderBufferCanStartStreaming(rb) &&
(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
{
/* we know there has to be one, because the size is not zero */
Assert(txn && rbtxn_is_toptxn(txn));
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);

ReorderBufferStreamTXN(rb, txn);
}

AFAICS since ReorderBufferLargestStreamableTopTXN() returns only
top-level transactions, the comment above the if statement is not
right. It would not pick a subtransaction.

Also, I'm not sure that the second comment "we know there has to be
one, because the size is not zero" is right since there might not be
top-transactions that are streamable. I think this is why we say
"Otherwise, spill to disk".

I've attached a patch to fix these comments. Feedback is very welcome.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_comment.patch
Description: Binary data

RE: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread José Neves

Hi Amit, thanks for the reply.

In our worker (custom pg replication client), we care only about INSERT, 
UPDATE, and DELETE operations, which - sure - may be part of the issue.
I can only replicate this with production-level load, not easy to get a real 
example, but as I'm understanding the issue (and building upon your 
exposition), we are seeing the following:

T-1
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000

T-2
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

If we miss LSN3-3000, let's say, a bad network, and we already received all 
other LSNs, we will commit to Postgres LSN4-5500 before restarting. LSN3 3000 
will never be reattempted. And there are a couple of issues with this scenery:
1. We have no way to match LSN operations with the respective commit, as they 
have unordered offsets. Assuming that all of them were received in order, we 
would commit all data with the commit message LSN4-4000 as other events would 
match the transaction start and end LSN interval of it.
2. Still we have no way to verify that we got all data for a given transaction, 
we will never miss LSN3-3000 of the first transaction till we look at and 
analyze the resulting data.

So the question: how can we prevent our worker from committing LSN4-5500 
without receiving LSN3-3000? Do we even have enough information out of pgoutput 
to do that?
PS.: when I say bad network, my suspicion is that this situation may be caused 
by network saturation on high QPS periods. Data will still arrive eventually 
but by that time our worker is no longer listening.

Thanks again. Regards,
José Neves

De: Amit Kapila 
Enviado: 31 de julho de 2023 14:31
Para: José Neves 
Cc: pgsql-hack...@postgresql.org 
Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom 
client

On Mon, Jul 31, 2023 at 3:06 PM José Neves  wrote:
>
> Hi there, hope to find you well.
>
> I'm attempting to develop a CDC on top of Postgres, currently using 12, the 
> last minor, with a custom client, and I'm running into issues with data loss 
> caused by out-of-order logical replication messages.
>
> The problem is as follows: postgres streams A, B, D, G, K, I, P logical 
> replication events, upon exit signal we stop consuming new events at LSN K, 
> and we wait 30s for out-of-order events. Let's say that we only got A, (and K 
> ofc) so in the following 30s, we get B, D, however, for whatever reason, G 
> never arrived. As with pgoutput-based logical replication we have no way to 
> calculate the next LSN, we have no idea that G was missing, so we assumed 
> that it all arrived, committing K to postgres slot and shutdown. In the next 
> run, our worker will start receiving data from K forward, and G is lost 
> forever...
> Meanwhile postgres moves forward with archiving and we can't go back to check 
> if we lost anything. And even if we could, would be extremely inefficient.
>
> In sum, the issue comes from the fact that postgres will stream events with 
> unordered LSNs on high transactional systems, and that pgoutput doesn't have 
> access to enough information to calculate the next or last LSN, so we have no 
> way to check if we receive all the data that we are supposed to receive, 
> risking committing an offset that we shouldn't as we didn't receive yet 
> preceding data.
>

As per my understanding, we stream the data in the commit LSN order
and for a particular transaction, all the changes are per their LSN
order. Now, it is possible that for a parallel transaction, we send
some changes from a prior LSN after sending the commit of another
transaction. Say we have changes as follows:

T-1
change1 LSN1-1000
change2 LSN2- 2000
commit   LSN3- 3000

T-2
change1 LSN1-500
change2 LSN2-1500
commit   LSN3-4000

In such a case, all the changes including the commit of T-1 are sent
and then all the changes including the commit of T-2 are sent. So, one
can say that some of the changes from T-2 from prior LSN arrived after
T-1's commit but that shouldn't be a problem because if restart
happens after we received partial T-2, we should receive the entire
T-2.

It is possible that you are seeing something else but if so then
please try to share a more concrete example.

--
With Regards,
Amit Kapila.

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread Euler Taveira

On Sat, Jul 29, 2023, at 8:07 PM, José Neves wrote:
> I'm attempting to develop a CDC on top of Postgres, currently using 12, the 
> last minor, with a custom client, and I'm running into issues with data loss 
> caused by out-of-order logical replication messages.

Can you provide a test case to show this issue? Did you try in a newer version?

> The problem is as follows: postgres streams *A, B, D, G, K, I, P *logical 
> replication events, upon exit signal we stop consuming new events at LSN *K,* 
> and we wait 30s for out-of-order events. Let's say that we only got *A*, (and 
> *K* ofc) so in the following 30s, we get *B, D*, however, for whatever 
> reason, *G* never arrived. As with pgoutput-based logical replication we have 
> no way to calculate the next LSN, we have no idea that *G* was missing, so we 
> assumed that it all arrived, committing *K *to postgres slot and shutdown. In 
> the next run, our worker will start receiving data from *K* forward, and *G* 
> is lost forever...
> Meanwhile postgres moves forward with archiving and we can't go back to check 
> if we lost anything. And even if we could, would be extremely inefficient.

Logical decoding provides the changes to output plugin at commit time. You
mentioned the logical replication events but didn't say which are part of the
same transaction. Let's say A, B, D and K are changes from the same transaction
and G, I and P are changes from another transaction. The first transaction will
be available when it processes K. The second transaction will be provided when
the logical decoding processes P.

You didn't say how your consumer is working. Are you sure your consumer doesn't
get the second transaction? If your consumer is advancing the replication slot
*after* receiving K (using pg_replication_slot_advance), it is doing it wrong.
Another common problem with consumer is that it uses
pg_logical_slot_get_changes() but *before* using the data it crashes; in this
case, the data is lost.

It is hard to say where the problem is if you didn't provide enough information
about the consumer logic and the WAL information (pg_waldump output) around the
time you detect the data loss.

> In sum, the issue comes from the fact that postgres will stream events with 
> unordered LSNs on high transactional systems, and that pgoutput doesn't have 
> access to enough information to calculate the next or last LSN, so we have no 
> way to check if we receive all the data that we are supposed to receive, 
> risking committing an offset that we shouldn't as we didn't receive yet 
> preceding data.
> 
> It seems very either to me that none of the open-source CDC projects that I 
> looked into care about this. They always assume that the next LSN received 
> is... well the next one, and commit that one, so upon restart, they are 
> vulnerable to the same issue. So... either I'm missing something... or we 
> have a generalized assumption causing data loss under certain conditions all 
> over.

Let me illustrate the current behavior. Let's say there are 3 concurrent
transactions.

Session A
==

euler=# SELECT pg_create_logical_replication_slot('repslot1', 'wal2json');
pg_create_logical_replication_slot 

(repslot1,0/369DF088)
(1 row)

euler=# create table foo (a int primary key);
CREATE TABLE
euler=# BEGIN;
BEGIN
euler=*# INSERT INTO foo (a) SELECT generate_series(1, 2);
INSERT 0 2

Session B
==

euler=# BEGIN;
BEGIN
euler=*# INSERT INTO foo (a) SELECT generate_series(11, 12);
INSERT 0 2

Session C
==

euler=# BEGIN;
BEGIN
euler=*# INSERT INTO foo (a) SELECT generate_series(21, 22);
INSERT 0 2

Session A
==

euler=*# INSERT INTO foo (a) VALUES(3);
INSERT 0 1

Session B
==

euler=*# INSERT INTO foo (a) VALUES(13);
INSERT 0 1

Session C
==

euler=*# INSERT INTO foo (a) VALUES(23);
INSERT 0 1
euler=*# COMMIT;
COMMIT

Session B
==

euler=*# COMMIT;
COMMIT

Session A
==

euler=*# COMMIT;
COMMIT

The output is:

euler=# SELECT * FROM pg_logical_slot_peek_changes('repslot1', NULL, NULL, 
'format-version', '2', 'include-types', '0');
lsn |  xid   |data  
  
++
0/369E4800 | 454539 | {"action":"B"}
0/36A05088 | 454539 | {"action":"C"}
0/36A05398 | 454542 | {"action":"B"}
0/36A05398 | 454542 | 
{"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":21}]}
0/36A05418 | 454542 | 
{"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":22}]}
0/36A05658 | 454542 | 
{"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":23}]}
0/36A057C0 | 454542 | {"action":"C"}
0/36A05258 | 454541 | {"action":"B"}
0/36A05258 | 454541 | 
{"action":"I","schema":"public","table":"foo","columns":[{"name":"a","value":11}]}
0/36A052D8 | 454541 | 
{"action":"I"

Re: Improve join_search_one_level readibilty (one line change)

2023-07-31 Thread Julien Rouhaud

Hi,

On Wed, Jun 07, 2023 at 11:02:09AM +0800, 謝東霖 wrote:
> Thank you, Julien, for letting me know that cfbot doesn't test txt files.
> Much appreciated!

Thanks for posting this v2!

So unsurprisingly the cfbot is happy with this patch, since it doesn't change
the behavior at all.  I just have some nitpicking:

@@ -109,14 +109,14 @@ join_search_one_level(PlannerInfo *root, int level)
List   *other_rels_list;
ListCell   *other_rels;

+   other_rels_list = joinrels[1];
+
if (level == 2) /* consider remaining initial 
rels */
{
-   other_rels_list = joinrels[level - 1];
other_rels = lnext(other_rels_list, r);
}
else/* consider all initial 
rels */
{
-   other_rels_list = joinrels[1];
other_rels = list_head(other_rels_list);
}

Since each branch only has a single instruction after the change the curly
braces aren't needed anymore.  The only reason keep them is if it helps
readability (like if there's a big comment associated), but that's not the case
here so it would be better to get rid of them.

Apart from that +1 from me for the patch, I think it helps focusing the
attention on what actually matters here.

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread Amit Kapila

On Mon, Jul 31, 2023 at 3:06 PM José Neves  wrote:
>
> Hi there, hope to find you well.
>
> I'm attempting to develop a CDC on top of Postgres, currently using 12, the 
> last minor, with a custom client, and I'm running into issues with data loss 
> caused by out-of-order logical replication messages.
>
> The problem is as follows: postgres streams A, B, D, G, K, I, P logical 
> replication events, upon exit signal we stop consuming new events at LSN K, 
> and we wait 30s for out-of-order events. Let's say that we only got A, (and K 
> ofc) so in the following 30s, we get B, D, however, for whatever reason, G 
> never arrived. As with pgoutput-based logical replication we have no way to 
> calculate the next LSN, we have no idea that G was missing, so we assumed 
> that it all arrived, committing K to postgres slot and shutdown. In the next 
> run, our worker will start receiving data from K forward, and G is lost 
> forever...
> Meanwhile postgres moves forward with archiving and we can't go back to check 
> if we lost anything. And even if we could, would be extremely inefficient.
>
> In sum, the issue comes from the fact that postgres will stream events with 
> unordered LSNs on high transactional systems, and that pgoutput doesn't have 
> access to enough information to calculate the next or last LSN, so we have no 
> way to check if we receive all the data that we are supposed to receive, 
> risking committing an offset that we shouldn't as we didn't receive yet 
> preceding data.
>

As per my understanding, we stream the data in the commit LSN order
and for a particular transaction, all the changes are per their LSN
order. Now, it is possible that for a parallel transaction, we send
some changes from a prior LSN after sending the commit of another
transaction. Say we have changes as follows:

T-1
change1 LSN1-1000
change2 LSN2- 2000
commit   LSN3- 3000

T-2
change1 LSN1-500
change2 LSN2-1500
commit   LSN3-4000

In such a case, all the changes including the commit of T-1 are sent
and then all the changes including the commit of T-2 are sent. So, one
can say that some of the changes from T-2 from prior LSN arrived after
T-1's commit but that shouldn't be a problem because if restart
happens after we received partial T-2, we should receive the entire
T-2.

It is possible that you are seeing something else but if so then
please try to share a more concrete example.

-- 
With Regards,
Amit Kapila.

Re: Adding a LogicalRepWorker type field

2023-07-31 Thread Amit Kapila

On Mon, Jul 31, 2023 at 3:25 PM Bharath Rupireddy
 wrote:
>
> On Mon, Jul 31, 2023 at 7:17 AM Peter Smith  wrote:
> >
> > PROBLEM:
> >
> > IMO, deducing the worker's type by examining multiple different field
> > values seems a dubious way to do it. This maybe was reasonable enough
> > when there were only 2 types, but as more get added it becomes
> > increasingly complicated.
>
> +1 for being more explicit in worker types.
>

+1. BTW, do we need the below functions (am_tablesync_worker(),
am_leader_apply_worker()) after this work?
static inline bool
 am_tablesync_worker(void)
 {
- return OidIsValid(MyLogicalRepWorker->relid);
+ return isTablesyncWorker(MyLogicalRepWorker);
 }

 static inline bool
 am_leader_apply_worker(void)
 {
- return (!am_tablesync_worker() &&
- !isParallelApplyWorker(MyLogicalRepWorker));
+ return isLeaderApplyWorker(MyLogicalRepWorker);
 }


-- 
With Regards,
Amit Kapila.

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-07-31 Thread Alvaro Herrera

On 2023-Jul-05, Jehan-Guillaume de Rorthais wrote:

>   ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1); 
> 
> The old sub-FKs (below 18289) created in this table to enforce the action
> triggers on referenced partitions are not deleted when the table becomes a
> partition. Because of this, we have additional and useless triggers on the
> referenced partitions and we can not DETACH this partition on the referencing
> side anymore:

Oh, hm, interesting.  Thanks for the report and patch.  I found a couple
of minor issues with it (most serious one: nkeys should be 3, not 2;
also sysscan should use conrelid index), but I'll try and complete it so
that it's ready for 2023-08-10's releases.

Regards

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: New PostgreSQL Contributors

2023-07-31 Thread Matthias van de Meent

On Fri, 28 Jul 2023 at 17:29, Christoph Berg  wrote:
>
> The PostgreSQL contributors team has been looking over the community
> activity and, over the first half of this year, has been recognizing
> new contributors to be listed on
>
> https://www.postgresql.org/community/contributors/
>
> New PostgreSQL Contributors:
>
>   Ajin Cherian
>   Alexander Kukushkin
>   Alexander Lakhin
>   Dmitry Dolgov
>   Hou Zhijie
>   Ilya Kosmodemiansky
>   Melanie Plageman
>   Michael Banck
>   Michael Brewer
>   Paul Jungwirth
>   Peter Smith
>   Vigneshwaran C
>
> New PostgreSQL Major Contributors:
>
>   Julien Rouhaud
>   Stacey Haysler
>   Steve Singer
>
> Thank you all for contributing to PostgreSQL to make it such a great project!

+1, thank you all, and congratulations!

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: stats test intermittent failure

2023-07-31 Thread Masahiko Sawada

Hi,

On Tue, Jul 11, 2023 at 3:35 AM Melanie Plageman
 wrote:
>
> Hi,
>
> Jeff pointed out that one of the pg_stat_io tests has failed a few times
> over the past months (here on morepork [1] and more recently here on
> francolin [2]).
>
> Failing test diff for those who prefer not to scroll:
>
> +++ 
> /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out
>2023-07-07 18:48:25.976313231 +
> @@ -1415,7 +1415,7 @@
> :io_sum_vac_strategy_after_reuses > 
> :io_sum_vac_strategy_before_reuses;
>   ?column? | ?column?
>  --+--
> - t| t
> + t| f
>
> My theory about the test failure is that, when there is enough demand
> for shared buffers, the flapping test fails because it expects buffer
> access strategy *reuses* and concurrent queries already flushed those
> buffers before they could be reused. Attached is a patch which I think
> will fix the test while keeping some code coverage. If we count
> evictions and reuses together, those should have increased.
>

Yeah, I've not reproduced this issue but it's possible. IIUC if we get
the buffer from the ring, we count an I/O as "reuse" even if the
buffer has already been flushed/replaced. However, if the buffer in
the ring is pinned by other backends, we end up evicting a buffer from
outside of the ring and adding it to the buffer, which is counted as
"eviction".

Regarding the patch, I have a comment:

 -- Test that reuse of strategy buffers and reads of blocks into these reused
--- buffers while VACUUMing are tracked in pg_stat_io.
+-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient
+-- demand for shared buffers from concurrent queries, some blocks may be
+-- evicted from the strategy ring before they can be reused. In such cases
+-- this, the backend will evict a block from a shared buffer outside of the
+-- ring and add it to the ring. This is considered an eviction and not a reuse.

The new comment seems not to be accurate if my understanding is
correct. How about the following?

Test that reuse of strategy buffers and reads of blocks into these
reused buffers while VACUUMing are tracked in pg_stat_io. If there is
sufficient demand for shared buffers from concurrent queries, some
buffers may be pinned by other backends before they can be reused. In
such cases, the backend will evict a buffer from a shared buffer
outside of the ring and add it to the ring. This is considered an
eviction and not a reuse.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Avoid undefined behavior with msvc compiler (src/include/port/pg_bitutils.h)

2023-07-31 Thread John Naylor

On Mon, Jul 31, 2023 at 5:57 PM Tom Lane  wrote:
>
> John Naylor  writes:

> > Works for me, so done that way for both forward and reverse variants.
Since
> > the return value is no longer checked in any builds, I thought about
> > removing the variable containing it, but it seems best to leave it
behind
> > for clarity since these are not our functions.
>
> Hmm, aren't you risking "variable is set but not used" warnings?
> Personally I'd have made these like
>
> (void) _BitScanReverse(&result, word);

I'd reasoned that such a warning would have showed up in non-assert builds
already, but I neglected to consider that those could have different
warning settings. For the time being drongo shows green, at least, but
we'll see.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2023-07-31 Thread Tomas Vondra

On 7/31/23 11:25, Amit Kapila wrote:
> On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
>  wrote:
>>
>> On 7/28/23 14:44, Ashutosh Bapat wrote:
>>> On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
>>>  wrote:

 Anyway, I was thinking about this a bit more, and it seems it's not as
 difficult to use the page LSN to ensure sequences don't go backwards.
 The 0005 change does that, by:

 1) adding pg_sequence_state, that returns both the sequence state and
the page LSN

 2) copy_sequence returns the page LSN

 3) tablesync then sets this LSN as origin_startpos (which for tables is
just the LSN of the replication slot)

 AFAICS this makes it work - we start decoding at the page LSN, so that
 we  skip the increments that could lead to the sequence going backwards.

>>>
>>> I like this design very much. It makes things simpler than complex.
>>> Thanks for doing this.
>>>
>>
>> I agree it seems simpler. It'd be good to try testing / reviewing it a
>> bit more, so that it doesn't misbehave in some way.
>>
> 
> Yeah, I also think this needs a review. This is a sort of new concept
> where we don't use the LSN of the slot (for cases where copy returned
> a larger value of LSN) or a full_snapshot created corresponding to the
> sync slot by Walsender. For the case of the table, we build a full
> snapshot because we use that for copying the table but why do we need
> to build that for copying the sequence especially when we directly
> copy it from the sequence relation without caring for any snapshot?
> 

We need the slot to decode/apply changes during catchup. The main
subscription may get ahead, and we need to ensure the WAL is not
discarded or something like that. This applies even if the initial sync
step does not use the slot/snapshot directly.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoid undefined behavior with msvc compiler (src/include/port/pg_bitutils.h)

2023-07-31 Thread Tom Lane

John Naylor  writes:
> On Sun, Jul 30, 2023 at 9:45 PM Tom Lane  wrote:
>> That's basically equivalent to the existing Assert(non_zero).
>> I think it'd be okay to drop that one and instead have
>> the same Assert condition as other platforms, but having both
>> would be redundant.

> Works for me, so done that way for both forward and reverse variants. Since
> the return value is no longer checked in any builds, I thought about
> removing the variable containing it, but it seems best to leave it behind
> for clarity since these are not our functions.

Hmm, aren't you risking "variable is set but not used" warnings?
Personally I'd have made these like

(void) _BitScanReverse(&result, word);

Anybody trying to understand this code is going to have to look up
the man page for _BitScanReverse anyway, so I'm not sure that
keeping the variable adds much readability.

However, if no version of MSVC actually issues such a warning,
it's moot.

regards, tom lane

Re: Support to define custom wait events for extensions

2023-07-31 Thread Bharath Rupireddy

On Mon, Jul 31, 2023 at 3:54 PM Michael Paquier  wrote:
>
> On Mon, Jul 31, 2023 at 05:10:21PM +0900, Kyotaro Horiguchi wrote:
> > +/*
> > + * Return the name of an wait event ID for extension.
> > + */
> > +static const char *
> > +GetWaitEventExtensionIdentifier(uint16 eventId)
> >
> > This looks inconsistent. Shouldn't it be GetWaitEventExtentionName()?
>
> This is an inspiration from GetLWLockIdentifier(), which is kind of OK
> by me.  If there is a consensus in changing that, fine by me, of
> course.

+1 to GetWaitEventExtensionIdentifier for consistency with LWLock's counterpart.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Support to define custom wait events for extensions

2023-07-31 Thread Michael Paquier

On Mon, Jul 31, 2023 at 05:10:21PM +0900, Kyotaro Horiguchi wrote:
> +/*
> + * Return the name of an wait event ID for extension.
> + */
> +static const char *
> +GetWaitEventExtensionIdentifier(uint16 eventId)
> 
> This looks inconsistent. Shouldn't it be GetWaitEventExtentionName()?

This is an inspiration from GetLWLockIdentifier(), which is kind of OK
by me.  If there is a consensus in changing that, fine by me, of
course.
--
Michael


signature.asc
Description: PGP signature

Re: Support to define custom wait events for extensions

2023-07-31 Thread Michael Paquier

On Mon, Jul 31, 2023 at 01:37:49PM +0530, Bharath Rupireddy wrote:
> Do you think it's worth adding a note here in the docs about an
> external module defining more than one custom wait event? A pseudo
> code if possible or just a note? Also, how about a XXX comment atop
> WaitEventExtensionNew and/or WaitEventExtensionRegisterName on the
> possibility of extending the functions to support allocation of more
> than one custom wait events?

I am not sure that any of that is necessary.  Anyway, I have applied
v11 to get the basics done.

Now, I agree that a WaitEventExtensionMultiple() may come in handy,
particularly for postgres_fdw that uses WAIT_EVENT_EXTENSION three
times.
--
Michael


signature.asc
Description: PGP signature

Missing comments/docs about custom scan path

2023-07-31 Thread Etsuro Fujita

Hi,

While working on [1], I noticed $SUBJECT: commit e7cb7ee14 failed to
update comments for the CustomPath struct in pathnodes.h, and commit
f49842d1e failed to update docs about custom scan path callbacks in
custom-scan.sgml, IIUC.  Attached are patches for updating these,
which I created separately for ease of review (patch
update-custom-scan-path-comments.patch for the former and patch
update-custom-scan-path-docs.patch for the latter).  In the second
patch I used almost the same text as for the
ReparameterizeForeignPathByChild callback function in fdwhandler.sgml.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK16ah9JtyVPtdqu6d%3DQGkRX%3DRAzoYQfX7%3DLZ%2BKnqwBfftg%40mail.gmail.com


update-custom-scan-path-comments.patch
Description: Binary data


update-custom-scan-path-docs.patch
Description: Binary data

Re: Adding a LogicalRepWorker type field

2023-07-31 Thread Bharath Rupireddy

On Mon, Jul 31, 2023 at 7:17 AM Peter Smith  wrote:
>
> PROBLEM:
>
> IMO, deducing the worker's type by examining multiple different field
> values seems a dubious way to do it. This maybe was reasonable enough
> when there were only 2 types, but as more get added it becomes
> increasingly complicated.

+1 for being more explicit in worker types. I also think that we must
move away from am_{parallel_apply, tablesync,
leader_apply}_worker(void) to Is{ParallelApply, TableSync,
LeaderApply}Worker(MyLogicalRepWorker), just to be a little more
consistent and less confusion around different logical replication
worker type related functions.

> AFAIK none of the complications is necessary anyway; the worker type
> is already known at launch time, and it never changes during the life
> of the process.
>
> The attached Patch 0001 introduces a new enum 'type' field, which is
> assigned during the launch.
>
> This change not only simplifies the code, but it also permits other
> code optimizations, because now we can switch on the worker enum type,
> instead of using cascading if/else.  (see Patch 0002).

Some comments:
1.
+/* Different kinds of workers */
+typedef enum LogicalRepWorkerType
+{
+LR_WORKER_TABLESYNC,
+LR_WORKER_APPLY,
+LR_WORKER_APPLY_PARALLEL
+} LogicalRepWorkerType;

Can these names be readable? How about something like
LR_TABLESYNC_WORKER, LR_APPLY_WORKER, LR_PARALLEL_APPLY_WORKER?

2.
-#define isParallelApplyWorker(worker) ((worker)->leader_pid != InvalidPid)
+#define isLeaderApplyWorker(worker) ((worker)->type == LR_WORKER_APPLY)
+#define isParallelApplyWorker(worker) ((worker)->type ==
LR_WORKER_APPLY_PARALLEL)
+#define isTablesyncWorker(worker) ((worker)->type == LR_WORKER_TABLESYNC)

Can the above start with "Is" instead of "is" similar to
IsLogicalWorker and IsLogicalParallelApplyWorker?

3.
+worker->type =
+OidIsValid(relid) ? LR_WORKER_TABLESYNC :
+is_parallel_apply_worker ? LR_WORKER_APPLY_PARALLEL :
+LR_WORKER_APPLY;

Perhaps, an if-else is better for readability?

if (OidIsValid(relid))
worker->type = LR_WORKER_TABLESYNC;
else if (is_parallel_apply_worker)
worker->type = LR_WORKER_APPLY_PARALLEL;
else
   worker->type = LR_WORKER_APPLY;

4.
+/* Different kinds of workers */
+typedef enum LogicalRepWorkerType
+{
+LR_WORKER_TABLESYNC,
+LR_WORKER_APPLY,
+LR_WORKER_APPLY_PARALLEL
+} LogicalRepWorkerType;

Have a LR_WORKER_UNKNOWN = 0 and set it in logicalrep_worker_cleanup()?

5.
+case LR_WORKER_APPLY:
+return (rel->state == SUBREL_STATE_READY ||
+(rel->state == SUBREL_STATE_SYNCDONE &&
+ rel->statelsn <= remote_final_lsn));

Not necessary, but a good idea to have a default: clause and error out
saying wrong logical replication worker type?

6.
+case LR_WORKER_APPLY_PARALLEL:
+/*
+ * Skip for parallel apply workers because they only
operate on tables
+ * that are in a READY state. See pa_can_start() and
+ * should_apply_changes_for_rel().
+ */
+break;

I'd better keep this if-else as-is instead of a switch case with
nothing for parallel apply worker.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-31 Thread John Naylor

On Thu, Jul 27, 2023 at 7:17 AM David Rowley  wrote:
>
> It would be really good if someone with another a newish intel CPU
> could test this too.

I ran the lotsaints test from last email on an i7-10750H (~3 years old) and
got these results (gcc 13.1 , turbo off):

REL_15_STABLE:
latency average = 956.453 ms
latency stddev = 4.854 ms

REL_16_STABLE @ 695f5deb7902 (28-JUL-2023):
latency average = 999.354 ms
latency stddev = 3.611 ms

master @ 39055cb4cc (31-JUL-2023):
latency average = 995.310 ms
latency stddev = 5.176 ms

master + revert c1308ce2d (the replace-palloc0 fix)
latency average = 1080.909 ms
latency stddev = 8.707 ms

master + pg_strtoint_fastpath1.patch
latency average = 938.146 ms
latency stddev = 9.354 ms

master + pg_strtoint_fastpath2.patch
latency average = 902.808 ms
latency stddev = 3.957 ms

For me, PG16 seems to regress from PG15, and the second patch seems faster
than the first.

--
John Naylor
EDB: http://www.enterprisedb.com

回复：pg_rewind fails with in-place tablespace

2023-07-31 Thread Rui Zhao

Sorry for the delay in responding to this matter as I have been waiting for 
another similar subject to approved by a moderator.
Upon review, I am satisfied with the proposed solution and believe that 
checking absolute path is better than hard coding with "pg_tblspc/". I think we 
have successfully resolved this issue in the pg_rewind case.
However, I would like to bring your attention to another issue: pg_upgrade 
fails with in-place tablespace. Another issue is still waiting for approved. I 
have tested all the tools in src/bin with in-place tablespace, and I believe 
this is the final issue.
Thank you for your understanding and assistance.
Best regard,
Rui Zhao
--
发件人：Michael Paquier 
发送时间：2023年7月31日(星期一) 06:49
收件人：赵锐(惜元) 
抄　送：pgsql-hackers ; Thomas Munro 

主　题：Re: pg_rewind fails with in-place tablespace
On Fri, Jul 28, 2023 at 04:54:56PM +0900, Michael Paquier wrote:
> I am finishing with the attached. Thoughts?
Applied this one as bf22792 on HEAD, without a backpatch as in-place
tablespaces are around for developers. If there are opinions in favor
of a backpatch, feel free of course.
--
Michael

CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread José Neves

Hi there, hope to find you well.

I'm attempting to develop a CDC on top of Postgres, currently using 12, the 
last minor, with a custom client, and I'm running into issues with data loss 
caused by out-of-order logical replication messages.

The problem is as follows: postgres streams A, B, D, G, K, I, P logical 
replication events, upon exit signal we stop consuming new events at LSN K, and 
we wait 30s for out-of-order events. Let's say that we only got A, (and K ofc) 
so in the following 30s, we get B, D, however, for whatever reason, G never 
arrived. As with pgoutput-based logical replication we have no way to calculate 
the next LSN, we have no idea that G was missing, so we assumed that it all 
arrived, committing K to postgres slot and shutdown. In the next run, our 
worker will start receiving data from K forward, and G is lost forever...
Meanwhile postgres moves forward with archiving and we can't go back to check 
if we lost anything. And even if we could, would be extremely inefficient.

In sum, the issue comes from the fact that postgres will stream events with 
unordered LSNs on high transactional systems, and that pgoutput doesn't have 
access to enough information to calculate the next or last LSN, so we have no 
way to check if we receive all the data that we are supposed to receive, 
risking committing an offset that we shouldn't as we didn't receive yet 
preceding data.

It seems very either to me that none of the open-source CDC projects that I 
looked into care about this. They always assume that the next LSN received 
is... well the next one, and commit that one, so upon restart, they are 
vulnerable to the same issue. So... either I'm missing something... or we have 
a generalized assumption causing data loss under certain conditions all over.

Am I missing any postgres mechanism that will allow me to at least detect that 
I'm missing data?

Thanks in advance for any clues on how to deal with this. It has been driving 
me nuts.

  *

Regards,
José Neves

pg_upgrade fails with in-place tablespace

2023-07-31 Thread Rui Zhao

Hello postgres hackers,
 Recently I encountered an issue: pg_upgrade fails when dealing with in-place 
tablespace. As we know, pg_upgrade uses pg_dumpall to dump objects and 
pg_restore to restore them. The problem seems to be that pg_dumpall is dumping 
in-place tablespace as relative path, which can't be restored.
 Here is the error message of pg_upgrade:
psql:/home/postgres/postgresql/src/bin/pg_upgrade/tmp_check/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230729T210058.329/dump/pg_upgrade_dump_globals.sql:36:
 ERROR: tablespace location must be an absolute path
 To help reproduce the failure, I have attached a tap test. The test also fails 
with tablespace regression, and it change the default value of 
allow_in_place_tablespaces to true only for debug, so it may not be fit for 
production. However, it is enough to reproduce this failure.
 I have also identified a solution for this problem, which I have included in 
the patch. The solution has two modifications:
 1) Make the function pg_tablespace_location returns path "" with in-place 
tablespace, rather than relative path. Because the path of the in-place 
tablespace is always 'pg_tblspc/'.
 2) Only check the tablespace with an absolute path in pg_upgrade.
 There are also other solutions, such as supporting the creation of 
relative-path tablespace in function CreateTableSpace. But do we really need 
relative-path tablespace? I think in-place tablespace is enough by now. My 
solution may be more lightweight and harmless.
 Thank you for your attention to this matter.
Best regards,
Rui Zhao


0001-Fix-pg_upgrade-fails-with-in-place-tablespace.patch
Description: Binary data

Re: logical decoding and replication of sequences, take 2

2023-07-31 Thread Amit Kapila

On Sat, Jul 29, 2023 at 5:53 PM Tomas Vondra
 wrote:
>
> On 7/28/23 14:44, Ashutosh Bapat wrote:
> > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra
> >  wrote:
> >>
> >> Anyway, I was thinking about this a bit more, and it seems it's not as
> >> difficult to use the page LSN to ensure sequences don't go backwards.
> >> The 0005 change does that, by:
> >>
> >> 1) adding pg_sequence_state, that returns both the sequence state and
> >>the page LSN
> >>
> >> 2) copy_sequence returns the page LSN
> >>
> >> 3) tablesync then sets this LSN as origin_startpos (which for tables is
> >>just the LSN of the replication slot)
> >>
> >> AFAICS this makes it work - we start decoding at the page LSN, so that
> >> we  skip the increments that could lead to the sequence going backwards.
> >>
> >
> > I like this design very much. It makes things simpler than complex.
> > Thanks for doing this.
> >
>
> I agree it seems simpler. It'd be good to try testing / reviewing it a
> bit more, so that it doesn't misbehave in some way.
>

Yeah, I also think this needs a review. This is a sort of new concept
where we don't use the LSN of the slot (for cases where copy returned
a larger value of LSN) or a full_snapshot created corresponding to the
sync slot by Walsender. For the case of the table, we build a full
snapshot because we use that for copying the table but why do we need
to build that for copying the sequence especially when we directly
copy it from the sequence relation without caring for any snapshot?

-- 
With Regards,
Amit Kapila.

Re: postgres_fdw: wrong results with self join + enable_nestloop off

2023-07-31 Thread Richard Guo

On Fri, Jul 28, 2023 at 4:56 PM Etsuro Fujita 
wrote:

> Cool!  I pushed the first patch after polishing it a little bit, so
> here is a rebased version of the second patch, in which I modified the
> ForeignPath and CustomPath cases in reparameterize_path_by_child() to
> reflect the new members fdw_restrictinfo and custom_restrictinfo, for
> safety, and tweaked a comment a bit.

Hmm, it seems that ForeignPath for a foreign join does not support
parameterized paths for now, as in postgresGetForeignJoinPaths() we have
this check:

  /*
   * This code does not work for joins with lateral references, since those
   * must have parameterized paths, which we don't generate yet.
   */
  if (!bms_is_empty(joinrel->lateral_relids))
  return;

And in create_foreign_join_path() we just set the path.param_info to
NULL.

  pathnode->path.param_info = NULL;   /* XXX see above */

So I doubt that it's necessary to adjust fdw_restrictinfo in
reparameterize_path_by_child, because it seems to me that
fdw_restrictinfo must be empty there.  Maybe we can add an Assert there
as below:

-ADJUST_CHILD_ATTRS(fpath->fdw_restrictinfo);
+
+/*
+ * Parameterized foreign joins are not supported.  So this
+ * ForeignPath cannot be a foreign join and fdw_restrictinfo
+ * must be empty.
+ */
+Assert(fpath->fdw_restrictinfo == NIL);

That being said, it's also no harm to handle fdw_restrictinfo in
reparameterize_path_by_child as the patch does.  So I'm OK we do that
for safety.

Thanks
Richard

Re: Support to define custom wait events for extensions

2023-07-31 Thread Kyotaro Horiguchi

At Mon, 31 Jul 2023 16:28:16 +0900, Michael Paquier  wrote 
in 
> Attaching a v11 based on Bharath's feedback and yours, for now.  I
> have also applied the addition of the two masking variables in
> wait_event.c separately with 7395a90.

+/*
+ * Return the name of an wait event ID for extension.
+ */
+static const char *
+GetWaitEventExtensionIdentifier(uint16 eventId)

This looks inconsistent. Shouldn't it be GetWaitEventExtentionName()?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Support to define custom wait events for extensions

2023-07-31 Thread Bharath Rupireddy

On Mon, Jul 31, 2023 at 12:58 PM Michael Paquier  wrote:
>
>
> Attaching a v11 based on Bharath's feedback and yours, for now.  I
> have also applied the addition of the two masking variables in
> wait_event.c separately with 7395a90.

+uint32 WaitEventExtensionNew(void)
+
+ Next, each process needs to associate the wait event allocated previously
+ to a user-facing custom string, which is something done by calling:
+
+void WaitEventExtensionRegisterName(uint32 wait_event_info, const
char *wait_event_name)
+
+ An example can be found in
src/test/modules/worker_spi
+ in the PostgreSQL source tree.
+

Do you think it's worth adding a note here in the docs about an
external module defining more than one custom wait event? A pseudo
code if possible or just a note? Also, how about a XXX comment atop
WaitEventExtensionNew and/or WaitEventExtensionRegisterName on the
possibility of extending the functions to support allocation of more
than one custom wait events?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Avoid undefined behavior with msvc compiler (src/include/port/pg_bitutils.h)

2023-07-31 Thread John Naylor

On Sun, Jul 30, 2023 at 9:45 PM Tom Lane  wrote:
>
> John Naylor  writes:
> > It seems that we should have "Assert(word != 0);" at the top, which
matches
> > the other platforms anyway, so I'll add that.
>
> That's basically equivalent to the existing Assert(non_zero).
> I think it'd be okay to drop that one and instead have
> the same Assert condition as other platforms, but having both
> would be redundant.

Works for me, so done that way for both forward and reverse variants. Since
the return value is no longer checked in any builds, I thought about
removing the variable containing it, but it seems best to leave it behind
for clarity since these are not our functions.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Support to define custom wait events for extensions

2023-07-31 Thread Masahiro Ikeda


On 2023-07-31 16:28, Michael Paquier wrote:

On Mon, Jul 31, 2023 at 03:53:27PM +0900, Masahiro Ikeda wrote:

/* This should only be called for user-defined wait event. */
if (eventId < NUM_BUILTIN_WAIT_EVENT_EXTENSION)
ereport(ERROR,
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid wait event ID %u", eventId));

I was just wondering if it should also check the eventId
that has been allocated though it needs to take the spinlock
and GetWaitEventExtensionIdentifier() doesn't take it into account.


What kind of extra check do you have in mind?  Once WAIT_EVENT_ID_MASK
is applied, we already know that we don't have something larger than
PG_UNIT16_MAX, or perhaps you want to cross-check this number with
what nextId holds in shared memory and that we don't have a number
between nextId and PG_UNIT16_MAX?  I am not sure that we need to care
much about that this much in this code path, and I'd rather avoid
taking an extra time the spinlock just for a cross-check.


OK. I assumed to check that we don't have a number between nextId and
PG_UNIT16_MAX.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Support to define custom wait events for extensions

2023-07-31 Thread Michael Paquier

On Mon, Jul 31, 2023 at 03:53:27PM +0900, Masahiro Ikeda wrote:
> I think the order in which they are mentioned should be matched. I mean that
> - so an LWLock or Extension wait
> + so an Extension or LWLock wait

Makes sense.

>   /* This should only be called for user-defined wait event. */
>   if (eventId < NUM_BUILTIN_WAIT_EVENT_EXTENSION)
>   ereport(ERROR,
>   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>   errmsg("invalid wait event ID %u", eventId));
> 
> I was just wondering if it should also check the eventId
> that has been allocated though it needs to take the spinlock
> and GetWaitEventExtensionIdentifier() doesn't take it into account.

What kind of extra check do you have in mind?  Once WAIT_EVENT_ID_MASK
is applied, we already know that we don't have something larger than
PG_UNIT16_MAX, or perhaps you want to cross-check this number with
what nextId holds in shared memory and that we don't have a number
between nextId and PG_UNIT16_MAX?  I am not sure that we need to care
much about that this much in this code path, and I'd rather avoid
taking an extra time the spinlock just for a cross-check.

Attaching a v11 based on Bharath's feedback and yours, for now.  I
have also applied the addition of the two masking variables in
wait_event.c separately with 7395a90.
--
Michael
From 23b6c4c36fbf23ca9c6fb4f42a3f943be44689a6 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 31 Jul 2023 16:04:20 +0900
Subject: [PATCH v11] Support custom wait events for wait event type
 "Extension"

Two backend routines are added to allow extension to allocate and define
custom wait events, all of these being allocated as of type "Extension":
* WaitEventExtensionNew(), that allocates a wait event ID computed from
a counter in shared memory.
* WaitEventExtensionRegisterName(), to associate a custom string to the
wait event registered.

Note that this includes an example of how to use this new facility in
worker_spi, with tests for various scenarios.

The use of these routines should be extended across more in-core
modules, but this is left as work for future patches.
---
 src/include/utils/wait_event.h|  26 +++
 src/backend/storage/ipc/ipci.c|   3 +
 .../activity/generate-wait_event_types.pl |   7 +-
 src/backend/utils/activity/wait_event.c   | 178 +-
 .../utils/activity/wait_event_names.txt   |   2 +-
 .../modules/worker_spi/t/001_worker_spi.pl|  34 +++-
 .../modules/worker_spi/worker_spi--1.0.sql|   5 +
 src/test/modules/worker_spi/worker_spi.c  | 104 +-
 doc/src/sgml/monitoring.sgml  |  11 +-
 doc/src/sgml/xfunc.sgml   |  45 +
 src/tools/pgindent/typedefs.list  |   2 +
 11 files changed, 393 insertions(+), 24 deletions(-)

diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 4517425f84..aad8bc08fa 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -38,6 +38,32 @@ extern void pgstat_reset_wait_event_storage(void);
 extern PGDLLIMPORT uint32 *my_wait_event_info;
 
 
+/* --
+ * Wait Events - Extension
+ *
+ * Use this category when the server process is waiting for some condition
+ * defined by an extension module.
+ *
+ * Extensions can define their own wait events in this category.  First,
+ * they should call WaitEventExtensionNew() to get one or more wait event
+ * IDs that are allocated from a shared counter.  These can be used directly
+ * with pgstat_report_wait_start() or equivalent.  Next, each individual
+ * process should call WaitEventExtensionRegisterName() to associate a wait
+ * event string to the number allocated previously.
+ */
+typedef enum
+{
+	WAIT_EVENT_EXTENSION = PG_WAIT_EXTENSION,
+	WAIT_EVENT_EXTENSION_FIRST_USER_DEFINED
+} WaitEventExtension;
+
+extern void WaitEventExtensionShmemInit(void);
+extern Size WaitEventExtensionShmemSize(void);
+
+extern uint32 WaitEventExtensionNew(void);
+extern void WaitEventExtensionRegisterName(uint32 wait_event_info,
+		   const char *wait_event_name);
+
 /* --
  * pgstat_report_wait_start() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index cc387c00a1..5551afffc0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -49,6 +49,7 @@
 #include "storage/spin.h"
 #include "utils/guc.h"
 #include "utils/snapmgr.h"
+#include "utils/wait_event.h"
 
 /* GUCs */
 int			shared_memory_type = DEFAULT_SHARED_MEMORY_TYPE;
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitEventExtensionShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -301,6 +303,7 @@ CreateSharedMemoryAndSemaphores(void)

95 matches

Mail list logo