pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication

2022-06-21 Thread RKN Sai Krishna
Hi,

Problem: Today when a data page is corrupted in the primary postgres with
physical replication (sync or async standbys), there seems to be no way to
repair it easily and we rely on PITR to recreate the postgres server or
drop the corrupted table (of course this is not an option for important
customer tables, but may be okay for some maintenance or temporary tables).
PITR is costly to do in a production environment oftentimes as it involves
creation of the full-blown postgres from the base backup and causing
downtime for the customers.

Solution: Make use of the uncorrupted page present in sync or async
standby. The proposed tool/extension pg_page_repair (as we call it) can
fetch the uncorrupted page from sync or async standby and overwrite the
corrupted one on the primary. Yes, there will be a challenge in making sure
that the WAL is replayed completely and standby is up-to-date so that we
are sure that stale pages are not copied across. A simpler idea could be
that the pg_page_repair can wait until the standby replays/catches up with
the primary's flush LSN before fetching the uncorrupted page. A downside of
this approach is that the pg_page_repair waits for long or rather
infinitely if the replication lag is huge. As we all know that the
replication lag is something a good postgres solution will always monitor
to keep it low, if true, the pg_page_repair is guaranteed to not wait for
longer. Another idea could be that the pg_page_repair gets the base page
from the standby and applies all the WAL records pertaining to the
corrupted page using the base page to get the uncorrupted page. This
requires us to pull the replay logic from the core to pg_page_repair which
isn't easy. Hence we propose to go with approach 1, but open to discuss on
approach 2 as well. We suppose that the solution proposed in this thread
holds good even for pages corresponding to indexes.

Implementation Choices: pg_page_repair can either take the corrupted page
info (db id, rel id, block number etc.) or just a relation name and
automatically figure out the corrupted page using pg_checksums for instance
or just database name and automatically figure out all the corrupted pages.
It can either repair the corrupted pages online (only the corrupted table
is inaccessible, the server continues to run) or take downtime if there are
many corrupted pages.

Future Scope: pg_page_repair can be integrated to the core so that the
postgres will repair the pages automatically without manual intervention.

Other Solutions: We did consider an approach where the tool could obtain
the FPI from WAL and replay till the latest WAL record to repair the page.
But there could be limitations such as FPI and related WAL not being
available in primary/archive location.

Thoughts?

Credits (cc-ed): thanks to SATYANARAYANA NARLAPURAM for initial thoughts
and thanks to Bharath Rupireddy, Chen Liang, mahendrakar s and Rohan Kumar
for internal discussions.

Thanks, RKN


Re: Support logical replication of DDLs

2022-06-21 Thread Masahiko Sawada
On Wed, Jun 15, 2022 at 1:00 PM Amit Kapila  wrote:
>
> On Wed, Jun 15, 2022 at 5:44 AM Zheng Li  wrote:
> >
> >
> > While I agree that the deparser is needed to handle the potential
> > syntax differences between
> > the pub/sub, I think it's only relevant for the use cases where only a
> > subset of tables in the database
> > are replicated. For other use cases where all tables, functions and
> > other objects need to be replicated,
> > (for example, creating a logical replica for major version upgrade)
> > there won't be any syntax difference to
> > handle and the schemas are supposed to match exactly between the
> > pub/sub. In other words the user seeks to create an identical replica
> > of the source database and the DDLs should be replicated
> > as is in this case.
> >
>
> I think even for database-level replication we can't assume that
> source and target will always have the same data in which case "Create
> Table As ..", "Alter Table .. " kind of statements can't be replicated
> as it is because that can lead to different results. The other point
> is tomorrow we can extend the database level option/syntax to exclude
> a few objects (something like [1]) as well in which case we again need
> to filter at the publisher level.

Good point.

Regarding the idea of using the parse-tree representation produced by
nodeToString(), I’ve not read the patch yet but I'm not sure it's a
good idea. A field name of a node could be changed in a major version.
If a publisher sends a parse-tree string representation to a newer
major version subscriber, the subscriber needs to be able to parse the
old format parse-tree string representation in order to reconstruct
the DDL, which reduces the maintainability much. On the other hand,
the format of deparsed json string would not be changed often.

>
> >
>  So I think it's an overkill to use deparser for
> > such use cases. It also costs more space and
> > time using deparsing. For example, the following simple ALTER TABLE
> > command incurs 11 times more space
> > in the WAL record if we were to use the format from the deparser,
> > there will also be time and CPU overhead from the deparser.
> >
> ...
> >
> > So I think it's better to define DDL replication levels [1] to tailor
> > for the two different use cases. We can use different logging format
> > based on the DDL replication level. For example,
> > we can simply log the DDL query string and the search_path for
> > database level DDL replication. But for table level DDL replication we
> > need to use the deparser format in order to
> > handle the potential syntax differences and schema mapping requests.
> >
>
> I think having different logging formats is worth considering but I am
> not sure we can distinguish it for database and table level
> replication because of the reasons mentioned above. One thing which
> may need a different format is the replication of global objects like
> roles, tablespace, etc. but we haven't analyzed them in detail about
> those. I feel we may also need a different syntax altogether to
> replicate such objects. Also, I think we may want to optimize the
> current format in some cases so that the WAL amount could be reduced.
>
> I feel if we think that deparsing is required for this project then
> probably at this stage it would be a good idea to explore ways to have
> independent ways to test it. One way is to do testing via the logical
> replication of DDL (aka via this patch) and the other is to write an
> independent test suite as Sawada-San seems to be speculating above
> [2]. I am not sure if there is any progress yet on the independent
> test suite front yet.

I've attached a WIP patch for adding regression tests for DDL deparse.
The patch can be applied on
v9-0001-Functions-to-deparse-DDL-commands.patch Hou recently
submitted[1]. The basic idea is to define the event trigger to deparse
DDLs, run the regression tests, load the deparsed DDLs to another
database cluster, dump both databases, and compare the dumps. Since
the patch doesn't support deparsing all DDLs and there is a bug[2],
the attached regression test does CREATE TABLE and some ALTER TABLE
instead of running regression tests.

Regards,

[1] 
https://www.postgresql.org/message-id/OS0PR01MB5716B1526C2EDA66907E733B94B39%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] deparsing "ALTER INDEX tbl_idx ALTER COLUMN 2 SET STATISTICS
1000;" causes an assertion failure.

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


0001-WIP-Add-regression-tests-for-DDL-deparse.patch
Description: Binary data


Re: Replica Identity check of partition table on subscriber

2022-06-21 Thread Amit Langote
Hi,

On Wed, Jun 22, 2022 at 12:02 PM houzj.f...@fujitsu.com
 wrote:
> Since the patch has been committed. Attach the last patch to fix the memory 
> leak.
>
> The bug exists on PG10 ~ PG15(HEAD).
>
> For HEAD,PG14,PG13, to fix the memory leak, I think we should use
> free_attrmap instead of pfree and release the no-longer-useful attrmap
> When rebuilding the map info.
>
> For PG12,PG11,PG10, we only need to add the code to release the
> no-longer-useful attrmap when rebuilding the map info. We can still use
> pfree() because the attrmap in back-branch is a single array like:
>
> entry->attrmap = palloc(desc->natts * sizeof(AttrNumber));

LGTM, thank you.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: [PROPOSAL] Detecting plan changes with plan_id in pg_stat_activity

2022-06-21 Thread Julien Rouhaud
Hi,

On Tue, Jun 21, 2022 at 08:04:01PM +, Imseih (AWS), Sami wrote:
>
> I separated the pg_stat_statements patch. The patch
> Introduces a secondary hash that tracks locations of
> A query ( by queryid ) in the external file.

I still think that's wrong.

> The hash
> remains in lockstep with the pgss_hash using a
> synchronization routine.

Can you describe how it's kept in sync, and how it makes sure that the property
is maintained over restart / gc?  I don't see any change in the code for the
2nd part so I don't see how it could work (and after a quick test it indeed
doesn't).

I also don't see any change in the heuristics for need_gc_qtext(), isn't that
going to lead to too frequent gc?

> My testing does not show any regression for workloads
> In which statements are not issues by multiple users/databases.
>
> However, it shows good improvement, 10-15%, when there
> are similar statements that are issues by multiple
> users/databases/tracking levels.

"no regression" and "10-15% improvement" on what?

Can you share more details on the benchmarks you did?  Did you also run
benchmark on workloads that induce entry eviction, with and without need for
gc?  Eviction in pgss is already very expensive, and making things worse just
to save a bit of disk space doesn't seem like a good compromise.




Re: pg15b1: FailedAssertion("val > base", File: "...src/include/utils/relptr.h", Line: 67, PID: 30485)

2022-06-21 Thread Thomas Munro
On Wed, Jun 22, 2022 at 4:24 PM Thomas Munro  wrote:
> On Wed, Jun 22, 2022 at 2:54 PM Tom Lane  wrote:
> > John Naylor  writes:
> > > On Wed, Jun 1, 2022 at 2:57 AM Robert Haas  wrote:
> > >> ... So we can fix this by:
> > >> 1. Using a relative pointer value other than 0 to represent a null
> > >> pointer. Andres suggested (Size) -1.
> > >> 2. Not storing the free page manager for the DSM in the main shared
> > >> memory segment at byte offset 0.
>
> For the record, the third idea proposed was to use 1 for the first
> byte, so that 0 is reserved for NULL and works with memset(0).  Here's
> an attempt at that.

... erm, though, duh, I forgot to adjust Assert(val > base).  One more time.
From ba1661745ad57c69d0d9f881913d337504b5e870 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 22 Jun 2022 16:18:53 +1200
Subject: [PATCH v2] Fix relptr's encoding of NULL.

Use 0 for NULL, and use 1-based offset for non-NULL values.  Also fix
macro hygiene in passing (ie missing/misplaced parentheses), and remove
open-coded access to the raw value from freepage.c/.h.
---
 src/backend/utils/mmgr/freepage.c |  6 +++---
 src/include/utils/freepage.h  |  4 ++--
 src/include/utils/relptr.h| 15 +--
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
index b26a295a4e..dcf246faf1 100644
--- a/src/backend/utils/mmgr/freepage.c
+++ b/src/backend/utils/mmgr/freepage.c
@@ -434,7 +434,7 @@ FreePageManagerDump(FreePageManager *fpm)
 
 	/* Dump general stuff. */
 	appendStringInfo(, "metadata: self %zu max contiguous pages = %zu\n",
-	 fpm->self.relptr_off, fpm->contiguous_pages);
+	 relptr_offset(fpm->self), fpm->contiguous_pages);
 
 	/* Dump btree. */
 	if (fpm->btree_depth > 0)
@@ -1269,7 +1269,7 @@ FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
 		if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
 			appendStringInfo(buf, " %zu->%zu",
 			 btp->u.internal_key[index].first_page,
-			 btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+			 relptr_offset(btp->u.internal_key[index].child) / FPM_PAGE_SIZE);
 		else
 			appendStringInfo(buf, " %zu(%zu)",
 			 btp->u.leaf_key[index].first_page,
@@ -1859,7 +1859,7 @@ FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
 	{
 		Size		f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
 
-		Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+		Assert(relptr_offset(fpm->freelist[f]) == pageno * FPM_PAGE_SIZE);
 		relptr_copy(fpm->freelist[f], span->next);
 	}
 }
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
index 08064b10f9..e69b2804ec 100644
--- a/src/include/utils/freepage.h
+++ b/src/include/utils/freepage.h
@@ -78,11 +78,11 @@ struct FreePageManager
 #define fpm_pointer_is_page_aligned(base, ptr)		\
 	(((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
 #define fpm_relptr_is_page_aligned(base, relptr)		\
-	((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+	(relptr_offset(relptr) % FPM_PAGE_SIZE == 0)
 
 /* Macro to find base address of the segment containing a FreePageManager. */
 #define fpm_segment_base(fpm)	\
-	(((char *) fpm) - fpm->self.relptr_off)
+	(((char *) fpm) - relptr_offset(fpm->self))
 
 /* Macro to access a FreePageManager's largest consecutive run of pages. */
 #define fpm_largest(fpm) \
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
index cc80a7200d..9364dd6694 100644
--- a/src/include/utils/relptr.h
+++ b/src/include/utils/relptr.h
@@ -42,7 +42,7 @@
 #define relptr_access(base, rp) \
 	(AssertVariableIsOfTypeMacro(base, char *), \
 	 (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
-		(base + (rp).relptr_off)))
+		(base) + (rp).relptr_off - 1))
 #else
 /*
  * If we don't have __builtin_types_compatible_p, assume we might not have
@@ -50,12 +50,15 @@
  */
 #define relptr_access(base, rp) \
 	(AssertVariableIsOfTypeMacro(base, char *), \
-	 (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+	 (void *) ((rp).relptr_off == 0 ? NULL : (base) + (rp).relptr_off - 1))
 #endif
 
 #define relptr_is_null(rp) \
 	((rp).relptr_off == 0)
 
+#define relptr_offset(rp) \
+	((rp).relptr_off - 1)
+
 /* We use this inline to avoid double eval of "val" in relptr_store */
 static inline Size
 relptr_store_eval(char *base, char *val)
@@ -64,8 +67,8 @@ relptr_store_eval(char *base, char *val)
 		return 0;
 	else
 	{
-		Assert(val > base);
-		return val - base;
+		Assert(val >= base);
+		return val - base + 1;
 	}
 }
 
@@ -73,7 +76,7 @@ relptr_store_eval(char *base, char *val)
 #define relptr_store(base, rp, val) \
 	(AssertVariableIsOfTypeMacro(base, char *), \
 	 AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
-	 (rp).relptr_off = relptr_store_eval(base, (char *) (val)))
+	 (rp).relptr_off = relptr_store_eval((base), (char *) (val)))
 #else
 /*
  * If we don't have __builtin_types_compatible_p, assume we might not have

Re: Skipping logical replication transactions on subscriber side

2022-06-21 Thread Noah Misch
On Mon, Jun 13, 2022 at 10:25:24AM -0400, Robert Haas wrote:
> On Sun, Apr 17, 2022 at 11:22 PM Noah Misch  wrote:
> > > Yes, but it could be false positives in some cases. For instance, the
> > > column {oid, bool, XLogRecPtr} should be okay on ALIGNOF_DOUBLE == 4
> > > and 8 platforms but the new test fails.
> >
> > I'm happy with that, because the affected author should look for 
> > padding-free
> > layouts before settling on your example layout.  If the padding-free layouts
> > are all unacceptable, the author should update the expected sanity_check.out
> > to show the one row where the test "fails".

> Perhaps instead we ought to legislate that NAMEDATALEN must
> be a multiple of 8, or some such thing.
> 
> The other constraint, that typalign='d' fields must always fall on an
> 8 byte boundary, is probably less annoying in practice, but it's easy
> to imagine a future catalog running into trouble. Let's say we want to
> introduce a new catalog that has only an Oid column and a float8
> column. Perhaps with 0-3 bool or uint8 columns as well, or with any
> number of NameData columns as well. Well, the only way to satisfy this
> constraint is to put the float8 column first and the Oid column after
> it, which immediately makes it look different from every other catalog
> we have.

> AFAICS, we could do that by:
> 
> 1. De-supporting platforms that have this problem, or
> 2. Introducing new typalign values, as Noah proposed back on April 2, or
> 3. Somehow forcing values that are sometimes 4-byte aligned and
> sometimes 8-byte aligned to be 8-byte alignment on all platforms

On Mon, Jun 20, 2022 at 10:04:06AM -0400, Robert Haas wrote:
> On Mon, Jun 20, 2022 at 9:52 AM Peter Eisentraut
>  wrote:
> > That means changing the system's ABI, so in the extreme case you then
> > need to compile everything else to match as well.
> 
> I think we wouldn't want to do that in a minor release, but doing it
> in a new major release seems fine -- especially if only AIX is
> affected.

"Everything" isn't limited to PostgreSQL.  The Perl ABI exposes large structs
to plperl; a field of type double could require the AIX user to rebuild Perl
with the same compiler option.


Overall, this could be a textbook example of choosing between:

- Mild harm (unaesthetic column order) to many people.
- Considerable harm (dump/reload instead of pg_upgrade) to a small, unknown,
  possibly-zero quantity of people.

Here's how I rank the options, from most-preferred to least-preferred:

1. Put new eight-byte fields at the front of each catalog, when in doubt.
2. On systems where double alignment differs from int64 alignment, require
   NAMEDATALEN%8==0.  Upgrading to v16 would require dump/reload for AIX users
   changing NAMEDATALEN to conform to the new restriction.
3. Introduce new typalign values.  Upgrading to v16 would require dump/reload
   for all AIX users.
4. De-support AIX.
5. From above, "Somehow forcing values that are sometimes 4-byte aligned and
   sometimes 8-byte aligned to be 8-byte alignment on all platforms".
   Upgrading to v16 would require dump/reload for all AIX users.
6. Require -qalign=natural on AIX.  Upgrading to v16 would require dump/reload
   and possible system library rebuilds for all AIX users.

I gather (1) isn't at the top of your ranking, or you wouldn't have written
in.  What do you think of (2)?




Re: pg15b1: FailedAssertion("val > base", File: "...src/include/utils/relptr.h", Line: 67, PID: 30485)

2022-06-21 Thread Thomas Munro
On Wed, Jun 22, 2022 at 2:54 PM Tom Lane  wrote:
> John Naylor  writes:
> > On Wed, Jun 1, 2022 at 2:57 AM Robert Haas  wrote:
> >> ... So we can fix this by:
> >> 1. Using a relative pointer value other than 0 to represent a null
> >> pointer. Andres suggested (Size) -1.
> >> 2. Not storing the free page manager for the DSM in the main shared
> >> memory segment at byte offset 0.

For the record, the third idea proposed was to use 1 for the first
byte, so that 0 is reserved for NULL and works with memset(0).  Here's
an attempt at that.
From 8b6b1e765b4dd23b968563aab1a01fa8b7c5a463 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 22 Jun 2022 16:18:53 +1200
Subject: [PATCH] Fix relptr's encoding of NULL.

Use 0 for NULL, and use 1-based offset for non-NULL values.  Also fix
macro hygiene in passing (ie missing/misplaced parentheses), and remove
open-coded access to the raw value from freepage.c/.h.
---
 src/backend/utils/mmgr/freepage.c |  6 +++---
 src/include/utils/freepage.h  |  4 ++--
 src/include/utils/relptr.h| 13 -
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
index b26a295a4e..dcf246faf1 100644
--- a/src/backend/utils/mmgr/freepage.c
+++ b/src/backend/utils/mmgr/freepage.c
@@ -434,7 +434,7 @@ FreePageManagerDump(FreePageManager *fpm)
 
 	/* Dump general stuff. */
 	appendStringInfo(, "metadata: self %zu max contiguous pages = %zu\n",
-	 fpm->self.relptr_off, fpm->contiguous_pages);
+	 relptr_offset(fpm->self), fpm->contiguous_pages);
 
 	/* Dump btree. */
 	if (fpm->btree_depth > 0)
@@ -1269,7 +1269,7 @@ FreePageManagerDumpBtree(FreePageManager *fpm, FreePageBtree *btp,
 		if (btp->hdr.magic == FREE_PAGE_INTERNAL_MAGIC)
 			appendStringInfo(buf, " %zu->%zu",
 			 btp->u.internal_key[index].first_page,
-			 btp->u.internal_key[index].child.relptr_off / FPM_PAGE_SIZE);
+			 relptr_offset(btp->u.internal_key[index].child) / FPM_PAGE_SIZE);
 		else
 			appendStringInfo(buf, " %zu(%zu)",
 			 btp->u.leaf_key[index].first_page,
@@ -1859,7 +1859,7 @@ FreePagePopSpanLeader(FreePageManager *fpm, Size pageno)
 	{
 		Size		f = Min(span->npages, FPM_NUM_FREELISTS) - 1;
 
-		Assert(fpm->freelist[f].relptr_off == pageno * FPM_PAGE_SIZE);
+		Assert(relptr_offset(fpm->freelist[f]) == pageno * FPM_PAGE_SIZE);
 		relptr_copy(fpm->freelist[f], span->next);
 	}
 }
diff --git a/src/include/utils/freepage.h b/src/include/utils/freepage.h
index 08064b10f9..e69b2804ec 100644
--- a/src/include/utils/freepage.h
+++ b/src/include/utils/freepage.h
@@ -78,11 +78,11 @@ struct FreePageManager
 #define fpm_pointer_is_page_aligned(base, ptr)		\
 	(((Size) (((char *) (ptr)) - (base))) % FPM_PAGE_SIZE == 0)
 #define fpm_relptr_is_page_aligned(base, relptr)		\
-	((relptr).relptr_off % FPM_PAGE_SIZE == 0)
+	(relptr_offset(relptr) % FPM_PAGE_SIZE == 0)
 
 /* Macro to find base address of the segment containing a FreePageManager. */
 #define fpm_segment_base(fpm)	\
-	(((char *) fpm) - fpm->self.relptr_off)
+	(((char *) fpm) - relptr_offset(fpm->self))
 
 /* Macro to access a FreePageManager's largest consecutive run of pages. */
 #define fpm_largest(fpm) \
diff --git a/src/include/utils/relptr.h b/src/include/utils/relptr.h
index cc80a7200d..79f4f0095d 100644
--- a/src/include/utils/relptr.h
+++ b/src/include/utils/relptr.h
@@ -42,7 +42,7 @@
 #define relptr_access(base, rp) \
 	(AssertVariableIsOfTypeMacro(base, char *), \
 	 (__typeof__((rp).relptr_type)) ((rp).relptr_off == 0 ? NULL : \
-		(base + (rp).relptr_off)))
+		(base) + (rp).relptr_off - 1))
 #else
 /*
  * If we don't have __builtin_types_compatible_p, assume we might not have
@@ -50,12 +50,15 @@
  */
 #define relptr_access(base, rp) \
 	(AssertVariableIsOfTypeMacro(base, char *), \
-	 (void *) ((rp).relptr_off == 0 ? NULL : (base + (rp).relptr_off)))
+	 (void *) ((rp).relptr_off == 0 ? NULL : (base) + (rp).relptr_off - 1))
 #endif
 
 #define relptr_is_null(rp) \
 	((rp).relptr_off == 0)
 
+#define relptr_offset(rp) \
+	((rp).relptr_off - 1)
+
 /* We use this inline to avoid double eval of "val" in relptr_store */
 static inline Size
 relptr_store_eval(char *base, char *val)
@@ -65,7 +68,7 @@ relptr_store_eval(char *base, char *val)
 	else
 	{
 		Assert(val > base);
-		return val - base;
+		return val - base + 1;
 	}
 }
 
@@ -73,7 +76,7 @@ relptr_store_eval(char *base, char *val)
 #define relptr_store(base, rp, val) \
 	(AssertVariableIsOfTypeMacro(base, char *), \
 	 AssertVariableIsOfTypeMacro(val, __typeof__((rp).relptr_type)), \
-	 (rp).relptr_off = relptr_store_eval(base, (char *) (val)))
+	 (rp).relptr_off = relptr_store_eval((base), (char *) (val)))
 #else
 /*
  * If we don't have __builtin_types_compatible_p, assume we might not have
@@ -81,7 +84,7 @@ relptr_store_eval(char *base, char *val)
  */
 #define relptr_store(base, rp, val) \
 	(AssertVariableIsOfTypeMacro(base, char *), \
-	 (rp).relptr_off = relptr_store_eval(base, 

Re: tablesync copy ignores publication actions

2022-06-21 Thread Amit Kapila
On Thu, Jun 16, 2022 at 6:07 AM Peter Smith  wrote:

>
> Thank you for your review comments. Those reported mistakes are fixed
> in the attached patch v3.
>

This patch looks mostly good to me except for a few minor comments
which are mentioned below. It is not very clear in which branch(es) we
should commit this patch? As per my understanding, this is a
pre-existing behavior but we want to document it because (a) It was
not already documented, and (b) we followed it for row filters in
PG-15 it seems that should be explained. So, we have the following
options (a) commit it only for PG-15, (b) commit for PG-15 and
backpatch the relevant sections, or (c) commit it when branch opens
for PG-16. What do you or others think?

Few comments:
==
1.
>
-   particular event types.  By default, all operation types are replicated.
-   (Row filters have no effect for TRUNCATE. See
-   ).
+   particular event types. By default, all operation types are replicated.
+   These are DML operation limitations only; they do not affect the initial
+   data synchronization copy.
>

Using limitations in the above sentence can be misleading. Can we
change it to something like: "These publication specifications apply
only for DML operations; they do ... ".

2.
+ operations. The publication pub3b has a row filter.

In the Examples section, you have used row filter whereas that section
is later in the docs. So, it is better if you give reference to that
section in the above sentence (see Section ...).

3.
+ 
+  This parameter only affects DML operations. In particular, the
+  subscription initial data synchronization does not take
this parameter
+  into account when copying existing table data.
+ 

In the second sentence: "... subscription initial data synchronization
..." doesn't sound appropriate. Can we change it to something like:
"In particular, the initial data synchronization (see Section ..) in
logical replication does not take this parameter into account when
copying existing table data."?


-- 
With Regards,
Amit Kapila.




RE: Replica Identity check of partition table on subscriber

2022-06-21 Thread houzj.f...@fujitsu.com
On Tuesday, June 21, 2022 4:49 PM Amit Kapila 
> 
> On Tue, Jun 21, 2022 at 12:50 PM Amit Langote 
> wrote:
> >
> > On Tue, Jun 21, 2022 at 3:35 PM houzj.f...@fujitsu.com
> >  wrote:
> >
> > Attached a patch containing the above to consider as an alternative.
> >
> 
> Thanks, the patch looks good to me. I'll push this after doing some testing.

Since the patch has been committed. Attach the last patch to fix the memory 
leak.

The bug exists on PG10 ~ PG15(HEAD).

For HEAD,PG14,PG13, to fix the memory leak, I think we should use
free_attrmap instead of pfree and release the no-longer-useful attrmap
When rebuilding the map info.

For PG12,PG11,PG10, we only need to add the code to release the
no-longer-useful attrmap when rebuilding the map info. We can still use
pfree() because the attrmap in back-branch is a single array like:

entry->attrmap = palloc(desc->natts * sizeof(AttrNumber));

Best regards,
Hou zj
 


v12-HEAD-PG14-0001-fix-memory-leak-about-attrmap.patch
Description: v12-HEAD-PG14-0001-fix-memory-leak-about-attrmap.patch


v12-PG13-0001-fix-memory-leak-about-attrmap.patch
Description: v12-PG13-0001-fix-memory-leak-about-attrmap.patch


v12-PG10-11-12-0001-Fix-memory-leak-about-attrmap.patch
Description: v12-PG10-11-12-0001-Fix-memory-leak-about-attrmap.patch


Re: pg15b1: FailedAssertion("val > base", File: "...src/include/utils/relptr.h", Line: 67, PID: 30485)

2022-06-21 Thread Tom Lane
John Naylor  writes:
> On Wed, Jun 1, 2022 at 2:57 AM Robert Haas  wrote:
>> ... So we can fix this by:
>> 1. Using a relative pointer value other than 0 to represent a null
>> pointer. Andres suggested (Size) -1.
>> 2. Not storing the free page manager for the DSM in the main shared
>> memory segment at byte offset 0.

> For this open item, the above two ideas were discussed as a short-term
> fix, and my reading of the thread is that the other proposals are too
> invasive at this point in the cycle. Both of them have a draft patch
> in the thread. #2, i.e. wasting MAXALIGN of space, seems the simplest
> and most localized. Any thoughts on pulling the trigger on either of
> these two approaches?

I'm still of the opinion that 0 == NULL is a good property to have,
so I vote for #2.

regards, tom lane




Re: pg15b1: FailedAssertion("val > base", File: "...src/include/utils/relptr.h", Line: 67, PID: 30485)

2022-06-21 Thread John Naylor
On Wed, Jun 1, 2022 at 2:57 AM Robert Haas  wrote:

> We do use fpm_segment_base(), but that accidentally fails
> to break, because instead of using relptr_access() it drills right
> through the abstraction and doesn't have any kind of special case for
> 0. So we can fix this by:
>
> 1. Using a relative pointer value other than 0 to represent a null
> pointer. Andres suggested (Size) -1.
> 2. Not storing the free page manager for the DSM in the main shared
> memory segment at byte offset 0.

Hi all,

For this open item, the above two ideas were discussed as a short-term
fix, and my reading of the thread is that the other proposals are too
invasive at this point in the cycle. Both of them have a draft patch
in the thread. #2, i.e. wasting MAXALIGN of space, seems the simplest
and most localized. Any thoughts on pulling the trigger on either of
these two approaches?

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2022-06-21 Thread Andres Freund
Hi,

On 2022-06-21 17:22:05 +1200, Thomas Munro wrote:
> Problem: I saw 031_recovery_conflict.pl time out while waiting for a
> buffer pin conflict, but so far once only, on CI:
> 
> https://cirrus-ci.com/task/5956804860444672
> 
> timed out waiting for match: (?^:User was holding shared buffer pin
> for too long) at t/031_recovery_conflict.pl line 367.
> 
> Hrmph.  Still trying to reproduce that, which may be a bug in this
> patch, a bug in the test or a pre-existing problem.  Note that
> recovery didn't say something like:
> 
> 2022-06-21 17:05:40.931 NZST [57674] LOG:  recovery still waiting
> after 11.197 ms: recovery conflict on buffer pin
> 
> (That's what I'd expect to see in
> https://api.cirrus-ci.com/v1/artifact/task/5956804860444672/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log
> if the startup process had decided to send the signal).
> 
> ... so it seems like the problem in that run is upstream of the interrupt 
> stuff.

Odd. The only theory I have so far is that the manual vacuum on the primary
somehow decided to skip the page, and thus didn't trigger a conflict. Because
clearly replay progressed past the records of the VACUUM. Perhaps we should
use VACUUM VERBOSE? In contrast to pg_regress tests that should be
unproblematic?

Greetings,

Andres Freund




Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2022-06-21 Thread Thomas Munro
On Wed, Jun 22, 2022 at 1:04 PM Michael Paquier  wrote:
> With the patch, we should always have QueryCancelPending set to false,
> as long as there are no QueryCancelHoldoffCount.  Perhaps an extra
> assertion for QueryCancelPending could be added at the beginning of
> ProcessRecoveryConflictInterrupts(), in combination of the one already
> present for InterruptHoldoffCount.  I agree that's a minor point,
> though.

But QueryCancelPending can be set to true at any time by
StatementCancelHandler(), if we receive SIGINT.




Re: amcheck is using a wrong macro to check compressed-ness

2022-06-21 Thread Michael Paquier
On Thu, Jun 09, 2022 at 10:48:27AM +0900, Michael Paquier wrote:
> Three weeks later, ping.  Robert, could you look at this thread?

And again.  beta2 is planned to next week, and this is still an open
item.  I could look at that by myself, but I always tend to get easily
confused with all the VARATT macros when it comes to compressed blobs,
so it would take a bit of time.
--
Michael


signature.asc
Description: PGP signature


Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-21 Thread Michael Paquier
On Tue, Jun 21, 2022 at 10:35:33AM +0900, Kyotaro Horiguchi wrote:
> At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas  wrote 
> in 
>> says "don't keep trying to read more WAL, just promote RIGHT HERE?". I
>> think this logic would surely be incorrect in that case. It feels to
>> me like the right thing to do is to always keep track if WAL ends in
>> an incomplete record, and then when we promote, we write an aborted
>> contrecord record if WAL ended in an incomplete record, full stop.

Agreed.  The state is maintained in the WAL reader as far as I
understand the logic behind it.

>> Now, we do need to keep in mind that, while in StandbyMode, we might
>> reach the end of WAL more than once, because we have a polling loop
>> that looks for more WAL. So it does not work to just set the values
>> once and then assume that's the whole truth forever. But why not
>> handle that by storing the abortedRecPtr and missingContrecPtr
>> unconditionally after every call to XLogPrefetcherReadRecord()? They
>> will go back and forth between XLogRecPtrIsInvalid and other values
>> many times but the last value should -- I think -- be however things
>> ended up at the point where we decided to stop reading WAL.
> 
> Unfortunately it doesn't work because we read a record already known
> to be complete again at the end of recovery.  It is the reason of
> "abortedRecPtr < xlogreader->EndRecPtr" in my PoC patch.  Without it,
> abrotedRecPtr is erased when it is actually needed.  I don't like that
> expedient-looking condition, but the strict condition for resetting
> abortedRecPtr is iff "we have read a complete record at the same or
> grater LSN ofabortedRecPtr"...
> 
> Come to think of this, I noticed that we can get rid of the
> file-global abortedRecPtr by letting FinishWalRecovery copy
> abortedRecPtr *before* reading the last record. This also allows us to
> remove the "expedient" condition.

Interesting point, though I am not sure to get why this is safe
compared to the existing solution of setting
missingContrecPtr/abortedRecPtr while reading a set of records when we
look for the last LSN at the end of recovery.

>> I haven't really dived into this issue much so I might be totally off
>> base, but it just doesn't feel right to me that this should depend on
>> whether we're in standby mode. That seems at best incidentally related
>> to whether an aborted contrecord record is required.
>> 
>> P.S. "aborted contrecord record" is really quite an awkward turn of phrase!
> 
> Thats true!  Always open for better phrasings:(

The term that would be appropriate here is continuation record?
contrecord is a bit confusing for French-speakers, actually, as
"contre" means "against" ;)

By the way, something that itches me with the proposed patch is that
we don't actually stress the second problem reported, which is that
the use of StandbyMode is incorrect.  Isn't that a sign that we'd
better extend more the tests of 026_overwrite_contrecord.pl with more
end-of-recovery scenarios?  Two things that immediately come to mind
are the use of recovery_target_lsn that would force a promotion in the
middle of a continuation record and cascading standbys to make sure
that these get the extra OVERWRITE_CONTRECORD record generated at the
end of recovery.
--
Michael


signature.asc
Description: PGP signature


Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2022-06-21 Thread Michael Paquier
On Tue, Jun 21, 2022 at 11:02:57PM +1200, Thomas Munro wrote:
> On Tue, Jun 21, 2022 at 7:44 PM Michael Paquier  wrote:
>> The extra business with QueryCancelHoldoffCount and DoingCommandRead
>> is the only addition for the snapshot, lock and tablespace conflict
>> handling part.  I don't see why a reason why that could be wrong on a
>> close lookup.  Anyway, why don't you check QueryCancelPending on top
>> of QueryCancelHoldoffCount?
> 
> The idea of this patch is to make ProcessRecoveryConflictInterrupt()
> throw its own ERROR, instead of setting QueryCancelPending (as an
> earlier version of the patch did).  It still has to respect
> QueryCancelHoldoffCount, though, to avoid emitting an ERROR at bad
> times for the fe/be protocol.

Yeah, I was reading through v3 and my brain questioned the
inconsistency, but I can see that v2 already did that and I have also
looked at it.   Anyway, my concern here is that the code becomes more
dependent on the ordering of ProcessRecoveryConflictInterrupt() and
the code path checking for QueryCancelPending in ProcessInterrupts().
With the patch, we should always have QueryCancelPending set to false,
as long as there are no QueryCancelHoldoffCount.  Perhaps an extra
assertion for QueryCancelPending could be added at the beginning of 
ProcessRecoveryConflictInterrupts(), in combination of the one already
present for InterruptHoldoffCount.  I agree that's a minor point,
though.
--
Michael


signature.asc
Description: PGP signature


Re: gcc -ftabstop option

2022-06-21 Thread Tom Lane
> At Tue, 21 Jun 2022 12:49:24 +0200, Peter Eisentraut 
>  wrote in 
>> One bit of trickery not addressed yet is that we might want to strip
>> out the option and not expose it through PGXS, since we don't know
>> what whitespacing rules external code uses.

This part seems like a bigger problem than the option is worth.
I agree that we can't assume third-party code prefers 4-space tabs.

Kyotaro Horiguchi  writes:
> There was a related discussion.
> https://www.postgresql.org/message-id/BDE54C55-438C-48E9-B2A3-08EB3A0CBB9F%40yesql.se
>> The -ftabstop option is (to a large extent, not entirely) to warn when tabs 
>> and
>> spaces are mixed creating misleading indentation that the author didn't even
>> notice due to tabwidth settings?  ISTM we are better of getting these 
>> warnings
>> than suppressing them.

Isn't that kind of redundant given that (a) we have git whitespace
warnings about this and (b) pgindent will take care of any such
problems in the end?

I'll grant the point about compiler warnings possibly not lining up
precisely.  But that's yet to bother me personally, so maybe I'm
underestimating its value.

regards, tom lane




Re: gcc -ftabstop option

2022-06-21 Thread Kyotaro Horiguchi
At Tue, 21 Jun 2022 12:49:24 +0200, Peter Eisentraut 
 wrote in 
> I suggest that we add the gcc (also clang) option -ftabstop=4.
> 
> This has two effects: First, it produces more accurate column numbers
> in compiler errors and correctly indents the code excerpts that the
> compiler shows with those.  Second, it enables the compiler's
> detection of confusingly indented code to work more correctly.  (But
> the latter is only a potential problem for code that does not use tabs
> for indentation, so it would be mostly a help during development with
> sloppy editor setups.)
> 
> Attached is a patch to discover the option in configure.
> 
> One bit of trickery not addressed yet is that we might want to strip
> out the option and not expose it through PGXS, since we don't know
> what whitespacing rules external code uses.
>
> Thoughts?

There's no strong reason for everyone to accept that change, but I
myself don't mind whether the option is applied or not.


There was a related discussion.

https://www.postgresql.org/message-id/BDE54C55-438C-48E9-B2A3-08EB3A0CBB9F%40yesql.se

> The -ftabstop option is (to a large extent, not entirely) to warn when tabs 
> and
> spaces are mixed creating misleading indentation that the author didn't even
> notice due to tabwidth settings?  ISTM we are better of getting these warnings
> than suppressing them.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PostgreSQL 15 Beta 2 release

2022-06-21 Thread Michael Paquier
On Tue, Jun 21, 2022 at 05:36:35PM -0400, Andrew Dunstan wrote:
> Not quite sure why I'm listed against the OAT hook issue, all I did was
> commit a test that exposed the long existing problem :-)

Yes, we've discussed about this open and came to the conclusion that
assigning it to you is not really fair, so don't worry :)

> It looks like Michael at least has some ideas about fixing it.

I am planning to send an update on this thread today, but that's not
going to be an amazing thing.
--
Michael


signature.asc
Description: PGP signature


Re: Add header support to text format and matching feature

2022-06-21 Thread Michael Paquier
On Mon, Jun 20, 2022 at 09:03:23AM +0900, Michael Paquier wrote:
> On Thu, Jun 16, 2022 at 09:24:56AM +0200, Peter Eisentraut wrote:
>> I don't feel very strongly about this.  It makes sense to stay consistent
>> with the existing COPY code.
> 
> Yes, my previous argument is based on consistency with the
> surroundings.  I am not saying that this could not be made better, it
> surely can, but I would recommend to tackle that in a separate patch,
> and apply that to more areas than this specific one.

Peter, beta2 is planned for next week.  Do you think that you would be
able to address this open item by the end of this week?  If not, and I
have already looked at the proposed patch, I can jump in and help.
--
Michael


signature.asc
Description: PGP signature


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-06-21 Thread Andrew Dunstan


On 2022-06-21 Tu 17:25, Andres Freund wrote:
> Hi,
>
> On 2022-06-21 17:11:33 -0400, Andrew Dunstan wrote:
>> I and a couple of colleagues have looked it over. As far as it goes the
>> json fix looks kosher to me. I'll play with it some more.
> Cool.
>
> Any chance you could look at fixing the "structure" of the generated
> expression "program". The recursive ExecEvalExpr() calls are really not ok...
>

Yes, but I don't guarantee to have a fix in time for Beta2.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: PostgreSQL 15 Beta 2 release

2022-06-21 Thread Andrew Dunstan


On 2022-06-21 Tu 09:27, Jonathan S. Katz wrote:
> Hi,
>
> The RMT[1] with the release team has set a date of June 30, 2022 for
> the PostgreSQL 15 Beta 2 release. We encourage you to try to close as
> many open items[2] prior to the release.
>
> If you are working on patches for Beta 2, please be sure that they are
> committed no later than June 26, 2022 AoE[3].
>
> Thanks!
>
> John, Jonathan, Michael


Not quite sure why I'm listed against the OAT hook issue, all I did was
commit a test that exposed the long existing problem :-)

It looks like Michael at least has some ideas about fixing it.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-06-21 Thread Andres Freund
Hi,

On 2022-06-21 17:11:33 -0400, Andrew Dunstan wrote:
> I and a couple of colleagues have looked it over. As far as it goes the
> json fix looks kosher to me. I'll play with it some more.

Cool.

Any chance you could look at fixing the "structure" of the generated
expression "program". The recursive ExecEvalExpr() calls are really not ok...

Greetings,

Andres Freund




Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-06-21 Thread Andrew Dunstan


On 2022-06-17 Fr 16:06, Andres Freund wrote:
>
>
> I also attached my heavily-WIP patches for the ExprEvalStep issues, 


Many thanks


> I
> accidentally had only included a small part of the contents of the json fix.
>

Yeah, that confused me mightily last week :-)

I and a couple of colleagues have looked it over. As far as it goes the
json fix looks kosher to me. I'll play with it some more.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Use fadvise in wal replay

2022-06-21 Thread Pavel Borisov
>
> > On 21 Jun 2022, at 16:59, Jakub Wartak  wrote:
> Oh, wow, your benchmarks show really impressive improvement.
>

FWIW I was trying to speedup long sequential file reads in Postgres using
fadvise hints. I've found no detectable improvements.
Then I've written 1Mb - 1Gb sequential read test with both fadvise
POSIX_FADV_WILLNEED
and POSIX_FADV_SEQUENTIAL in Linux. The only improvement I've found was

1. when the size of read was around several Mb and fadvise len also around
several Mb.
2. when before fdavice and the first read there was a delay (which was
supposedly used by OS for reading into prefetch buffer)
3. If I read sequential blocks i saw speedup only on first ones. Overall
read speed of say 1Gb file remained unchanged no matter what.

I became convinced that if I read something long, OS does necessary
speedups automatically (which is also in agreement with fadvise manual/code
comments).
Could you please elaborate how have you got the results with that big
difference? (Though I don't against fadvise usage, at worst it is expected
to be useless).

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Use fadvise in wal replay

2022-06-21 Thread Andrey Borodin



> On 21 Jun 2022, at 16:59, Jakub Wartak  wrote:
Oh, wow, your benchmarks show really impressive improvement.

> I think that 1 additional syscall is not going to be cheap just for 
> non-standard OS configurations
Also we can reduce number of syscalls by something like

#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8, 
POSIX_FADV_WILLNEED);
#endif

and maybe define\reuse the some GUC to control number of prefetched pages at 
once.

Best regards, Andrey Borodin.



Re: fix crash with Python 3.11

2022-06-21 Thread Tom Lane
Is it time yet to back-patch 2e517818f ("Fix SPI's handling of errors
during transaction commit")?  We know we're going to have to do it
before Python 3.11 ships, and it's been stable in HEAD for 3.5 months
now.  Also, the Fedora guys absorbed the patch a couple weeks ago [1]
because they're already using 3.11 in rawhide, and I've not heard
complaints from that direction.

My inclination at this point is to not back-patch the second change
12d768e70 ("Don't use static storage for SaveTransactionCharacteristics").
It's not clear that the benefit would be worth even a small risk of
somebody being unhappy about the API break.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CA%2BHKMWMY_e2otmTJDjKUAvC8Urh4rzSWOPZ%3DfszU5brkBP97ng%40mail.gmail.com




Re: doc: Bring mention of unique index forced transaction wait behavior outside of the internal section

2022-06-21 Thread David G. Johnston
On Tue, Jun 21, 2022 at 6:49 AM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi David,
>
> > It's basically a glorified cross-reference.  I didn't dislike directing
> the reader to the internals section enough to try and establish a better
> location for the main content.
>
> One problem I see is that:
>
> + [..], but as there is no pre-existing data, visibility checks are
> unnecessary.
>
> ... allows a wide variety of interpretations, most of which will be
> wrong. And all in all I find an added paragraph somewhat cryptic.


Yeah, I'd probably have to say "but since no existing record is being
modified, visibility checks are unnecessary".

Is there a specific mis-interpretation that first came to mind for you that
I can consider specifically?

>
> If the goal is to add a cross-reference I suggest keeping it short,
> something like "For additional details on various corner cases please
> see ...".
>
>
That does work, and I may end up there, but it feels unsatisfying to be so
vague/general.

David J.


Re: doc: array_length produces null instead of 0

2022-06-21 Thread David G. Johnston
On Tue, Jun 21, 2022 at 6:33 AM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi David,
>
> > Per discussion here:
> >
> >
> https://www.postgresql.org/message-id/163636931138.8076.5140809232053731248%40wrigleys.postgresql.org
> >
> > We can now easily document the array_length behavior of returning null
> instead of zero for an empty array/dimension.
> >
> > I added an example to the json_array_length function to demonstrate that
> it does return 0 as one would expect, but contrary to the SQL array
> behavior.
> >
> > I did not bother to add examples to the other half dozen or so "_length"
> functions that all produce 0 as expected.  Just the surprising case and the
> adjacent one.
>
> Good catch.
>
> +array_length(array[], 1)
> +NULL
>
> One tiny nitpick I have is that this example will not work if used
> literally, as is:
>
> ```
> =# select array_length(array[], 1);
> ERROR:  cannot determine type of empty array
> LINE 1: select array_length(array[], 1);
> ```
>
> Maybe it's worth using `array_length(array[] :: int[], 1)` instead.
>
>
I think subconsciously the cast looked ugly to me so I probably skipped
adding it.  I do agree the example should be executable though, and most of
the existing examples use integer[] (not the abbreviated form, int) so I'll
plan to go with that.

Thanks for the review!

David J.


Re: generate_series for timestamptz and time zone problem

2022-06-21 Thread Przemysław Sztoch



Przemysław Sztoch wrote on 14.06.2022 21:46:

Tom Lane wrote on 14.06.2022 15:43:

=?UTF-8?Q?Przemys=c5=82aw_Sztoch?=  writes:

Please let me know what is the convention (procedure) of adding new
functions to pg_proc. Specifically how oid is allocated.

See
https://www.postgresql.org/docs/devel/system-catalog-initial-data.html#SYSTEM-CATALOG-OID-ASSIGNMENT
(you should probably read that whole chapter for context).

Thx.

There is another patch.
It works, but one thing is wrongly done because I lack knowledge.

Where I'm using DirectFunctionCall3 I need to pass the timezone name, 
but I'm using cstring_to_text and I'm pretty sure there's a memory 
leak here. But I need help to fix this.
I don't know how best to store the timezone in the generate_series 
context. Please, help.
Please give me feedback on how to properly store the timezone name in 
the function context structure. I can't finish my work without it.


Additionally, I added a new variant of the date_trunc function that 
takes intervals as an argument.
It enables functionality similar to date_bin, but supports monthly, 
quarterly, annual, etc. periods.
In addition, it is resistant to the problems of different time zones and 
daylight saving time (DST).


--
Przemysław Sztoch | Mobile +48 509 99 00 66
diff --git a/src/backend/utils/adt/timestamp.c 
b/src/backend/utils/adt/timestamp.c
index f70f829d83..98ca7f8d53 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -69,6 +69,7 @@ typedef struct
TimestampTz finish;
Intervalstep;
int step_sign;
+   chartzname[TZ_STRLEN_MAX + 1];
 } generate_series_timestamptz_fctx;
 
 
@@ -3003,83 +3004,124 @@ timestamptz_pl_interval(PG_FUNCTION_ARGS)
 {
TimestampTz timestamp = PG_GETARG_TIMESTAMPTZ(0);
Interval   *span = PG_GETARG_INTERVAL_P(1);
-   TimestampTz result;
+   pg_tz  *attimezone = NULL;
int tz;
 
if (TIMESTAMP_NOT_FINITE(timestamp))
-   result = timestamp;
-   else
+   PG_RETURN_TIMESTAMP(timestamp);
+
+   if (PG_NARGS() > 2)
{
-   if (span->month != 0)
+   text   *zone = PG_GETARG_TEXT_PP(2);
+   chartzname[TZ_STRLEN_MAX + 1];
+   char   *lowzone;
+   int type,
+   val;
+   pg_tz  *tzp;
+   /*
+* Look up the requested timezone (see notes in 
timestamptz_zone()).
+*/
+   text_to_cstring_buffer(zone, tzname, sizeof(tzname));
+
+   /* DecodeTimezoneAbbrev requires lowercase input */
+   lowzone = downcase_truncate_identifier(tzname,
+   
   strlen(tzname),
+   
   false);
+
+   type = DecodeTimezoneAbbrev(0, lowzone, , );
+
+   if (type == TZ || type == DTZ)
{
-   struct pg_tm tt,
-  *tm = 
-   fsec_t  fsec;
-
-   if (timestamp2tm(timestamp, , tm, , NULL, NULL) 
!= 0)
-   ereport(ERROR,
-   
(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
-errmsg("timestamp out of 
range")));
-
-   tm->tm_mon += span->month;
-   if (tm->tm_mon > MONTHS_PER_YEAR)
-   {
-   tm->tm_year += (tm->tm_mon - 1) / 
MONTHS_PER_YEAR;
-   tm->tm_mon = ((tm->tm_mon - 1) % 
MONTHS_PER_YEAR) + 1;
-   }
-   else if (tm->tm_mon < 1)
-   {
-   tm->tm_year += tm->tm_mon / MONTHS_PER_YEAR - 1;
-   tm->tm_mon = tm->tm_mon % MONTHS_PER_YEAR + 
MONTHS_PER_YEAR;
-   }
-
-   /* adjust for end of month boundary problems... */
-   if (tm->tm_mday > 
day_tab[isleap(tm->tm_year)][tm->tm_mon - 1])
-   tm->tm_mday = 
(day_tab[isleap(tm->tm_year)][tm->tm_mon - 1]);
-
-   tz = DetermineTimeZoneOffset(tm, session_timezone);
-
-   if (tm2timestamp(tm, fsec, , ) != 0)
-   ereport(ERROR,
-   
(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
-errmsg("timestamp out of 
range")));
+   /* fixed-offset abbreviation, get a pg_tz descriptor 
for that */
+   tzp = pg_tzset_offset(-val);
  

RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak
> On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak 
> wrote:
> > > > Maybe the important question is why would be readahead mechanism
> > > > be
> > > disabled in the first place via /sys | blockdev ?
> > >
> > > Because database should know better than OS which data needs to be
> > > prefetched and which should not. Big OS readahead affects index scan
> > > performance.
> >
> > OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ
> which is not cheap either. The code is already hot and there is example from 
> the
> past where syscalls were limiting the performance [1]. Maybe it could be
> prefetching in larger batches (128kB? 1MB? 16MB?)  ?
> 
> I've always thought we'd want to tell it about the *next* segment file, to
> smooth the transition from one file to the next, something like the attached 
> (not
> tested).

Hey Thomas! 

Apparently it's false theory. Redo-bench [1] results (1st is total recovery 
time in seconds, 3.1GB pgdata (out of which 2.6 pg_wals/166 files). Redo-bench 
was slightly hacked to drop fs caches always after copying so that there is 
nothing in fscache (both no pgdata and no pg_wals; shared fs). M_io_c is at 
default (10), recovery_prefetch same (try; on by default)

master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master, blockdev --setra 0 /dev/nvme0n1: 
53.151, 0.603
58.329, 0.525
52.435, 0.536

master, with yours patch (readaheads disabled) -- double checked, calls to 
fadvise64(offset=0 len=0) were there
58.063, 0.593
51.369, 0.574
51.716, 0.59

master, with Kirill's original patch (readaheads disabled)
38.25, 1.134
36.563, 0.582
37.711, 0.584

I've noted also that in both cases POSIX_FADV_SEQUENTIAL is being used instead 
of WILLNEED (?). 
I haven't quantified the tradeoff of master vs Kirill's with readahead, but I 
think that 1 additional syscall is not going to be cheap just for non-standard 
OS configurations (?)

-J.

[1] - https://github.com/macdice/redo-bench


Re: doc: Bring mention of unique index forced transaction wait behavior outside of the internal section

2022-06-21 Thread Aleksander Alekseev
Hi David,

> It's basically a glorified cross-reference.  I didn't dislike directing the 
> reader to the internals section enough to try and establish a better location 
> for the main content.

One problem I see is that:

+ [..], but as there is no pre-existing data, visibility checks are unnecessary.

... allows a wide variety of interpretations, most of which will be
wrong. And all in all I find an added paragraph somewhat cryptic.

If the goal is to add a cross-reference I suggest keeping it short,
something like "For additional details on various corner cases please
see ...".

-- 
Best regards,
Aleksander Alekseev




Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-21 Thread Przemysław Sztoch


Thomas Munro wrote on 21.06.2022 02:53:

On Tue, Jun 21, 2022 at 12:11 PM Michael Paquier  wrote:

Yeah, Latin-ASCII.xml is getting it wrong here, then.  unaccent
fetches the thing from this URL currently:
https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml

Oh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND
RECORDING COPYRIGHT[1] so we could drop it from special_cases().

Hmm, is it possible to get rid of CYRILLIC CAPITAL LETTER IO and
CYRILLIC SMALL LETTER IO by adding Cyrillic to PLAIN_LETTER_RANGES?

That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT.  Not sure how
to kill those last two special cases -- they should be directly
replaced by their decomposition.

[1] https://unicode-org.atlassian.net/browse/CLDR-11383

I patch v3 support for cirilic is added.
Special character function has been purged.
Added support for category: So - Other Symbol. This category include 
characters from special_cases().


--
Przemysław Sztoch | Mobile +48 509 99 00 66


Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-21 Thread Przemysław Sztoch



Michael Paquier wrote on 21.06.2022 02:11:

On Mon, Jun 20, 2022 at 10:37:57AM +0200, Przemysław Sztoch wrote:

But ligature check is performed on combining_ids (result of translation),
not on base codepoint.
Without it, you will get assertions in get_plain_letters.

Hmm.  I am wondering if we could make the whole logic a bit more
intuitive here.  The loop that builds the set of mappings gets now
much more complicated with the addition of the categories beginning by
N for the numbers, and that's mostly the same set of checks as the
ones applied for T.

I'm sorry, but I can't correct this condition.
I have tried, but there are further exceptions and errors.



However, the current Latin-ASCII.xml suggests a conversion to x.
I found an open discussion on the internet about this and the suggestion
that the Latin-ASCII.xml file should be corrected for this letter.
But I wouldn't expect that Unicode makes the revised Latin-ASCII.xml quickly
into the official repo.

Yeah, Latin-ASCII.xml is getting it wrong here, then.  unaccent
fetches the thing from this URL currently:
https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml

Could it be better to handle that as an exception in
generate_unaccent_rules.py, documenting why we are doing it this way
then?  My concern is somebody re-running the script without noticing
this exception, and the set of rules would be blindly, and
incorrectly, updated.

I replaced python set with python dictionary.
It resolve problem with duplicated entry.
I left the conversion to "x". It was like that before and I leave it as 
it was.
The conversion to "x" is probably due to the phonetic interpretation of 
this sign.

If they correct the Latin-ASCII.xml file, it will change.

--
Michael


--
Przemysław Sztoch | Mobile +48 509 99 00 66
commit 077cbb396584ac2e7bed6fd662c27e881f82d7c0
Author: Przemyslaw Sztoch 
Date:   Wed May 4 18:28:59 2022 +0200

Update unnaccent rules generator.

diff --git a/contrib/unaccent/generate_unaccent_rules.py 
b/contrib/unaccent/generate_unaccent_rules.py
index c405e231b3..bb797fc954 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -35,13 +35,16 @@ import xml.etree.ElementTree as ET
 sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
 
 # The ranges of Unicode characters that we consider to be "plain letters".
-# For now we are being conservative by including only Latin and Greek.  This
-# could be extended in future based on feedback from people with relevant
-# language knowledge.
+# For now we are being conservative by including only Latin, Greek and 
Cyrillic.
+# This could be extended in future based on feedback from people with
+# relevant language knowledge.
 PLAIN_LETTER_RANGES = ((ord('a'), ord('z')),  # Latin lower case
(ord('A'), ord('Z')),  # Latin upper case
-   (0x03b1, 0x03c9),  # GREEK SMALL LETTER ALPHA, 
GREEK SMALL LETTER OMEGA
-   (0x0391, 0x03a9))  # GREEK CAPITAL LETTER ALPHA, 
GREEK CAPITAL LETTER OMEGA
+   (ord('0'), ord('9')),  # Digits
+   (0x0391, 0x03a9),  # Greek capital letters 
(ALPHA-OMEGA)
+   (0x03b1, 0x03c9),  # Greek small letters 
(ALPHA-OMEGA)
+   (0x0410, 0x044f),  # Cyrillic capital and small 
letters
+   (0x00b0, 0x00b0))  # Degree sign
 
 # Combining marks follow a "base" character, and result in a composite
 # character. Example: "U&'A\0300'"produces "À".There are three types of
@@ -139,24 +142,24 @@ def get_plain_letter(codepoint, table):
 return codepoint
 
 # Should not come here
-assert(False)
+assert False, 'Codepoint U+%0.2X' % codepoint.id
 
 
 def is_ligature(codepoint, table):
 """Return true for letters combined with letters."""
-return all(is_letter(table[i], table) for i in codepoint.combining_ids)
+return all(i in table and is_letter(table[i], table) for i in 
codepoint.combining_ids)
 
 
 def get_plain_letters(codepoint, table):
 """Return a list of plain letters from a ligature."""
-assert(is_ligature(codepoint, table))
+assert is_ligature(codepoint, table), 'Codepoint U+%0.2X' % codepoint.id
 return [get_plain_letter(table[id], table) for id in 
codepoint.combining_ids]
 
 
 def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
 """Parse the XML file and return a set of tuples (src, trg), where "src"
 is the original character and "trg" the substitute."""
-charactersSet = set()
+charactersDict = {}
 
 # RegEx to parse rules
 rulePattern = re.compile(r'^(?:(.)|(\\u[0-9a-fA-F]{4})) \u2192 
(?:\'(.+)\'|(.+)) ;')
@@ -196,25 +199,19 @@ def 
parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
 # the parser of unaccent only accepts non-whitespace characters
 # for "src" and "trg" (see unaccent.c)

Re: doc: array_length produces null instead of 0

2022-06-21 Thread Aleksander Alekseev
Hi David,

> Per discussion here:
>
> https://www.postgresql.org/message-id/163636931138.8076.5140809232053731248%40wrigleys.postgresql.org
>
> We can now easily document the array_length behavior of returning null 
> instead of zero for an empty array/dimension.
>
> I added an example to the json_array_length function to demonstrate that it 
> does return 0 as one would expect, but contrary to the SQL array behavior.
>
> I did not bother to add examples to the other half dozen or so "_length" 
> functions that all produce 0 as expected.  Just the surprising case and the 
> adjacent one.

Good catch.

+array_length(array[], 1)
+NULL

One tiny nitpick I have is that this example will not work if used
literally, as is:

```
=# select array_length(array[], 1);
ERROR:  cannot determine type of empty array
LINE 1: select array_length(array[], 1);
```

Maybe it's worth using `array_length(array[] :: int[], 1)` instead.

-- 
Best regards,
Aleksander Alekseev




PostgreSQL 15 Beta 2 release

2022-06-21 Thread Jonathan S. Katz

Hi,

The RMT[1] with the release team has set a date of June 30, 2022 for the 
PostgreSQL 15 Beta 2 release. We encourage you to try to close as many 
open items[2] prior to the release.


If you are working on patches for Beta 2, please be sure that they are 
committed no later than June 26, 2022 AoE[3].


Thanks!

John, Jonathan, Michael

[1] https://wiki.postgresql.org/wiki/Release_Management_Team
[2] https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items
[3] https://en.wikipedia.org/wiki/Anywhere_on_Earth


OpenPGP_signature
Description: OpenPGP digital signature


Re: Support load balancing in libpq

2022-06-21 Thread Aleksander Alekseev
Hi Jelte,

> Load balancing connections across multiple read replicas is a pretty
> common way of scaling out read queries. There are two main ways of doing
> so, both with their own advantages and disadvantages:
> 1. Load balancing at the client level
> 2. Load balancing by connecting to an intermediary load balancer
>
> Option 1 has been supported by JDBC (Java) for 8 years and Npgsql (C#)
> merged support about a year ago. This patch adds the same functionality
> to libpq. The way it's implemented is the same as the implementation of
> JDBC, and contains two levels of load balancing:
> 1. The given hosts are randomly shuffled, before resolving them
> one-by-one.
> 2. Once a host its addresses get resolved, those addresses are shuffled,
> before trying to connect to them one-by-one.

Thanks for the patch.

I don't mind the feature but I believe the name is misleading. Unless
I missed something, the patch merely allows choosing a random host
from the provided list. By load balancing people generally expect
something more elaborate - e.g. sending read-only queries to replicas
and read/write queries to the primary, or perhaps using weights
proportional to the server throughput/performance.

Randomization would be a better term for what the patch does.

-- 
Best regards,
Aleksander Alekseev




Re: Use fadvise in wal replay

2022-06-21 Thread Amit Kapila
On Tue, Jun 21, 2022 at 5:41 PM Bharath Rupireddy
 wrote:
>
> On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila  wrote:
> >
> > On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin  wrote:
> > >
> > > > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> > > >
> > > > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > > > help your case?
> > >
> > > AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
> > > prefetch WAL itself before reading it. Kirill is trying to solve the 
> > > problem of reading WAL segments that are our of OS page cache.
> > >
> >
> > Okay, but normally the WAL written by walreceiver is read by the
> > startup process soon after it's written as indicated in code comments
> > (get_sync_bit()). So, what is causing the delay here which makes the
> > startup process perform physical reads?
>
> That's not always true. If there's a huge apply lag and/or
> restartpoint is infrequent/frequent or there are many reads on the
> standby - in all of these cases the OS cache can replace the WAL from
> it causing the startup process to hit the disk for WAL reading.
>

It is possible that due to one or more these reasons startup process
has to physically read the WAL. I think it is better to find out what
is going on for the OP. AFAICS, there is no mention of any other kind
of reads on the problematic standby. As per the analysis shared in the
initial email, the replication lag is due to disk reads, so there
doesn't seem to be a very clear theory as to why the OP is seeing disk
reads.

-- 
With Regards,
Amit Kapila.




Re: Use fadvise in wal replay

2022-06-21 Thread Bharath Rupireddy
On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila  wrote:
>
> On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin  wrote:
> >
> > > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> > >
> > > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > > help your case?
> >
> > AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
> > prefetch WAL itself before reading it. Kirill is trying to solve the 
> > problem of reading WAL segments that are our of OS page cache.
> >
>
> Okay, but normally the WAL written by walreceiver is read by the
> startup process soon after it's written as indicated in code comments
> (get_sync_bit()). So, what is causing the delay here which makes the
> startup process perform physical reads?

That's not always true. If there's a huge apply lag and/or
restartpoint is infrequent/frequent or there are many reads on the
standby - in all of these cases the OS cache can replace the WAL from
it causing the startup process to hit the disk for WAL reading.

Regards,
Bharath Rupireddy.




Re: Use fadvise in wal replay

2022-06-21 Thread Amit Kapila
On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin  wrote:
>
> > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> >
> > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > help your case?
>
> AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
> prefetch WAL itself before reading it. Kirill is trying to solve the problem 
> of reading WAL segments that are our of OS page cache.
>

Okay, but normally the WAL written by walreceiver is read by the
startup process soon after it's written as indicated in code comments
(get_sync_bit()). So, what is causing the delay here which makes the
startup process perform physical reads?

-- 
With Regards,
Amit Kapila.




Re: Use fadvise in wal replay

2022-06-21 Thread Bharath Rupireddy
On Tue, Jun 21, 2022 at 4:22 PM Thomas Munro  wrote:
>
> On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak  wrote:
> > > > Maybe the important question is why would be readahead mechanism be
> > > disabled in the first place via /sys | blockdev ?
> > >
> > > Because database should know better than OS which data needs to be
> > > prefetched and which should not. Big OS readahead affects index scan
> > > performance.
> >
> > OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ 
> > which is not cheap either. The code is already hot and there is example 
> > from the past where syscalls were limiting the performance [1]. Maybe it 
> > could be prefetching in larger batches (128kB? 1MB? 16MB?)  ?
>
> I've always thought we'd want to tell it about the *next* segment
> file, to smooth the transition from one file to the next, something
> like the attached (not tested).

Yes, it makes sense to prefetch the "future" WAL files that "may be"
needed for recovery (crash recovery/archive or PITR recovery/standby
recovery), not the current WAL file. Having said that, it's not a
great idea (IMO) to make the WAL readers prefetching instead WAL
prefetching can be delegated to a new background worker or existing bg
writer or checkpointer which gets started during recovery.

Also, it's a good idea to measure the benefits with and without WAL
prefetching for all recovery types - crash recovery/archive or PITR
recovery/standby recovery. For standby recovery,  the WAL files may be
in OS cache if there wasn't a huge apply lag.

Regards,
Bharath Rupireddy.




Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2022-06-21 Thread Thomas Munro
On Tue, Jun 21, 2022 at 7:44 PM Michael Paquier  wrote:
> The extra business with QueryCancelHoldoffCount and DoingCommandRead
> is the only addition for the snapshot, lock and tablespace conflict
> handling part.  I don't see why a reason why that could be wrong on a
> close lookup.  Anyway, why don't you check QueryCancelPending on top
> of QueryCancelHoldoffCount?

The idea of this patch is to make ProcessRecoveryConflictInterrupt()
throw its own ERROR, instead of setting QueryCancelPending (as an
earlier version of the patch did).  It still has to respect
QueryCancelHoldoffCount, though, to avoid emitting an ERROR at bad
times for the fe/be protocol.




Re: Use fadvise in wal replay

2022-06-21 Thread Thomas Munro
On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak  wrote:
> > > Maybe the important question is why would be readahead mechanism be
> > disabled in the first place via /sys | blockdev ?
> >
> > Because database should know better than OS which data needs to be
> > prefetched and which should not. Big OS readahead affects index scan
> > performance.
>
> OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ 
> which is not cheap either. The code is already hot and there is example from 
> the past where syscalls were limiting the performance [1]. Maybe it could be 
> prefetching in larger batches (128kB? 1MB? 16MB?)  ?

I've always thought we'd want to tell it about the *next* segment
file, to smooth the transition from one file to the next, something
like the attached (not tested).
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6eba626420..e1dc37d3c2 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -4032,6 +4032,25 @@ rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN)
 }
 
 
+/*
+ * Tell the kernel to prefetch a logfile segment, if it exists.  Ignores
+ * errors, since this is only a hint.
+ */
+static void
+XLogFilePrefetch(XLogSegNo segno, TimeLineID tli)
+{
+	char		path[MAXPGPATH];
+	File		file;
+
+	XLogFilePath(path, tli, segno, wal_segment_size);
+	file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (file >= 0)
+	{
+		FilePrefetch(file, 0, 0, WAIT_EVENT_WAL_PREFETCH);
+		FileClose(file);
+	}
+}
+
 /*
  * Open a logfile segment for reading (during recovery).
  *
@@ -4106,6 +4125,13 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		if (source != XLOG_FROM_STREAM)
 			XLogReceiptTime = GetCurrentTimestamp();
 
+		/*
+		 * Every time we open a file from pg_wal, hint to the kernel that we'll
+		 * likely soon be reading the next segment.
+		 */
+		if (readSource == XLOG_FROM_PG_WAL)
+			XLogFilePrefetch(segno + 1, tli);
+
 		return fd;
 	}
 	if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 87c15b9c6f..f45d32ceef 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -729,6 +729,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_WAL_INIT_WRITE:
 			event_name = "WALInitWrite";
 			break;
+		case WAIT_EVENT_WAL_PREFETCH:
+			event_name = "WALPrefetch";
+			break;
 		case WAIT_EVENT_WAL_READ:
 			event_name = "WALRead";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index b578e2ec75..1473a4388f 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -226,6 +226,7 @@ typedef enum
 	WAIT_EVENT_WAL_COPY_WRITE,
 	WAIT_EVENT_WAL_INIT_SYNC,
 	WAIT_EVENT_WAL_INIT_WRITE,
+	WAIT_EVENT_WAL_PREFETCH,
 	WAIT_EVENT_WAL_READ,
 	WAIT_EVENT_WAL_SYNC,
 	WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,


gcc -ftabstop option

2022-06-21 Thread Peter Eisentraut

I suggest that we add the gcc (also clang) option -ftabstop=4.

This has two effects:  First, it produces more accurate column numbers 
in compiler errors and correctly indents the code excerpts that the 
compiler shows with those.  Second, it enables the compiler's detection 
of confusingly indented code to work more correctly.  (But the latter is 
only a potential problem for code that does not use tabs for 
indentation, so it would be mostly a help during development with sloppy 
editor setups.)


Attached is a patch to discover the option in configure.

One bit of trickery not addressed yet is that we might want to strip out 
the option and not expose it through PGXS, since we don't know what 
whitespacing rules external code uses.


Thoughts?From 2095d1616b2ba8e39daa6aac8004936b8a229354 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 21 Jun 2022 12:40:59 +0200
Subject: [PATCH] Add compiler option -ftabstop=4 if available

---
 configure| 43 +++
 configure.ac |  5 +
 2 files changed, 48 insertions(+)

diff --git a/configure b/configure
index 7dec6b7bf9..b7420cceea 100755
--- a/configure
+++ b/configure
@@ -6303,6 +6303,49 @@ if test x"$pgac_cv_prog_CC_cflags__ftree_vectorize" = 
x"yes"; then
 fi
 
 
+  NOT_THE_CFLAGS=""
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports 
-ftabstop=4, for NOT_THE_CFLAGS" >&5
+$as_echo_n "checking whether ${CC} supports -ftabstop=4, for NOT_THE_CFLAGS... 
" >&6; }
+if ${pgac_cv_prog_CC_cflags__ftabstop_4+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${NOT_THE_CFLAGS} -ftabstop=4"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__ftabstop_4=yes
+else
+  pgac_cv_prog_CC_cflags__ftabstop_4=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$pgac_cv_prog_CC_cflags__ftabstop_4" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__ftabstop_4" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__ftabstop_4" = x"yes"; then
+  NOT_THE_CFLAGS="${NOT_THE_CFLAGS} -ftabstop=4"
+fi
+
+
+  if test -n "$NOT_THE_CFLAGS"; then
+CPPFLAGS="$CPPFLAGS -ftabstop=4"
+  fi
   #
   # The following tests want to suppress various unhelpful warnings by adding
   # -Wno-foo switches.  But gcc won't complain about unrecognized -Wno-foo
diff --git a/configure.ac b/configure.ac
index d093fb88dd..b1d7ac9fc2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -525,6 +525,11 @@ if test "$GCC" = yes -a "$ICC" = no; then
   PGAC_PROG_CC_VAR_OPT(CFLAGS_UNROLL_LOOPS, [-funroll-loops])
   # Optimization flags for specific files that benefit from vectorization
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTORIZE, [-ftree-vectorize])
+  NOT_THE_CFLAGS=""
+  PGAC_PROG_CC_VAR_OPT(NOT_THE_CFLAGS, [-ftabstop=4])
+  if test -n "$NOT_THE_CFLAGS"; then
+CPPFLAGS="$CPPFLAGS -ftabstop=4"
+  fi
   #
   # The following tests want to suppress various unhelpful warnings by adding
   # -Wno-foo switches.  But gcc won't complain about unrecognized -Wno-foo
-- 
2.36.1



RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak
> > Maybe the important question is why would be readahead mechanism be
> disabled in the first place via /sys | blockdev ?
> 
> Because database should know better than OS which data needs to be
> prefetched and which should not. Big OS readahead affects index scan
> performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which 
is not cheap either. The code is already hot and there is example from the past 
where syscalls were limiting the performance [1]. Maybe it could be prefetching 
in larger batches (128kB? 1MB? 16MB?)  ?

-J.

[1] - https://commitfest.postgresql.org/28/2606/






Re: Use fadvise in wal replay

2022-06-21 Thread Andrey Borodin



> On 21 Jun 2022, at 13:20, Jakub Wartak  wrote:
> 
> Maybe the important question is why would be readahead mechanism be disabled 
> in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be prefetched 
and which should not. Big OS readahead affects index scan performance.

Best regards, Andrey Borodin.



RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak
>> > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
>> >
>> > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
>> > help your case?
>> 
>> AFAICS recovery_prefetch tries to prefetch main fork, but does not try to
>> prefetch WAL itself before reading it. Kirill is trying to solve the problem 
>> of
>> reading WAL segments that are our of OS page cache.

It seems that it is always by default set to 128 (kB) by default, another thing 
is that having (default) 16MB WAL segments might also hinder the readahead 
heuristics compared to having configured the bigger WAL segment size.

Maybe the important question is why would be readahead mechanism be disabled in 
the first place via /sys | blockdev ?

-J.




Re: Make COPY extendable in order to support Parquet and other formats

2022-06-21 Thread Aleksander Alekseev
Hi Ashutosh,

> An extension just for COPY to/from parquet looks limited in
> functionality. Shouldn't this be viewed as an FDW or Table AM support
> for parquet or other formats? Of course the later is much larger in
> scope compared to the first one. But there may already be efforts
> underway
> https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/

Many thanks for sharing your thoughts on this!

We are using parquet_fdw [2] but this is a read-only FDW.

What users typically need is to dump their data as fast as possible in
a given format and either to upload it to the cloud as historical data
or to transfer it to another system (Spark, etc). The data can be
accessed later if needed, as read only one.

Note that when accessing the historical data with parquet_fdw you
basically have a zero ingestion time.

Another possible use case is transferring data to PostgreSQL from
another source. Here the requirements are similar - the data should be
dumped as fast as possible from the source, transferred over the
network and imported as fast as possible.

In other words, personally I'm unaware of use cases when somebody
needs a complete read/write FDW or TableAM implementation for formats
like Parquet, ORC, etc. Also to my knowledge they are not particularly
optimized for this.

[2]: https://github.com/adjust/parquet_fdw

-- 
Best regards,
Aleksander Alekseev




Re: Use fadvise in wal replay

2022-06-21 Thread Andrey Borodin



> On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
> 
> I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to 
prefetch WAL itself before reading it. Kirill is trying to solve the problem of 
reading WAL segments that are our of OS page cache.

Best regards, Andrey Borodin.



Re: Devel docs on website reloading

2022-06-21 Thread Magnus Hagander
On Wed, Nov 18, 2020 at 1:44 PM Magnus Hagander  wrote:

> On Wed, Nov 18, 2020 at 1:31 PM Alvaro Herrera 
> wrote:
> >
> > On 2020-Nov-18, Magnus Hagander wrote:
> >
> > > It would be trivial to change this so that it only actually updates
> > > pages if they have been changed.
> >
> > I think this means we could also check much more frequently whether a
> > rebuild is needed, right?  We could do that every 30 mins or so, since
> > most of the time it would be a no-op.
>
> No that'd be unrelated. We don't have a dedicated buildfarm animal for
> it, we just piggyback on the existing run, which runs on any changes,
> not just docs.
>

Less than 2 years later this is now actually done. That is, we are now
loading the docs from a git checkout instead of a tarball, which means the
devel docs are now stamped with the git revision that they were built from
(and they're only rebuilt when something changes the docs, but when they
do, it should normally take <2 minutes for the new docs to appear on the
website)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Make COPY extendable in order to support Parquet and other formats

2022-06-21 Thread Ashutosh Bapat
On Mon, Jun 20, 2022 at 8:35 PM Aleksander Alekseev
 wrote:
>
> I would like to invest some time into providing a corresponding patch
> for the core and implementing "pg_copy_parquet" extension as a
> practical example, and yet another, a bit simpler, extension as an API
> usage example for the core codebase. I just wanted to double-check
> that this is still a wanted feature and no one on pgsql-hackers@
> objects the idea.

An extension just for COPY to/from parquet looks limited in
functionality. Shouldn't this be viewed as an FDW or Table AM support
for parquet or other formats? Of course the later is much larger in
scope compared to the first one. But there may already be efforts
underway [1]
https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/

I have not used it myself or worked with it.

-- 
Best Wishes,
Ashutosh Bapat




Re: Use fadvise in wal replay

2022-06-21 Thread Amit Kapila
On Tue, Jun 21, 2022 at 1:07 PM Kirill Reshke  wrote:
>
> Recently we faced a problem with one of our production clusters. We use a 
> cascade replication setup in this cluster, that is: master, standby (r1), and 
> cascade standby (r2). From time to time, the replication lag on r1 used to 
> grow, while on r2 it did not. Analysys showed that r1 startup process was 
> spending a lot of time in reading wal from disk. Increasing 
> /sys/block/md2/queue/read_ahead_kb to 16384 (from 0) helps in this case. 
> Maybe we can add fadvise call in postgresql startup, so it would not be 
> necessary to change settings on the hypervisor?
>

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

[1] - 
https://www.postgresql.org/docs/devel/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY

-- 
With Regards,
Amit Kapila.




Re: Replica Identity check of partition table on subscriber

2022-06-21 Thread Amit Langote
On Tue, Jun 21, 2022 at 5:08 PM houzj.f...@fujitsu.com
 wrote:
> On Tuesday, June 21, 2022 3:21 PM Amit Langote  
> wrote:
> > Thanks for the patch.
> >
> > I agree it's an old bug.  A partition map entry's localrel may point
> > to a stale Relation pointer, because once the caller had closed the
> > relation, the relcache subsystem is free to "clear" it, like in the
> > case of a RELCACHE_FORCE_RELEASE build.
>
> Hi,
>
> Thanks for replying.
>
> > Fixing it the way patch does seems fine, though it feels like
> > localrelvalid will lose some of its meaning for the partition map
> > entries -- we will now overwrite localrel even if localrelvalid is
> > true.
>
> To me, it seems localrelvalid doesn't have the meaning that the cached 
> relation
> pointer is valid. In logicalrep_rel_open(), we also reopen and update the
> relation even if the localrelvalid is true.

Ah, right.  I guess only the localrelvalid=false case is really
interesting then.  Only in that case, we need to (re-)build other
fields that are computed using localrel.  In the localrelvalid=true
case, we don't need to worry about other fields, but still need to
make sure that localrel points to an up to date relcache entry of the
relation.

> > +   /*
> > +* Relation is opened and closed by caller, so we need to always update 
> > the
> > +* partrel in case the cached relation was closed.
> > +*/
> > +   entry->localrel = partrel;
> > +
> > +   if (entry->localrelvalid)
> > return entry;
> >
> > Maybe we should add a comment here about why it's okay to overwrite
> > localrel even if localrelvalid is true.  How about the following hunk:
> >
> > @@ -596,8 +596,20 @@ logicalrep_partition_open(LogicalRepRelMapEntry
> > *root,
> >
> > entry = _entry->relmapentry;
> >
> > +   /*
> > +* We must always overwrite entry->localrel with the latest partition
> > +* Relation pointer, because the Relation pointed to by the old value 
> > may
> > +* have been cleared after the caller would have closed the partition
> > +* relation after the last use of this entry.  Note that localrelvalid 
> > is
> > +* only updated by the relcache invalidation callback, so it may still 
> > be
> > +* true irrespective of whether the Relation pointed to by localrel has
> > +* been cleared or not.
> > +*/
> > if (found && entry->localrelvalid)
> > +   {
> > +   entry->localrel = partrel;
> > return entry;
> > +   }
> >
> > Attached a patch containing the above to consider as an alternative.
>
> This looks fine to me as well.

Thank you.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




RE: Replica Identity check of partition table on subscriber

2022-06-21 Thread shiy.f...@fujitsu.com
On Tuesday, June 21, 2022 4:49 PM Amit Kapila  wrote:
> 
> On Tue, Jun 21, 2022 at 12:50 PM Amit Langote 
> wrote:
> >
> > On Tue, Jun 21, 2022 at 3:35 PM houzj.f...@fujitsu.com
> >  wrote:
> >
> > Attached a patch containing the above to consider as an alternative.
> >
> 
> Thanks, the patch looks good to me. I'll push this after doing some testing.

The patch looks good to me as well.

I also verified that the patch can be applied cleanly on back-branches and I
confirmed that the bug exists on back branches before this patch and is fixed
after applying this patch. The regression tests also passed with and without
RELCACHE_FORCE_RELEASE option in my machine.

Regards,
Shi yu


Re: Replica Identity check of partition table on subscriber

2022-06-21 Thread Amit Kapila
On Tue, Jun 21, 2022 at 12:50 PM Amit Langote  wrote:
>
> On Tue, Jun 21, 2022 at 3:35 PM houzj.f...@fujitsu.com
>  wrote:
>
> Attached a patch containing the above to consider as an alternative.
>

Thanks, the patch looks good to me. I'll push this after doing some testing.

-- 
With Regards,
Amit Kapila.




Re: Using PQexecQuery in pipeline mode produces unexpected Close messages

2022-06-21 Thread Kyotaro Horiguchi
At Tue, 21 Jun 2022 14:56:40 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> By the way, I noticed that "libpq_pipeline uniqviol" intermittently
> fails for uncertain reasons.
> 
> > result 574/575: pipeline aborted
> > ...
> > done writing
> > 
> > libpq_pipeline:1531: got unexpected NULL

PQsendQueryPrepared() is called after the conection's state has moved
to PGASYNC_IDLE so PQgetResult returns NULL. But actually there are
results.  So, if pqPipelineProcessorQueue() doesn't move the async
state to PGASYNC_IDLE when queue is emtpy, uniqviol can run till the
end. But that change breaks almost all of other test items.

Finally, I found that the change in pqPipelineProcessorQueue() as
attached fixes the uniqviol failure and doesn't break other tests.
However, I don't understand what I did by the change for now... X(
It seems to me something's wrong in the PQ_PIPELINE_ABORTED mode..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From f0b69fbc4f708d3b844b672989dbf15f84ed0c9b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 15 Jun 2022 19:56:41 +0200
Subject: [PATCH v6] Avoid going IDLE in pipeline mode

Introduce a new PGASYNC_PIPELINE_IDLE state, which helps PQgetResult
distinguish the case of "really idle".

This fixes the problem that ReadyForQuery is sent too soon, which caused
a CloseComplete to be received when in idle state.

XXX -- this is still WIP.

Co-authored-by: Kyotaro Horiguchi 
Reported-by: Daniele Varrazzo 
Discussion: https://postgr.es/m/ca+mi_8bvd0_cw3sumgwpvwdnzxy32itog_16tdyru_1s2gv...@mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c |  1 +
 src/interfaces/libpq/fe-exec.c| 56 +++
 src/interfaces/libpq/fe-protocol3.c   | 12 ---
 src/interfaces/libpq/libpq-int.h  |  3 +-
 .../modules/libpq_pipeline/libpq_pipeline.c   | 99 +++
 .../traces/simple_pipeline.trace  | 37 +++
 6 files changed, 175 insertions(+), 33 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 709ba15220..afd0bc809a 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -6751,6 +6751,7 @@ PQtransactionStatus(const PGconn *conn)
 {
 	if (!conn || conn->status != CONNECTION_OK)
 		return PQTRANS_UNKNOWN;
+	/* XXX what should we do here if status is PGASYNC_PIPELINE_IDLE? */
 	if (conn->asyncStatus != PGASYNC_IDLE)
 		return PQTRANS_ACTIVE;
 	return conn->xactStatus;
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 4180683194..3cf59e45e1 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -1279,7 +1279,8 @@ pqAppendCmdQueueEntry(PGconn *conn, PGcmdQueueEntry *entry)
 			 * itself consume commands from the queue; if we're in any other
 			 * state, we don't have to do anything.
 			 */
-			if (conn->asyncStatus == PGASYNC_IDLE)
+			if (conn->asyncStatus == PGASYNC_IDLE ||
+conn->asyncStatus == PGASYNC_PIPELINE_IDLE)
 			{
 resetPQExpBuffer(>errorMessage);
 pqPipelineProcessQueue(conn);
@@ -1642,9 +1643,9 @@ PQsendQueryStart(PGconn *conn, bool newQuery)
 		return false;
 	}
 
-	/* Can't send while already busy, either, unless enqueuing for later */
-	if (conn->asyncStatus != PGASYNC_IDLE &&
-		conn->pipelineStatus == PQ_PIPELINE_OFF)
+	/* In non-pipeline mode, we can't send anything while already busy */
+	if (conn->pipelineStatus == PQ_PIPELINE_OFF &&
+		conn->asyncStatus != PGASYNC_IDLE)
 	{
 		appendPQExpBufferStr(>errorMessage,
 			 libpq_gettext("another command is already in progress\n"));
@@ -1667,11 +1668,13 @@ PQsendQueryStart(PGconn *conn, bool newQuery)
 		switch (conn->asyncStatus)
 		{
 			case PGASYNC_IDLE:
+			case PGASYNC_PIPELINE_IDLE:
 			case PGASYNC_READY:
 			case PGASYNC_READY_MORE:
 			case PGASYNC_BUSY:
 /* ok to queue */
 break;
+
 			case PGASYNC_COPY_IN:
 			case PGASYNC_COPY_OUT:
 			case PGASYNC_COPY_BOTH:
@@ -1881,6 +1884,7 @@ PQsetSingleRowMode(PGconn *conn)
 	 */
 	if (!conn)
 		return 0;
+	/* XXX modify this? */
 	if (conn->asyncStatus != PGASYNC_BUSY)
 		return 0;
 	if (!conn->cmd_queue_head ||
@@ -2047,19 +2051,19 @@ PQgetResult(PGconn *conn)
 	{
 		case PGASYNC_IDLE:
 			res = NULL;			/* query is complete */
-			if (conn->pipelineStatus != PQ_PIPELINE_OFF)
-			{
-/*
- * We're about to return the NULL that terminates the round of
- * results from the current query; prepare to send the results
- * of the next query when we're called next.  Also, since this
- * is the start of the results of the next query, clear any
- * prior error message.
- */
-resetPQExpBuffer(>errorMessage);
-pqPipelineProcessQueue(conn);
-			}
 			break;
+		case PGASYNC_PIPELINE_IDLE:
+			Assert(conn->pipelineStatus != PQ_PIPELINE_OFF);
+
+			res = NULL;			/* query is complete */
+			/*
+			 * We're about to return the NULL that terminates the 

RE: Replica Identity check of partition table on subscriber

2022-06-21 Thread houzj.f...@fujitsu.com
On Tuesday, June 21, 2022 3:21 PM Amit Langote  wrote:
> 
> On Tue, Jun 21, 2022 at 3:35 PM houzj.f...@fujitsu.com
>  wrote:
> > On Tuesday, June 21, 2022 1:29 PM Amit Kapila :
> > > After pushing this patch, buildfarm member prion has failed.
> > >
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prion=HE
> > > AD
> > >
> > > It seems to me that the problem could be due to the reason that the
> > > entry returned by logicalrep_partition_open() may not have the correct
> > > value for localrel when we found the entry and localrelvalid is also
> > > true. The point is that before this commit we never use localrel value
> > > from the rel entry returned by logicalrep_partition_open. I think we
> > > need to always update the localrel value in
> > > logicalrep_partition_open().
> >
> > Agreed.
> >
> > And I have confirmed that the failure is due to the segmentation violation
> when
> > access the cached relation. I reproduced this by using
> -DRELCACHE_FORCE_RELEASE
> > -DCATCACHE_FORCE_RELEASE option which was hinted by Tom.
> >
> > Stack:
> > #0 check_relation_updatable (rel=0x1cf4548) at worker.c:1745
> > #1 0x00909cbb in apply_handle_tuple_routing (edata=0x1cbf4e8,
> remoteslot=0x1cbf908, newtup=0x0, operation=CMD_DELETE) at
> worker.c:2181
> > #2 0x009097a5 in apply_handle_delete (s=0x7ffcef7fd730) at
> worker.c:2005
> > #3 0x0090a794 in apply_dispatch (s=0x7ffcef7fd730) at
> worker.c:2503
> > #4 0x0090ad43 in LogicalRepApplyLoop
> (last_received=22299920) at worker.c:2775
> > #5 0x0090c2ab in start_apply (origin_startpos=0) at worker.c:3549
> > #6 0x0090ca8d in ApplyWorkerMain (main_arg=0) at
> worker.c:3805
> > #7 0x008c4c64 in StartBackgroundWorker () at bgworker.c:858
> > #8 0x008ceaeb in do_start_bgworker (rw=0x1c3c6b0) at
> postmaster.c:5815
> > #9 0x008cee97 in maybe_start_bgworkers () at postmaster.c:6039
> > #10 0x008cdf4e in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5204
> > #11 
> > #12 0x7fd8fbe0d4ab in select () from /lib64/libc.so.6
> > #13 0x008c9cfb in ServerLoop () at postmaster.c:1770
> > #14 0x008c96e4 in PostmasterMain (argc=4, argv=0x1c110a0) at
> postmaster.c:1478
> > #15 0x007c665b in main (argc=4, argv=0x1c110a0) at main.c:202
> > (gdb) p rel->localrel->rd_rel
> > $5 = (Form_pg_class) 0x7f7f7f7f7f7f7f7f
> >
> > We didn't hit this problem because we only access that relation when we plan
> to
> > report an error[1] and then the worker will restart and cache will be 
> > built, so
> > everything seems OK.
> >
> > The problem seems already existed and we hit this because we started to
> access
> > the cached relation in more places.
> >
> > I think we should try to update the relation every time as the relation is
> > opened and closed by caller and here is the patch to do that.
> Thanks for the patch.
> 
> I agree it's an old bug.  A partition map entry's localrel may point
> to a stale Relation pointer, because once the caller had closed the
> relation, the relcache subsystem is free to "clear" it, like in the
> case of a RELCACHE_FORCE_RELEASE build.

Hi,

Thanks for replying.

> Fixing it the way patch does seems fine, though it feels like
> localrelvalid will lose some of its meaning for the partition map
> entries -- we will now overwrite localrel even if localrelvalid is
> true.

To me, it seems localrelvalid doesn't have the meaning that the cached relation
pointer is valid. In logicalrep_rel_open(), we also reopen and update the
relation even if the localrelvalid is true.

> +   /*
> +* Relation is opened and closed by caller, so we need to always update 
> the
> +* partrel in case the cached relation was closed.
> +*/
> +   entry->localrel = partrel;
> +
> +   if (entry->localrelvalid)
> return entry;
> 
> Maybe we should add a comment here about why it's okay to overwrite
> localrel even if localrelvalid is true.  How about the following hunk:
> 
> @@ -596,8 +596,20 @@ logicalrep_partition_open(LogicalRepRelMapEntry
> *root,
> 
> entry = _entry->relmapentry;
> 
> +   /*
> +* We must always overwrite entry->localrel with the latest partition
> +* Relation pointer, because the Relation pointed to by the old value may
> +* have been cleared after the caller would have closed the partition
> +* relation after the last use of this entry.  Note that localrelvalid is
> +* only updated by the relcache invalidation callback, so it may still be
> +* true irrespective of whether the Relation pointed to by localrel has
> +* been cleared or not.
> +*/
> if (found && entry->localrelvalid)
> +   {
> +   entry->localrel = partrel;
> return entry;
> +   }
> 
> Attached a patch containing the above to consider as an alternative.

This looks fine to me as well.

Best regards,
Hou zj


Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2022-06-21 Thread Michael Paquier
On Tue, Jun 21, 2022 at 05:22:05PM +1200, Thomas Munro wrote:
> Here's one thing I got a bit confused about along the way, but it
> seems the comment was just wrong:
> 
> +   /*
> +* If we can abort just the current
> subtransaction then we are OK
> +* to throw an ERROR to resolve the conflict.
> Otherwise drop
> +* through to the FATAL case.
> ...
> +   if (!IsSubTransaction())
> ...
> +   ereport(ERROR,
> 
> Surely this was meant to say, "If we're not in a subtransaction then
> ...", right?  Changed.

Indeed, the code does something else than what the comment says, aka
generating an ERROR if the process is not in a subtransaction,
ignoring already aborted transactions (aborted subtrans go to FATAL)
and throwing a FATAL in the other cases.  So your change looks right.

> I thought of a couple more simplifications for the big switch
> statement in ProcessRecoveryConflictInterrupt().  The special case for
> DoingCommandRead can be changed to fall through, instead of needing a
> separate ereport(FATAL).

The extra business with QueryCancelHoldoffCount and DoingCommandRead
is the only addition for the snapshot, lock and tablespace conflict
handling part.  I don't see why a reason why that could be wrong on a
close lookup.  Anyway, why don't you check QueryCancelPending on top
of QueryCancelHoldoffCount?

> Now we're down to just one ereport(FATAL), one ereport(ERROR), and a
> couple of ways to give up without erroring.  I think this makes the
> logic a lot easier to follow?

Agreed that it looks like a gain in clarity.
--
Michael


signature.asc
Description: PGP signature


Use fadvise in wal replay

2022-06-21 Thread Kirill Reshke
Hi hackers!

Recently we faced a problem with one of our production clusters. We use a
cascade replication setup in this cluster, that is: master, standby (r1),
and cascade standby (r2). From time to time, the replication lag on r1 used
to grow, while on r2 it did not. Analysys showed that r1 startup process
was spending a lot of time in reading wal from disk. Increasing
/sys/block/md2/queue/read_ahead_kb to 16384 (from 0) helps in this case.
Maybe we can add fadvise call in postgresql startup, so it would not be
necessary to change settings on the hypervisor?


v1-0001-Use-fadvise-to-prefect-wal-in-xlogrecovery.patch
Description: Binary data


Re: Replica Identity check of partition table on subscriber

2022-06-21 Thread Amit Langote
On Tue, Jun 21, 2022 at 3:35 PM houzj.f...@fujitsu.com
 wrote:
> On Tuesday, June 21, 2022 1:29 PM Amit Kapila :
> > After pushing this patch, buildfarm member prion has failed.
> > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prion=HE
> > AD
> >
> > It seems to me that the problem could be due to the reason that the
> > entry returned by logicalrep_partition_open() may not have the correct
> > value for localrel when we found the entry and localrelvalid is also
> > true. The point is that before this commit we never use localrel value
> > from the rel entry returned by logicalrep_partition_open. I think we
> > need to always update the localrel value in
> > logicalrep_partition_open().
>
> Agreed.
>
> And I have confirmed that the failure is due to the segmentation violation 
> when
> access the cached relation. I reproduced this by using 
> -DRELCACHE_FORCE_RELEASE
> -DCATCACHE_FORCE_RELEASE option which was hinted by Tom.
>
> Stack:
> #0 check_relation_updatable (rel=0x1cf4548) at worker.c:1745
> #1 0x00909cbb in apply_handle_tuple_routing (edata=0x1cbf4e8, 
> remoteslot=0x1cbf908, newtup=0x0, operation=CMD_DELETE) at worker.c:2181
> #2 0x009097a5 in apply_handle_delete (s=0x7ffcef7fd730) at 
> worker.c:2005
> #3 0x0090a794 in apply_dispatch (s=0x7ffcef7fd730) at worker.c:2503
> #4 0x0090ad43 in LogicalRepApplyLoop (last_received=22299920) at 
> worker.c:2775
> #5 0x0090c2ab in start_apply (origin_startpos=0) at worker.c:3549
> #6 0x0090ca8d in ApplyWorkerMain (main_arg=0) at worker.c:3805
> #7 0x008c4c64 in StartBackgroundWorker () at bgworker.c:858
> #8 0x008ceaeb in do_start_bgworker (rw=0x1c3c6b0) at postmaster.c:5815
> #9 0x008cee97 in maybe_start_bgworkers () at postmaster.c:6039
> #10 0x008cdf4e in sigusr1_handler (postgres_signal_arg=10) at 
> postmaster.c:5204
> #11 
> #12 0x7fd8fbe0d4ab in select () from /lib64/libc.so.6
> #13 0x008c9cfb in ServerLoop () at postmaster.c:1770
> #14 0x008c96e4 in PostmasterMain (argc=4, argv=0x1c110a0) at 
> postmaster.c:1478
> #15 0x007c665b in main (argc=4, argv=0x1c110a0) at main.c:202
> (gdb) p rel->localrel->rd_rel
> $5 = (Form_pg_class) 0x7f7f7f7f7f7f7f7f
>
> We didn't hit this problem because we only access that relation when we plan 
> to
> report an error[1] and then the worker will restart and cache will be built, 
> so
> everything seems OK.
>
> The problem seems already existed and we hit this because we started to access
> the cached relation in more places.
>
> I think we should try to update the relation every time as the relation is
> opened and closed by caller and here is the patch to do that.

Thanks for the patch.

I agree it's an old bug.  A partition map entry's localrel may point
to a stale Relation pointer, because once the caller had closed the
relation, the relcache subsystem is free to "clear" it, like in the
case of a RELCACHE_FORCE_RELEASE build.

Fixing it the way patch does seems fine, though it feels like
localrelvalid will lose some of its meaning for the partition map
entries -- we will now overwrite localrel even if localrelvalid is
true.

+   /*
+* Relation is opened and closed by caller, so we need to always update the
+* partrel in case the cached relation was closed.
+*/
+   entry->localrel = partrel;
+
+   if (entry->localrelvalid)
return entry;

Maybe we should add a comment here about why it's okay to overwrite
localrel even if localrelvalid is true.  How about the following hunk:

@@ -596,8 +596,20 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root,

entry = _entry->relmapentry;

+   /*
+* We must always overwrite entry->localrel with the latest partition
+* Relation pointer, because the Relation pointed to by the old value may
+* have been cleared after the caller would have closed the partition
+* relation after the last use of this entry.  Note that localrelvalid is
+* only updated by the relcache invalidation callback, so it may still be
+* true irrespective of whether the Relation pointed to by localrel has
+* been cleared or not.
+*/
if (found && entry->localrelvalid)
+   {
+   entry->localrel = partrel;
return entry;
+   }

Attached a patch containing the above to consider as an alternative.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


logicalrep_partition_open-always-set-localrel.patch
Description: Binary data


RE: Replica Identity check of partition table on subscriber

2022-06-21 Thread houzj.f...@fujitsu.com
On Tuesday, June 21, 2022 1:29 PM Amit Kapila :
> 
> On Tue, Jun 21, 2022 at 8:02 AM Amit Kapila  wrote:
> >
> > On Tue, Jun 21, 2022 at 7:49 AM Amit Langote 
> wrote:
> > >
> > >
> > > I think it should spell out REPLICA IDENTITY explicitly to avoid the
> > > commit being confused to have to do with "Referential Integrity
> > > checking".
> > >
> >
> > This makes sense. I'll take care of this.
> >
> 
> After pushing this patch, buildfarm member prion has failed.
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prion=HE
> AD
> 
> It seems to me that the problem could be due to the reason that the
> entry returned by logicalrep_partition_open() may not have the correct
> value for localrel when we found the entry and localrelvalid is also
> true. The point is that before this commit we never use localrel value
> from the rel entry returned by logicalrep_partition_open. I think we
> need to always update the localrel value in
> logicalrep_partition_open().

Agreed.

And I have confirmed that the failure is due to the segmentation violation when
access the cached relation. I reproduced this by using -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE option which was hinted by Tom.

Stack:
#0 check_relation_updatable (rel=0x1cf4548) at worker.c:1745
#1 0x00909cbb in apply_handle_tuple_routing (edata=0x1cbf4e8, 
remoteslot=0x1cbf908, newtup=0x0, operation=CMD_DELETE) at worker.c:2181
#2 0x009097a5 in apply_handle_delete (s=0x7ffcef7fd730) at worker.c:2005
#3 0x0090a794 in apply_dispatch (s=0x7ffcef7fd730) at worker.c:2503
#4 0x0090ad43 in LogicalRepApplyLoop (last_received=22299920) at 
worker.c:2775
#5 0x0090c2ab in start_apply (origin_startpos=0) at worker.c:3549
#6 0x0090ca8d in ApplyWorkerMain (main_arg=0) at worker.c:3805
#7 0x008c4c64 in StartBackgroundWorker () at bgworker.c:858
#8 0x008ceaeb in do_start_bgworker (rw=0x1c3c6b0) at postmaster.c:5815
#9 0x008cee97 in maybe_start_bgworkers () at postmaster.c:6039
#10 0x008cdf4e in sigusr1_handler (postgres_signal_arg=10) at 
postmaster.c:5204
#11 
#12 0x7fd8fbe0d4ab in select () from /lib64/libc.so.6
#13 0x008c9cfb in ServerLoop () at postmaster.c:1770
#14 0x008c96e4 in PostmasterMain (argc=4, argv=0x1c110a0) at 
postmaster.c:1478
#15 0x007c665b in main (argc=4, argv=0x1c110a0) at main.c:202
(gdb) p rel->localrel->rd_rel
$5 = (Form_pg_class) 0x7f7f7f7f7f7f7f7f

We didn't hit this problem because we only access that relation when we plan to
report an error[1] and then the worker will restart and cache will be built, so
everything seems OK.

The problem seems already existed and we hit this because we started to access
the cached relation in more places.

I think we should try to update the relation every time as the relation is
opened and closed by caller and here is the patch to do that.

[1]
/*
 * We are in error mode so it's fine this is somewhat slow. It's better 
to
 * give user correct error.
 */
if (OidIsValid(GetRelationIdentityOrPK(rel->localrel)))

Best regards,
Hou zj


0001-Fix-segmentation-fault.patch
Description: 0001-Fix-segmentation-fault.patch


Re: Using PQexecQuery in pipeline mode produces unexpected Close messages

2022-06-21 Thread Kyotaro Horiguchi
At Tue, 21 Jun 2022 14:59:07 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 21 Jun 2022 14:56:40 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > By the way, I noticed that "libpq_pipeline uniqviol" intermittently
> > fails for uncertain reasons.
> > 
> > > result 574/575: pipeline aborted
> > > ...
> > > done writing
> > > 
> > > libpq_pipeline:1531: got unexpected NULL
> > 
> > The "...done writing" is printed too late in the error cases.
> > 
> > This causes the TAP test fail, but I haven't find what's happnening at
> > the time.
> 
> Just to make sure, I see this with the master branch

No. It *is* caused by the fix. Sorry for the mistake.  The test module
linked to the wrong binary..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center