Re: Typo about subxip in comments

2022-11-14 Thread Amit Kapila
On Sun, Nov 13, 2022 at 8:32 PM Japin Li  wrote:
>
> On Sat, 12 Nov 2022 at 12:12, Amit Kapila  wrote:
> >
> > I don't know if that is an improvement. I think we should stick to
> > your initial proposed change.
>
> Agreed.  Let's focus on the initial proposed change.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-11-14 Thread Amit Kapila
On Mon, Nov 14, 2022 at 6:52 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > > > It seems to me Kuroda-San has proposed this change [1] to fix the test
> > > > but it is not clear to me why such a change is required. Why can't
> > > > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below
> > > > code [2] in LogicalRepApplyLoop() sufficient to handle parameter
> > > > updates?
>
> (I forgot to say, this change was not proposed by me. I said that there 
> should be
> modified. I thought workers should wake up after the transaction was 
> committed.)
>
> > So, why only honor the 'disable' option of the subscription? For
> > example, one can change 'min_apply_delay' and it seems
> > recoveryApplyDelay() honors a similar change in the recovery
> > parameter. Is there a way to set the latch of the worker process, so
> > that it can recheck if anything is changed?
>
> I have not considered about it, but seems reasonable. We may be able to
> do maybe_reread_subscription() if subscription parameters are changed
> and latch is set.
>

One more thing I would like you to consider is the point raised by me
related to this patch's interaction with the parallel apply feature as
mentioned in the email [1]. I am not sure the idea proposed in that
email [1] is a good one because delaying after applying commit may not
be good as we want to delay the apply of the transaction(s) on
subscribers by this feature. I feel this needs more thought.

> Currently, IIUC we try to disable subscription regardless of the state, but
> should we avoid to reread catalog if workers are handling the transactions,
> like LogicalRepApplyLoop()?
>

IIUC, here you are referring to reading catalogs again via the
function maybe_reread_subscription(), right? If so, I think the idea
is to not invoke it frequently to avoid increasing transaction apply
time. However, when you are anyway going to wait for a delay, it may
not matter. I feel it would be better to add some comments saying that
we don't want workers to wait for a long time if users have disabled
the subscription or reduced the apply_delay time.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JRs0v9Z65HWKEZg3quWx4LiQ%3DpddTJZ_P1koXsbR3TMA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Avoid overhead open-close indexes (catalog updates)

2022-11-14 Thread Michael Paquier
On Tue, Nov 15, 2022 at 09:57:26AM +0900, Michael Paquier wrote:
> Anyway, multi-inserts are going to be solution better than
> CatalogTupleInsertWithInfo() in some cases, because we would just
> generate one WAL record of N inserts rather than N records with one
> INSERT each.

Something that you did not consider in the initial patch is that we
may finish by opening catalog indexes even in cases where this would
not have happened on HEAD, as we may finish by doing nothing when
copying the stats or updating them during an analyze, and that's not
fine IMO.  However it is easy enough to minimize the cost: just do a
CatalogOpenIndexes() when absolutely required, and close things only
if the indexes have been opened.

Then, there are the cases where it is worth switching to a
multi-insert logic as these are going to manipulate more than 2
entries all the time: enum list addition and two code paths of
tsearchcmds.c (where up to 16 entries can be lined up).  This is a
case-by-case analysis.  For example, in the case of the enums, the
number of elements is known in advance so it is possible to know the
number of slots that would be used and initialize them.  But that's
not something you would do for the first tsearch bits where the data
is built upon a scan so the slot init should be delayed.  The second
tsearch one can use a predictible approach, like the enums based on
the number of known elements to insert.

So I've given a try at all that, and finished with the attached.  This
patch finishes with a list of bullet points, so this had better be
split into different commits, I guess.
Thoughts?
--
Michael
From 4dd7d8a7670343132d8ba6fb032eaf86ec724378 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 15 Nov 2022 15:44:45 +0900
Subject: [PATCH v2] Avoid some overhead with some catalog inserts and updates

This commit makes lighter a couple of code paths that do catalog updates
and inserts.

First, CatalogTupleInsert() is costly when using it for the insertion or the
update of multiple tuples compared to CatalogTupleInsertWithInfo(), as
it would need to open and close the indexes of the catalog worked on
multiple times.  It happens that some code paths use
CatalogTupleInsert() while CatalogTupleInsertWithInfo() would be a
better choice, and the following ones are changed in this commit:
- REINDEX CONCURRENTLY when copying statistics from one index relation
to the other.  Multi-INSERTs are avoided here, as this would begin to
show benefits on indexes with multiple expressions, for example, which
may not be the most common pattern.  This happens to be noticeable on
profiles with indexes having many expressions in a worst case scenario.
- Update of statistics on ANALYZE, that mixes inserts and updates.
In each case, the catalog indexes are opened only if at least one
insertion and/or update is required, to minimize the cost of the
operation.

Second, a few code paths are switched from CatalogTupleInsert() to
a multi-insert approach through tuple slots:
- ALTER TEXT SEARCH CONFIGURATION ADD/ALTER MAPPING when inserting new
entries.  The replacement logic switches to CatalogTupleInsertWithInfo()
for the updates.
- CREATE TEXT SEARCH CONFIGURATION.
- CREATE/ALTER TYPE AS ENUM when adding a list of enum values.
These use a case-by-case logic, so as the initialization of tuple slots
is done only when necessary to minimize the cost of the operation.

Author: Ranier Vilela, Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/CAEudQAqh0F9y6Di_Wc8xW4zkWm_5SDd-nRfVsCn=h0nm1c_...@mail.gmail.com
---
 src/backend/catalog/heap.c |  10 ++-
 src/backend/catalog/pg_enum.c  |  55 +
 src/backend/commands/analyze.c |  11 ++-
 src/backend/commands/tsearchcmds.c | 120 +++--
 4 files changed, 157 insertions(+), 39 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 5b49cc5a09..bdd413f01b 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -2856,6 +2856,7 @@ CopyStatistics(Oid fromrelid, Oid torelid)
 	SysScanDesc scan;
 	ScanKeyData key[1];
 	Relation	statrel;
+	CatalogIndexState indstate = NULL;
 
 	statrel = table_open(StatisticRelationId, RowExclusiveLock);
 
@@ -2878,13 +2879,20 @@ CopyStatistics(Oid fromrelid, Oid torelid)
 
 		/* update the copy of the tuple and insert it */
 		statform->starelid = torelid;
-		CatalogTupleInsert(statrel, tup);
+
+		/* fetch index information when we know we need it */
+		if (indstate == NULL)
+			indstate = CatalogOpenIndexes(statrel);
+
+		CatalogTupleInsertWithInfo(statrel, tup, indstate);
 
 		heap_freetuple(tup);
 	}
 
 	systable_endscan(scan);
 
+	if (indstate != NULL)
+		CatalogCloseIndexes(indstate);
 	table_close(statrel, RowExclusiveLock);
 }
 
diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index 114715498d..ee11f5f6f1 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ 

Re: PGDOCS - Logical replication GUCs - added some xrefs

2022-11-14 Thread Peter Smith
On Sun, Nov 13, 2022 at 11:47 AM vignesh C  wrote:
>
> On Mon, 24 Oct 2022 at 13:15, Peter Smith  wrote:
> >
> > Hi hackers.
> >
> > There is a docs Logical Replication section "31.10 Configuration
> > Settings" [1] which describes some logical replication GUCs, and
> > details on how they interact with each other and how to take that into
> > account when setting their values.
> >
> > There is another docs Server Configuration section "20.6 Replication"
> > [2] which lists the replication-related GUC parameters, and what they
> > are for.
> >
> > Currently AFAIK those two pages are unconnected, but I felt it might
> > be helpful if some of the parameters in the list [2] had xref links to
> > the additional logical replication configuration information [1]. PSA
> > a patch to do that.
> >
> > ~~
> >
> > Meanwhile, I also suspect that the main blurb top of [1] is not
> > entirely correct... it says "These settings control the behaviour of
> > the built-in streaming replication feature", although some of the GUCs
> > mentioned later in this section are clearly for "logical replication".
>
> The introduction mainly talks about streaming replication and the page
> [1] subsection "Subscribers" clearly mentions that these
> configurations are for logical replication. As we already have a
> separate page [2] to detail about logical replication configurations,
> it might be better to move the "subscribers" section from [1] to [2].
>
> [1] 20.6 Replication -
> https://www.postgresql.org/docs/current/runtime-config-replication.html
> [2] 31.10 Configuration Settings -
> https://www.postgresql.org/docs/current/logical-replication-config.html
>

Thanks, Vignesh. Your suggestion (to move that "Subscribers" section)
seemed like a good idea to me, so PSA my patch v2 to implement that.

Now, on the Streaming Replication page
- the blurb has a reference to information about logical replication config
- the "Subscribers" section was relocated to the other page

Now, on the Logical Replication "Configuration Settings" page
- there are new subsections for "Publishers", "Subscribers" (copied), "Notes"
- some wording is rearranged but the content is basically the same as before

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-Logical-replication-GUCs-consolidated.patch
Description: Binary data


Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Dilip Kumar
On Mon, Nov 14, 2022 at 10:18 PM David G. Johnston
 wrote:

>> Do you have a view on this point?
>>
>
> NULL when overflowed seems like the opposite of the desired effect, calling 
> attention to the exceptional status.  Make it a text column and write 
> "overflow" or "###" as appropriate.  Anyone using the column is going to end 
> up wanting to special-case overflow anyway and number-to-text conversion 
> aside from overflow is simple enough if a number, and not just a display 
> label, is needed.

+1, if we are interested to add only one column then this could be the
best way to show.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Dilip Kumar
On Tue, Nov 15, 2022 at 2:47 AM Andres Freund  wrote:
>
> First, it's not good to have a cliff that you can't see coming - presumbly
> you'd want to warn *before* you regularly reach PGPROC_MAX_CACHED_SUBXIDS
> subxids, rather when the shit has hit the fan already.

I agree with the point that it is good to have a way to know that the
problem is about to happen.  So for that reason, we should show the
subtransaction count.  With showing count user can exactly know if
there are some sessions that could create problems in near future and
may take some action before the problem actually happens.

> IMO the number matters a lot when analyzing why this is happening / how to
> react. A session occasionally reaching 65 subxids might be tolerable and not
> necessarily indicative of a bug. But 100k subxids is something that one just
> can't accept.

Actually, we will see the problem as soon as it has crossed 64 because
after that for any visibility checking we need to check the SLRU.  So
I feel both count and overflow are important.  Count to know that we
are heading towards overflow and overflow to know that it has already
happened.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: Assertion failure in SnapBuildInitialSnapshot()

2022-11-14 Thread Takamichi Osumi (Fujitsu)
Hi,


On Tuesday, November 15, 2022 10:26 AM Andres Freund  wrote:
> On 2022-11-10 16:04:40 +0530, Amit Kapila wrote:
> > I don't have any good ideas on how to proceed with this. Any thoughts
> > on this would be helpful?
> 
> One thing worth doing might be to convert the assertion path into an elog(),
> mentioning both xids (or add a framework for things like AssertLT(), but that
> seems hard). With the concrete values we could make a better guess at
> what's going wrong.
> 
> It'd probably not hurt to just perform this check independent of
> USE_ASSERT_CHECKING - compared to the cost of creating a slot it's
> neglegible.
> 
> That'll obviously only help us whenever we re-encounter the issue, which will
> likely be a while...
> 
> 
> Have you tried reproducing the issue by running the test in a loop?
Just FYI, I've tried to reproduce this failure in a loop,
but I couldn't hit the same assertion failure.


I extracted the #16643 of 100_bugs.pl only and
executed the tests more than 500 times.


My env and test was done in rhel7.9 and gcc 4.8 with configure option of
--enable-cassert --enable-debug --enable-tap-tests --with-icu CFLAGS=-O0 and 
prefix.



Best Regards,
Takamichi Osumi





Re: predefined role(s) for VACUUM and ANALYZE

2022-11-14 Thread Nathan Bossart
On Mon, Nov 14, 2022 at 03:40:04PM -0800, Nathan Bossart wrote:
> Thanks for taking a look!  Here is a rebased version of the patch set.

Oops, apparently object_aclcheck() cannot be used for pg_class.  Here is
another version that uses pg_class_aclcheck() instead.  I'm not sure how I
missed this earlier.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 20df1ffd941753f20b710b14c1de5c4b52de3eca Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 7 Sep 2022 22:25:29 -0700
Subject: [PATCH v9 1/4] Change AclMode from a uint32 to a uint64.

---
 src/backend/nodes/outfuncs.c|  2 +-
 src/bin/pg_upgrade/check.c  | 35 +
 src/include/catalog/pg_type.dat |  4 ++--
 src/include/nodes/parsenodes.h  |  6 +++---
 src/include/utils/acl.h | 28 +-
 5 files changed, 55 insertions(+), 20 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f05e72f0dc..8f150e9a2e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -560,7 +560,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 	WRITE_BOOL_FIELD(lateral);
 	WRITE_BOOL_FIELD(inh);
 	WRITE_BOOL_FIELD(inFromCl);
-	WRITE_UINT_FIELD(requiredPerms);
+	WRITE_UINT64_FIELD(requiredPerms);
 	WRITE_OID_FIELD(checkAsUser);
 	WRITE_BITMAPSET_FIELD(selectedCols);
 	WRITE_BITMAPSET_FIELD(insertedCols);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index f1bc1e6886..615a53a864 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster);
 static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
+static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
@@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check)
 	check_for_reg_data_type_usage(_cluster);
 	check_for_isn_and_int8_passing_mismatch(_cluster);
 
+	/*
+	 * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk
+	 * format for existing data.
+	 */
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
+		check_for_aclitem_data_type_usage(_cluster);
+
 	/*
 	 * PG 14 changed the function signature of encoding conversion functions.
 	 * Conversions from older versions cannot be upgraded automatically
@@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster)
 		check_ok();
 }
 
+/*
+ * check_for_aclitem_data_type_usage
+ *
+ *	aclitem changed its storage format in 16, so check for it.
+ */
+static void
+check_for_aclitem_data_type_usage(ClusterInfo *cluster)
+{
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for incompatible aclitem data type in user tables");
+
+	snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt");
+
+	if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path))
+	{
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n"
+ "The internal format of \"aclitem\" changed in PostgreSQL version 16\n"
+ "so this cluster cannot currently be upgraded.  You can drop the\n"
+ "problem columns and restart the upgrade.  A list of the problem\n"
+ "columns is in the file:\n"
+ "%s", output_path);
+	}
+	else
+		check_ok();
+}
 
 /*
  * check_for_jsonb_9_4_usage()
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index df45879463..0763dfde39 100644
--- a/src/include/catalog/pg_type.dat
+++ b/src/include/catalog/pg_type.dat
@@ -267,9 +267,9 @@
 # OIDS 1000 - 1099
 
 { oid => '1033', array_type_oid => '1034', descr => 'access control list',
-  typname => 'aclitem', typlen => '12', typbyval => 'f', typcategory => 'U',
+  typname => 'aclitem', typlen => '16', typbyval => 'f', typcategory => 'U',
   typinput => 'aclitemin', typoutput => 'aclitemout', typreceive => '-',
-  typsend => '-', typalign => 'i' },
+  typsend => '-', typalign => 'd' },
 { oid => '1042', array_type_oid => '1014',
   descr => 'char(length), blank-padded string, fixed storage length',
   typname => 'bpchar', typlen => '-1', typbyval => 'f', typcategory => 'S',
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7caff62af7..f4ed9bbff9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -73,12 +73,12 @@ typedef enum SetQuantifier
 
 /*
  * Grantable rights are encoded so that we can OR them together in a bitmask.
- * The present representation of AclItem limits us to 16 distinct rights,
- * even though AclMode is defined as uint32.  See utils/acl.h.
+ * The present 

Re: Schema variables - new implementation for Postgres 15

2022-11-14 Thread Pavel Stehule
po 14. 11. 2022 v 8:00 odesílatel Sergey Shinderuk <
s.shinde...@postgrespro.ru> napsal:

> On 13.11.2022 20:59, Pavel Stehule wrote:
> > fresh rebase
>
> Hello,
>
> Sorry, I haven't been following this thread, but I'd like to report a
> memory management bug. I couldn't apply the latest patches, so I tested
> with v20221104-1-* patches applied atop of commit b0284bfb1db.
>
>
> postgres=# create variable s text default 'abc';
>
> create function f() returns text as $$
> begin
>  return g(s);
> end;
> $$ language plpgsql;
>
> create function g(t text) returns text as $$
> begin
>  let s = 'BOOM!';
>  return t;
> end;
> $$ language plpgsql;
>
> select f();
> CREATE VARIABLE
> CREATE FUNCTION
> CREATE FUNCTION
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> LOG:  server process (PID 55307) was terminated by signal 11:
> Segmentation fault
> DETAIL:  Failed process was running: select f();
>
>
> I believe it's a use-after-free error, triggered by assigning a new
> value to s in g(), thus making t a dangling pointer.
>
> After reconnecting I get a scary error:
>
> postgres=# select f();
> ERROR:  compressed pglz data is corrupt
>

I am able to reproduce it, and I have a quick fix, but I need to
investigate i this fix will be correct

It's a good example so I have to always return a copy of value.

Regards

Pavel





>
> Best regards,
>
> --
> Sergey Shinderukhttps://postgrespro.com/
>
>


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-14 Thread Masahiko Sawada
On Mon, Nov 14, 2022 at 10:00 PM John Naylor
 wrote:
>
> On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada  wrote:
> >
> > 0004 patch is a new patch supporting a pointer tagging of the node
> > kind. Also, it introduces rt_node_ptr we discussed so that internal
> > functions use it rather than having two arguments for encoded and
> > decoded pointers. With this intermediate patch, the DSA support patch
> > became more readable and understandable. Probably we can make it
> > smaller further if we move the change of separating the control object
> > from radix_tree to the main patch (0002). The patch still needs to be
> > polished but I'd like to check if this idea is worthwhile. If we agree
> > on this direction, this patch will be merged into the main radix tree
> > implementation patch.
>
> Thanks for the new patch set. I've taken a very brief look at 0004 and I 
> think the broad outlines are okay. As you say it needs polish, but before 
> going further, I'd like to do some experiments of my own as I mentioned 
> earlier:
>
> - See how much performance we actually gain from tagging the node kind.
> - Try additional size classes while keeping the node kinds to only four.
> - Optimize node128 insert.
> - Try templating out the differences between local and shared memory. With 
> local memory, the node-pointer struct would be a union, for example. 
> Templating would also reduce branches and re-simplify some internal APIs, but 
> it's likely that would also make the TID store and/or vacuum more complex, 
> because at least some external functions would be duplicated.

Thanks! Please let me know if there is something I can help with.

In the meanwhile, I'd like to make some progress on the vacuum
integration and improving the test coverages.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Suppressing useless wakeups in walreceiver

2022-11-14 Thread Nathan Bossart
Another option might be to just force initial reply/feedback messages when
streaming starts.  The attached patch improves src/test/recovery test
runtime just like the previous one I posted.

On Mon, Nov 14, 2022 at 11:01:27AM -0800, Nathan Bossart wrote:
> I wonder if we
> ought to do something similar in the ConfigReloadPending path in case
> hot_standby_feedback is being turned on.

Looking closer, I see that a feedback message is forced in this case
already.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8bd2ba37dd..ad383dbcaa 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -433,6 +433,10 @@ WalReceiverMain(void)
 			for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
 WalRcvComputeNextWakeup(i, now);
 
+			/* Send initial reply/feedback messages. */
+			XLogWalRcvSendReply(true, false);
+			XLogWalRcvSendHSFeedback(true);
+
 			/* Loop until end-of-streaming or error */
 			for (;;)
 			{


Re: libpq compression (part 2)

2022-11-14 Thread Andrey Borodin
On Sat, Nov 12, 2022 at 8:04 PM Andrey Borodin  wrote:
>
> While testing patch some more I observe unpleasant segfaults:
>
> #27 0x0b08fda2 in lz4_decompress (d_stream=0x18cf82a0,
> src=0x7feae4fa505d, src_size=92,
> (gdb) select-frame 27
> (gdb) info locals
> ds = 0x18cf82a0
> decPtr = 0x18cf8aec ""
> decBytes = -87
>

I've debugged the stuff and the problem resulted to be in wrong
message limits for Lz4 compression
+#define MESSAGE_MAX_BYTES 819200
instead of
+#define MESSAGE_MAX_BYTES 8192

Other codecs can utilize continuation of the decompression stream
using ZS_DATA_PENDING, while Lz4 cannot do this. I was going to
produce quickfix for all my lz4 findings, but it occured to me that a
patchset needs a heavy rebase. If no one shows up to fix it I'll do
that in one of the coming weekends. Either way here's a reproducer for
the coredump:
psql 'compression=lz4' -f boom.sql

Thanks!

Best regards, Andrey Borodin.


boom.sql
Description: Binary data


Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-14 Thread Thomas Munro
On Tue, Nov 15, 2022 at 3:38 PM Andres Freund  wrote:
> On 2022-11-14 17:25:31 -0800, Andres Freund wrote:
> Another theory: I dimly remember Thomas mentioning that there's some different
> behaviour of xlogreader during shutdown as part of the v15 changes. I don't
> quite remember what the scenario leading up to that was. Thomas?

Yeah.  So as mentioned in:

https://www.postgresql.org/message-id/CA%2BhUKG%2BWKsZpdoryeqM8_rk5uQPCqS2HGY92WiMGFsK2wVkcig%40mail.gmail.com

I still have on my list to remove a new "missing contrecord" error
that can show up in a couple of different scenarios that aren't
exactly error conditions depending on how you think about it, but I
haven't done that yet.  I am not currently aware of anything bad
happening because of those messages, but I could be wrong.

> It's certainly interesting that we see stuff like:
>
> 2022-11-08 00:20:23.255 GMT [2012][walsender] 
> [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR:  could not find 
> record while sending logically-decoded data: missing contrecord at 0/1D3B710
> 2022-11-08 00:20:23.255 GMT [2012][walsender] 
> [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT:  
> START_REPLICATION SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 
> 0/1D2B650 (proto_version '3', origin 'any', publication_names '"testpub"')
> ERROR:  could not find record while sending logically-decoded data: missing 
> contrecord at 0/1D3B710
> 2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR:  error 
> while shutting down streaming COPY: ERROR:  could not find record while 
> sending logically-decoded data: missing contrecord at 0/1D3B710

Right, so that might fit the case described in my email above:
logical_read_xlog_page() notices that it has been asked to shut down
when it is between reads of pages with a spanning contrecord.  Before,
it would fail silently, so XLogReadRecord() returns NULL without
setting *errmsg, but now it complains about a missing contrecord.  In
the case where it was showing up on that other thread, just a few
machines often seemed to log that error when shutting down --
peripatus for example -- I don't know why, but I assume something to
do with shutdown timing and page alignment.

> It could entirely be caused by postmaster slowly killing processes after the
> assertion failure and that that is corrupting shared memory state though. But
> it might also be related.

Hmm.




Re: Direct I/O

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-10 14:26:20 +0700, John Naylor wrote:
> On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro  wrote:
> 
> > Memory alignment patches:
> >
> > Direct I/O generally needs to be done to/from VM page-aligned
> > addresses, but only "standard" 4KB pages, even when larger VM pages
> > are in use (if there is an exotic system where that isn't true, it
> > won't work).  We need to deal with buffers on the stack, the heap and
> > in shmem.  For the stack, see patch 0001.  For the heap and shared
> > memory, see patch 0002, but David Rowley is going to propose that part
> > separately, as MemoryContext API adjustments are a specialised enough
> > topic to deserve another thread; here I include a copy as a
> > dependency.  The main direct I/O patch is 0003.
> 
> One thing to note: Currently, a request to aset above 8kB must go into a
> dedicated block. Not sure if it's a coincidence that that matches the
> default PG page size, but if allocating pages on the heap is hot enough,
> maybe we should consider raising that limit. Although then, aligned-to-4kB
> requests would result in 16kB chunks requested unless a different allocator
> was used.

With one exception, there's only a small number of places that allocate pages
dynamically and we only do it for a small number of buffers. So I don't think
we should worry too much about this for now.

The one exception to this: GetLocalBufferStorage(). But it already batches
memory allocations by increasing sizes, so I think we're good as well.

Greetings,

Andres Freund




Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 17:25:31 -0800, Andres Freund wrote:
> Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
> SnapBuildExportSnapshot()? Why aren't the error checks for
> SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
> transactions are we even in when we import the snapshot (cf.
> SnapBuildExportSnapshot() doing a StartTransactionCommand()).

Most of the checks for that are in CreateReplicationSlot() - but not al,
e.g. XactReadOnly isn't set, nor do we enforce in an obvious place that we
don't already hold a snapshot.

I first thought this might directly explain the problem, due to the
MyProc->xmin assignment in SnapBuildInitialSnapshot() overwriting a value that
could influence the return value for GetOldestSafeDecodingTransactionId(). But
that happens later, and we check that MyProc->xmin is invalid at the start.

But it still seems supicious. This will e.g. influence whether
StartupDecodingContext() sets PROC_IN_LOGICAL_DECODING. Which probably is
fine, but...


Another theory: I dimly remember Thomas mentioning that there's some different
behaviour of xlogreader during shutdown as part of the v15 changes. I don't
quite remember what the scenario leading up to that was. Thomas?


It's certainly interesting that we see stuff like:

2022-11-08 00:20:23.255 GMT [2012][walsender] 
[pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR:  could not find record 
while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [2012][walsender] 
[pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT:  START_REPLICATION 
SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version 
'3', origin 'any', publication_names '"testpub"')
ERROR:  could not find record while sending logically-decoded data: missing 
contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR:  error 
while shutting down streaming COPY: ERROR:  could not find record while sending 
logically-decoded data: missing contrecord at 0/1D3B710

It could entirely be caused by postmaster slowly killing processes after the
assertion failure and that that is corrupting shared memory state though. But
it might also be related.


Greetings,

Andres Freund




Re: List of Bitmapset (was Re: ExecRTCheckPerms() and many prunable partitions)

2022-11-14 Thread Amit Langote
 On Mon, Nov 14, 2022 at 11:57 PM Tom Lane  wrote:
> Amit Langote  writes:
> > On Sat, Nov 12, 2022 at 1:46 AM Tom Lane  wrote:
> >> The main thing I was wondering about in connection with that
> >> was whether to assume that there could be other future applications
> >> of the logic to perform multi-bitmapset union, intersection,
> >> etc.  If so, then I'd be inclined to choose different naming and
> >> put those functions in or near to bitmapset.c.  It doesn't look
> >> like Amit's code needs anything like that, but maybe somebody
> >> has an idea about other applications?
>
> > Yes, simple storage of multiple Bitmapsets in a List somewhere in a
> > parse/plan tree sounded like that would have wider enough use to add
> > proper node support for.   Assuming you mean trying to generalize
> > VarAttnoSet in your patch 0004 posted at [2], I wonder if you want to
> > somehow make its indexability by varno / RT index a part of the
> > interface of the generic code you're thinking for it?
>
> For discussion's sake, here's my current version of that 0004 patch,
> rewritten to use list-of-bitmapset as the data structure.  (This
> could actually be pulled out of the outer-join-vars patchset and
> committed independently, just as a minor performance improvement.
> It doesn't quite apply cleanly to HEAD, but pretty close.)
>
> As it stands, the new functions are still in util/clauses.c, but
> if we think they could be of general use it'd make sense to move them
> either to nodes/bitmapset.c or to some new file under backend/nodes.

These multi_bms_* functions sound generic enough to me, so +1 to put
them in nodes/bitmapset.c.  Or even a new file if the API should
involve a dedicated struct enveloping the List as you write below.

> Some other thoughts:
>
> * The multi_bms prefix is a bit wordy, so I was thinking of shortening
> the function names to mbms_xxx.  Maybe that's too brief.

FWIW, multi_bms_* naming sounds fine to me.

> * This is a pretty short list of functions so far.  I'm not eager
> to build out a bunch of dead code though.  Is it OK to leave it
> with just this much functionality until someone needs more?

+1

> * I'm a little hesitant about whether the API actually should be
> List-of-Bitmapset, or some dedicated struct as I had in the previous
> version of 0004.  This version is way less invasive in prepjointree.c
> than that was, but the reason is there's ambiguity about what the
> forced_null_vars Lists actually contain, which feels error-prone.

Are you thinking of something like a MultiBitmapset that wraps the
multi_bms List?  That sounds fine to me.  Another option is to make
the generic API be List-of-Bitmapset but keep VarAttnoSet in
prepjointree.c and put the List in it.  IMHO, VarAttnoSet is
definitely more self-documenting for that patch's purposes.

+ * The new member is identified by the zero-based index of the List
+ * element it should go into, and the bit number to be set therein.
+ */
+List *
+multi_bms_add_member(List *mbms, int index1, int index2)

The comment sounds a bit ambiguous, especially the ", and the bit
number to be set therein." part.  If you meant to describe the
arguments, how about mentioning their names too, as in:

The new member is identified by 'index1', the zero-based index of the
List element it should go into, and 'index2' specifies the bit number
to be set therein.

+   /* Add empty elements to a, as needed */
+   while (list_length(a) < list_length(b))
+   a = lappend(a, NULL);
+   /* forboth will stop at the end of the shorter list, which is fine */

Isn't this comment unnecessary given that the while loop makes both
lists be the same length?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: [BUG] Logical replica crash if there was an error in a function.

2022-11-14 Thread Tom Lane
"Anton A. Melnikov"  writes:
> On 02.11.2022 21:02, Tom Lane wrote:
>> I don't think the cost of that test case is justified by the tiny
>> probability that it'd ever catch anything.

> Seems it is possible to do a test without these remarks. The attached
> test uses existing nodes and checks the specific error message.

My opinion remains unchanged: this would be a very poor use of test
cycles.

> Additionally
> i've tried to reduce overall number of nodes previously
> used in this test in a similar way.

Optimizing existing tests isn't an answer to that.  We could
install those optimizations without adding a new test case.

regards, tom lane




Re: [BUG] Logical replica crash if there was an error in a function.

2022-11-14 Thread Anton A. Melnikov

Hello!

On 02.11.2022 21:02, Tom Lane wrote:

So I'm now good with the idea of just not failing.  I don't like
the patch as presented though.  First, the cfbot is quite rightly
complaining about the "uninitialized variable" warning it draws.
Second, I don't see a good reason to tie the change to logical
replication in any way.  Let's just change the Assert to an if(),
as attached.


Fully agreed that is the most optimal solution for that case. Thanks!
Surely it's very rare one but there was a real segfault at production server.
Someone came up with the idea to modify function like public.test_selector()
in repcmd.sql (see above) on the fly with adding to it :last_modification:
field from current time and some other parameters with the help of replace()
inside the creation of the rebuild_test() procedure.

On 03.11.2022 18:29, Tom Lane wrote:

Amit Kapila  writes:

LGTM. I don't know if it is a good idea to omit the test case for this
scenario. If required, we can reuse the test case from Sawada-San's
patch in the email [1].


I don't think the cost of that test case is justified by the tiny
probability that it'd ever catch anything.  If we were just adding a
query or two to an existing scenario, that could be okay; but spinning
up and syncing a whole new primary and standby database is *expensive*
when you multiply it by the number of times developers and buildfarm
animals are going to run the tests in the future.

There's also the little issue that I'm not sure it would actually
detect a problem if we had one.  The case is going to fail, and
what we want to know is just how messily it fails, and I think the
TAP infrastructure isn't very sensitive to that ... especially
if the test isn't even checking for specific error messages.

Seems it is possible to do a test without these remarks. The attached
test uses existing nodes and checks the specific error message. Additionally
i've tried to reduce overall number of nodes previously
used in this test in a similar way.

Would be glad for comments and remarks.

With best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 722fa6d8c629eb83b3d11ea88b49bab1f700b48d
Author: Anton A. Melnikov 
Date:   Mon Nov 14 08:30:26 2022 +0300

Add test for syntax error in the function in a a logical replication
worker and combine some test nodes.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 6247aa7730..76a6c12cae 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -69,6 +69,9 @@ $node_publisher->wait_for_catchup('sub1');
 
 pass('index predicates do not cause crash');
 
+$node_publisher->safe_psql('postgres',
+	"CHECKPOINT");
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
@@ -81,9 +84,8 @@ $node_subscriber->stop('fast');
 # identity set before accepting updates.  If it did not it would cause
 # an error when an update was attempted.
 
-$node_publisher = PostgreSQL::Test::Cluster->new('publisher2');
-$node_publisher->init(allows_streaming => 'logical');
-$node_publisher->start;
+$node_publisher->rotate_logfile();
+$node_publisher->start();
 
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub FOR ALL TABLES");
@@ -102,6 +104,9 @@ is( $node_publisher->psql(
 	'update to unlogged table without replica identity with FOR ALL TABLES publication'
 );
 
+$node_publisher->safe_psql('postgres',
+	"CHECKPOINT");
+
 $node_publisher->stop('fast');
 
 # Bug #16643 - https://postgr.es/m/16643-eaadeb2a1a58d...@postgresql.org
@@ -226,13 +231,13 @@ $node_sub->stop('fast');
 # target table's relcache was not being invalidated. This leads to skipping
 # UPDATE/DELETE operations during apply on the subscriber side as the columns
 # required to search corresponding rows won't get logged.
-$node_publisher = PostgreSQL::Test::Cluster->new('publisher3');
-$node_publisher->init(allows_streaming => 'logical');
-$node_publisher->start;
 
-$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber3');
-$node_subscriber->init(allows_streaming => 'logical');
-$node_subscriber->start;
+$node_publisher->rotate_logfile();
+$node_publisher->start();
+
+$node_subscriber->rotate_logfile();
+$node_subscriber->rotate_logfile(); # call twice to synchronize lognames between master and replica
+$node_subscriber->start();
 
 $node_publisher->safe_psql('postgres',
 	"CREATE TABLE tab_replidentity_index(a int not null, b int not null)");
@@ -296,7 +301,89 @@ is( $node_subscriber->safe_psql(
 	qq(-1|1),
 	"update works with REPLICA IDENTITY");
 
+$node_publisher->safe_psql('postgres',
+	"CHECKPOINT");
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
+# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru
+
+# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language
+# CREATE FUNCTION or DO command executed in a logical replication 

Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-10 16:04:40 +0530, Amit Kapila wrote:
> I don't have any good ideas on how to proceed with this. Any thoughts
> on this would be helpful?

One thing worth doing might be to convert the assertion path into an elog(),
mentioning both xids (or add a framework for things like AssertLT(), but that
seems hard). With the concrete values we could make a better guess at what's
going wrong.

It'd probably not hurt to just perform this check independent of
USE_ASSERT_CHECKING - compared to the cost of creating a slot it's neglegible.

That'll obviously only help us whenever we re-encounter the issue, which will
likely be a while...


Have you tried reproducing the issue by running the test in a loop?


One thing I noticed just now is that we don't assert
builder->building_full_snapshot==true. I think we should? That didn't use to
be an option, but now it is... It doesn't look to me like that's the issue,
but ...

Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
SnapBuildExportSnapshot()? Why aren't the error checks for
SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
transactions are we even in when we import the snapshot (cf.
SnapBuildExportSnapshot() doing a StartTransactionCommand()).

I'm also somewhat suspicious of calling RestoreTransactionSnapshot() with
source=MyProc. Looks like it works, but it'd be pretty easy to screw up, and
there's no comments in SetTransactionSnapshot() or
ProcArrayInstallRestoredXmin() warning that that might be the case.

Greetings,

Andres Freund




Re: A new strategy for pull-up correlated ANY_SUBLINK

2022-11-14 Thread Richard Guo
On Sun, Nov 13, 2022 at 6:45 AM Tom Lane  wrote:

> Looking again at that contain_vars_of_level restriction, I think the
> reason for it was just to avoid making a FROM subquery that has outer
> references, and the reason we needed to avoid that was merely that we
> didn't have LATERAL at the time.  So I experimented with the attached.
> It seems to work, in that we don't get wrong answers from any of the
> small number of places that are affected.  (I wonder though whether
> those test cases still test what they were intended to, particularly
> the postgres_fdw one.  We might have to try to hack them some more
> to not get affected by this optimization.)  Could do with more test
> cases, no doubt.


Hmm, it seems there were discussions about this change before, such as
in [1].


> One thing I'm not at all clear about is whether we need to restrict
> the optimization so that it doesn't occur if the subquery contains
> outer references falling outside available_rels.  I think that that
> case is covered by is_simple_subquery() deciding later to not pull up
> the subquery based on LATERAL restrictions, but maybe that misses
> something.


I think we need to do this, otherwise we'd encounter the problem
described in [2].  In short, the problem is that the constraints imposed
by LATERAL references may make us fail to find any legal join order.  As
an example, consider

explain select * from A where exists
(select * from B where A.i in (select C.i from C where C.j = B.j));
ERROR:  failed to build any 3-way joins

[1]
https://www.postgresql.org/message-id/flat/CAN_9JTx7N%2BCxEQLnu_uHxx%2BEscSgxLLuNgaZT6Sjvdpt7toy3w%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CAMbWs49cvkF9akbomz_fCCKS=D5TY=4kgheqcfhpzcxs1gv...@mail.gmail.com

Thanks
Richard


Re: Avoid overhead open-close indexes (catalog updates)

2022-11-14 Thread Michael Paquier
On Sat, Nov 12, 2022 at 11:03:46AM -0300, Ranier Vilela wrote:
> I think complexity doesn't pay off.
> For example, CopyStatistics not knowing how many tuples will be processed.
> IMHO, this step is right now.
> CatalogTupleInsertWithInfo offers considerable improvement without
> introducing bugs and maintenance issues.

Considerable may be a bit an overstatement?  I can see a difference in
profiles when switching from one to the other in some extreme cases,
but for the REINDEX CONCURRENTLY case most of the runtime is going to
be eaten in the wait phases, the index build and its validation.

Anyway, multi-inserts are going to be solution better than
CatalogTupleInsertWithInfo() in some cases, because we would just
generate one WAL record of N inserts rather than N records with one
INSERT each.

Looking closely, EnumValuesCreate() is a DDL path but I'd like to
think that two enum values are at least present at creation in most
cases.  AddRoleMems() becomes relevant when using more than one role,
which is a less common pattern, so I'd be fine with switching to a
single index-opening approach with CatalogTupleUpdateWithInfo() as you
suggest without the tuple slot management.  CopyStatistics() does not
know in advance the number of tuples it would insert, and it would be
a gain when there are more than 2 expressions with entries in
pg_statistic as of HEAD.  Perhaps you're right with your simple
suggestion to stick with CatalogTupleUpdateWithInfo() in this case.
Maybe there is some external code calling this routine for tables, who
knows.

update_attstats() is actually an area that cannot be changed now that
I look at it, as we could finish to update some entries, so the slot
approach will not be relevant, but using CatalogTupleUpdateWithInfo()
is.  (As a matter of fact, the regression test suite is reporting that
update_attstats() is called for one attribute 10% of the time, did not
check the insert/update rate though).

Would you like to give a try with the tuple slot management in
EnumValuesCreate()?
--
Michael


signature.asc
Description: PGP signature


Re: Collation version tracking for macOS

2022-11-14 Thread Jeff Davis
I looked at v6.

  * We'll need some clearer instructions on how to build/install extra
ICU versions that might not be provided by the distribution packaging.
For instance, I got a cryptic error until I used --enable-rpath, which
might not be obvious to all users.
  * Can we have a better error when the library was built with --
disable-renaming? We can just search for the plain (no suffix) symbol.
  * We should use dlerror() instead of %m to report dlopen() errors.
  * It seems like the collation version is just there to issue WARNINGs
when a user is using the non-versioned locale syntax and the library
changes underneath them (or if there is collation version change within
a single ICU major version)?
  * How are you testing this?
  * In my tests (sort, hacked so abbreviate is always false), I see a
~3% regression for ICU+UTF8. That's fine with me. I assume it's due to
the indirect function call, but that's not obvious to me from the
profile. If it's a major problem we could have a special case of
varstrfastcmp_locale() that works on the compile-time ICU version.

I realize your patch is experimental, but when there is a better
consensus on the approach, we should consider adding declarative syntax
such as:

   CREATE COLLATION (or LOCALE?) PROVIDER icu67
 TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';

It will offer more opportunities to catch errors early and offer better
error messages. It would also enable it to function if the library is
built with --disable-renaming (though we'd have to trust the user).

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> Problem 1:  Suppose you're ready to start using (say) v72.  I guess
> you'd use the REFRESH command, which would open the main linked ICU's
> collversion and stamp that into the catalogue, at which point new
> sessions would start using that, and then you'd have to rebuild all
> your indexes (with no help from PG to tell you how to find everything
> that needs to be rebuilt, as belaboured in previous reverted work).
> Aside from the possibility of getting the rebuilding job wrong (as
> belaboured elsewhere), it's not great, because there is still a
> transitional period where you can be using the wrong version for your
> data.  So this requires some careful planning and understanding from
> the administrator.

How is this related to the search-by-collversion design? It seems like
it's hard no matter what.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Meson doesn't define HAVE_LOCALE_T for mscv

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-10 10:59:41 +0100, Juan José Santamaría Flecha wrote:
> Meson doesn't see the redefinition of locale_t done
> in src/include/port/win32_port.h, so is not defining
> HAVE_LOCALE_T, HAVE_WCSTOMBS_L nor HAVE_MBSTOWCS_L as the
> current src/tools/msvc/build.pl script does.
> 
> Please find attached a patch for so.

Hm. Is it right that the changes are only done for msvc? win32_port.h defines
the types for mingw as well afaict.

Greetings,

Andres Freund




Re: Multitable insert syntax support on Postgres?

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 21:06:09 -0300, Alexandre hadjinlian guerra wrote:
> Hello
> Are there any plans to incorporate a formal syntax multitable/conditional
> insert , similar to the syntax below? snowflake does have the same feature
> 
> https://oracle-base.com/articles/9i/multitable-inserts
> 
> Today, im resorting to a function that receives the necessary parameters
> from the attributes definition/selection area in a sql select query, called
> by each tuple retrieved. A proper syntax show be real cool

I only skimmed that link, but afaict most of this can be done today in
postgres, with a bit different syntax, using CTEs. Postgres implements this as
an extension to the standard CTE syntax.

WITH data_src AS (SELECT * FROM source_tbl),
  insert_a AS (INSERT INTO a SELECT * FROM data_src WHERE d < 5),
  insert_b AS (INSERT INTO b SELECT * FROM data_src WHERE d >= 5)
INSERT INTO c SELECT * FROM data_src WHERE d < 5

It's a bit annoying that the last "arm" of the insert looks a bit
difference. OTOH, it's a much more general syntax.

Greetings,

Andres Freund




Re: Error on missing Python module in Meson setup

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 14:23:02 +0100, Daniel Gustafsson wrote:
> When setting up a postgres tree with Meson on an almost empty Debian 11 VM I
> hit an error on "meson setup -Ddebug=true build ." like this:
> 
> Program python3 found: YES (/usr/bin/python3)
> meson.build:987:2: ERROR: Unknown method "dependency" in object.
> 
> The error in itself isn't terribly self-explanatory.  According to the log the
> error was a missing Python package:

> Traceback (most recent call last):
> File "", line 20, in 
> File "", line 8, in links_against_libpython
> ModuleNotFoundError: No module named ‘distutils.core'
> 
> Installing the distutils package fixes it, but it seems harsh to fail setup on
> a missing package. Would something like the attached make sense?

The error is a bit better in newer versions of meson:
meson.build:986: WARNING:  
['/usr/bin/python3']> is not a valid python or it is missing distutils
but we do still error out weirdly afterwards.

We probably should report this to the meson folks regardless of us working
around it or not.


> diff --git a/meson.build b/meson.build
> index 058382046e..1a7e301fc9 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -984,8 +984,12 @@ pyopt = get_option('plpython')
>  if not pyopt.disabled()
>pm = import('python')
>python3_inst = pm.find_installation(required: pyopt.enabled())
> -  python3_dep = python3_inst.dependency(embed: true, required: 
> pyopt.enabled())
> -  if not cc.check_header('Python.h', dependencies: python3_dep, required: 
> pyopt.enabled())
> +  if python3_inst.found()
> +python3_dep = python3_inst.dependency(embed: true, required: 
> pyopt.enabled())
> +if not cc.check_header('Python.h', dependencies: python3_dep, required: 
> pyopt.enabled())
> +  python3_dep = not_found_dep
> +endif
> +  else
>  python3_dep = not_found_dep
>endif
>  else

Perhaps worth simplifying a bit. What do you think about:


pyopt = get_option('plpython')
python3_dep = not_found_dep
if not pyopt.disabled()
  pm = import('python')
  python3_inst = pm.find_installation(required: pyopt.enabled())
  if python3_inst.found()
python3_dep_int = python3_inst.dependency(embed: true, required: 
pyopt.enabled())
if cc.check_header('Python.h', dependencies: python3_dep_int, required: 
pyopt.enabled())
  python3_dep = python3_dep
endif
  endif
endif


Greetings,

Andres Freund




Multitable insert syntax support on Postgres?

2022-11-14 Thread Alexandre hadjinlian guerra
Hello
Are there any plans to incorporate a formal syntax multitable/conditional
insert , similar to the syntax below? snowflake does have the same feature

https://oracle-base.com/articles/9i/multitable-inserts

Today, im resorting to a function that receives the necessary parameters
from the attributes definition/selection area in a sql select query, called
by each tuple retrieved. A proper syntax show be real cool

Thanks!


Re: pg_basebackup's --gzip switch misbehaves

2022-11-14 Thread Michael Paquier
On Mon, Nov 14, 2022 at 03:27:14PM +0100, Daniel Gustafsson wrote:
> Ugh, yes, that's what it should say.

A split sounds fine by me.  On top of what Tom has mentioned, I have
spotted two small-ish things.

-This module is available from CPAN or an operating system package.
+This module is available from
+https://metacpan.org/release/IPC-Run;>CPAN
+or an operating system package.

It looks like there is a second one in install-windows.sgml.

+Many operations in the test suites use a 180 second timeout, which on slow
Nit: s/180 second/180-second/?
--
Michael


signature.asc
Description: PGP signature


Re: Support logical replication of DDLs

2022-11-14 Thread Peter Smith
Here are some review comments for v32-0002

==

1. Commit message

Comment says:
While creating a publication, we register a command end
trigger that deparses the DDL as a JSON blob, and WAL logs it. The event
trigger is automatically removed at the time of drop publication.

SUGGESTION (uppercase the SQL)
During CREATE PUBLICATION we register a command end trigger that
deparses the DDL as a JSON blob, and WAL logs it. The event
trigger is automatically removed at the time of DROP PUBLICATION.

~~~

2.

Comment says:
This is a POC patch to show how using event triggers and DDL deparsing
facilities we can implement DDL replication. So, the implementation is
restricted to CREATE TABLE/ALTER TABLE/DROP TABLE commands.

~

Still correct or old comment gone stale?

~~~

3.

Comment says:
Note that the replication for ALTER INDEX command is still under
progress.

~

Still correct or old comment gone stale?

==

4. GENERAL - Patch order.

Somehow, I feel this v32-0002 patch and the v32-0001 patch should be
swapped. IIUC this one seems to me to be the "core" framework for the
DDL message replication but the other 0001 was more like just the
implements of all the supported different *kinds* of DDL JSON blobs.
So actually this patch seems more like the mandatory one and the other
one can just evolve as it gets more supported JSON.

~~~

5. GENERAL - naming

The DDL suffix 'msg' or 'message' seemed sometimes unnecessary because
there is no ambiguity that this is a message for DDL replication, so
the shorter name conveys the same amount of information, doesn't it?

e.g. Maybe reconsider some of these ones (probably there are others)...

src/include/replication/decode.h
logicalddlmsg_decode -> Why not call this function logicalddl_decode?

src/include/replication/logicalproto.h:
LOGICAL_REP_MSG_DDLMESSAGE -> Why not call it 'LOGICAL_REP_MSG_DDL'?
logicalrep_write_ddlmessage -> Why not call this function logicalrep_write_ddl?
logicalrep_read_ddlmessage -> Why not call this function logicalrep_read_ddl?

src/include/replication/output_plugin.h:
'ddlmessage_cb' -> Why not call it 'ddl_cb'?
'stream_ddlmessage_cb' -> Why not call it 'stream_ddl_cb'?

src/include/replication/reorderbuffer.h:
- 'REORDER_BUFFER_CHANGE_DDL' --> Why not call it 'REORDER_BUFFER_CHANGE_DDL'?
- 'ddlmsg' -> Why not call it 'ddl'?
- 'ddlmessage' -> Why not call it 'ddl'?
- 'stream_ddlmessage' -> Why not call it 'stream_ddl'?

==

src/backend/access/rmgrdesc/Makefile

6.

@@ -19,6 +19,7 @@ OBJS = \
  hashdesc.o \
  heapdesc.o \
  logicalmsgdesc.o \
+ logicalddlmsgdesc.o \

Change should be in alphabetical order.

==

src/backend/access/rmgrdesc/logicalddlmsgdesc.c

7. logicalddlmsg_identify

+const char *
+logicalddlmsg_identify(uint8 info)
+{
+ if ((info & ~XLR_INFO_MASK) == XLOG_LOGICAL_DDL_MESSAGE)
+ return "DDL MESSAGE";
+
+ return NULL;
+}

The logicalrep_message_type (see below) said "DDL", so maybe this
should also just say "DDL" instead of "DDL MESSAGE"

@@ -1218,6 +1264,8 @@ logicalrep_message_type(LogicalRepMsgType action)
  return "TYPE";
  case LOGICAL_REP_MSG_MESSAGE:
  return "MESSAGE";
+ case LOGICAL_REP_MSG_DDLMESSAGE:
+ return "DDL";

==

src/backend/commands/event_trigger.c

8. start/end

+/*
+ * publication_deparse_ddl_command_start
+ *
+ * Deparse the ddl command and log it.
+ */
+Datum
+publication_deparse_ddl_command_start(PG_FUNCTION_ARGS)
...
+/*
+ * publication_deparse_ddl_command_end
+ *
+ * Deparse the ddl command and log it.
+ */
+Datum
+publication_deparse_ddl_command_end(PG_FUNCTION_ARGS)

The start/end function comments are the same -- there should be some
more explanation to say what they are for.

~~~

9. publication_deparse_ddl_command_start

+ char*command = psprintf("Drop table command start");

Huh? So this function is only for this specific case of DROP TABLE? If
correct, then I think that should be commented on or asserted
somewhere.

~

10.

+ /* extract the relid from the parse tree */
+ foreach(cell1, stmt->objects)

Uppercase comment

~

11.

+ if (relpersist == RELPERSISTENCE_TEMP)
+ {
+ table_close(relation, NoLock);
+ continue;
+ }
+
+ LogLogicalDDLMessage("deparse", address.objectId, DCT_TableDropStart,
+ command, strlen(command) + 1);
+
+ if (relation)
+ table_close(relation, NoLock);

This code looks overly complex. Can't it just be like below?

SUGGESTION

if (relpersist != RELPERSISTENCE_TEMP)
LogLogicalDDLMessage("deparse", address.objectId, DCT_TableDropStart,
command, strlen(command) + 1);

if (relation)
table_close(relation, NoLock);

~~~

12. publication_deparse_table_rewrite

+ if (relpersist == RELPERSISTENCE_TEMP)
+ return PointerGetDatum(NULL);
+
+ /* Deparse the DDL command and WAL log it to allow decoding of the same. */
+ json_string = deparse_utility_command(cmd, false);
+
+ if (json_string != NULL)
+ LogLogicalDDLMessage("deparse", cmd->d.alterTable.objectId, DCT_TableAlter,
+ json_string, strlen(json_string) + 1);
+
+ return PointerGetDatum(NULL);

Similar to 

Re: [RFC] building postgres with meson - v12

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 17:16:46 -0600, Justin Pryzby wrote:
> On Sun, Aug 28, 2022 at 01:37:41PM -0700, Andres Freund wrote:
> > > You're running tap tests via a python script.  There's no problem with
> > > that, but it's different from what's done by the existing makefiles.
> > > I was able to remove the python indirection - maybe that's better to
> > > talk about on the CI thread?  That moves some setup for TAP tests
> > > (TESTDIR, PATH, cd) from Makefile into the existing perl, which means
> > > less duplication.
> > 
> > I'm doubtful it's worth removing. You'd need to move removing the files from
> > the last run into both pg_regress and the tap test infrastructure. And I do
> > think it's nice to afterwards have markers which tests failed, so we can 
> > only
> > collect their logs.
> 
> Are you planning on putting something in place to remove (or allow
> removing) logs for successful tests ?  Is that primarily for cirrus, or
> buildfarm or ??

What I'd like to do is to add a 'collect-logs-for-failed-test's script and/or
target that moves those logs into a different folder. By default we'd then
collect all the files from that different folder in CI.  I think that's better
than removing logs for successful tests.

I'd like to use the same script for the BF as well - we've had too many cases
where we had to adjust things in multiple places / code-bases.

Perhaps we could also use that test to print the list of relevant logfiles at
the end of a "local" testrun?


> It is wasteful to upload thousdands of logfiles to show a single
> failure.  That would make our cirrus tasks faster - compressing and
> uploading the logs takes over a minute.
>
> It's also a lot friendlier to show fewer than 8 pages of test folders to
> search through to find the one that failed.

Indeed.

Greetings,

Andres Freund




Re: meson oddities

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-15 08:22:59 +0900, Michael Paquier wrote:
> I pass down some custom CFLAGS to the meson command as well, and these find
> their way to LDFLAGS on top of CFLAGS for the user-defined entries.  I would
> not have expected that, either.

We effectively do that with autoconf as well, except that we don't mention
that in pg_config --ldflags. Our linking rules include CFLAGS, see e.g.:

%: %.o
$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

postgres: $(OBJS)
$(CC) $(CFLAGS) $(call expand_subsys,$^) $(LDFLAGS) $(LDFLAGS_EX) 
$(export_dynamic) $(LIBS) -o $@

ifdef PROGRAM
$(PROGRAM): $(OBJS)
$(CC) $(CFLAGS) $(OBJS) $(PG_LIBS_INTERNAL) $(LDFLAGS) $(LDFLAGS_EX) 
$(PG_LIBS) $(LIBS) -o $@$(X)
endif

# Rule for building a shared library from a single .o file
%.so: %.o
$(CC) $(CFLAGS) $< $(LDFLAGS) $(LDFLAGS_SL) -shared -o $@


Should we try that fact in pg_configin the meson build as well?


Meson automatically includes compiler flags during linking because a)
apparently many dependencies (.pc files etc) specify linker flags in CFLAGS b)
at least some kinds of LTO requires compiler flags being present during
"linking".

Greetings,

Andres Freund




Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-14 Thread Michael Paquier
On Mon, Nov 14, 2022 at 03:47:27PM +0800, Julien Rouhaud wrote:
> On Mon, Nov 14, 2022 at 02:40:37PM +0900, Michael Paquier wrote:
>> If the caller passes NULL for *linectx as the initial line context,
>> just create it as we do now.  If *linectx is not NULL, just reuse it.
>> That may be cleaner than returning the created MemoryContext as
>> returned result from tokenize_auth_file().
> 
> I originally used two functions as it's a common pattern (e.g. ReadBuffer /
> ReadBufferExtended or all the *_internal versions of functions) and to avoid
> unnecessary branch, but I agree that here having an extra branch is unlikely 
> to
> make any measurable difference.  It would only matter on -DEXEC_BACKEND /
> win32, and in that case the extra overhead (over an already expensive new
> backend mechanism) is way more than that, so +1 to keep things simple here.

Oh, Okay.  So that was your intention here.

>> This aggregates all the error messages for all the files included in a
>> given repository.  As the patch stands, it seems to me that we would
>> get errors related to an include_dir clause for two cases:
>> - The specified path does not exist, in which case we have only one
>> err_msg to consume and report back.
>> - Multiple failures in opening and/or reading included files.
>> In the second case, aggregating the reports would provide a full set
>> of information, but that's not something a user would be able to act
>> on directly as this is system-related.  Or there is a case to know a
>> full list of files in the case of multiple files that cannot be read
>> because of permission issues?  We may be fine with just the first
>> system error here.  Note that in the case where files can be read and
>> opened, these would have their own TokenizedAuthLines for each line
>> parsed, meaning one line in the SQL views once translated to an
>> HbaLine or an IdentLine.
> 
> If you have an include_dir directive and multiple files have wrong permission
> (or maybe broken symlink or something like that), you will get multiple errors
> when trying to process that single directive.  I think it's friendlier to
> report as much detail as we can, so users can make sure they fix everything
> rather than iterating over the first error.  That's especially helpful if the
> fix is done in some external tooling (puppet or whatever) rather than directly
> on the server.

Hmm, Okay.  Well, this would include only errors on I/O or permission
problems for existing files..  I have not seen deployments that have
dozens of sub-files for GUCs with an included dir, but perhaps I lack
of experience on user histories.

> I think that the problem is that we have the same interface for processing the
> files on a startup/reload and for filling the views, so if we want the views 
> to
> be helpful and report all errors we have to also allow a bogus "include" to
> continue in the reload case too.  The same problem doesn't exists for GUCs, so
> a slightly different behavior there might be acceptable.

Well, there is as well the point that we don't have yet a view for
GUCs that does the equivalent of HBA and ident, but that does not
mean, it seems to me, that there should be an inconsistency in the way
we process those commands because one has implemented a feature but
not the other.  On the contrary, I'd rather try to make them
consistent.  As things are in the patch, the only difference between
"include_if_exists" and "include" is that the latter would report some
information if the file goes missing, the former generates a LOG entry
about the file being skipped without something in the system view.
Now, wouldn't it be useful for the end-user to report that a file is
skipped as an effect of "include_if_exists" in the system views?  If
not, then what's the point of having this clause to begin with?  My
opinion is that both clauses are useful, still on the ground of
consistency both clauses should work the same as for GUCs.  Both
should report something in err_msg even if that's just about a file
being skipped, though I agree that this could be considered confusing
as well for an if_exists clause (does not look like a big deal to me
based on the debuggability gain).
--
Michael


signature.asc
Description: PGP signature


Re: predefined role(s) for VACUUM and ANALYZE

2022-11-14 Thread Nathan Bossart
On Fri, Oct 14, 2022 at 07:37:38PM -0400, Corey Huinker wrote:
> Patch applies.
> Passes make check and make check-world.
> Test coverage seems adequate.
> 
> Coding is very clear and very much in the style of the existing code. Any
> quibbles I have with the coding style are ones I have with the overall
> pg-style, and this isn't the forum for that.
> 
> I haven't done any benchmarking yet, but it seems that the main question
> will be the impact on ordinary DML statements.
> 
> I have no opinion about the design debate earlier in this thread, but I do
> think that this patch is ready and adds something concrete to the ongoing
> discussion.

Thanks for taking a look!  Here is a rebased version of the patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From eb6d4f8bb2bb055ec161767bbec741e375314dcb Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 7 Sep 2022 22:25:29 -0700
Subject: [PATCH v8 1/4] Change AclMode from a uint32 to a uint64.

---
 src/backend/nodes/outfuncs.c|  2 +-
 src/bin/pg_upgrade/check.c  | 35 +
 src/include/catalog/pg_type.dat |  4 ++--
 src/include/nodes/parsenodes.h  |  6 +++---
 src/include/utils/acl.h | 28 +-
 5 files changed, 55 insertions(+), 20 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f05e72f0dc..8f150e9a2e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -560,7 +560,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 	WRITE_BOOL_FIELD(lateral);
 	WRITE_BOOL_FIELD(inh);
 	WRITE_BOOL_FIELD(inFromCl);
-	WRITE_UINT_FIELD(requiredPerms);
+	WRITE_UINT64_FIELD(requiredPerms);
 	WRITE_OID_FIELD(checkAsUser);
 	WRITE_BITMAPSET_FIELD(selectedCols);
 	WRITE_BITMAPSET_FIELD(insertedCols);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index f1bc1e6886..615a53a864 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster);
 static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
+static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
@@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check)
 	check_for_reg_data_type_usage(_cluster);
 	check_for_isn_and_int8_passing_mismatch(_cluster);
 
+	/*
+	 * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk
+	 * format for existing data.
+	 */
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
+		check_for_aclitem_data_type_usage(_cluster);
+
 	/*
 	 * PG 14 changed the function signature of encoding conversion functions.
 	 * Conversions from older versions cannot be upgraded automatically
@@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster)
 		check_ok();
 }
 
+/*
+ * check_for_aclitem_data_type_usage
+ *
+ *	aclitem changed its storage format in 16, so check for it.
+ */
+static void
+check_for_aclitem_data_type_usage(ClusterInfo *cluster)
+{
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for incompatible aclitem data type in user tables");
+
+	snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt");
+
+	if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path))
+	{
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n"
+ "The internal format of \"aclitem\" changed in PostgreSQL version 16\n"
+ "so this cluster cannot currently be upgraded.  You can drop the\n"
+ "problem columns and restart the upgrade.  A list of the problem\n"
+ "columns is in the file:\n"
+ "%s", output_path);
+	}
+	else
+		check_ok();
+}
 
 /*
  * check_for_jsonb_9_4_usage()
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index df45879463..0763dfde39 100644
--- a/src/include/catalog/pg_type.dat
+++ b/src/include/catalog/pg_type.dat
@@ -267,9 +267,9 @@
 # OIDS 1000 - 1099
 
 { oid => '1033', array_type_oid => '1034', descr => 'access control list',
-  typname => 'aclitem', typlen => '12', typbyval => 'f', typcategory => 'U',
+  typname => 'aclitem', typlen => '16', typbyval => 'f', typcategory => 'U',
   typinput => 'aclitemin', typoutput => 'aclitemout', typreceive => '-',
-  typsend => '-', typalign => 'i' },
+  typsend => '-', typalign => 'd' },
 { oid => '1042', array_type_oid => '1014',
   descr => 'char(length), blank-padded string, fixed storage length',
   typname => 'bpchar', typlen => '-1', typbyval => 'f', typcategory => 'S',
diff --git a/src/include/nodes/parsenodes.h 

Re: Suppressing useless wakeups in walreceiver

2022-11-14 Thread Thomas Munro
On Tue, Nov 15, 2022 at 12:14 PM Nathan Bossart
 wrote:
> On Tue, Nov 15, 2022 at 09:42:26AM +1300, Thomas Munro wrote:
> > That works for 020_pg_receivewal.  I wonder if there are also tests
> > that stream a bit of WAL first and then do wait_for_catchup that were
> > previously benefiting from the 100ms-after-startup message by
> > scheduling luck (as in, that was usually enough for replay)?  I might
> > go and teach Cluster.pm to log how much time is wasted in
> > wait_for_catchup to get some observability, and then try to figure out
> > how to optimise it properly.  We could perhaps put the 100ms duct tape
> > back temporarily though, if necessary.
>
> Oh, I see.  Since we don't check the apply position when determining
> whether to send a reply, tests may need to wait a full
> wal_receiver_status_interval.  FWIW with my patch, the runtime of the
> src/test/recovery tests seems to be back to what it was on my machine, but
> I certainly wouldn't rule out scheduling luck.

Yeah, what you have looks good in my experiments.  Here are some
statistics on waits on my system.  Times in seconds, first row is
before the new time logic went in, your patch is the last row.

 name  | count |   sum   |  avg  | stddev |  min  |  max
---+---+-+---++---+
 before|   718 |  37.375 | 0.052 |  0.135 | 0.006 |  1.110
 master|   718 | 173.100 | 0.241 |  1.374 | 0.006 | 10.004
 initial-100ms |   718 |  37.169 | 0.052 |  0.126 | 0.006 |  0.676
 initial-0ms   |   718 |  35.276 | 0.049 |  0.123 | 0.006 |  0.661

The difference on master is explained by these 14 outliers:

   name   |  time
--+
 testrun/recovery/019_replslot_limit  | 10.004
 testrun/recovery/033_replay_tsp_drops|  9.917
 testrun/recovery/033_replay_tsp_drops|  9.957
 testrun/pg_basebackup/020_pg_receivewal  |  9.899
 testrun/pg_rewind/003_extrafiles |  9.961
 testrun/pg_rewind/003_extrafiles |  9.916
 testrun/pg_rewind/008_min_recovery_point |  9.929
 recovery/019_replslot_limit  | 10.004
 recovery/033_replay_tsp_drops|  9.917
 recovery/033_replay_tsp_drops|  9.957
 pg_basebackup/020_pg_receivewal  |  9.899
 pg_rewind/003_extrafiles |  9.961
 pg_rewind/003_extrafiles |  9.916
 pg_rewind/008_min_recovery_point |  9.929
(14 rows)

Now that I can see what is going on I'm going to try to see how much I
can squeeze out of this...




Re: meson oddities

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 17:41:54 -0500, Andrew Dunstan wrote:
> Here's a couple of things I've noticed.
> 
> 
> andrew@ub22:HEAD $ inst.meson/bin/pg_config --libdir --ldflags
> /home/andrew/pgl/pg_head/root/HEAD/inst.meson/lib/x86_64-linux-gnu
> -fuse-ld=lld -DCOPY_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST
> -DWRITE_READ_PARSE_PLAN_TREES
> 
> 
> Are we really intending to add a new subdirectory to the default layout?
> Why is that x84_64-linux-gnu there?

It's the platform default on, at least, debian derived distros - that's how
you can install 32bit/64bit libraries and libraries with different ABIs
(e.g. linking against glibc vs linking with musl) in parallel.

We could override meson inferring that from the system if we want to, but it
doesn't seem like a good idea?


> Also, why have the CPPFLAGS made their way into the LDFLAGS? That seems
> wrong.

Because these days meson treats CPPFLAGS as part of CFLAGS as it apparently
repeatedly confused build system writers and users when e.g. header-presence
checks would only use CPPFLAGS. Some compiler options aren't entirely clearly
delineated, consider e.g. -isystem (influencing warning behaviour as well as
preprocessor paths).  Not sure if that's the best choice, but it's imo
defensible.

Greetings,

Andres Freund




Re: meson oddities

2022-11-14 Thread Michael Paquier
On Mon, Nov 14, 2022 at 05:41:54PM -0500, Andrew Dunstan wrote:
> Also, why have the CPPFLAGS made their way into the LDFLAGS? That seems
> wrong.

Not only CPPFLAGS.  I pass down some custom CFLAGS to the meson
command as well, and these find their way to LDFLAGS on top of
CFLAGS for the user-defined entries.  I would not have expected that,
either.
--
Michael


signature.asc
Description: PGP signature


Re: [RFC] building postgres with meson - v12

2022-11-14 Thread Justin Pryzby
On Sun, Aug 28, 2022 at 01:37:41PM -0700, Andres Freund wrote:
> > You're running tap tests via a python script.  There's no problem with
> > that, but it's different from what's done by the existing makefiles.
> > I was able to remove the python indirection - maybe that's better to
> > talk about on the CI thread?  That moves some setup for TAP tests
> > (TESTDIR, PATH, cd) from Makefile into the existing perl, which means
> > less duplication.
> 
> I'm doubtful it's worth removing. You'd need to move removing the files from
> the last run into both pg_regress and the tap test infrastructure. And I do
> think it's nice to afterwards have markers which tests failed, so we can only
> collect their logs.

Are you planning on putting something in place to remove (or allow
removing) logs for successful tests ?  Is that primarily for cirrus, or
buildfarm or ??

It is wasteful to upload thousdands of logfiles to show a single
failure.  That would make our cirrus tasks faster - compressing and
uploading the logs takes over a minute.

It's also a lot friendlier to show fewer than 8 pages of test folders to
search through to find the one that failed.

-- 
Justin




Re: Suppressing useless wakeups in walreceiver

2022-11-14 Thread Nathan Bossart
On Tue, Nov 15, 2022 at 09:42:26AM +1300, Thomas Munro wrote:
> That works for 020_pg_receivewal.  I wonder if there are also tests
> that stream a bit of WAL first and then do wait_for_catchup that were
> previously benefiting from the 100ms-after-startup message by
> scheduling luck (as in, that was usually enough for replay)?  I might
> go and teach Cluster.pm to log how much time is wasted in
> wait_for_catchup to get some observability, and then try to figure out
> how to optimise it properly.  We could perhaps put the 100ms duct tape
> back temporarily though, if necessary.

Oh, I see.  Since we don't check the apply position when determining
whether to send a reply, tests may need to wait a full
wal_receiver_status_interval.  FWIW with my patch, the runtime of the
src/test/recovery tests seems to be back to what it was on my machine, but
I certainly wouldn't rule out scheduling luck.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Peter Geoghegan
On Mon, Nov 14, 2022 at 2:58 PM Andres Freund  wrote:
> On 2022-11-14 14:42:16 -0800, Peter Geoghegan wrote:
> > What does this have to tell us, if anything, about the implications
> > for code on HEAD?
>
> Nothing really test I sent (*) - I wanted to advance the discussion about the
> patch being wrong as-is in a concrete way.

Got it.

> This logic was one of my main complaints in
> https://postgr.es/m/20221109220803.t25sosmfvkeglhy4%40awork3.anarazel.de
> and you went in a very different direction in your reply. Hence a test
> showcasing the issue.

I guess I was also confused by the fact that you called it
"skewer.diff", which is a terminology I invented to describe the scary
HOT chain freezing bug. You probably started out writing an isolation
test to do something like that, but then repurposed it to show a bug
in the patch. Anyway, never mind, I understand you now.

> Note that neither of my complaints around FrozenTransactionId in that email
> actually require that HOT is involved. The code in the patch doesn't
> differentiate between hot and not-hot until later.

I understand. I mentioned HOT only because it's more obviously not
okay with HOT -- you can point to the precise code that is broken
quite easily (index scans break with HOT chain traversals give wrong
answers in the problem scenario with freezing HOT chains in the wrong
place).

I'd really like to know if the scary HOT chain freezing scenario is
possible, for the very obvious reason. Have you tried to write a test
case for that?

-- 
Peter Geoghegan




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 14:42:16 -0800, Peter Geoghegan wrote:
> What does this have to tell us, if anything, about the implications
> for code on HEAD?

Nothing really test I sent (*) - I wanted to advance the discussion about the
patch being wrong as-is in a concrete way.

This logic was one of my main complaints in
https://postgr.es/m/20221109220803.t25sosmfvkeglhy4%40awork3.anarazel.de
and you went in a very different direction in your reply. Hence a test
showcasing the issue.

Note that neither of my complaints around FrozenTransactionId in that email
actually require that HOT is involved. The code in the patch doesn't
differentiate between hot and not-hot until later.


> I don't see any connection between this problem and the possibility of a
> live bug on HEAD involving freezing later tuple versions in a HOT chain,
> leaving earlier non-frozen versions behind to break HOT chain traversal
> code. Should I have noticed such a connection?

No.

Greetings,

Andres Freund

(*) the commented out test perhaps is an argument for expanding the comment
nd "We will advance past RECENTLY_DEAD tuples just in case there's a DEAD
after them;" in heap_prune_chain()




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Peter Geoghegan
On Mon, Nov 14, 2022 at 2:33 PM Andres Freund  wrote:
> > Why don't you think that there is corruption?
>
> I looked at the state after the test and the complaint is bogus. It's caused
> by the patch ignoring the cur->xmax == next->xmin condition if next->xmin is
> FrozenTransactionId. The isolationtester test creates a situation where that
> leads to verify_heapam() considering tuples to be part of the same chain even
> though they aren't.

Having looked at your isolation test in more detail, it seems like you
were complaining about a fairly specific and uncontroversial
shortcoming in the patch itself: it complains about a newly inserted
tuple that gets frozen. It thinks that the inserted tuple is part of
the same HOT chain (or at least the same update chain) as other tuples
on the same heap page, when in fact it's just some wholly unrelated
tuple/logical row. It seems as if the new amcheck code doesn't get all
the details of validating HOT chains right, and so jumps the gun here,
reporting corruption based on a faulty assumption that the frozen-xmin
tuple is in any way related to the chain.

I was confused about whether we were talking about this patch, bugs in
HEAD, or both.

> > Maybe you're right about the proposed new functionality getting things wrong
> > with your adversarial isolation test, but I seem to have missed the
> > underlying argument. Are you just talking about regular update chains here,
> > not HOT chains? Something else?
>
> As I noted, it happens regardless of HOT being used or not. The tuples aren't
> part of the same chain, but the patch treats them as if they were.  The reason
> the patch considers them to be part of the same chain is precisely the
> FrozenTransactionId condition I was worried about. Just because t_ctid points
> to a tuple on the same page and the next tuple has xmin ==
> FrozenTransactionId, doesn't mean they're part of the same chain. Once you
> encounter a tuple with a frozen xmin you simply cannot assume it's part of the
> chain you've been following.

Got it.

That seems relatively straightforward and uncontroversial to me,
because it's just how code like heap_hot_search_buffer (HOT chain
traversal code) works already. The patch got some of those details
wrong, and should be revised.

What does this have to tell us, if anything, about the implications
for code on HEAD? I don't see any connection between this problem and
the possibility of a live bug on HEAD involving freezing later tuple
versions in a HOT chain, leaving earlier non-frozen versions behind to
break HOT chain traversal code. Should I have noticed such a
connection?

-- 
Peter Geoghegan




meson oddities

2022-11-14 Thread Andrew Dunstan


Here's a couple of things I've noticed.


andrew@ub22:HEAD $ inst.meson/bin/pg_config --libdir --ldflags
/home/andrew/pgl/pg_head/root/HEAD/inst.meson/lib/x86_64-linux-gnu
-fuse-ld=lld -DCOPY_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST
-DWRITE_READ_PARSE_PLAN_TREES


Are we really intending to add a new subdirectory to the default layout?
Why is that x84_64-linux-gnu there?

Also, why have the CPPFLAGS made their way into the LDFLAGS? That seems
wrong.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 14:13:10 -0800, Peter Geoghegan wrote:
> > I think the problem partially is that the proposed verify_heapam() code is 
> > too
> > "aggressive" considering things to be part of the same hot chain - which 
> > then
> > means we have to be very careful about erroring out.
> >
> > The attached isolationtester test triggers:
> > "unfrozen tuple was updated to produce a tuple at offset %u which is frozen"
> > "updated version at offset 3 is also the updated version of tuple at offset 
> > %u"
> >
> > Despite there afaict not being any corruption. Worth noting that this 
> > happens
> > regardless of hot/non-hot updates being used (uncomment s3ci to see).
> 
> Why don't you think that there is corruption?

I looked at the state after the test and the complaint is bogus. It's caused
by the patch ignoring the cur->xmax == next->xmin condition if next->xmin is
FrozenTransactionId. The isolationtester test creates a situation where that
leads to verify_heapam() considering tuples to be part of the same chain even
though they aren't.


> Because I feel like I'm repeating myself more than I should, but: why isn't
> it as simple as "HOT chain traversal logic is broken by frozen xmin in the
> obvious way, therefore all bets are off"?

Because that's irrelevant for the testcase and a good number of my concerns.


> Maybe you're right about the proposed new functionality getting things wrong
> with your adversarial isolation test, but I seem to have missed the
> underlying argument. Are you just talking about regular update chains here,
> not HOT chains? Something else?

As I noted, it happens regardless of HOT being used or not. The tuples aren't
part of the same chain, but the patch treats them as if they were.  The reason
the patch considers them to be part of the same chain is precisely the
FrozenTransactionId condition I was worried about. Just because t_ctid points
to a tuple on the same page and the next tuple has xmin ==
FrozenTransactionId, doesn't mean they're part of the same chain. Once you
encounter a tuple with a frozen xmin you simply cannot assume it's part of the
chain you've been following.

Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Peter Geoghegan
On Mon, Nov 14, 2022 at 12:58 PM Andres Freund  wrote:
> I wonder if we ought to add an error check to heap_prepare_freeze_tuple()
> against this scenario. We're working towards being more aggressive around
> freezing, which will make it more likely to hit corner cases around this.

In theory my work on freezing doesn't change the basic rules about how
freezing works, and doesn't know anything about HOT, so it shouldn't
introduce any new risk. Even still, I agree that this seems like
something to do in the scope of the same work, just in case. Plus it's
just important.

It would be possible to have exhaustive heap_prepare_freeze_tuple
checks in assert-only builds -- we can exhaustively check the final
array of prepared freeze plans that we collected for a given heap
page, and check it against the page exhaustively right before freezing
is executed. That's not perfect, but it would be a big improvement.

Right now I am not entirely sure what I would need to check in such a
mechanism. I am legitimately unsure of what the rules are in light of
this new information.

> Which'd still be fine if the whole chain were already "fully dead". One way I
> think this can happen is <= PG 13's HEAPTUPLE_DEAD handling in
> lazy_scan_heap().

You mean the tupgone thing? Perhaps it would have avoided this
particular problem, or one like it. But it had so many other problems
that I don't see why it matters now.

> At least as long as correctness requires not ending up in endless loops -
> indeed. We end up with lazy_scan_prune() endlessly retrying. Without a chance
> to interrupt. Shouldn't there at least be a CFI somewhere?

Probably, but that wouldn't change the fact that it's a bug when this
happens. Obviously it's more important to avoid such a bug than it is
to ameliorate it.

> I think the problem partially is that the proposed verify_heapam() code is too
> "aggressive" considering things to be part of the same hot chain - which then
> means we have to be very careful about erroring out.
>
> The attached isolationtester test triggers:
> "unfrozen tuple was updated to produce a tuple at offset %u which is frozen"
> "updated version at offset 3 is also the updated version of tuple at offset 
> %u"
>
> Despite there afaict not being any corruption. Worth noting that this happens
> regardless of hot/non-hot updates being used (uncomment s3ci to see).

Why don't you think that there is corruption?

The terminology here is tricky. It's possible that the amcheck patch
makes a very good point here, even without necessarily complaining
about a state that leads to obviously wrong behavior. It's also
possible that there really is wrong behavior, at least in my mind -- I
don't know what your remarks about no corruption are really based on.

I feel like I'm repeating myself more than I should, but: why isn't it
as simple as "HOT chain traversal logic is broken by frozen xmin in
the obvious way, therefore all bets are off"? Maybe you're right about
the proposed new functionality getting things wrong with your
adversarial isolation test, but I seem to have missed the underlying
argument. Are you just talking about regular update chains here, not
HOT chains? Something else?

> One important protection right now is that vacuumlazy.c uses a more
> pessimistic horizon than pruneheap.c. Even if visibility determinations within
> pruning recompute the horizon, vacuumlazy.c won't freeze based on the advanced
> horizon.  I don't quite know where we we'd best put a comment with a warning
> about this fact.

We already have comments discussing the relationship between
OldestXmin and vistest (as well as rel_pages) in heap_vacuum_rel().
That seems like the obvious place to put something like this, at least
to me.

-- 
Peter Geoghegan




Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-08 14:57:35 +1300, David Rowley wrote:
> On Tue, 8 Nov 2022 at 05:24, Andres Freund  wrote:
> > Should we handle the case where we get a suitably aligned pointer from
> > MemoryContextAllocExtended() differently?
> 
> Maybe it would be worth the extra check. I'm trying to imagine future
> use cases.  Maybe if someone wanted to ensure that we're aligned to
> CPU cache line boundaries then the chances of the pointer already
> being aligned to 64 bytes is decent enough.  The problem is it that
> it's too late to save any memory, it just saves a bit of boxing and
> unboxing of the redirect headers.

Couldn't we reduce the amount of over-allocation by a small amount by special
casing the already-aligned case? That's not going to be relevant for page size
aligne allocations, but for smaller alignment values it could matter.

Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 14:27:54 -0500, Robert Haas wrote:
> On Wed, Nov 9, 2022 at 5:08 PM Andres Freund  wrote:
> > I don't really understand this logic - why can't we populate the predecessor
> > array, if we can construct a successor entry?
> 
> This whole thing was my idea, so let me try to explain. I think the
> naming and comments need work, but I believe the fundamental idea may
> be sound.
> 
> successor[x] = y means that when we looked at line pointer x, we saw
> that it was either a redirect to line pointer y, or else it had
> storage and the associated tuple's CTID pointed to line pointer y.

> At this point, we do not have any idea whether y is at all sane, nor we do
> we know anything about which of x and y is larger.

What do you mean with "larger" here?


> Furthermore, it is
> possible that successor[x] = successor[x'] since the page might be corrupted
> and we haven't checked otherwise.
> 
> predecessor[y] = x means that successor[x] = y but in addition we've
> checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> tuples for which these conditions hold, we've issued complaints about
> all but one and entered the last into the predecessor array.

As shown by the isolationtester test I just posted, this doesn't quite work
right now. Probably fixable.

I don't think we can follow non-HOT ctid chains if they're older than the xmin
horizon, including all cases of xmin being frozen. There's just nothing
guaranteeing that the tuples are actually "related".


It seems like we should do a bit more validation within a chain of
tuples. E.g. that no live tuple can follow an !DidCommit xmin?



> > I'm doubtful it's a good idea to try to validate the 9.4 case. The 
> > likelihood
> > of getting that right seems low and I don't see us gaining much by even 
> > trying.
> 
> I agree with Peter. We have to try to get that case right. If we can
> eventually eliminate it as a valid case by some mechanism, hooray. But
> in the meantime we have to deal with it as best we can.

I now think that the 9.4 specific reasoning is bogus in the first place. The
patch says:

 * Add a line pointer offset to the predecessor array 
if xmax is
 * matching with xmin of next tuple (reaching via its 
t_ctid).
 * Prior to PostgreSQL 9.4, we actually changed the 
xmin to
 * FrozenTransactionId so we must add offset to 
predecessor
 * array(irrespective of xmax-xmin matching) if updated 
tuple xmin
 * is frozen, so that we can later do validation 
related to frozen
 * xmin. Raise corruption if we have two tuples having 
the same
 * predecessor.

but it's simply not correct to iterate through xmin=FrozenTransactionId - as
shown in the isolationtester test. And that's unrelated to 9.4, because we
couldn't rely on the raw xmin value either, because even if they match, they
could be from different epochs.

Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Peter Geoghegan
On Mon, Nov 14, 2022 at 11:28 AM Robert Haas  wrote:
> Part of the motivation here is also driven by trying to figure out how
> to word the complaints. We have a dedicated field in the amcheck that
> can hold one tuple offset or the other, but if we're checking the
> relationships between tuples, what do we put there? I feel it will be
> easiest to understand if we put the offset of the older tuple in that
> field and then phrase the complaint as the patch does, e.g.:

That makes a lot of sense to me, and reminds me of how things work in
verify_nbtree.c.

At a high level verify_nbtree.c works by doing a breadth-first
traversal of the tree. The search makes each distinct page the "target
page" exactly once. The target page is the clear focal point for
everything -- almost every complaint about corruption frames the
problem as a problem in the target page. We consistently describe
things in terms of their relationship with the target page, so under
this scheme everybody is...on the same page (ahem).

Being very deliberate about that probably had some small downsides.
Maybe it would have made a little more sense to word certain
particular corruption report messages in a way that placed blame on
"ancillary" pages like sibling/child pages (not the target page) as
problems in the ancillary page itself, not the target page. This still
seems like the right trade-off -- the control flow can be broken up
into understandable parts once you understand that the target page is
the thing that we use to describe every other page.

> > I'm doubtful it's a good idea to try to validate the 9.4 case. The 
> > likelihood
> > of getting that right seems low and I don't see us gaining much by even 
> > trying.
>
> I agree with Peter. We have to try to get that case right. If we can
> eventually eliminate it as a valid case by some mechanism, hooray. But
> in the meantime we have to deal with it as best we can.

Practiced intellectual humility seems like the way to go here. On some
level I suspect that we'll have problems in exactly the places that we
don't look for them.

-- 
Peter Geoghegan




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 13:43:41 -0500, Robert Haas wrote:
> On Mon, Nov 14, 2022 at 12:47 PM Andres Freund  wrote:
> > I'd go the other way. It's pretty unimportant whether it overflowed, it's
> > important how many subtxns there are. The cases where overflowing causes 
> > real
> > problems are when there's many thousand subtxns - which one can't judge just
> > from suboverflowed alone. Nor can monitoring a boolean tell you whether 
> > you're
> > creeping closer to the danger zone.
> 
> This is the opposite of what I believe to be true. I thought the
> problem is that once a single backend overflows the subxid array, all
> snapshots have to be created suboverflowed, and this makes visibility
> checking more expensive. It's my impression that for some users this
> creates and extremely steep performance cliff: the difference between
> no backends overflowing and 1 backend overflowing is large, but
> whether you are close to the limit makes no difference as long as you
> don't reach it, and once you've passed it it makes little difference
> how far past it you go.

First, it's not good to have a cliff that you can't see coming - presumbly
you'd want to warn *before* you regularly reach PGPROC_MAX_CACHED_SUBXIDS
subxids, rather when the shit has hit the fan already.

IMO the number matters a lot when analyzing why this is happening / how to
react. A session occasionally reaching 65 subxids might be tolerable and not
necessarily indicative of a bug. But 100k subxids is something that one just
can't accept.


Perhaps this would better be tackled by a new "visibility" view. It could show
- number of sessions with a snapshot
- max age of backend xmin
- pid with max backend xmin
- number of sessions that suboverflowed
- pid of the session with the most subxids
- age of the oldest prepared xact
- age of the oldest slot
- age of the oldest walsender
- ...

Perhaps implemented in SQL, with new functions for accessing the properties we
don't expose today.  That'd address the pg_stat_activity width, while still
allowing very granular access when necessary. And provide insight into
something that's way to hard to query right now.


> > I don't buy the argument that the ship of pg_stat_activity width has 
> > entirely
> > sailed. A session still fits onto a reasonably sized terminal in \x output -
> > but not much longer.
> 
> I guess it depends on what you mean by reasonable. For me, without \x,
> it wraps across five times on an idle system with the 24x80 window
> that I normally use, and even if I full screen my terminal window, it
> still wraps around. With \x, sure, it fits, both only if the query is
> shorter than the width of my window minus ~25 characters, which isn't
> that likely to be the case IME because users write long queries.
>
> I don't even try to use \x most of the time because the queries are likely
> to be long enough to destroy any benefit, but it all depends on how big your
> terminal is and how long your queries are.

I pretty much always use less with -S/--chop-long-lines (via $LESS), otherwise
I find psql to be pretty hard to use.

Greetings,

Andres Freund




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 14, 2022 at 12:47 PM Andres Freund  wrote:
>> I'd go the other way. It's pretty unimportant whether it overflowed, it's
>> important how many subtxns there are. The cases where overflowing causes real
>> problems are when there's many thousand subtxns - which one can't judge just
>> from suboverflowed alone. Nor can monitoring a boolean tell you whether 
>> you're
>> creeping closer to the danger zone.

> This is the opposite of what I believe to be true. I thought the
> problem is that once a single backend overflows the subxid array, all
> snapshots have to be created suboverflowed, and this makes visibility
> checking more expensive.

Yeah, that's what I thought too.  Andres, please enlarge ...

regards, tom lane




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 09:50:48 -0800, Peter Geoghegan wrote:
> On Mon, Nov 14, 2022 at 9:38 AM Andres Freund  wrote:
> > Anyway, I played a bit around with this. It's hard to hit, not because we
> > somehow won't choose such a horizon, but because we'll commonly prune the
> > earlier tuple version away due to xmax being old enough.
>
> That must be a bug, then. Since, as I said, I can't see how it could
> possibly be okay to freeze an xmin of tuple in a HOT chain without
> also making sure that it has no earlier versions left behind.

Hard to imagine us having bugs in this code. Ahem.

I really wish I knew of a reasonably complex way to utilize coverage guided
fuzzing on heap pruning / vacuuming.


I wonder if we ought to add an error check to heap_prepare_freeze_tuple()
against this scenario. We're working towards being more aggressive around
freezing, which will make it more likely to hit corner cases around this.



> If there are earlier versions that we have to go through to get to the
> frozen-xmin tuple (not just an LP_REDIRECT), we're going to break the HOT
> chain traversal logic in code like heap_hot_search_buffer in a rather
> obvious way.
>
> HOT chain traversal logic code will interpret the frozen xmin from the
> tuple as FrozenTransactionId (not as its raw xmin). So traversal is
> just broken in this scenario.
>

Which'd still be fine if the whole chain were already "fully dead". One way I
think this can happen is <= PG 13's HEAPTUPLE_DEAD handling in
lazy_scan_heap().

I now suspect that the seemingly-odd "We will advance past RECENTLY_DEAD
tuples just in case there's a DEAD one after them;" logic in
heap_prune_chain() might be required for correctness.  Which IIRC we'd been
talking about getting rid elsewhere?



At least as long as correctness requires not ending up in endless loops -
indeed. We end up with lazy_scan_prune() endlessly retrying. Without a chance
to interrupt. Shouldn't there at least be a CFI somewhere?  The attached
isolationtester spec has a commented out test for this.


I think the problem partially is that the proposed verify_heapam() code is too
"aggressive" considering things to be part of the same hot chain - which then
means we have to be very careful about erroring out.

The attached isolationtester test triggers:
"unfrozen tuple was updated to produce a tuple at offset %u which is frozen"
"updated version at offset 3 is also the updated version of tuple at offset %u"

Despite there afaict not being any corruption. Worth noting that this happens
regardless of hot/non-hot updates being used (uncomment s3ci to see).


> > It *is* possible to
> > hit, if the horizon increases between the two tuple version checks (e.g. if
> > there's another tuple inbetween that we check the visibility of).
>
> I suppose that that's the detail that "protects" us, then -- that
> would explain the apparent lack of problems in the field. Your
> sequence requires 3 sessions, not just 2.

One important protection right now is that vacuumlazy.c uses a more
pessimistic horizon than pruneheap.c. Even if visibility determinations within
pruning recompute the horizon, vacuumlazy.c won't freeze based on the advanced
horizon.  I don't quite know where we we'd best put a comment with a warning
about this fact.

Greetings,

Andres Freund
diff --git c/src/test/isolation/expected/skewer.out i/src/test/isolation/expected/skewer.out
new file mode 100644
index 000..e69de29bb2d
diff --git c/src/test/isolation/specs/skewer.spec i/src/test/isolation/specs/skewer.spec
new file mode 100644
index 000..f9fa6ea8aa1
--- /dev/null
+++ i/src/test/isolation/specs/skewer.spec
@@ -0,0 +1,74 @@
+setup
+{
+DROP TABLE IF EXISTS t;
+DROP EXTENSION IF EXISTS amcheck;
+DROP EXTENSION IF EXISTS pageinspect;
+
+CREATE EXTENSION amcheck;
+CREATE EXTENSION pageinspect;
+CREATE TABLE t(key int, d int);
+}
+
+session s1
+step s1b {BEGIN; SELECT txid_current();}
+step s1u1 {UPDATE t SET d = d + 1 WHERE key = 1; }
+step s1i3 {INSERT INTO t VALUES (3, 1);}
+step s1c {COMMIT;}
+
+session s2
+step s2b {BEGIN; SELECT txid_current();}
+step s2c {COMMIT;}
+
+session s3
+step s3b {BEGIN; SELECT txid_current();}
+step s3i1 {INSERT INTO t VALUES (1, 1);}
+step s3i2 {INSERT INTO t VALUES (2, 1);}
+step s3c {COMMIT;}
+step s3a {ABORT;}
+step s3u1 {UPDATE t SET d = d + 1 WHERE key = 1; }
+step s3u2 {UPDATE t SET d = d + 1 WHERE key = 2; }
+
+step s3ci { CREATE INDEX ON t(d)}
+
+session s4
+step s4show { SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, mask.raw_flags, mask.combined_flags, t_hoff, t_bits  FROM heap_page_items(get_raw_page('t', 0)), heap_tuple_infomask_flags(t_infomask, t_infomask2) AS mask;}
+step s4verify { SELECT * FROM verify_heapam('t'); SELECT pg_backend_pid();}
+step s4vac { VACUUM (VERBOSE) t; }
+step s4vacfreeze { VACUUM (VERBOSE, FREEZE) t; }
+
+session s5
+step s5pin { BEGIN; DECLARE foo CURSOR FOR SELECT * FROM t; FETCH FROM foo;}
+

Re: Suppressing useless wakeups in walreceiver

2022-11-14 Thread Thomas Munro
On Tue, Nov 15, 2022 at 8:01 AM Nathan Bossart  wrote:
> Is there any reason we should wait for 100ms before sending the initial
> reply?  ISTM the previous behavior essentially caused the first reply (and
> feedback message) to be sent at the first opportunity after streaming
> begins.  My instinct is to do something like the attached.  I wonder if we
> ought to do something similar in the ConfigReloadPending path in case
> hot_standby_feedback is being turned on.

That works for 020_pg_receivewal.  I wonder if there are also tests
that stream a bit of WAL first and then do wait_for_catchup that were
previously benefiting from the 100ms-after-startup message by
scheduling luck (as in, that was usually enough for replay)?  I might
go and teach Cluster.pm to log how much time is wasted in
wait_for_catchup to get some observability, and then try to figure out
how to optimise it properly.  We could perhaps put the 100ms duct tape
back temporarily though, if necessary.




Change error to warning and increase thresholds of tsearch

2022-11-14 Thread Ankit Kumar Pandey

Hi,

I was looking at Todo item:/Consider changing error to warning for strings larger than one megabyte/  
and after going through existing mails and suggestions.  I would like to propose a patch for tsearch to change error into warning for string larger than one mb and also increase word and position limits.


I've checked operations select/insertion/index, which worked fine without any 
error (except for the warning as intended).

Thoughts: I am not really sure why was it proposed in the mail to decrease 
len/MAXSTRLEN.
You could decrease len in WordEntry to 9 (512 characters) and increase 
pos to 22 (4 Mb). Don't forget to update MAXSTRLEN and MAXSTRPOS 
accordingly.



I'm attaching a patch herewith. I will be glad to get some feedback on this.


Thanks,
Ankit
From 6eb6db71bd54c23ebfed545e730806229e67210e Mon Sep 17 00:00:00 2001
From: Ankit Kumar Pandey 
Date: Tue, 15 Nov 2022 01:09:11 +0530
Subject: [PATCH] change error to warn for tsearch and increase its limits

---
 src/backend/utils/adt/tsvector.c | 8 ++--
 src/include/tsearch/ts_type.h| 8 
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/adt/tsvector.c b/src/backend/utils/adt/tsvector.c
index 04c6f33537..bea5c54414 100644
--- a/src/backend/utils/adt/tsvector.c
+++ b/src/backend/utils/adt/tsvector.c
@@ -192,6 +192,7 @@ tsvectorin(PG_FUNCTION_ARGS)
 	int			poslen;
 	char	   *strbuf;
 	int			stroff;
+	booloverflow_warn = false;
 
 	/*
 	 * Tokens are appended to tmpbuf, cur is a pointer to the end of used
@@ -216,11 +217,14 @@ tsvectorin(PG_FUNCTION_ARGS)
 			(long) toklen,
 			(long) (MAXSTRLEN - 1;
 
-		if (cur - tmpbuf > MAXSTRPOS)
-			ereport(ERROR,
+		if (!overflow_warn && (cur - tmpbuf > MAXSTRPOS))
+		{
+			ereport(WARNING,
 	(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 	 errmsg("string is too long for tsvector (%ld bytes, max %ld bytes)",
 			(long) (cur - tmpbuf), (long) MAXSTRPOS)));
+			overflow_warn = true;
+		}
 
 		/*
 		 * Enlarge buffers if needed
diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h
index f1ec84702d..5daf604dcf 100644
--- a/src/include/tsearch/ts_type.h
+++ b/src/include/tsearch/ts_type.h
@@ -42,12 +42,12 @@ typedef struct
 {
 	uint32
 haspos:1,
-len:11,			/* MAX 2Kb */
-pos:20;			/* MAX 1Mb */
+len:9,			/* MAX 512 bytes */
+pos:22;			/* MAX 4Mb */
 } WordEntry;
 
-#define MAXSTRLEN ( (1<<11) - 1)
-#define MAXSTRPOS ( (1<<20) - 1)
+#define MAXSTRLEN ( (1<<9) - 1)
+#define MAXSTRPOS ( (1<<22) - 1)
 
 extern int	compareWordEntryPos(const void *a, const void *b);
 
-- 
2.37.2



Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-14 Thread David Christensen
Enclosed is v8, which uses the replication slot method to retain WAL
as well as fsync'ing the output directory when everything is done.


v8-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data


Re: Allow single table VACUUM in transaction block

2022-11-14 Thread Simon Riggs
On Tue, 8 Nov 2022 at 03:10, Simon Riggs  wrote:
>
> On Mon, 7 Nov 2022 at 08:20, Simon Riggs  wrote:
>
> > Temp tables are actually easier, since we don't need any of the
> > concurrency features we get with lazy vacuum.

> Thoughts?

New patch, which does this, when in a xact block

1. For temp tables, only VACUUM FULL is allowed
2. For persistent tables, an AV task is created to perform the vacuum,
which eventually performs a vacuum

The patch works, but there are various aspects of the design that need
input. Thoughts?

-- 
Simon Riggshttp://www.EnterpriseDB.com/


single_table_vacuum_in_xact.v3.patch
Description: Binary data


Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Robert Haas
On Mon, Nov 14, 2022 at 2:17 PM David G. Johnston
 wrote:
> Assuming getting an actual count value to print is fairly cheap, or even a 
> sunk cost if you are going to report overflow, I don't see why we wouldn't 
> want to provide the more detailed data.
>
> My concern, through ignorance, with reporting a number is that it would have 
> no context in the query result itself.  If I have two rows with numbers, one 
> with 10 and one with 1,000, is the two orders of magnitude of the second 
> number important or does overflow happen at, say, 65,000 and so both numbers 
> are exceedingly small and thus not worth worrying about?  That can be handled 
> by documentation just fine, so long as the reference number in question isn't 
> a per-session variable.  Otherwise, showing some kind of "percent of max" 
> computation seems warranted.  In which case maybe the two presentation 
> outputs would be:
>
> 1,000 (13%)
> Overflowed

I think the idea of cramming a bunch of stuff into a text field is
dead on arrival. Data types are a wonderful invention because they let
people write queries, say looking for backends where overflowed =
true, or backends where subxids > 64. that gets much harder if the
query has to try to make sense of some random text representation.

If both values are separately important, then we need to report them
both, and the only question is whether to do that in pg_stat_activity
or via a side mechanism. What I don't yet understand is why that's
true. I think the important question is whether there are overflowed
backends, and Andres thinks it's how many subtransaction XIDs there
are, so there is a reasonable chance that both things actually matter
in separate scenarios. But I only know the scenario in which
overflowed matters, not the one in which subtransaction XID count
matters.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Robert Haas
On Wed, Nov 9, 2022 at 5:08 PM Andres Freund  wrote:
> I don't really understand this logic - why can't we populate the predecessor
> array, if we can construct a successor entry?

This whole thing was my idea, so let me try to explain. I think the
naming and comments need work, but I believe the fundamental idea may
be sound.

successor[x] = y means that when we looked at line pointer x, we saw
that it was either a redirect to line pointer y, or else it had
storage and the associated tuple's CTID pointed to line pointer y. At
this point, we do not have any idea whether y is at all sane, nor we
do we know anything about which of x and y is larger. Furthermore, it
is possible that successor[x] = successor[x'] since the page might be
corrupted and we haven't checked otherwise.

predecessor[y] = x means that successor[x] = y but in addition we've
checked that y is sane, and that x.xmax=y.xmin. If there are multiple
tuples for which these conditions hold, we've issued complaints about
all but one and entered the last into the predecessor array.

An earlier version of the algorithm had only a predecessor[] array but
the code to try to populate in a single pass was complex and looked
ugly and error-prone. To set a predecessor entry in one step, we had
to sanity-check each of x and y but only if that hadn't yet been done,
which was quite awkward. For example, imagine line pointers 1 and 2
both point to 3, and line pointer 3 points backward to line pointer 1
(because of corruption, since it shouldn't ever be circular). We can't
reason about the relationship between 1 and 3 without first making
sure that each one is sane in isolation. But if we do that when we're
at line pointer 1, then when we get to 2, we need to check 2 but don't
need to recheck 3, and when we get to 3 we need to recheck neither 3
nor 1. This algorithm lets us run through and do all the basic sanity
checks first, while populating the successor array, and then check
relationships in later stages.

Part of the motivation here is also driven by trying to figure out how
to word the complaints. We have a dedicated field in the amcheck that
can hold one tuple offset or the other, but if we're checking the
relationships between tuples, what do we put there? I feel it will be
easiest to understand if we put the offset of the older tuple in that
field and then phrase the complaint as the patch does, e.g.:

tuple with uncommitted xmin %u was updated to produce a tuple at
offset %u with differing xmin %u

We could flip that around and put the newer tuple offset in the field
and then phrase the complaint the other way around, but it seems a bit
awkward, e.g.:

tuple with uncommited xmin %u at offset %u was updated to produce this
tuple with differing xmin %u

I think if we did do it that way around (and figured out how to phrase
the messages) we might not need both arrays any more (though I'm not
positive about that). It seems hard to avoid needing at least one,
else you can't explicitly notice two converging HOT chains, which
seems like a case we probably ought to notice. But the "to produce
this tuple" phrasing is just confusing, I think, and removing "this"
doesn't help. You need to somehow get people to understand that the
offset they probably saw in another field is the second tuple, not the
first one. Maybe:

xmin %u does not match xmax %u of prior tuple at offset %u

Hmm.

Anyway, whether it was the right idea or not, the desire to have the
earlier tuple be the focus of the error messages was part of the
motivation here.

> I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> of getting that right seems low and I don't see us gaining much by even 
> trying.

I agree with Peter. We have to try to get that case right. If we can
eventually eliminate it as a valid case by some mechanism, hooray. But
in the meantime we have to deal with it as best we can.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread David G. Johnston
On Mon, Nov 14, 2022 at 11:43 AM Robert Haas  wrote:

> On Mon, Nov 14, 2022 at 12:47 PM Andres Freund  wrote:
> > I'd go the other way. It's pretty unimportant whether it overflowed, it's
> > important how many subtxns there are. The cases where overflowing causes
> real
> > problems are when there's many thousand subtxns - which one can't judge
> just
> > from suboverflowed alone. Nor can monitoring a boolean tell you whether
> you're
> > creeping closer to the danger zone.
>
> This is the opposite of what I believe to be true. I thought the
> problem is that once a single backend overflows the subxid array, all
> snapshots have to be created suboverflowed, and this makes visibility
> checking more expensive. It's my impression that for some users this
> creates and extremely steep performance cliff: the difference between
> no backends overflowing and 1 backend overflowing is large, but
> whether you are close to the limit makes no difference as long as you
> don't reach it, and once you've passed it it makes little difference
> how far past it you go.
>
>
Assuming getting an actual count value to print is fairly cheap, or even a
sunk cost if you are going to report overflow, I don't see why we wouldn't
want to provide the more detailed data.

My concern, through ignorance, with reporting a number is that it would
have no context in the query result itself.  If I have two rows with
numbers, one with 10 and one with 1,000, is the two orders of magnitude of
the second number important or does overflow happen at, say, 65,000 and so
both numbers are exceedingly small and thus not worth worrying about?  That
can be handled by documentation just fine, so long as the reference number
in question isn't a per-session variable.  Otherwise, showing some kind of
"percent of max" computation seems warranted.  In which case maybe the two
presentation outputs would be:

1,000 (13%)
Overflowed

David J.


Re: Suppressing useless wakeups in walreceiver

2022-11-14 Thread Nathan Bossart
On Mon, Nov 14, 2022 at 02:47:14PM +1300, Thomas Munro wrote:
> Ahh, I think I might have it.  In the old coding, sendTime starts out
> as 0, which I think caused it to send its first reply message after
> the first 100ms sleep, and only after that fall into a cadence of
> wal_receiver_status_interval (10s) while idle.  Our new coding won't
> send its first reply until start time + wal_receiver_status_interval.
> If I have that right, think we can get back to the previous behaviour
> by explicitly setting the first message time, like:

Good find!

> @@ -433,6 +433,9 @@ WalReceiverMain(void)
> for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> WalRcvComputeNextWakeup(i, now);
> 
> +   /* XXX start with a reply after 100ms */
> +   wakeup[WALRCV_WAKEUP_REPLY] = now + 10;
> +
> /* Loop until end-of-streaming or error */
> 
> Obviously that's bogus and racy (it races with wait_for_catchup, and
> it's slow, actually both sides are not great and really should be
> event-driven, in later work)...

Is there any reason we should wait for 100ms before sending the initial
reply?  ISTM the previous behavior essentially caused the first reply (and
feedback message) to be sent at the first opportunity after streaming
begins.  My instinct is to do something like the attached.  I wonder if we
ought to do something similar in the ConfigReloadPending path in case
hot_standby_feedback is being turned on.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8bd2ba37dd..4de0b9c387 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -433,6 +433,10 @@ WalReceiverMain(void)
 			for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
 WalRcvComputeNextWakeup(i, now);
 
+			/* send initial reply/feedback at the next opportunity */
+			wakeup[WALRCV_WAKEUP_REPLY] = now;
+			wakeup[WALRCV_WAKEUP_HSFEEDBACK] = now;
+
 			/* Loop until end-of-streaming or error */
 			for (;;)
 			{


Re: [PoC] Let libpq reject unexpected authentication requests

2022-11-14 Thread Jacob Champion
On 11/11/22 22:57, Aleksander Alekseev wrote:
> I did a little more research and I think you are right. What happens
> according to the C standard:

Thanks for confirming! (I personally prefer -1 to a *MAX macro, because
it works regardless of the length of the type.)

--Jacob




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Robert Haas
On Mon, Nov 14, 2022 at 12:47 PM Andres Freund  wrote:
> I'd go the other way. It's pretty unimportant whether it overflowed, it's
> important how many subtxns there are. The cases where overflowing causes real
> problems are when there's many thousand subtxns - which one can't judge just
> from suboverflowed alone. Nor can monitoring a boolean tell you whether you're
> creeping closer to the danger zone.

This is the opposite of what I believe to be true. I thought the
problem is that once a single backend overflows the subxid array, all
snapshots have to be created suboverflowed, and this makes visibility
checking more expensive. It's my impression that for some users this
creates and extremely steep performance cliff: the difference between
no backends overflowing and 1 backend overflowing is large, but
whether you are close to the limit makes no difference as long as you
don't reach it, and once you've passed it it makes little difference
how far past it you go.

> But TBH, to me this still is something that'd be better addressed with a
> tracepoint.

I think that makes it far, far less accessible to the typical user.

> I don't buy the argument that the ship of pg_stat_activity width has entirely
> sailed. A session still fits onto a reasonably sized terminal in \x output -
> but not much longer.

I guess it depends on what you mean by reasonable. For me, without \x,
it wraps across five times on an idle system with the 24x80 window
that I normally use, and even if I full screen my terminal window, it
still wraps around. With \x, sure, it fits, both only if the query is
shorter than the width of my window minus ~25 characters, which isn't
that likely to be the case IME because users write long queries. I
don't even try to use \x most of the time because the queries are
likely to be long enough to destroy any benefit, but it all depends on
how big your terminal is and how long your queries are.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-14 Thread David Christensen
On Fri, Nov 11, 2022 at 4:57 AM Bharath Rupireddy
 wrote:
> > Well if it's not the same output then I guess you're right and there's
> > not a use for the `--fixup` mode.  By the same token, I'd say
> > calculating/setting the checksum also wouldn't need to be done, we
> > should just include the page as included in the WAL stream.
>
> Let's hear from others, we may be missing something here. I recommend
> keeping the --fixup patch as 0002, in case if we decide to discard
> it's easier, however I'll leave that to you.

I've whacked in `git` for now; I can resurrect if people consider it useful.

> > > > > 5.
> > > > > +if (!RestoreBlockImage(record, block_id, page))
> > > > > +continue;
> > > > > +
> > > > > +/* we have our extracted FPI, let's save it now */
> > > > > After extracting the page from the WAL record, do we need to perform a
> > > > > checksum on it?
> > >
> > > I think you just need to do the following, this will ensure the
> > > authenticity of the page that pg_waldump returns.
> > > if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, 
> > > blk))
> > > {
> > > pg_fatal("page checksum failed");
> > > }
> >
> > The WAL already has a checksum, so not certain this makes sense on its
> > own. Also I'm inclined to make it a warning if it doesn't match rather
> > than a fatal. (I'd also have to verify that the checksum is properly
> > set on the page prior to copying the FPI into WAL, which I'm pretty
> > sure it is but not certain.)
>
> How about having it as an Assert()?

Based on empirical testing, the checksums don't match, so
asserting/alerting on each block extracted seems next to useless, so
going to just remove that.

> > > 5.
> > > +fsync(fileno(OPF));
> > > +fclose(OPF);
> > > I think just the fsync() isn't enough, you still need fsync_fname()
> > > and/or fsync_parent_path(), perhaps after for (block_id = 0; block_id
> > > <= XLogRecMaxBlockId(record); block_id++) loop.
> >
> > I'm not sure I get the value of the fsyncs here; if you are using this
> > tool at this capacity you're by definition doing some sort of
> > transient investigative steps.  Since the WAL was fsync'd, you could
> > always rerun/recreate as needed in the unlikely event of an OS crash
> > in the middle of this investigation.  Since this is outside the
> > purview of the database operations proper (unlike, say, initdb) seems
> > like it's unnecessary (or definitely shouldn't need to be selectable).
> > My thoughts are that if we're going to fsync, just do the fsyncs
> > unconditionally rather than complicate the interface further.
>
> -1 for fysnc() per file created as it can create a lot of sync load on
> production servers impacting performance. How about just syncing the
> directory at the end assuming it doesn't cost as much as fsync() per
> FPI file created would?

I can fsync the dir if that's a useful compromise.

> > > 4.
> > > +$primary->append_conf('postgresql.conf', "wal_level='replica'");
> > > +$primary->append_conf('postgresql.conf', 'archive_mode=on');
> > > +$primary->append_conf('postgresql.conf', "archive_command='/bin/false'");
> > > Why do you need to set wal_level to replica, out of the box your
> > > cluster comes with replica only no?
> > > And why do you need archive_mode on and set the command to do nothing?
> > > Why archiving is needed for testing your feature firstly?
> >
> > I think it had shown "minimal" in my testing; I was purposefully
> > failing archives so the WAL would stick around.  Maybe a custom
> > archive command that just copied a single WAL file into a known
> > location so I could use that instead of the current approach would
> > work, though not sure how Windows support would work with that.  Open
> > to other ideas to more cleanly get a single WAL file that isn't the
> > last one.  (Earlier versions of this test were using /all/ of the
> > generated WAL files rather than a single one, so maybe I am
> > overcomplicating things for a single WAL file case.)
>
> Typically we create a physical replication slot at the beginning so
> that the server keeps the WAL required for you in pg_wal itself, for
> instance, please see pg_walinspect:
>
> -- Make sure checkpoints don't interfere with the test.
> SELECT 'init' FROM
> pg_create_physical_replication_slot('regress_pg_walinspect_slot',
> true, false);

Will see if I can get something like this to work; I'm currently
stopping the server before running the file-based tests, but I suppose
there's no reason to do so, so a temporary slot that holds it around
until the test is complete is probably fine.

David




Re: Documentation for building with meson

2022-11-14 Thread samay sharma
Hi,

On Thu, Nov 10, 2022 at 4:46 AM Nazir Bilal Yavuz 
wrote:

> Hi,
>
> I did some tests on windows. I used 'ninja' as a backend.
> On 11/8/2022 9:23 PM, samay sharma wrote:
>
> Hi,
>
> On Sat, Nov 5, 2022 at 2:39 PM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2022-10-30 20:51:59 -0700, samay sharma wrote:
>> > +# setup and enter build directory (done only first time)
>> > +meson setup build src --prefix=$PWD/install
>>
>> This command won't work on windows, I think.
>>
>
> Yes, $PWD isn't recognized on windows, %CD% could be alternative.
>
Added.

>
> > +  
>> > +
>>  
>> --sysconfdir=DIRECTORY
>> > +   
>> > +
>> > + Sets the directory for various configuration files,
>> > + PREFIX/etc by
>> default.
>> > +
>> > +   
>> > +  
>>
>> Need to check what windows does here.
>>
>
> It is same on windows: 'PREFIX/etc'.
>
> I also checked other dirs(bindir, sysconfdir, libdir, includedir, datadir,
> localedir, mandir), default path is correct for all of them.
>

Thanks. I've made a few windows specific fixes in the latest version.
Attached v5.

Regards,
Samay


>
> Regards,
> Nazir Bilal Yavuz
>
>
>
>


v5-0001-Add-docs-for-building-with-meson.patch
Description: Binary data


Re: libpq support for NegotiateProtocolVersion

2022-11-14 Thread Jacob Champion
On 11/13/22 01:21, Peter Eisentraut wrote:
> On 11.11.22 23:28, Jacob Champion wrote:
>> Put another way, why do we loop around and poll for more data when we
>> hit the end of the connection buffer, if we've already checked at this
>> point that we should have the entire message buffered locally?
> 
> Isn't that the same behavior for other message types?  I don't see 
> anything in the handling of the early 'E' and 'R' messages that would 
> handle this.

I agree for the 'E' case. For 'R', I see the msgLength being passed down
to pg_fe_sendauth().

> If we want to address this, maybe this should be handled 
> in the polling loop before we pass off the input buffer to the 
> per-message-type handlers.

I thought it was supposed to be handled by this code:

>   /*
>* Can't process if message body isn't all here yet.
>*/
>   msgLength -= 4;
>   avail = conn->inEnd - conn->inCursor;
>   if (avail < msgLength)
>   {
>   /*
>* Before returning, try to enlarge the input buffer if
>* needed to hold the whole message; see notes in
>* pqParseInput3.
>*/
>   if (pqCheckInBufferSpace(conn->inCursor + (size_t) msgLength,
>conn))
>   goto error_return;
>   /* We'll come back when there is more data */
>   return PGRES_POLLING_READING;
>   }

But after this block, we still treat EOF as if we need to get more data.
If we know that the message was supposed to be fully buffered, can we
just avoid the return to the pooling loop altogether and error out
whenever we see EOF?

Thanks,
--Jacob




Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-14 Thread Aleksander Alekseev
Hi hackers,

> Thanks, done!

Dilip Kumar asked a good question in the thread about the 0001..0003
subset [1]. I would like to duplicate it here to make sure it was not
missed by mistake:

"""
Have we measured the WAL overhead because of this patch set? maybe
these particular patches will not impact but IIUC this is ground work
for making xid 64 bit. So each XLOG record size will increase at
least by 4 bytes because the XLogRecord contains the xid.
"""

Do we have an estimate on this?

[1]: 
https://www.postgresql.org/message-id/CAFiTN-uudj2PY8GsUzFtLYFpBoq_rKegW3On_8ZHdxB1mVv3-A%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Peter Geoghegan
On Mon, Nov 14, 2022 at 9:38 AM Andres Freund  wrote:
> Anyway, I played a bit around with this. It's hard to hit, not because we
> somehow won't choose such a horizon, but because we'll commonly prune the
> earlier tuple version away due to xmax being old enough.

That must be a bug, then. Since, as I said, I can't see how it could
possibly be okay to freeze an xmin of tuple in a HOT chain without
also making sure that it has no earlier versions left behind. If there
are earlier versions that we have to go through to get to the
frozen-xmin tuple (not just an LP_REDIRECT), we're going to break the
HOT chain traversal logic in code like heap_hot_search_buffer in a
rather obvious way.

HOT chain traversal logic code will interpret the frozen xmin from the
tuple as FrozenTransactionId (not as its raw xmin). So traversal is
just broken in this scenario.

> It *is* possible to
> hit, if the horizon increases between the two tuple version checks (e.g. if
> there's another tuple inbetween that we check the visibility of).

I suppose that that's the detail that "protects" us, then -- that
would explain the apparent lack of problems in the field. Your
sequence requires 3 sessions, not just 2.

-- 
Peter Geoghegan




Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-11-14 Thread Andres Freund
On 2022-11-14 16:06:27 +0530, Amit Kapila wrote:
> Pushed.

Thanks.




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-14 12:29:58 -0500, Tom Lane wrote:
> I'd vote for just overflowed true/false.  Why do people need to know
> the exact number of subtransactions?  (If there is a use-case, that
> would definitely be material for an auxiliary function instead of a
> view column.)

I'd go the other way. It's pretty unimportant whether it overflowed, it's
important how many subtxns there are. The cases where overflowing causes real
problems are when there's many thousand subtxns - which one can't judge just
from suboverflowed alone. Nor can monitoring a boolean tell you whether you're
creeping closer to the danger zone.

Monitoring the number also has the advantage that we'd not embed an
implementation detail ("suboverflowed") in a view. The number of
subtransactions is far less prone to changing than the way we implement
subtransactions in the procarray.

But TBH, to me this still is something that'd be better addressed with a
tracepoint.

I don't buy the argument that the ship of pg_stat_activity width has entirely
sailed. A session still fits onto a reasonably sized terminal in \x output -
but not much longer.

Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-14 Thread Andres Freund
Hi,

On 2022-11-09 18:35:12 -0800, Peter Geoghegan wrote:
> On Wed, Nov 9, 2022 at 6:10 PM Andres Freund  wrote:
> > And thinking about it, it'd be quite bad if the horizon worked that way. 
> > You can easily construct a workload where every single xid would "skewer" 
> > some chain, never allowing the horizon to be raised.
> 
> Your whole scenario is one involving a insert of a tuple by XID 10,
> which is then updated by XID 5 -- a lower XID. Obviously that's
> possible, but it's relatively rare. I have to imagine that the vast
> majority of updates affect tuples inserted by an XID before the XID of
> the updater.

> My use of the term "skewer" was limited to updates that look like
> that. So I don't know what you mean about never allowing the horizon
> to be raised.

You don't need it to happen all the time, it's enough when it happens
occasionally, since that'd "block" the whole range of xids between. So you
you'd just need occasional transactions to prevent the horizon from
increasing.


Anyway, I played a bit around with this. It's hard to hit, not because we
somehow won't choose such a horizon, but because we'll commonly prune the
earlier tuple version away due to xmax being old enough. It *is* possible to
hit, if the horizon increases between the two tuple version checks (e.g. if
there's another tuple inbetween that we check the visibility of).

I think there's another way it can happen in older cluster, but don't want to
spend the time to verify it.

Either way, we can't error out in this situation - there's nothing invalid
about it.

Greetings,

Andres Freund




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Tom Lane
"David G. Johnston"  writes:
> On Mon, Nov 14, 2022 at 9:41 AM Robert Haas  wrote:
>> The overhead of fetching the information is not large, but Justin is
>> concerned about the effect on the display width. I feel that's kind of
>> a lost cause because it's so wide already anyway, but I don't see a
>> reason why we need *two* new columns. Can't we get by with just one?
>> It could be overflowed true/false, or it could be the number of
>> subtransaction XIDs but with NULL instead if overflowed.

> NULL when overflowed seems like the opposite of the desired effect, calling
> attention to the exceptional status.  Make it a text column and write
> "overflow" or "###" as appropriate.  Anyone using the column is going to
> end up wanting to special-case overflow anyway and number-to-text
> conversion aside from overflow is simple enough if a number, and not just a
> display label, is needed.

I'd vote for just overflowed true/false.  Why do people need to know
the exact number of subtransactions?  (If there is a use-case, that
would definitely be material for an auxiliary function instead of a
view column.)

regards, tom lane




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread David G. Johnston
On Mon, Nov 14, 2022 at 9:41 AM Robert Haas  wrote:

> On Mon, Nov 14, 2022 at 11:35 AM Amit Singh 
> wrote:
> > Making the information available in pg_stat_activity makes it a lot
> easier to identify the pid which has caused the subtran overflow. Debugging
> through the app code can be an endless exercise and logging every statement
> in postgresql logs is not practical either. If the overhead of fetching the
> information isn't too big, I think we should consider the
> subtransaction_count and is_overflowed field as potential candidates for
> the enhancement of pg_stat_activity.
>
> The overhead of fetching the information is not large, but Justin is
> concerned about the effect on the display width. I feel that's kind of
> a lost cause because it's so wide already anyway, but I don't see a
> reason why we need *two* new columns. Can't we get by with just one?
> It could be overflowed true/false, or it could be the number of
> subtransaction XIDs but with NULL instead if overflowed.
>
> Do you have a view on this point?
>
>
NULL when overflowed seems like the opposite of the desired effect, calling
attention to the exceptional status.  Make it a text column and write
"overflow" or "###" as appropriate.  Anyone using the column is going to
end up wanting to special-case overflow anyway and number-to-text
conversion aside from overflow is simple enough if a number, and not just a
display label, is needed.

David J.


Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Robert Haas
On Mon, Nov 14, 2022 at 11:35 AM Amit Singh  wrote:
> Making the information available in pg_stat_activity makes it a lot easier to 
> identify the pid which has caused the subtran overflow. Debugging through the 
> app code can be an endless exercise and logging every statement in postgresql 
> logs is not practical either. If the overhead of fetching the information 
> isn't too big, I think we should consider the subtransaction_count and 
> is_overflowed field as potential candidates for the enhancement of 
> pg_stat_activity.

The overhead of fetching the information is not large, but Justin is
concerned about the effect on the display width. I feel that's kind of
a lost cause because it's so wide already anyway, but I don't see a
reason why we need *two* new columns. Can't we get by with just one?
It could be overflowed true/false, or it could be the number of
subtransaction XIDs but with NULL instead if overflowed.

Do you have a view on this point?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Amit Singh
Making the information available in pg_stat_activity makes it a lot easier
to identify the pid which has caused the subtran overflow. Debugging
through the app code can be an endless exercise and logging every statement
in postgresql logs is not practical either. If the overhead of fetching the
information isn't too big, I think we should consider the
subtransaction_count and is_overflowed field as potential candidates for
the enhancement of pg_stat_activity.


Regards
Amit Singh

On Mon, Nov 14, 2022 at 11:10 PM Robert Haas  wrote:

> On Mon, Mar 21, 2022 at 7:45 PM Andres Freund  wrote:
> > > It feels to me like far too much effort is being invested in
> fundamentally
> > > the wrong direction here.  If the subxact overflow business is causing
> > > real-world performance problems, let's find a way to fix that, not put
> > > effort into monitoring tools that do little to actually alleviate
> anyone's
> > > pain.
> >
> > There seems to be some agreement on this (I certainly do agree). Thus it
> seems
> > we should mark the CF entry as rejected?
>
> I don't think I agree with this outcome, for two reasons.
>
> First, we're just talking about an extra couple of columns in
> pg_stat_activity here, which does not seem like a heavy price to pay.
> I'm not even sure we need two columns; I think we could get down to
> one pretty easily. Rough idea: number of cached subtransaction XIDs if
> not overflowed, else NULL. Or if that's likely to create 0/NULL
> confusion, then maybe just a Boolean, overflowed or not.
>
> Second, the problem seems pretty fundamental to me. Shared memory is
> fixed size, so we cannot use it to store an unbounded number of
> subtransaction IDs. We could perhaps rejigger things to be more
> memory-efficient in some way, but no matter how many subtransaction
> XIDs you can keep in shared memory, the user can always consume that
> number plus one --  unless you allow for ~2^31 in shared memory, which
> seems unrealistic. To me, that means that overflowed snapshots are not
> going away. We could make them less painful by rewriting the SLRU
> stuff to be more efficient, and I bet that's possible, but I think
> it's probably hard, or someone would have gotten it done by now. This
> has been sucking for a long time and I see no evidence that progress
> is imminent. Even if it happens, it is unlikely that it will be a full
> solution. If it were possible to make SLRU lookups fast enough not to
> matter, we wouldn't need to have hint bits, but in reality we do have
> them and attempts to get rid of them have not gone well up until now,
> and in my opinion probably never will.
>
> The way that I view this problem is that it is relatively rare but
> hard for some users to troubleshoot. I think I've seen it come up
> multiple times, and judging from the earlier responses on this thread,
> several other people here have, too. In my experience, the problem is
> inevitably that someone has a DML statement inside a plpgsql EXCEPTION
> block inside a plpgsql loop. Concurrently with that, they are running
> a lot of queries that look at recently modified data, so that the
> overflowed snapshot trigger SLRU lookups often enough to matter. How
> is a user supposed to identify which backend is causing the problem,
> as things stand today? I have generally given people the advice to go
> find the DML inside of a plpgsql EXCEPTION block inside of a loop, but
> some users have trouble doing that. The DBA who is observing the
> performance problem is not necessarily the developer who wrote all of
> the PL code, and the PL code may be large and badly formatted and
> there could be a bunch of EXCEPTION blocks and it might not be clear
> which one is the problem. The exception block could be calling another
> function or procedure that does the actual DML rather than doing it
> directly, and the loop surrounding it might not be in the same
> function or procedure but in some other one that calls it, or it could
> be called repeatedly from the SQL level.
>
> I think I fundamentally disagree with the idea that we should refuse
> to expose instrumentation data because some day the internals might
> change. If we accepted that argument categorically, we wouldn't have
> things like backend_xmin or backend_xid in pg_stat_activity, or wait
> events either, but we do have those things and users find them useful.
> They suck in the sense that you need to know quite a bit about how the
> internals work in order to use them to find problems, but people who
> want to support production PostgreSQL instances have to learn about
> how those internals work one way or the other because they
> demonstrably matter. It is absolutely stellar when we can say "hey, we
> don't need to have a way for users to see what's going on here
> internally because they don't ever need to care," but once it is
> established that they do need to care, we should let them see directly
> the data they need to care about rather than forcing them to
> 

Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Robert Haas
On Mon, Nov 14, 2022 at 11:18 AM David G. Johnston
 wrote:
>> I guess that's OK. I don't particularly favor that approach here but I
>> can live with it. I agree that too-wide views are annoying, but as far
>> as pg_stat_activity goes, that ship has pretty much sailed already,
>> and the same is true for a lot of other views. Inventing a one-off
>> solution for this particular case doesn't seem particularly warranted
>> to me but, again, I can live with it.
>
> I can see putting counts that people would want to use for statistics 
> elsewhere but IIUC the whole purpose of "overflowed" is to inform someone 
> that their session presently has degraded performance because it has created 
> too many subtransactions.  Just because the "degraded" condition itself is 
> rare doesn't mean the field "is my session degraded" is going to be seldom 
> consulted.  In fact, I would rather think it is always briefly consulted to 
> confirm it has the expected value of "false" (blank, IMO, don't show anything 
> in that column unless it is exceptional) and the presence of a value there 
> would draw attention to the desired fact that something is wrong and warrants 
> further investigation.  The pg_stat_activity view seems like the perfect 
> place to at least display that exception flag.

OK, thanks for voting. I take that as +1 for putting it in
pg_stat_activity proper, which is also my preferred approach.

However, a slight correction: it doesn't inform you that your session
has degraded performance. It informs you that your session may be
degrading everyone else's performance.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread David G. Johnston
On Mon, Nov 14, 2022 at 9:04 AM Robert Haas  wrote:

> On Mon, Nov 14, 2022 at 10:57 AM Justin Pryzby 
> wrote:
> > > First, we're just talking about an extra couple of columns in
> > > pg_stat_activity here, which does not seem like a heavy price to pay.
> >
> > The most recent patch adds a separate function rather than adding more
> > columns to pg_stat_activity.  I think the complaint about making that
> > view wider for infrequently-used columns is entirely valid.
>
> I guess that's OK. I don't particularly favor that approach here but I
> can live with it. I agree that too-wide views are annoying, but as far
> as pg_stat_activity goes, that ship has pretty much sailed already,
> and the same is true for a lot of other views. Inventing a one-off
> solution for this particular case doesn't seem particularly warranted
> to me but, again, I can live with it.
>
>
I can see putting counts that people would want to use for statistics
elsewhere but IIUC the whole purpose of "overflowed" is to inform someone
that their session presently has degraded performance because it has
created too many subtransactions.  Just because the "degraded" condition
itself is rare doesn't mean the field "is my session degraded" is going to
be seldom consulted.  In fact, I would rather think it is always briefly
consulted to confirm it has the expected value of "false" (blank, IMO,
don't show anything in that column unless it is exceptional) and the
presence of a value there would draw attention to the desired fact that
something is wrong and warrants further investigation.  The
pg_stat_activity view seems like the perfect place to at least display that
exception flag.

David J.


Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Robert Haas
On Mon, Nov 14, 2022 at 10:57 AM Justin Pryzby  wrote:
> > First, we're just talking about an extra couple of columns in
> > pg_stat_activity here, which does not seem like a heavy price to pay.
>
> The most recent patch adds a separate function rather than adding more
> columns to pg_stat_activity.  I think the complaint about making that
> view wider for infrequently-used columns is entirely valid.

I guess that's OK. I don't particularly favor that approach here but I
can live with it. I agree that too-wide views are annoying, but as far
as pg_stat_activity goes, that ship has pretty much sailed already,
and the same is true for a lot of other views. Inventing a one-off
solution for this particular case doesn't seem particularly warranted
to me but, again, I can live with it.

> Why would pg_upgrade fail due to new/removed columns in
> pg_stat_activity?  Do you mean if a user creates a view on top of it?

Yes, that is a thing that some people do, and I think it is the most
likely way for any changes to the view definition to cause
compatibility problems. I could be wrong, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: when the startup process doesn't (logging startup delays)

2022-11-14 Thread Robert Haas
On Mon, Nov 14, 2022 at 7:37 AM Simon Riggs
 wrote:
> > Whilte at it, I noticed that we report redo progress for PITR, but we
> > don't report when standby enters archive recovery mode, say due to a
> > failure in the connection to primary or after the promote signal is
> > found. Isn't it useful to report in this case as well to know the
> > recovery progress?
>
> I think your patch disables progress too early, effectively turning
> off the standby progress feature. The purpose was to report on things
> that take long periods during recovery, not just prior to recovery.
>
> I would advocate that we disable progress only while waiting, as I've done 
> here:
> https://www.postgresql.org/message-id/CANbhV-GcWjZ2cmj0uCbZDWQUHnneMi_4EfY3dVWq0-yD5o7Ccg%40mail.gmail.com

Maybe I'm confused here, but I think that, on a standby, startup
progress messages are only printed until the main redo loop is
reached. Otherwise, we would print a message on a standby every 10s
forever, which seems like a thing that most users would not like. So I
think that Bharath has the right idea here.

I don't think that his patch is right in detail, though. I don't think
the call to disable_timeout() needs to be conditional, and I don't
think the Assert is correct. Also, I think that your patch has the
right idea in encapsulating the disable_timeout() call inside a new
function disable_startup_progress_timeout(), rather than having the
details known directly by xlogrecovery.c.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Justin Pryzby
On Mon, Nov 14, 2022 at 10:09:57AM -0500, Robert Haas wrote:
> On Mon, Mar 21, 2022 at 7:45 PM Andres Freund  wrote:
> > > It feels to me like far too much effort is being invested in fundamentally
> > > the wrong direction here.  If the subxact overflow business is causing
> > > real-world performance problems, let's find a way to fix that, not put
> > > effort into monitoring tools that do little to actually alleviate anyone's
> > > pain.
> >
> > There seems to be some agreement on this (I certainly do agree). Thus it 
> > seems
> > we should mark the CF entry as rejected?
> 
> I don't think I agree with this outcome, for two reasons.
> 
> First, we're just talking about an extra couple of columns in
> pg_stat_activity here, which does not seem like a heavy price to pay.

The most recent patch adds a separate function rather than adding more
columns to pg_stat_activity.  I think the complaint about making that
view wider for infrequently-used columns is entirely valid.

> If and when it happens that a field like backend_xmin or the new ones
> proposed here are no longer relevant, we can just remove them from the
> monitoring views. Yeah, that's a backward compatibility break, and
> there's some pain associated with that. But we have demonstrated that
> we are perfectly willing to incur the pain associated with adding new
> columns when there is new and valuable information to display, and
> that is equally a compatibility break, in the sense that it has about
> the same chance of making pg_upgrade fail.

Why would pg_upgrade fail due to new/removed columns in
pg_stat_activity?  Do you mean if a user creates a view on top of it?

-- 
Justin




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Robert Haas
On Mon, Nov 14, 2022 at 10:41 AM Tom Lane  wrote:
> Maybe the original patch took an hour to write, but it's sure been
> bikeshedded to death :-(.  I was complaining about the total amount
> of attention spent more than the patch itself.

Unfortunately, that problem is not unique to this patch, and even more
unfortunately, despite all the bikeshedding, we still often get it
wrong. Catching up from my week off I see that you've fixed not one
but two bugs in a patch I thought I'd reviewed half to death. :-(

> The patch of record seems to be v4 from 2022-01-13, which was failing
> in cfbot at last report but presumably could be fixed easily.  The
> proposed documentation's grammar is pretty shaky, but I don't see
> much else wrong in a quick eyeball scan.

I can take a crack at improving the documentation. Do you have a view
on the best way to cut this down to a single new column, or the
desirability of doing so?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Tom Lane
Robert Haas  writes:
> In short, I think this is a good idea, and if somebody thinks that we
> should solve the underlying problem instead, I'd like to hear what
> people think a realistic solution might be. Because to me, it looks
> like we're refusing to commit a patch that probably took an hour to
> write because with 10 years of engineering effort we could *maybe* fix
> the root cause.

Maybe the original patch took an hour to write, but it's sure been
bikeshedded to death :-(.  I was complaining about the total amount
of attention spent more than the patch itself.

The patch of record seems to be v4 from 2022-01-13, which was failing
in cfbot at last report but presumably could be fixed easily.  The
proposed documentation's grammar is pretty shaky, but I don't see
much else wrong in a quick eyeball scan.

regards, tom lane




Re: Small TAP improvements

2022-11-14 Thread Andrew Dunstan


On 2022-11-09 We 05:35, Andrew Dunstan wrote:
> On 2022-11-06 Su 08:51, Álvaro Herrera wrote:
>> On 2022-Jun-14, Andrew Dunstan wrote:
>>
>>> OK, here's a more principled couple of patches. For config_data, if you
>>> give multiple options it gives you back the list of values. If you don't
>>> specify any, in scalar context it just gives you back all of pg_config's
>>> output, but in array context it gives you a map, so you should be able
>>> to say things like:
>>>
>>> my %node_config = $node->config_data;
>> Hi, it looks to me like these were forgotten?
>>
> Yeah, will get to it this week.
>
>
> Thanks for the reminder.
>
>

Pushed now.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-14 Thread Robert Haas
On Mon, Mar 21, 2022 at 7:45 PM Andres Freund  wrote:
> > It feels to me like far too much effort is being invested in fundamentally
> > the wrong direction here.  If the subxact overflow business is causing
> > real-world performance problems, let's find a way to fix that, not put
> > effort into monitoring tools that do little to actually alleviate anyone's
> > pain.
>
> There seems to be some agreement on this (I certainly do agree). Thus it seems
> we should mark the CF entry as rejected?

I don't think I agree with this outcome, for two reasons.

First, we're just talking about an extra couple of columns in
pg_stat_activity here, which does not seem like a heavy price to pay.
I'm not even sure we need two columns; I think we could get down to
one pretty easily. Rough idea: number of cached subtransaction XIDs if
not overflowed, else NULL. Or if that's likely to create 0/NULL
confusion, then maybe just a Boolean, overflowed or not.

Second, the problem seems pretty fundamental to me. Shared memory is
fixed size, so we cannot use it to store an unbounded number of
subtransaction IDs. We could perhaps rejigger things to be more
memory-efficient in some way, but no matter how many subtransaction
XIDs you can keep in shared memory, the user can always consume that
number plus one --  unless you allow for ~2^31 in shared memory, which
seems unrealistic. To me, that means that overflowed snapshots are not
going away. We could make them less painful by rewriting the SLRU
stuff to be more efficient, and I bet that's possible, but I think
it's probably hard, or someone would have gotten it done by now. This
has been sucking for a long time and I see no evidence that progress
is imminent. Even if it happens, it is unlikely that it will be a full
solution. If it were possible to make SLRU lookups fast enough not to
matter, we wouldn't need to have hint bits, but in reality we do have
them and attempts to get rid of them have not gone well up until now,
and in my opinion probably never will.

The way that I view this problem is that it is relatively rare but
hard for some users to troubleshoot. I think I've seen it come up
multiple times, and judging from the earlier responses on this thread,
several other people here have, too. In my experience, the problem is
inevitably that someone has a DML statement inside a plpgsql EXCEPTION
block inside a plpgsql loop. Concurrently with that, they are running
a lot of queries that look at recently modified data, so that the
overflowed snapshot trigger SLRU lookups often enough to matter. How
is a user supposed to identify which backend is causing the problem,
as things stand today? I have generally given people the advice to go
find the DML inside of a plpgsql EXCEPTION block inside of a loop, but
some users have trouble doing that. The DBA who is observing the
performance problem is not necessarily the developer who wrote all of
the PL code, and the PL code may be large and badly formatted and
there could be a bunch of EXCEPTION blocks and it might not be clear
which one is the problem. The exception block could be calling another
function or procedure that does the actual DML rather than doing it
directly, and the loop surrounding it might not be in the same
function or procedure but in some other one that calls it, or it could
be called repeatedly from the SQL level.

I think I fundamentally disagree with the idea that we should refuse
to expose instrumentation data because some day the internals might
change. If we accepted that argument categorically, we wouldn't have
things like backend_xmin or backend_xid in pg_stat_activity, or wait
events either, but we do have those things and users find them useful.
They suck in the sense that you need to know quite a bit about how the
internals work in order to use them to find problems, but people who
want to support production PostgreSQL instances have to learn about
how those internals work one way or the other because they
demonstrably matter. It is absolutely stellar when we can say "hey, we
don't need to have a way for users to see what's going on here
internally because they don't ever need to care," but once it is
established that they do need to care, we should let them see directly
the data they need to care about rather than forcing them to
troubleshoot the problem in some more roundabout way like auditing all
of the code and guessing which part is the problem, or writing custom
dtrace scripts to run on their production instances.

If and when it happens that a field like backend_xmin or the new ones
proposed here are no longer relevant, we can just remove them from the
monitoring views. Yeah, that's a backward compatibility break, and
there's some pain associated with that. But we have demonstrated that
we are perfectly willing to incur the pain associated with adding new
columns when there is new and valuable information to display, and
that is equally a compatibility break, in the sense that it has about
the same 

List of Bitmapset (was Re: ExecRTCheckPerms() and many prunable partitions)

2022-11-14 Thread Tom Lane
[ I'm intentionally forking this off as a new thread, so as to
not confuse the cfbot about what's the live patchset on the
ExecRTCheckPerms thread. ]

Amit Langote  writes:
> On Sat, Nov 12, 2022 at 1:46 AM Tom Lane  wrote:
>> The main thing I was wondering about in connection with that
>> was whether to assume that there could be other future applications
>> of the logic to perform multi-bitmapset union, intersection,
>> etc.  If so, then I'd be inclined to choose different naming and
>> put those functions in or near to bitmapset.c.  It doesn't look
>> like Amit's code needs anything like that, but maybe somebody
>> has an idea about other applications?

> Yes, simple storage of multiple Bitmapsets in a List somewhere in a
> parse/plan tree sounded like that would have wider enough use to add
> proper node support for.   Assuming you mean trying to generalize
> VarAttnoSet in your patch 0004 posted at [2], I wonder if you want to
> somehow make its indexability by varno / RT index a part of the
> interface of the generic code you're thinking for it?

For discussion's sake, here's my current version of that 0004 patch,
rewritten to use list-of-bitmapset as the data structure.  (This
could actually be pulled out of the outer-join-vars patchset and
committed independently, just as a minor performance improvement.
It doesn't quite apply cleanly to HEAD, but pretty close.)

As it stands, the new functions are still in util/clauses.c, but
if we think they could be of general use it'd make sense to move them
either to nodes/bitmapset.c or to some new file under backend/nodes.

Some other thoughts:

* The multi_bms prefix is a bit wordy, so I was thinking of shortening
the function names to mbms_xxx.  Maybe that's too brief.

* This is a pretty short list of functions so far.  I'm not eager
to build out a bunch of dead code though.  Is it OK to leave it
with just this much functionality until someone needs more?

* I'm a little hesitant about whether the API actually should be
List-of-Bitmapset, or some dedicated struct as I had in the previous
version of 0004.  This version is way less invasive in prepjointree.c
than that was, but the reason is there's ambiguity about what the
forced_null_vars Lists actually contain, which feels error-prone.

Comments?

regards, tom lane

commit 998723891b3ecea420ed77ff9de475f7b6c54cda
Author: Tom Lane 
Date:   Sun Nov 13 14:44:28 2022 -0500

Rewrite reduce_outer_joins' matching of Vars.

Adding varnullingrels breaks the logic in reduce_outer_joins_pass2
that detects antijoins by matching upper-level "Var IS NULL" tests to
strict join quals.  The problem of course is that the upper Var is
no longer equal() to the one in the join qual, since the former will
now be marked as being nulled by the outer join.

Now, this logic was always pretty brute-force: doing list_intersect
on a list full of Vars isn't especially cheap.  So let's fix it by
creating a better-suited data structure, namely a list of per-RTE
bitmaps of relevant Vars' attnos.

(I wonder if there aren't other applications for a list-of-bitmaps data
structure, which might lead to wanting this to be a general-purpose
data structure on the same level as Bitmapset.  But for now I just
settled for writing enough primitives for the immediate problem.)

Discussion: https://postgr.es/m/CAMbWs4-mvPPCJ1W6iK6dD5HiNwoJdi6mZp=-7me8n9sh+cd...@mail.gmail.com

diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 4821b7a71f..1e22ddd100 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -2753,7 +2753,7 @@ reduce_outer_joins_pass1(Node *jtnode)
  *	state2: where to accumulate info about successfully-reduced joins
  *	root: toplevel planner state
  *	nonnullable_rels: set of base relids forced non-null by upper quals
- *	forced_null_vars: list of Vars forced null by upper quals
+ *	forced_null_vars: multi-bitmapset of Var attnos forced null by upper quals
  *
  * Returns info in state2 about outer joins that were successfully simplified.
  * Joins that were fully reduced to inner joins are all added to
@@ -2790,8 +2790,8 @@ reduce_outer_joins_pass2(Node *jtnode,
 		pass_nonnullable_rels = bms_add_members(pass_nonnullable_rels,
 nonnullable_rels);
 		pass_forced_null_vars = find_forced_null_vars(f->quals);
-		pass_forced_null_vars = list_concat(pass_forced_null_vars,
-			forced_null_vars);
+		pass_forced_null_vars = multi_bms_add_members(pass_forced_null_vars,
+	  forced_null_vars);
 		/* And recurse --- but only into interesting subtrees */
 		Assert(list_length(f->fromlist) == list_length(state1->sub_states));
 		forboth(l, f->fromlist, s, state1->sub_states)
@@ -2899,7 +2899,7 @@ reduce_outer_joins_pass2(Node *jtnode,
 		if (jointype == JOIN_LEFT)
 		{
 			List	   *nonnullable_vars;
-			List	   

Re: pg_basebackup's --gzip switch misbehaves

2022-11-14 Thread Daniel Gustafsson
> On 14 Nov 2022, at 15:23, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> How about this version?
> 
> This isn't correct shell syntax is it?
> 
> +PG_TEST_NOCLEAN make -C src/bin/pg_dump check
> 
> I think you meant
> 
> +PG_TEST_NOCLEAN=1 make -C src/bin/pg_dump check
> 
> or the like.

Ugh, yes, that's what it should say.

--
Daniel Gustafsson   https://vmware.com/





Re: pg_basebackup's --gzip switch misbehaves

2022-11-14 Thread Tom Lane
Daniel Gustafsson  writes:
> How about this version?

This isn't correct shell syntax is it?

+PG_TEST_NOCLEAN make -C src/bin/pg_dump check

I think you meant

+PG_TEST_NOCLEAN=1 make -C src/bin/pg_dump check

or the like.

regards, tom lane




Re: psql: Add command to use extended query protocol

2022-11-14 Thread Daniel Verite
Peter Eisentraut wrote:

> > I assume that we may sometimes want to use the
> > extended protocol on all queries of a script, like
> > pgbench does with --protocol=extended.
> 
> But is there an actual use case for this in psql?  In pgbench, there are 
> scenarios where you want to test aspects of prepared statements, plan 
> caching, and so on.  Is there something like that for psql?

If we set aside "exercising the protocol" as not an interesting use case
for psql, then no, I can't think of any benefit.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Commit fest 2022-11

2022-11-14 Thread Ian Lawrence Barwick
2022年11月14日(月) 22:23 James Coleman :
>
> On Mon, Nov 14, 2022 at 7:08 AM Ian Lawrence Barwick  
> wrote:
> >
> > 2022年11月9日(水) 8:12 Justin Pryzby :
> 
> > > If my script is not wrong, these patches add TAP tests, but don't update
> > > the requisite ./meson.build file.  It seems like it'd be reasonable to
> > > set them all as WOA until that's done.
> > >
> > > $ for a in `git branch -a |sort |grep commitfest/40`; do : echo "$a..."; 
> > > x=`git log -1 --compact-summary "$a"`; echo "$x" |grep '/t/.*pl.*new' 
> > > >/dev/null || continue; echo "$x" |grep -Fw meson >/dev/null && continue; 
> > > git log -1 --oneline "$a"; done
> > > ... [CF 40/3558] Allow file inclusion in pg_hba and pg_ident files
> > > ... [CF 40/3628] Teach pg_waldump to extract FPIs from the WAL stream
> > > ... [CF 40/3646] Skip replicating the tables specified in except table 
> > > option
> > > ... [CF 40/3663] Switching XLog source from archive to streaming when 
> > > primary available
> > > ... [CF 40/3670] pg_rewind: warn when checkpoint hasn't happened after 
> > > promotion
> > > ... [CF 40/3729] Testing autovacuum wraparound
> > > ... [CF 40/3877] vacuumlo: add test to vacuumlo for test coverage
> > > ... [CF 40/3985] TDE key management patches
> >
> > Looks like your script is correct, will update accordingly.
> >
>
> It's possible this has been discussed before, but it seems less than
> ideal to have notifications about moving to WOA be sent only in a bulk
> email hanging off the "current CF" email chain as opposed to being
> sent to the individual threads. Perhaps that's something the app
> should do for us in this situation. Without that though the patch
> authors are left to wade through unrelated discussion, and, probably
> more importantly, the patch discussion thread doesn't show the current
> state (I think bumping there is more likely to prompt activity as
> well).

FWIW I've been manually "bumping" the respective threads, which is somewhat
time-consuming but seems to have been quite productive in terms of getting
patches updated.

Will do same for the above once I've confirmed what is being requested,
(which I presume is adding the new tests to the 'tests' array in the respective
"meson.build" file; just taking the opportunity to familiariize myself with it).

Regards

Ian Barwick




Re: Commit fest 2022-11

2022-11-14 Thread James Coleman
On Mon, Nov 14, 2022 at 7:08 AM Ian Lawrence Barwick  wrote:
>
> 2022年11月9日(水) 8:12 Justin Pryzby :

> > If my script is not wrong, these patches add TAP tests, but don't update
> > the requisite ./meson.build file.  It seems like it'd be reasonable to
> > set them all as WOA until that's done.
> >
> > $ for a in `git branch -a |sort |grep commitfest/40`; do : echo "$a..."; 
> > x=`git log -1 --compact-summary "$a"`; echo "$x" |grep '/t/.*pl.*new' 
> > >/dev/null || continue; echo "$x" |grep -Fw meson >/dev/null && continue; 
> > git log -1 --oneline "$a"; done
> > ... [CF 40/3558] Allow file inclusion in pg_hba and pg_ident files
> > ... [CF 40/3628] Teach pg_waldump to extract FPIs from the WAL stream
> > ... [CF 40/3646] Skip replicating the tables specified in except table 
> > option
> > ... [CF 40/3663] Switching XLog source from archive to streaming when 
> > primary available
> > ... [CF 40/3670] pg_rewind: warn when checkpoint hasn't happened after 
> > promotion
> > ... [CF 40/3729] Testing autovacuum wraparound
> > ... [CF 40/3877] vacuumlo: add test to vacuumlo for test coverage
> > ... [CF 40/3985] TDE key management patches
>
> Looks like your script is correct, will update accordingly.
>
> Do we have a FAQ/checklist of meson things to consider for patches anywhere?

It's possible this has been discussed before, but it seems less than
ideal to have notifications about moving to WOA be sent only in a bulk
email hanging off the "current CF" email chain as opposed to being
sent to the individual threads. Perhaps that's something the app
should do for us in this situation. Without that though the patch
authors are left to wade through unrelated discussion, and, probably
more importantly, the patch discussion thread doesn't show the current
state (I think bumping there is more likely to prompt activity as
well).

James Coleman




Error on missing Python module in Meson setup

2022-11-14 Thread Daniel Gustafsson
When setting up a postgres tree with Meson on an almost empty Debian 11 VM I
hit an error on "meson setup -Ddebug=true build ." like this:

Program python3 found: YES (/usr/bin/python3)
meson.build:987:2: ERROR: Unknown method "dependency" in object.

The error in itself isn't terribly self-explanatory.  According to the log the
error was a missing Python package:

Traceback (most recent call last):
File "", line 20, in 
File "", line 8, in links_against_libpython
ModuleNotFoundError: No module named ‘distutils.core'

Installing the distutils package fixes it, but it seems harsh to fail setup on
a missing package. Would something like the attached make sense?

--
Daniel Gustafsson   https://vmware.com/



meson_python.diff
Description: Binary data


RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-11-14 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> > > It seems to me Kuroda-San has proposed this change [1] to fix the test
> > > but it is not clear to me why such a change is required. Why can't
> > > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below
> > > code [2] in LogicalRepApplyLoop() sufficient to handle parameter
> > > updates?

(I forgot to say, this change was not proposed by me. I said that there should 
be
modified. I thought workers should wake up after the transaction was committed.)

> So, why only honor the 'disable' option of the subscription? For
> example, one can change 'min_apply_delay' and it seems
> recoveryApplyDelay() honors a similar change in the recovery
> parameter. Is there a way to set the latch of the worker process, so
> that it can recheck if anything is changed?

I have not considered about it, but seems reasonable. We may be able to
do maybe_reread_subscription() if subscription parameters are changed
and latch is set.

Currently, IIUC we try to disable subscription regardless of the state, but
should we avoid to reread catalog if workers are handling the transactions,
like LogicalRepApplyLoop()?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-14 Thread John Naylor
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada 
wrote:
>
> 0004 patch is a new patch supporting a pointer tagging of the node
> kind. Also, it introduces rt_node_ptr we discussed so that internal
> functions use it rather than having two arguments for encoded and
> decoded pointers. With this intermediate patch, the DSA support patch
> became more readable and understandable. Probably we can make it
> smaller further if we move the change of separating the control object
> from radix_tree to the main patch (0002). The patch still needs to be
> polished but I'd like to check if this idea is worthwhile. If we agree
> on this direction, this patch will be merged into the main radix tree
> implementation patch.

Thanks for the new patch set. I've taken a very brief look at 0004 and I
think the broad outlines are okay. As you say it needs polish, but before
going further, I'd like to do some experiments of my own as I mentioned
earlier:

- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With
local memory, the node-pointer struct would be a union, for example.
Templating would also reduce branches and re-simplify some internal APIs,
but it's likely that would also make the TID store and/or vacuum more
complex, because at least some external functions would be duplicated.

I'll set the patch to "waiting on author", but in this case the author is
me.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: when the startup process doesn't (logging startup delays)

2022-11-14 Thread Simon Riggs
On Tue, 8 Nov 2022 at 12:33, Bharath Rupireddy
 wrote:
>
> On Tue, Nov 8, 2022 at 4:35 PM Thomas Munro  wrote:
> >
> > On Sat, Oct 30, 2021 at 7:44 AM Robert Haas  wrote:
> > > Committed.
> >
> > Is it expected that an otherwise idle standby's recovery process
> > receives SIGALRM every N seconds, or should the timer be canceled at
> > that point, as there is no further progress to report?
>
> Nice catch. Yeah, that seems unnecessary, see the below standby logs.
> I think we need to disable_timeout(STARTUP_PROGRESS_TIMEOUT, false);,
> something like the attached? I think there'll be no issue with the
> patch since the StandbyMode gets reset only at the end of recovery (in
> FinishWalRecovery()) but it can very well be set during recovery (in
> ReadRecord()). Note that I also added an assertion in
> has_startup_progress_timeout_expired(), just in case.
>
> 2022-11-08 11:28:23.563 UTC [980909] LOG:  SIGALRM handle_sig_alarm received
> 2022-11-08 11:28:23.563 UTC [980909] LOG:
> startup_progress_timeout_handler called
> 2022-11-08 11:28:33.563 UTC [980909] LOG:  SIGALRM handle_sig_alarm received
> 2022-11-08 11:28:33.563 UTC [980909] LOG:
> startup_progress_timeout_handler called
> 2022-11-08 11:28:43.563 UTC [980909] LOG:  SIGALRM handle_sig_alarm received
> 2022-11-08 11:28:43.563 UTC [980909] LOG:
> startup_progress_timeout_handler called
> 2022-11-08 11:28:53.563 UTC [980909] LOG:  SIGALRM handle_sig_alarm received
> 2022-11-08 11:28:53.563 UTC [980909] LOG:
> startup_progress_timeout_handler called
>
> Whilte at it, I noticed that we report redo progress for PITR, but we
> don't report when standby enters archive recovery mode, say due to a
> failure in the connection to primary or after the promote signal is
> found. Isn't it useful to report in this case as well to know the
> recovery progress?

I think your patch disables progress too early, effectively turning
off the standby progress feature. The purpose was to report on things
that take long periods during recovery, not just prior to recovery.

I would advocate that we disable progress only while waiting, as I've done here:
https://www.postgresql.org/message-id/CANbhV-GcWjZ2cmj0uCbZDWQUHnneMi_4EfY3dVWq0-yD5o7Ccg%40mail.gmail.com

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: pg_basebackup's --gzip switch misbehaves

2022-11-14 Thread Daniel Gustafsson
> On 3 Nov 2022, at 12:49, Michael Paquier  wrote:
> 
> On Wed, Nov 02, 2022 at 09:42:12PM +0100, Daniel Gustafsson wrote:
>> I think that's a good idea, not everyone running tests will read the 
>> internals
>> documentation (or even know abou it even).  How about the attached?
> 
> Thanks for the patch.  Perhaps this should be mentioned additionally
> in install-windows.sgml?  I have not tested, but as long as these
> variables are configured with a "set" command in a command prompt,
> they would be passed down to the processes triggered by vcregress.pl
> (see for example TESTLOGDIR and TESTDATADIR).

That's probably a good idea, I've amended the patch with that and also made the
CPAN mention of IPC::Run into a ulink like how it is in the Windows section in
passing.  To avoid duplicating the info in the docs I made it into a sect2
which can be linked to.  How about this version?

--
Daniel Gustafsson   https://vmware.com/



regress_tap-v2.diff
Description: Binary data


Re: Commit fest 2022-11

2022-11-14 Thread Ian Lawrence Barwick
2022年11月9日(水) 8:12 Justin Pryzby :
>
> On Thu, Nov 03, 2022 at 07:43:03PM -0500, Justin Pryzby wrote:
> > On Thu, Nov 03, 2022 at 11:33:57AM +0900, Ian Lawrence Barwick wrote:
> > > 2022年11月2日(水) 19:10 Greg Stark :
> > > > On Tue, 1 Nov 2022 at 06:56, Michael Paquier  
> > > > wrote:
> > > >
> > > > > Two people showing up to help is really great, thanks!  I'll be around
> > > > > as well this month, so I'll do my share of patches, as usual.
> > > >
> > > > Fwiw I can help as well -- starting next week. I can't do much this 
> > > > week though.
> > > >
> > > > I would suggest starting with the cfbot to mark anything that isn't
> > > > applying cleanly and passing tests (and looking for more than design
> > > > feedback) as Waiting on Author and reminding the author that it's
> > > > commitfest time and a good time to bring the patch into a clean state.
> > >
> > > Sounds like a plan; I'll make a start on that today/tomorrow as I have
> > > some time.
> >
> > If I'm not wrong, Jacob used the CF app to bulk-mail people about
> > patches not applying and similar things.  That seemed to work well, and
> > doesn't require sending mails to dozens of threads.
>
> If my script is not wrong, these patches add TAP tests, but don't update
> the requisite ./meson.build file.  It seems like it'd be reasonable to
> set them all as WOA until that's done.
>
> $ for a in `git branch -a |sort |grep commitfest/40`; do : echo "$a..."; 
> x=`git log -1 --compact-summary "$a"`; echo "$x" |grep '/t/.*pl.*new' 
> >/dev/null || continue; echo "$x" |grep -Fw meson >/dev/null && continue; git 
> log -1 --oneline "$a"; done
> ... [CF 40/3558] Allow file inclusion in pg_hba and pg_ident files
> ... [CF 40/3628] Teach pg_waldump to extract FPIs from the WAL stream
> ... [CF 40/3646] Skip replicating the tables specified in except table option
> ... [CF 40/3663] Switching XLog source from archive to streaming when primary 
> available
> ... [CF 40/3670] pg_rewind: warn when checkpoint hasn't happened after 
> promotion
> ... [CF 40/3729] Testing autovacuum wraparound
> ... [CF 40/3877] vacuumlo: add test to vacuumlo for test coverage
> ... [CF 40/3985] TDE key management patches

Looks like your script is correct, will update accordingly.

Do we have a FAQ/checklist of meson things to consider for patches anywhere?

Regards

Ian Barwick




Re: Allow logical replication to copy tables in binary format

2022-11-14 Thread Melih Mutlu
Hello,

osumi.takami...@fujitsu.com , 12 Eki 2022 Çar,
04:36 tarihinde şunu yazdı:

> >   I agree with the direction to support binary copy for v16 and
> later.
> >
> >   IIUC, the binary format replication with different data types
> fails even
> > during apply phase on HEAD.
> >   I thought that means, the upgrade concern only applies to a
> scenario
> > that the user executes
> >   only initial table synchronizations between the publisher and
> subscriber
> >   and doesn't replicate any data at apply phase after that. I would
> say
> >   this isn't a valid scenario and your proposal makes sense.
> >
> > No, logical replication in binary does not fail on apply phase if data
> types are
> > different.
> With HEAD, I observe in some case we fail at apply phase because of
> different data types like
> integer vs. bigint as written scenario in [1]. In short, I think we can
> slightly
> adjust your documentation and make it more general so that the description
> applies to
> both table sync phase and apply phase.
>

Yes, you're right. I somehow had the impression that HEAD supports
replication between different types in binary.
But as can be shown in the scenario you mentioned, it does not work.

I'll suggest a below change for your sentence of logical-replication.sgml.
> FROM:
> In binary case, it is not allowed to replicate data between different
> types due to restrictions inherited from COPY.
> TO:
> Binary format is type specific and does not allow to replicate data
> between different types according to its
> restrictions.
>

In this case, this change makes sense since this patch does actually not
introduce this issue. It already exists in HEAD too.


> If my idea above is correct, then I feel we can remove all the fixes for
> create_subscription.sgml.
> I'm not sure if I should pursue this perspective of the document
> improvement
> any further after this email, since this isn't essentially because of this
> patch.
>

I'm only keeping the following change in create_subscription.sgml to
indicate binary option copies in binary format now.

> -  Specifies whether the subscription will request the publisher to
> -  send the data in binary format (as opposed to text).
> +  Specifies whether the subscription will copy the initial data to
> +  synchronize relations in binary format and also request the
> publisher
> +  to send the data in binary format too (as opposed to text).




> > The concern with upgrade (if data types are not the same) would be not
> being
> > able to create a new subscription with binary enabled or replicate new
> tables
> > added into publication.
> > Replication of tables from existing subscriptions would not be affected
> by this
> > change since they will already be in the apply phase, not tablesync.
> > Do you think this would still be an issue?
> Okay, thanks for explaining this. I understand that
> the upgrade concern applies to the table sync that is executed
> between text format (before the patch) and binary format (after the patch).
>

I was thinking apply would work with different types in binary format.
Since apply also would not work, then the scenario that I tried to explain
earlier is not a concern anymore.


Attached patch with updated version of this patch.

Thanks,
Melih


v4-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data


Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-14 Thread Thom Brown
On Thu, 3 Nov 2022 at 08:12, Maxim Orlov  wrote:
>
> Hi!
>
>> I attach an additional V48-0009 patch as they are just comments, apply it if 
>> you want to.
>
> Big thank you for your review. I've applied your addition in the recent patch 
> set below.
>
> Besides, mentioned above, next changes are made:
> - rename HeapTupleCopyBaseFromPage to HeapTupleCopyXidsFromPage, since this 
> old name came from the time when еру "t_xid_base" was stored in tuple,
>   and not correspond to recent state of the code;
> - replace ToastTupleHeader* calls with HeapHeader* with the "is_toast" 
> argument. This reduces diff and make the code more readable;
> - put HeapTupleSetZeroXids calls in several places for the sake of redundancy;
> - in heap_tuple_would_freeze add case to reset xmax without reading clog;
> - rename SeqTupleHeaderSetXmax/Xmin to SeqTupleSetXmax/min and refactoring of 
> the function; Now it will set HeapTuple and HeapTupleHeader xmax;
> - add case of int64 values in check_GUC_init;
> - massive refactoring in htup_details.h to use inline functions with type 
> control over macro;
> - reorder code in htup_details.h to reduce overall diff.
>
> As always, reviews and opinions are very welcome!

0008 needs a rebase.  heapam.h and catversion.h are failing.

Regards

Thom




Re: Unit tests for SLRU

2022-11-14 Thread Michael Paquier
On Fri, Nov 11, 2022 at 02:11:08PM +0900, Michael Paquier wrote:
> Is there a reason why you need a TAP test here?  It is by design more
> expensive than pg_regress and it does not require --enable-tap-tests.
> See for example what we do for snapshot_too_old, commit_ts,
> worker_spi, etc., where each module uses a custom configuration file.

I have put my hands on that, and I found that the tests were a bit
overengineered.  First, SimpleLruDoesPhysicalPageExist() is not that
much necessary before and after each operation, like truncation or
deletion, as the previous pages were doing equal tests.  The hardcoded
page number lacks a bit of flexibility and readability IMO, especially
when combined with the number of pages per segments, as well.

I have reworked that as per the attached, that provides basically the
same coverage, going through a SQL interface for the whole thing.
Like all the other tests of its kind, this does not use a TAP test,
relying on a custom configuration file instead.  This still needs some
polishing, but the basics are here.

What do you think?
--
Michael
From 021cff01608b7defc52419168f58584d037e6edf Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 14 Nov 2022 20:16:36 +0900
Subject: [PATCH v6] Add unit tests for SLRU

SimpleLruInit() checks whether it is called under postmaster. For this reason
the tests can't be implemented in regress.c without changing the interface.
We wanted to avoid this. As a result the tests were implemented as an extension
in src/test/modules/ directory.

Aleksander Alekseev, reviewed by Michael Paquier, Daniel Gustafsson,
Noah Misch, Pavel Borisov, Maxim Orlov.
Discussion: https://postgr.es/m/CAJ7c6TOFoWcHOW4BVe3BG_uikCrO9B91ayx9d6rh5JZr_tPESg%40mail.gmail.com
---
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 src/test/modules/test_slru/.gitignore |   4 +
 src/test/modules/test_slru/Makefile   |  27 ++
 .../modules/test_slru/expected/test_slru.out  | 135 ++
 src/test/modules/test_slru/meson.build|  35 +++
 src/test/modules/test_slru/sql/test_slru.sql  |  38 +++
 src/test/modules/test_slru/test_slru--1.0.sql |  21 ++
 src/test/modules/test_slru/test_slru.c| 249 ++
 src/test/modules/test_slru/test_slru.conf |   2 +
 src/test/modules/test_slru/test_slru.control  |   4 +
 11 files changed, 517 insertions(+)
 create mode 100644 src/test/modules/test_slru/.gitignore
 create mode 100644 src/test/modules/test_slru/Makefile
 create mode 100644 src/test/modules/test_slru/expected/test_slru.out
 create mode 100644 src/test/modules/test_slru/meson.build
 create mode 100644 src/test/modules/test_slru/sql/test_slru.sql
 create mode 100644 src/test/modules/test_slru/test_slru--1.0.sql
 create mode 100644 src/test/modules/test_slru/test_slru.c
 create mode 100644 src/test/modules/test_slru/test_slru.conf
 create mode 100644 src/test/modules/test_slru/test_slru.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7b3f292965..af3eef19b0 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -30,6 +30,7 @@ SUBDIRS = \
 		  test_regex \
 		  test_rls_hooks \
 		  test_shm_mq \
+		  test_slru \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c2e5f5ffd5..0b2dd3134a 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,5 +24,6 @@ subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
+subdir('test_slru')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_slru/.gitignore b/src/test/modules/test_slru/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_slru/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_slru/Makefile b/src/test/modules/test_slru/Makefile
new file mode 100644
index 00..936886753b
--- /dev/null
+++ b/src/test/modules/test_slru/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/test_slru/Makefile
+
+MODULE_big = test_slru
+OBJS = \
+	$(WIN32RES) \
+	test_slru.o
+PGFILEDESC = "test_slru - test module for SLRUs"
+
+EXTENSION = test_slru
+DATA = test_slru--1.0.sql
+
+REGRESS_OPTS = --temp-config $(top_srcdir)/src/test/modules/test_slru/test_slru.conf
+REGRESS = test_slru
+# Disabled because these tests require "shared_preload_libraries=test_slru",
+# which typical installcheck users do not have (e.g. buildfarm clients).
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_slru
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_slru/expected/test_slru.out 

Re: Privileges on PUBLICATION

2022-11-14 Thread Antonin Houska
Peter Eisentraut  wrote:

> On 04.11.22 08:28, Antonin Houska wrote:
> > I thought about the whole concept a bit more and I doubt if the PUBLICATION
> > privilege is the best approach. In particular, the user specified in CREATE
> > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have 
> > SELECT
> > privilege on the tables replicated. So if the DBA excludes some columns from
> > the publication's column list and sets the (publication) privileges in such 
> > a
> > way that the user cannot get the column values via other publications, the
> > user still can connect to the database directly and get values of the 
> > excluded
> > columns.
> 
> Why are the SELECT privileges needed?  Maybe that's something to think about
> and maybe change.

I haven't noticed an explanation in comments nor did I search in the mailing
list archives, but the question makes sense: the REPLICATION attribute of a
role is sufficient for streaming replication, so why should the logical
replication require additional privileges?

Technically the SELECT privilege is needed because the sync worker does
actually execute SELECT query on the published tables. However, I realize now
that it's not checked by the output plugin. Thus if SELECT is revoked from the
"subscription user" after the table has been synchronized, the replication
continues to work. So the necessity for the SELECT privilege might be an
omission rather than a design choice. (Even the documentation says that the
SELECT privilege is needed only for the initial synchronization [1], however
it does not tell why.)

> > As an alternative to the publication privileges, I think that the CREATE
> > SUBSCRIPTION command could grant ACL_SELECT automatically to the 
> > subscription
> > user on the individual columns contained in the publication column list, and
> > DROP SUBSCRIPTION would revoke that privilege.
> 
> I think that approach is weird and unusual.  Privileges and object creation
> should be separate operations.

ok. Another approach would be to skip the check for the SELECT privilege (as
well as the check for the USAGE privilege on the corresponding schema) if
given column is being accessed via a publication which has it on its column
list and if the subscription user has the USAGE privilege on that publication.

So far I wasn't sure if we can do that because, if pg_upgrade grants the USAGE
privilege on all publications to the "public" role, the DBAs who relied on the
SELECT privileges might not notice that any role having the REPLICATION
attribute can access all the published tables after the upgrade. (pg_upgrade
can hardly do anything else because it has no information on the "subscription
users", so it cannot convert the SELECT privilege on tables to the USAGE
privileges on publications.)

But now that I see that the logical replication doesn't check the SELECT
privilege properly anyway, I think we can get rid of it.


[1] https://www.postgresql.org/docs/current/logical-replication-security.html

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Fix order of checking ICU options in initdb and create database

2022-11-14 Thread Marina Polyakova

On 2022-11-12 22:43, Jose Arthur Benetasso Villanova wrote:

Hello Marina.

I just reviewed your patch.

It applied cleanly at my current master (commit
d6a3dbe14f98d867b2fc3faeb99d2d3c2a48ca67).

Also, it worked as described in email. Since it's a clarification in
an error message, I think the documentation is fine.

I played a bit with "make check", creating a database in my native
language (pt_BR), testing with some data and everything worked as
expected.


Hello!

Thank you!

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Split index and table statistics into different types of stats

2022-11-14 Thread Drouvot, Bertrand

Hi,

On 11/4/22 9:51 PM, Melanie Plageman wrote:

Hi Bertrand,

I'm glad you are working on this.

I had a few thoughts/ideas



Thanks for looking at it!


It seems better to have all of the counts in the various stats structs
not be prefixed with n_, i_, t_

typedef struct PgStat_StatDBEntry
{
...
 PgStat_Counter n_blocks_fetched;
 PgStat_Counter n_blocks_hit;
 PgStat_Counter n_tuples_returned;
 PgStat_Counter n_tuples_fetched;
...

I've attached a patch (0002) to change this in case you are interested
in making such a change


I did not renamed initially the fields/columns to ease the review.

Indeed, I think we should go further than removing the n_, i_ and t_ 
prefixes so that the fields actually match the view's columns.


For example, currently pg_stat_all_indexes.idx_tup_read is linked to 
"tuples_returned", so that it would make sense to rename 
"tuples_returned" to "tuples_read" or even "tup_read" in the indexes 
counters.


That's why I had in mind to do this fields/columns renaming into a 
separate patch (once this one is committed), so that the current one 
focus only on splitting the stats: what do you think?



(I've attached all of my suggestions in patches
along with your original patch so that cfbot still passes).

On Wed, Nov 2, 2022 at 5:00 AM Drouvot, Bertrand
 wrote:

On 11/1/22 1:30 AM, Andres Freund wrote:

On 2022-10-31 14:14:15 +0100, Drouvot, Bertrand wrote:

@@ -240,9 +293,23 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
  PG_RETURN_INT64(result);
   }

+Datum
+pg_stat_get_index_blocks_fetched(PG_FUNCTION_ARGS)
+{
+Oid relid = PG_GETARG_OID(0);
+int64   result;
+PgStat_StatIndEntry *indentry;
+
+if ((indentry = pgstat_fetch_stat_indentry(relid)) == NULL)
+result = 0;
+else
+result = (int64) (indentry->blocks_fetched);
+
+PG_RETURN_INT64(result);
+}


We have so many copies of this by now - perhaps we first should deduplicate
them somehow? Even if it's just a macro or such.



Yeah good point, a new macro has been defined for the "int64" return
case in V3 attached.


I looked for other opportunities to de-duplicate, but most of the functions
that were added that are identical except the return type and
PgStat_Kind are short enough that it doesn't make sense to make wrappers
or macros.



Yeah, agree.


I do think it makes sense to reorder the members of the two structs
PgStat_IndexCounts and PgStat_TableCounts so that they have a common
header. I've done that in the attached patch (0003).



That's a good idea, thanks! But for that we would need to have the same 
fields names, means:


- Remove the prefixes (as you've done in 0002)
- And probably reduce the number of fields in the new 
PgStat_RelationCounts that 003 is introducing (for example 
tuples_returned should be excluded if we're going to rename it later on 
to "tuples_read" for the indexes to map the 
pg_stat_all_indexes.idx_tup_read column).


ISTM that we should do it in the "renaming" effort, after this patch is 
committed.



In the flush functions, I was also thinking it might be nice to use the
same pattern as is used in [1] and [2] to add the counts together. It
makes the lines a bit easier to read, IMO. If we remove the prefixes
from the count fields, this works for many of the fields. I've attached
a patch (0004) that does something like this, in case you wanted to go
in this direction.


I like it too but same remarks as previously. I think it should be part 
of the "renaming" effort.




Since you have made new parallel functions for indexes and tables for
many of the functions in pgstat_relation.c, perhaps it makes sense to
split it into pgstat_table.c and pgstat_index.c?


Good point, thanks, I'll work on it.



One question I had about the original code (not related to your
refactor) is why the pending stats aren't memset in the flush functions
after aggregating them into the shared stats.


Not sure I'm getting your point, do you think something is not right 
with the flush functions?


Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-11-14 Thread Amit Kapila
On Tue, Nov 8, 2022 at 3:07 PM Amit Kapila  wrote:
>
> On Mon, Nov 7, 2022 at 11:12 PM Robert Haas  wrote:
> >
> > On Mon, Oct 31, 2022 at 11:49 PM Amit Kapila  
> > wrote:
> > > I am fine with any of those. Would you like to commit or do you prefer
> > > me to take care of this?
> >
> > Sorry for not responding to this sooner. I think it's too late to do
> > anything about this for the current round of releases at this point,
> > but I am fine if you want to take care of it after that.
> >
>
> Okay, I'll take care of this either later this week after the release
> work is finished or early next week.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-11-14 Thread Amit Kapila
On Mon, Nov 14, 2022 at 2:28 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > I don't understand the reason for the below change in the patch:
> >
> > + /*
> > + * If this subscription has been disabled and it has an apply
> > + * delay set, wake up the logical replication worker to finish
> > + * it as soon as possible.
> > + */
> > + if (!opts.enabled && sub->applydelay > 0)
> > + logicalrep_worker_wakeup(sub->oid, InvalidOid);
> > +
> >
> > It seems to me Kuroda-San has proposed this change [1] to fix the test
> > but it is not clear to me why such a change is required. Why can't
> > CHECK_FOR_INTERRUPTS() after waiting, followed by the existing below
> > code [2] in LogicalRepApplyLoop() sufficient to handle parameter
> > updates?
> >
> > [2]
> > if (!in_remote_transaction && !in_streamed_transaction)
> > {
> > /*
> > * If we didn't get any transactions for a while there might be
> > * unconsumed invalidation messages in the queue, consume them
> > * now.
> > */
> > AcceptInvalidationMessages();
> > maybe_reread_subscription();
> > ...
>
> I mentioned the case with a long min_apply_delay configuration.
>
> The worker will exit normally if apply_delay() has been ended and then it can 
> reach
> LogicalRepApplyLoop(). It works well if the delay is short and workers can 
> wake up
> immediately. But if workers have long min_apply_delay, they cannot go out the
> while-loop, so worker processes remain for a long time. According to test 
> code,
> it is determined that worker should die immediately and we have a
> test-case that we try to kill the worker with  min_apply_delay = 1 day.
>

So, why only honor the 'disable' option of the subscription? For
example, one can change 'min_apply_delay' and it seems
recoveryApplyDelay() honors a similar change in the recovery
parameter. Is there a way to set the latch of the worker process, so
that it can recheck if anything is changed?

-- 
With Regards,
Amit Kapila.




  1   2   >