date:20240410

Re: Incorrect handling of IS [NOT] NULL quals on inheritance parents

2024-04-10 Thread Richard Guo

On Wed, Apr 10, 2024 at 1:13 PM David Rowley  wrote:

> I looked at the patch and I don't think it's a good idea to skip
> recording NOT NULL constraints to fix based on the fact that it
> happens to result in this particular optimisation working correctly.
> It seems that just makes this work in favour of possibly being wrong
> for some future optimisation where we have something else that looks
> at the RelOptInfo.notnullattnums and makes some decision that assumes
> the lack of corresponding notnullattnums member means the column is
> NULLable.

Hmm, I have thought about your point, but I may have a different
perspective.  I think the oversight discussed here occurred because we
mistakenly recorded NOT NULL columns that are actually nullable for
traditional inheritance parents.  Take the query from my first email as
an example.  There are three RTEs: p(inh), p(non-inh) and c(non-inh).
And we've added a NOT NULL constraint on the column a of 'p' but not of
'c'.  So it seems to me that while we can mark column a of p(non-inh) as
non-nullable, we cannot mark column a of p(inh) as non-nullable, because
there might be NULL values in 'c' and that makes column a of p(inh)
nullable.

And I think recording NOT NULL columns for traditional inheritance
parents can be error-prone for some future optimization where we look
at an inheritance parent's notnullattnums and make decisions based on
the assumption that the included columns are non-nullable.  The issue
discussed here serves as an example of this potential problem.

Thanks
Richard

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-10 Thread Peter Eisentraut


On 06.04.24 19:47, Daniel Gustafsson wrote:

In bumping we want to move to 1.1.1 since that's the first version with the
rewritten RNG which is fork-safe by design, something PostgreSQL clearly
benefits from.


I think it might be better to separate this into two steps:

1. Move to 1.1.0.  This is an API update.  Change OPENSSL_API_COMPAT, 
and remove a bunch of code that no longer needs to be conditional.  We 
could check for a representative function like OPENSSL_init_ssl() in 
configure/meson, or we could just let the compilation fail with older 
versions.


2. Move to 1.1.1.  I understand this has to do with the fork-safety of 
pg_strong_random(), and it's not an API change but a behavior change. 
Let's make this association clearer in the code.  For example, add a 
version check or assertion about this into pg_strong_random() itself.


I don't know how LibreSSL interacts with either of these two points. 
That's something that could be clearer.


Some more detailed review on the v6 patch:

* doc/src/sgml/libpq.sgml

This small documentation patch could be committed forthwith.

* src/backend/libpq/be-secure-openssl.c

+#include 

This patch doesn't appear to add anything, so why does it need a new 
include?


Could the additions of SSL_OP_NO_CLIENT_RENEGOTIATION and 
SSL_R_VERSION_TOO_LOW be separate patches?


* src/common/hmac_openssl.c

There appears to be some unrelated refactoring happening here?

* src/include/common/openssl.h

Is the comment no longer applicable to OpenSSL, only to LibreSSL?

* src/port/pg_strong_random.c

I would prefer to remove pg_strong_random_init() if it's no longer 
useful.  I mean, if we leave it as is, and we are not removing any 
callers, then we are effectively continuing to support OpenSSL <1.1.1, 
right?

some confusion about parallel insert select in postgres parallel dml develop

2024-04-10 Thread jiye

Dears,


this is a developer of postgresql and currently engaged in research of  
parallel dml. I read lot of mail list about postgres parallel insert select or 
other related developing jobs, 
but i could not understand why should we determine parallel-safety of partition 
relations in parallel dml even just allow parallel query in insert select 
statement , 
i consider that modify job do in leader process and partition exprs check also 
in leader node, isn't that safty enough. in the other words, if we only build a 
parallel query planner in insert select and forbit parallel insert jobs, should 
we skip any parallel safty check about target relation and build parallel query 
plan directly ?


could anyone give me some suggestion about this confusions.


Best Regards

Tomas Ji












| |
jiye
|
|
jiye...@126.com
|

Re: Is this a problem in GenericXLogFinish()?

2024-04-10 Thread Michael Paquier

On Tue, Apr 09, 2024 at 10:25:36AM +, Hayato Kuroda (Fujitsu) wrote:
>> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote:
>> I'm slightly annoyed by the fact that there is no check on
>> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
>> show the symmetry between the insert and replay paths.  Shouldn't
>> there be at least an assert for that in the branch when there are no
>> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
>> just before updating the hash page's opaque area when
>> is_prev_bucket_same_wrt.
> 
> Indeed, added an Assert() in else part. Was it same as your expectation?

Yep, close enough.  Thanks to the insert path we know that there will
be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
and the replay path where the assertion is added.

>> I've been thinking about ways to make such cases detectable in an
>> automated fashion.  The best choice would be
>> verifyBackupPageConsistency(), just after RestoreBlockImage() on the
>> copy of the block image stored in WAL before applying the page masking
>> which would mask the LSN.  A problem, unfortunately, is that we would
>> need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
>> flag so we would be able to track if the block rebuilt from the record
>> has the *same* LSN as the copy used for the consistency check.  So
>> this edge consistency case would come at a cost, I am afraid, and the
>> 8 bits of bimg_info are precious :/
> 
> I could not decide that the change has more benefit than its cost, so I did
> nothing for it.

It does not prevent doing an implementation that can be used for some
test validation in the context of this thread.  Using this idea, I
have done the attached 0002.  This is not something to merge into the
tree, of course, just a toy to:
- Validate the fix for the problem we know.
- More braodly, check if there are other areas covered by the main
regression test suite if there are other problems.

And from what I can see, applying 0002 without 0001 causes the
following test to fail as the standby chokes on a inconsistent LSN
with a FATAL because the LSN of the apply page coming from the primary
and the LSN saved in the page replayed don't match.  Here is a command
to reproduce the problem:
cd src/test/recovery/ && \
  PG_TEST_EXTRA=wal_consistency_checking \
  PROVE_TESTS=t/027_stream_regress.pl make check

And then in the logs you'd see stuff like:
2024-04-10 16:52:21.844 JST [2437] FATAL:  inconsistent page LSN
replayed 0/A7E5CD18 primary 0/A7E56348
2024-04-10 16:52:21.844 JST [2437] CONTEXT:  WAL redo at 0/A7E59F68
for Hash/SQUEEZE_PAGE: prevblkno 28, nextblkno 4294967295, ntups 0,
is_primary T; blkref #1: rel 1663/16384/25434, blk 7 FPW; blkref #2:
rel 1663/16384/25434, blk 29 FPW; blkref #3: rel 1663/16384/25434, blk
28 FPW; blkref #5: rel 1663/16384/25434, blk 9 FPW; blkref #6: rel
1663/16384/25434, blk 0 FPW

I don't see other areas with a similar problem, at the extent of the
core regression tests, that is to say.  That's my way to say that your
patch looks good to me and that I'm planning to apply it to fix the
issue.

This shows me a more interesting issue unrelated to this thread:
027_stream_regress.pl would be stuck if the standby finds an
inconsistent page under wal_consistency_checking.  This needs to be
fixed before I'm able to create a buildfarm animal with this setup.
I'll spawn a new thread about that tomorrow.
--
Michael
From 4c6597f2ce57439696aecdbda08b374857ab715d Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 10 Apr 2024 16:47:29 +0900
Subject: [PATCH v3 1/2] Fix inconsistency with replay of hash squeeze record
 for clean buffers

Blah.
---
 src/backend/access/hash/hash_xlog.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index cb1a63cfee..d7ba74579b 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -666,6 +666,7 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		char	   *data;
 		Size		datalen;
 		uint16		ninserted = 0;
+		bool		mod_wbuf = false;
 
 		data = begin = XLogRecGetBlockData(record, 1, &datalen);
 
@@ -695,6 +696,17 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 
 ninserted++;
 			}
+
+			mod_wbuf = true;
+		}
+		else
+		{
+			/*
+			 * See _hash_freeovflpage() which has a similar assertion when
+			 * there are no tuples.
+			 */
+			Assert(xldata->is_prim_bucket_same_wrt ||
+   xldata->is_prev_bucket_same_wrt);
 		}
 
 		/*
@@ -711,10 +723,15 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 			HashPageOpaque writeopaque = HashPageGetOpaque(writepage);
 
 			writeopaque->hasho_nextblkno = xldata->nextblkno;
+			mod_wbuf = true;
 		}
 
-		PageSetLSN(writepage, lsn);
-		MarkBufferDirty(writebuf);
+		/* Set LSN and mark writebuf dirty iff it is modified */
+		if (mod_wbuf)
+		{
+			PageSetLSN(writep

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-04-10 Thread Sutou Kouhei

Hi Andres,

Could you take a look at this? I think that you don't want
to touch the current text/csv/binary implementations. The
v17 patch approach doesn't touch the current text/csv/binary
implementations. What do you think about this approach?


Thanks,
-- 
kou

In <20240320.232732.488684985873786799@clear-code.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 20 Mar 2024 23:27:32 +0900 (JST),
  Sutou Kouhei  wrote:

> Hi,
> 
> Could someone review the v17 patch to proceed this?
> 
> The v17 patch:
> https://www.postgresql.org/message-id/flat/20240305.171808.667980402249336456.kou%40clear-code.com#d2ee079b75ebcf00c410300ecc4a357a
> 
> Some profiles by Michael:
> https://www.postgresql.org/message-id/flat/ZelfYatRdVZN3FbE%40paquier.xyz#eccfd1a0131af93c48026d691cc247f4
> 
> Thanks,
> -- 
> kou
> 
> In <20240308.092254.359611633589181574@clear-code.com>
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Fri, 08 Mar 2024 09:22:54 +0900 (JST),
>   Sutou Kouhei  wrote:
> 
>> Hi,
>> 
>> In 
>>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
>> on Thu, 7 Mar 2024 15:32:01 +0900,
>>   Michael Paquier  wrote:
>> 
>>> While on it, here are some profiles based on HEAD and v17 with the
>>> previous tests (COPY TO /dev/null, COPY FROM data sent to the void).
>>> 
>> ...
>>> 
>>> So, in short, and that's not really a surprise, there is no effect
>>> once we use the dispatching with the routines only when a format would
>>> want to plug-in with the APIs, but a custom format would still have a
>>> penalty of a few percents for both if bottlenecked on CPU.
>> 
>> Thanks for sharing these profiles!
>> I agree with you.
>> 
>> This shows that the v17 approach doesn't affect the current
>> text/csv/binary implementations. (The v17 approach just adds
>> 2 new structs, Copy{From,To}Rountine, without changing the
>> current text/csv/binary implementations.)
>> 
>> Can we push the v17 patch and proceed following
>> implementations? Could someone (especially a PostgreSQL
>> committer) take a look at this for double-check?
>> 
>> 
>> Thanks,
>> -- 
>> kou
>> 
>> 
> 
>

Re: sql/json remaining issue

2024-04-10 Thread Amit Langote

On Tue, Apr 9, 2024 at 8:37 PM Amit Langote  wrote:
> On Tue, Apr 9, 2024 at 4:47 PM jian he  wrote:
> > the last query error message is:
> > `
> > ERROR:  no SQL/JSON item
> > `
> >
> > we are in ExecEvalJsonExprPath, can we output it to be:
> > `
> > ERROR:  after applying json_path "5s", no SQL/JSON item found
> > `
> > in a json_table query, we can have multiple path_expressions, like the
> > above query.
> > it's not easy to know applying which path_expression failed.
>
> Hmm, I'm not so sure about mentioning the details of the path because
> path names are optional and printing path expression itself is not a
> good idea.  Perhaps, we could mention the column name which would
> always be there, but we'd then need to add a new field column_name
> that's optionally set to JsonFuncExpr and JsonExpr, that is, when they
> are being set up for JSON_TABLE() columns.  As shown in the attached.
> With the patch you'll get:
>
> ERROR:  no SQL/JSON item found for column "b"

Attached is a bit more polished version of that, which also addresses
the error messages in JsonPathQuery() and JsonPathValue().  I noticed
that there was comment I had written at one point during JSON_TABLE()
hacking that said that we should be doing this.

I've also added an open item for this.

-- 
Thanks, Amit Langote


v2-0001-JSON_TABLE-mention-column-name-in-some-error-mess.patch
Description: Binary data

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-10 Thread Alexander Lakhin


Hello Alexander and Dmitry,

10.04.2024 02:03, Alexander Korotkov wrote:

On Mon, Apr 8, 2024 at 11:43 PM Dmitry Koval  wrote:

Attached fix for the problems found by Alexander Lakhin.

About grammar errors.
Unfortunately, I don't know English well.
Therefore, I plan (in the coming days) to show the text to specialists
who perform technical translation of documentation.

Thank you.  I've pushed this fix with minor corrections from me.


Thank you for fixing that defect!

Please look at an error message emitted for foreign tables:
CREATE TABLE t (i int) PARTITION BY RANGE (i);
CREATE FOREIGN TABLE ftp_0_1 PARTITION OF t
  FOR VALUES FROM (0) TO (1)
  SERVER loopback OPTIONS (table_name 'lt_0_1');
CREATE FOREIGN TABLE ftp_1_2 PARTITION OF t
  FOR VALUES FROM (1) TO (2)
  SERVER loopback OPTIONS (table_name 'lt_1_2');
ALTER TABLE t MERGE PARTITIONS (ftp_0_1, ftp_1_2) INTO ftp_0_2;
ERROR:  "ftp_0_1" is not a table

Shouldn't it be more correct/precise?

Best regards,
Alexander

Re: Support a wildcard in backtrace_functions

2024-04-10 Thread Jelte Fennema-Nio

On Fri, 22 Mar 2024 at 11:09, Jelte Fennema-Nio  wrote:
> On Thu, 21 Mar 2024 at 15:41, Jelte Fennema-Nio  wrote:
> > Attached is a patch that implements this. Since the more I think about
> > it, the more I like this approach.
>
> I now added a 3rd patch to this patchset which changes the
> log_backtrace default to "internal", because it seems quite useful to
> me if user error reports of internal errors included a backtrace.

I think patch 0002 should be considered an Open Item for PG17. Since
it's proposing changing the name of the newly introduced
backtrace_on_internal_error GUC. If we want to change it in this way,
we should do it before the release and preferably before the beta.

I added it to the Open Items list[1] so we don't forget to at least
decide on this.

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

Re: Speed up clean meson builds by ~25%

2024-04-10 Thread Thomas Munro

On Wed, Apr 10, 2024 at 5:03 PM Tom Lane  wrote:
> I don't doubt that there are other clang versions where the problem
> bites a lot harder.  What result do you get from the test I tried
> (turning mm_strdup into a no-op macro)?

#define mm_strdup(x) (x) does this:

Apple clang 15: master: 14s -> 9s
MacPorts clang 16, master: 170s -> 10s

Re: Can't find not null constraint, but \d+ shows that

2024-04-10 Thread jian he

another related bug, in master.

drop table if exists notnull_tbl1;
CREATE TABLE notnull_tbl1 (c0 int not null, c1 int);
ALTER TABLE notnull_tbl1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
\d+ notnull_tbl1
ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;
ALTER TABLE notnull_tbl1 ALTER c1 DROP NOT NULL;

"ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;"
should fail?

I didn't investigate deep enough.

Re: Is this a problem in GenericXLogFinish()?

2024-04-10 Thread Amit Kapila

On Wed, Apr 10, 2024 at 1:27 PM Michael Paquier  wrote:
>
> On Tue, Apr 09, 2024 at 10:25:36AM +, Hayato Kuroda (Fujitsu) wrote:
> >> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote:
> >> I'm slightly annoyed by the fact that there is no check on
> >> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
> >> show the symmetry between the insert and replay paths.  Shouldn't
> >> there be at least an assert for that in the branch when there are no
> >> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
> >> just before updating the hash page's opaque area when
> >> is_prev_bucket_same_wrt.
> >
> > Indeed, added an Assert() in else part. Was it same as your expectation?
>
> Yep, close enough.  Thanks to the insert path we know that there will
> be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
> and the replay path where the assertion is added.
>

It is fine to have an assertion for this path.

+ else
+ {
+ /*
+ * See _hash_freeovflpage() which has a similar assertion when
+ * there are no tuples.
+ */
+ Assert(xldata->is_prim_bucket_same_wrt ||
+xldata->is_prev_bucket_same_wrt);

I can understand this comment as I am aware of this code but not sure
it would be equally easy for the people first time looking at this
code. One may try to find the equivalent assertion in
_hash_freeovflpage(). The alternative could be: "Ensure that the
required flags are set when there are no tuples. See
_hash_freeovflpage().". I am also fine if you prefer to go with your
proposed comment.

Otherwise, the patch looks good to me.

-- 
With Regards,
Amit Kapila.

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-10 Thread Matthias van de Meent

On Wed, 3 Apr 2024 at 23:50, Tom Lane  wrote:
> I've pushed this after a good deal of cosmetic polishing -- for
> example, I spent some effort on making serializeAnalyzeReceive
> look as much like printtup as possible, in hopes of making it
> easier to keep the two functions in sync in future.

Upthread at [0], Stepan mentioned that we should default to SERIALIZE
when ANALYZE is enabled. I suspect a patch in that direction would
primarily contain updates in the test plan outputs, but I've not yet
worked on that.

Does anyone else have a strong opinion for or against adding SERIALIZE
to the default set of explain features enabled with ANALYZE?

I'll add this to "Decisions to Recheck Mid-Beta"-section if nobody has
a clear objection.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://postgr.es/m/ea885631-21f1-425a-97ed-c4bfb8cf9c63%40gmx.de

Re: Can't find not null constraint, but \d+ shows that

2024-04-10 Thread Tender Wang

jian he  于2024年4月10日周三 17:34写道：

>
> another related bug, in master.
>
> drop table if exists notnull_tbl1;
> CREATE TABLE notnull_tbl1 (c0 int not null, c1 int);
> ALTER TABLE notnull_tbl1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
> \d+ notnull_tbl1
> ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;
> ALTER TABLE notnull_tbl1 ALTER c1 DROP NOT NULL;
>
> "ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;"
> should fail?
>

Yeah, it should fail as before, because c0 is primary key.
In master, although c0's pg_attribute.attnotnull is still true, but its
not-null constraint has been deleted
in dropconstraint_internal().

If we drop column c1 after dropping c0 not null, the primary key will be
dropped indirectly.
And now you can see c0 is still not-null if you do \d+ notnull_tbl1. But it
will report error "not found not-null"
if you alter c0 drop not null.

postgres=# ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;
ALTER TABLE
postgres=# \d+ notnull_tbl1
  Table "public.notnull_tbl1"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression
| Stats target | Description
+-+---+--+-+-+-+--+-
 c0 | integer |   | not null | | plain   |
|  |
 c1 | integer |   | not null | | plain   |
|  |
Indexes:
"q" PRIMARY KEY, btree (c0, c1)
Access method: heap

postgres=# alter table notnull_tbl1 drop c1;
ALTER TABLE
postgres=# \d notnull_tbl1
Table "public.notnull_tbl1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 c0 | integer |   | not null |

postgres=# alter table notnull_tbl1 alter c0 drop not null;
ERROR:  could not find not-null constraint on column "c0", relation
"notnull_tbl1"


-- 
Tender Wang
OpenPie:  https://en.openpie.com/

Re: Potential stack overflow in incremental base backup

2024-04-10 Thread Thomas Munro

On Fri, Mar 8, 2024 at 6:53 AM Robert Haas  wrote:
> ... We could
> avoid transposing relative block numbers to absolute block numbers
> whenever start_blkno is 0,  ...

Could we just write the blocks directly into the output array, and
then transpose them directly in place if start_blkno > 0?  See
attached.  I may be missing something, but the only downside I can
think of is that the output array is still clobbered even if we decide
to return BACK_UP_FILE_FULLY because of the 90% rule, but that just
requires a warning in the comment at the top.

v2-0001-Fix-potential-stack-overflow-in-incremental-baseb.patch
Description: Binary data

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-04-10 Thread Peter Eisentraut


On 05.04.24 14:55, Robert Haas wrote:

I also wonder how the protocol negotiation for column encryption is
actually going to work. What are the actual wire protocol changes that
are needed? What does the server need to know from the client, or the
client from the server, about what is supported?


I have just posted an updated patch for that: [0]

The protocol changes can be inspected in the diffs for

doc/src/sgml/protocol.sgml
src/backend/access/common/printtup.c
src/interfaces/libpq/fe-protocol3.c

There are various changes, including new messages, additional fields in 
existing messages, and some more flag bits in existing fields.


It all works, so I don't have any requests or anything in this thread, 
but it would be good to get some feedback if I'm using this wrong. 
AFAICT, that patch was the first public one that ever tried to make use 
of the protocol extension facility, so I was mainly guessing about the 
intended way to use this.



[0]: 
https://www.postgresql.org/message-id/f63fe170-cef2-4914-be00-ef9222456505%40eisentraut.org

Re: Can't find not null constraint, but \d+ shows that

2024-04-10 Thread Alvaro Herrera

On 2024-Apr-10, Tender Wang wrote:

> Yeah, it should fail as before, because c0 is primary key.
> In master, although c0's pg_attribute.attnotnull is still true, but its
> not-null constraint has been deleted
> in dropconstraint_internal().

Yeah, the problem here is that we need to do the checks whenever the
constraints are dropped, either directly or indirectly ... but we cannot
do them in RemoveConstraintById, because if you elog(ERROR) there, it
won't let you use DROP TABLE (which will also arrive at
RemoveConstraintById):

55490 17devel 2220048=# drop table notnull_tbl2;
ERROR:  column "c0" of relation "notnull_tbl2" is an identity column

... which is of course kinda ridiculous, so this is not a viable
alternative.  The problem is that RemoveConstraintById doesn't have
sufficient context about the whole operation.  Worse: it cannot feed
its operations back into the alter table state.

I had a thought yesterday about making the resets of attnotnull and the
tests for replica identity and PKs to a separate ALTER TABLE pass,
independent of RemoveConstraintById (which would continue to be
responsible only for dropping the catalog row, as currently).
This may make the whole thing simpler.  I'm on it.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)

Re: Can't find not null constraint, but \d+ shows that

2024-04-10 Thread Alvaro Herrera

On 2024-Apr-10, jian he wrote:

> another related bug, in master.
> 
> drop table if exists notnull_tbl1;
> CREATE TABLE notnull_tbl1 (c0 int not null, c1 int);
> ALTER TABLE notnull_tbl1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
> \d+ notnull_tbl1
> ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;
> ALTER TABLE notnull_tbl1 ALTER c1 DROP NOT NULL;
> 
> "ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;"
> should fail?

No, this should not fail, and it is working correctly in master.  You
can drop the not-null constraint, but the column will still be
non-nullable, because the primary key still exists.  If you drop the
primary key later, then the column becomes nullable.  This is by design.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-10 Thread Amit Kapila

On Fri, Apr 5, 2024 at 4:59 PM Ajin Cherian  wrote:
>
> On Thu, Apr 4, 2024 at 4:38 PM Amit Kapila  wrote:
>>
>>
>> I think this would probably be better than the current situation but
>> can we think of a solution to allow toggling the value of two_phase
>> even when prepared transactions are present? Can you please summarize
>> the reason for the problems in doing that and the solutions, if any?
>>
>
>
> Updated the patch, as it wasn't addressing updating of two-phase in the 
> remote slot.
>

Vitaly, does the minimal solution provided by the proposed patch
(Allow to alter two_phase option of a subscriber provided no
uncommitted
prepared transactions are pending on that subscription.) address your use case?

>  Currently the main issue that needs to be handled is the handling of pending 
> prepared transactions while the two_phase is altered. I see 3 issues with the 
> current approach.
>
> 1. Uncommitted prepared transactions when toggling two_phase from true to 
> false
>   When two_phase was true, prepared transactions were decoded at PREPARE time 
> and send to the subscriber, which is then prepared on the subscriber with a 
> new gid. Once the two_phase is toggled to false, then the COMMIT PREPARED on 
> the publisher is converted to commit and the entire transaction is decoded 
> and sent to the subscriber. This will   leave the previously prepared 
> transaction pending.
>
> 2. Uncommitted prepared transactions when toggling two_phase form false to 
> true
>   When two_phase was false, prepared transactions were ignored and not 
> decoded at PREPARE time on the publisher. Once the two_phase is toggled to 
> true, the apply worker and the walsender are restarted and a replication is 
> restarted from a new "start_decoding_at" LSN. Now, this new 
> "start_decoding_at" could be past the LSN of the PREPARE record and if so, 
> the PREPARE record is skipped and not send to the subscriber. Look at 
> comments in DecodeTXNNeedSkip() for detail.  Later when the user issues 
> COMMIT PREPARED, this is decoded and sent to the subscriber. but there is no 
> prepared transaction on the subscriber, and this fails because the  
> corresponding gid of the transaction couldn't be found.
>
> 3. While altering the two_phase of the subscription, it is required to also 
> alter the two_phase field of the slot on the primary. The subscription cannot 
> remotely alter the two_phase option of the slot when the subscription is  
> enabled, as the slot is owned by the walsender on the publisher side.
>

Thanks for summarizing the reasons for not allowing altering the
two_pc property for a subscription.

> Possible solutions for the 3 problems:
>
> 1. While toggling two_phase from true to false, we could probably get a list 
> of prepared transactions for this subscriber id and rollback/abort the 
> prepared transactions. This will allow the transactions to be re-applied like 
> a normal transaction when the commit comes. Alternatively, if this isn't 
> appropriate doing it in the ALTER SUBSCRIPTION context, we could store the 
> xids of all prepared transactions of this subscription in a list and when the 
> corresponding xid is being committed by the apply worker, prior to commit, we 
> make sure the previously prepared transaction is rolled back. But this would 
> add the overhead of checking this list every time a transaction is committed 
> by the apply worker.
>

In the second solution, if you check at the time of commit whether
there exists a prior prepared transaction then won't we end up
applying the changes twice? I think we can first try to achieve it at
the time of Alter Subscription because the other solution can add
overhead at each commit?

> 2. No solution yet.
>

One naive idea is that on the publisher we can remember whether the
prepare has been sent and if so then only send commit_prepared,
otherwise send the entire transaction. On the subscriber-side, we
somehow, need to ensure before applying the first change whether the
corresponding transaction is already prepared and if so then skip the
changes and just perform the commit prepared. One drawback of this
approach is that after restart, the prepare flag wouldn't be saved in
the memory and we end up sending the entire transaction again. One way
to avoid this overhead is that the publisher before sending the entire
transaction checks with subscriber whether it has a prepared
transaction corresponding to the current commit. I understand that
this is not a good idea even if it works but I don't have any better
ideas. What do you think?

> 3. We could mandate that the altering of two_phase state only be done after 
> disabling the subscription, just like how it is handled for failover option.
>

makes sense.

-- 
With Regards,
Amit Kapila.

Re: postgres_fdw fails because GMT != UTC

2024-04-10 Thread Etsuro Fujita

On Thu, Apr 4, 2024 at 3:49 PM Laurenz Albe  wrote:
> On Thu, 2024-04-04 at 02:19 -0400, Tom Lane wrote:
> > > ERROR:  invalid value for parameter "TimeZone": "UTC"
> >
> > I am not quite clear on how broken an installation needs to be to
> > reject "UTC" as a time zone setting, except that the breakage cannot
> > be subtle.  However, I notice that our code in pgtz.c and other
> > places treats "GMT" as a hard-wired special case ... but not "UTC".
> > I wonder if we ought to modify those places to force "UTC" down the
> > same hard-wired paths.  If we acted like that, this would have worked
> > no matter how misconfigured the installation was.
> >
> > An alternative answer could be to change postgres_fdw to send "GMT"
> > not "UTC".  That's ugly from a standards-compliance viewpoint, but
> > it would fix this problem even with a non-updated remote server,
> > and I think postgres_fdw is generally intended to work with even
> > very old remote servers.
> >
> > Or we could do both.
>
> I think the first is desirable for reasons of general sanity, and the
> second for best compatibility with old versions.
>
> So I vote for "both".

+1 for both (assuming that the latter does not make the postgres_fdw
code complicated).

Best regards,
Etsuro Fujita

Revise some error messages in split partition code

2024-04-10 Thread Richard Guo

I noticed some error messages in the split partition code that are not
up to par.  Such as:

"new partitions not have value %s but split partition has"

how about we revise it to:

"new partitions do not have value %s but split partition does"

Another one is:

"any partition in the list should be DEFAULT because split partition is
DEFAULT"

how about we revise it to:

"all partitions in the list should be DEFAULT because split partition is
DEFAULT"

Another problem I noticed is that in the test files partition_split.sql
and partition_merge.sql, there are comments specifying the expected
error messages for certain test queries.  However, in some cases, the
error message mentioned in the comment does not match the error message
actually generated by the query.  Such as:

-- ERROR:  invalid partitions order, partition "sales_mar2022" can not be
merged
-- (space between sections sales_jan2022 and sales_mar2022)
ALTER TABLE sales_range MERGE PARTITIONS (sales_jan2022, sales_mar2022)
INTO sales_jan_mar2022;
ERROR:  lower bound of partition "sales_mar2022" conflicts with upper bound
of previous partition "sales_jan2022"

I'm not sure if it's a good practice to specify the expected error
message in the comment.  But if we choose to do so, I think we at least
need to ensure that the specified error message in the comment remains
consistent with the error message produced by the query.

Also there are some comments containing grammatical issues.  Such as:

-- no error: bounds of sales_noerror equals to lower and upper bounds of
sales_dec2022 and sales_feb2022

Attached is a patch to fix the issues I've observed.  I suspect there
may be more to be found.

Thanks
Richard


v1-0001-Revise-some-error-messages-in-split-partition-code.patch
Description: Binary data

Re: WIP Incremental JSON Parser

2024-04-10 Thread Andrew Dunstan



On 2024-04-09 Tu 15:42, Andrew Dunstan wrote:


On 2024-04-09 Tu 07:54, Andrew Dunstan wrote:



On 2024-04-09 Tu 01:23, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:

There is no direct check on test_json_parser_perf.c, either, only a
custom rule in the Makefile without specifying something for meson.
So it looks like you could do short execution check in a TAP test, at
least.



Not adding a test for that was deliberate - any sane test takes a 
while, and I didn't want to spend that much time on it every time 
someone runs "make check-world" or equivalent. However, adding a test 
to run it with a trivial number of iterations seems reasonable, so 
I'll add that. I'll also add a meson target for the binary.




While reading the code, I've noticed a few things, as well:

+   /* max delicious line length is less than this */
+   char    buff[6001];

Delicious applies to the contents, nothing to do with this code even
if I'd like to think that these lines of code are edible and good.
Using a hardcoded limit for some JSON input sounds like a bad idea to
me even if this is a test module.  Comment that applies to both the
perf and the incremental tools.  You could use a #define'd buffer size
for readability rather than assuming this value in many places.



The comment is a remnant of when I hadn't yet added support for 
incomplete tokens, and so I had to parse the input line by line. I 
agree it can go, and we can use a manifest constant for the buffer size.




+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
+ * This progam tests incremental parsing of json. The input is fed 
into
+ * full range of incement handling, especially in the lexer, is 
exercised.

+++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
+ *    Performancet est program for both flavors of the JSON parser
+ * This progam tests either the standard (recursive descent) JSON 
parser

+++ b/src/test/modules/test_json_parser/README
+  reads in a file and pases it in very small chunks (60 bytes at a 
time) to


Collection of typos across various files.



Will fix. (The older I get the more typos I seem to make and the 
harder it is to notice them. It's very annoying.)




+ appendStringInfoString(&json, "1+23 trailing junk");
What's the purpose here?  Perhaps the intention should be documented
in a comment?



The purpose is to ensure that if there is not a trailing '\0' on the 
json chunk the parser will still do the right thing. I'll add a 
comment to that effect.




At the end, having a way to generate JSON blobs randomly to test this
stuff would be more appealing than what you have currently, with
perhaps a few factors like:
- Array and simple object density.
- Max Level of nesting.
- Limitation to ASCII characters that can be escaped.
- Perhaps more things I cannot think about?



No, I disagree. Maybe we need those things as well, but we do need a 
static test where we can test the output against known results. I 
have no objection to changing the input and output files.


It's worth noting that t/002_inline.pl does generate some input and 
test e.g., the maximum nesting levels among other errors. Perhaps you 
missed that. If you think we need more tests there adding them would 
be extremely simple.




So the current state of things is kind of disappointing, and the size
of the data set added to the tree is not that portable either if you
want to test various scenarios depending on the data set.  It seems to
me that this has been committed too hastily and that this is not ready
for integration, even if that's just a test module.  Tom also has
shared some concerns upthread, as far as I can see.



I have posted a patch already that addresses the issue Tom raised.





Here's a consolidated set of cleanup patches, including the memory 
leak patch and Jacob's shrink-tiny patch.



Here's v2 of the cleanup patch 4, that fixes some more typos kindly 
pointed out to me by Alexander Lakhin.



cheers


andrew




--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From 6b12a71b6b43354c0f897ac5feb1e53419a2c15f Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Tue, 9 Apr 2024 15:30:48 -0400
Subject: [PATCH v2] Assorted minor cleanups in the test_json_parser module

Per gripes from Michael Paquier

Discussion: https://postgr.es/m/zhtq6_w1vwohq...@paquier.xyz

Along the way, also clean up a handful of typos in 3311ea86ed and
ea7b4e9a2a, found by Alexander Lakhin.
---
 src/common/jsonapi.c  |  6 ++---
 src/common/parse_manifest.c   |  2 +-
 src/test/modules/test_json_parser/README  | 19 +++--
 .../test_json_parser_incremental.c| 27 ++-
 .../test_json_parser/test_json_parser_perf.c  | 11 
 5 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 9dfbc397c0..12fabcaccf 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jso

Comment about handling of asynchronous requests in postgres_fdw.c

2024-04-10 Thread Etsuro Fujita

Hi,

We updated $SUBJECT in back branches to make it clear (see commit
f6f61a4bd), so I would like to propose to do so in HEAD as well for
consistency.  Attached is a patch for that.

Best regards,
Etsuro Fujita


postgres-fdw-comment.patch
Description: Binary data

RE: Synchronizing slots from primary to standby

2024-04-10 Thread Zhijie Hou (Fujitsu)


On Thursday, April 4, 2024 5:37 PM Amit Kapila  wrote:
> 
> BTW, while thinking on this one, I
> noticed that in the function LogicalConfirmReceivedLocation(), we first update
> the disk copy, see comment [1] and then in-memory whereas the same is not
> true in
> update_local_synced_slot() for the case when snapshot exists. Now, do we have
> the same risk here in case of standby? Because I think we will use these xmins
> while sending the feedback message (in XLogWalRcvSendHSFeedback()).
>
> * We have to write the changed xmin to disk *before* we change
> * the in-memory value, otherwise after a crash we wouldn't know
> * that some catalog tuples might have been removed already.

Yes, I think we have the risk on the standby, I can reproduce the case that if
the server crashes after updating the in-memory value and before saving them to
disk, the synced slot could be invalidated after restarting from crash, because
the necessary rows have been removed on the primary. The steps can be found in
[1].

I think we'd better fix the order in update_local_synced_slot() as well. I
tried to make the fix in 0002, 0001 is Shveta's patch to fix another issue in 
this thread. Since
they are touching the same function, so attach them together for review.

[1]
-- Primary:
SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 
'test_decoding', false, false, true);

-- Standby:
SELECT 'init' FROM pg_create_logical_replication_slot('standbylogicalslot', 
'test_decoding', false, false, false);
SELECT pg_sync_replication_slots();

-- Primary:
CREATE TABLE test (a int);
INSERT INTO test VALUES(1);
DROP TABLE test;

SELECT txid_current();
SELECT txid_current();
SELECT txid_current();
SELECT pg_log_standby_snapshot();

SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn());

-- Standby:
- wait for standby to replay all the changes on the primary.

- this is to serialize snapshots.
SELECT pg_replication_slot_advance('standbylogicalslot', 
pg_last_wal_replay_lsn());

- Use gdb to stop at the place after calling ReplicationSlotsComputexx()
  functions and before calling ReplicationSlotSave().
SELECT pg_sync_replication_slots();

-- Primary:

- First, wait for the primary slot(the physical slot)'s catalog xmin to be
  updated to the same as the failover slot.

VACUUM FULL;

- Wait for VACUMM FULL to be replayed on standby.

-- Standby:

- For the process which is blocked by gdb, let the process crash (elog(PANIC,
  ...)).

After restarting the standby from crash, we can see the synced slot is 
invalidated.

LOG:  invalidating obsolete replication slot "logicalslot"
DETAIL:  The slot conflicted with xid horizon 741.
CONTEXT:  WAL redo at 0/3059B90 for Heap2/PRUNE_ON_ACCESS: 
snapshotConflictHorizon: 741, isCatalogRel: T, nplans: 0, nredirected: 0, 
ndead: 7, nunused: 0, dead: [22, 23, 24, 25, 26, 27, 28]; blkref #0: rel 
1663/5/1249, blk 16


Best Regards,
Hou zj


v4-0002-write-the-changed-xmin-to-disk-before-computing-g.patch
Description:  v4-0002-write-the-changed-xmin-to-disk-before-computing-g.patch


v4-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch
Description:  v4-0001-Correct-sanity-check-to-compare-confirmed_lsn.patch

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-10 Thread Alexander Lakhin


10.04.2024 12:00, Alexander Lakhin wrote:

Hello Alexander and Dmitry,

10.04.2024 02:03, Alexander Korotkov wrote:

Thank you.  I've pushed this fix with minor corrections from me.




Please look at another anomaly with MERGE.

CREATE TEMP TABLE t (i int) PARTITION BY RANGE (i);
CREATE TABLE tp_0_2 PARTITION OF t
  FOR VALUES FROM (0) TO (2);
fails with
ERROR:  cannot create a permanent relation as partition of temporary relation 
"t"

But
CREATE TEMP TABLE t (i int) PARTITION BY RANGE (i);
CREATE TEMP TABLE tp_0_1 PARTITION OF t
  FOR VALUES FROM (0) TO (1);
CREATE TEMP TABLE tp_1_2 PARTITION OF t
  FOR VALUES FROM (1) TO (2);
ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
succeeds and we get:
regression=# \d+ t*
    Partitioned table "pg_temp_1.t"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | 
Stats target | Description
+-+---+--+-+-+-+--+-
 i  | integer |   |  | | plain | |  
    |
Partition key: RANGE (i)
Partitions: tp_0_2 FOR VALUES FROM (0) TO (2)

 Table "public.tp_0_2"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | 
Stats target | Description
+-+---+--+-+-+-+--+-
 i  | integer |   |  | | plain | |  
    |
Partition of: t FOR VALUES FROM (0) TO (2)
Partition constraint: ((i IS NOT NULL) AND (i >= 0) AND (i < 2))

Best regards,
Alexander

Re: Potential stack overflow in incremental base backup

2024-04-10 Thread Robert Haas

On Wed, Apr 10, 2024 at 6:21 AM Thomas Munro  wrote:
> Could we just write the blocks directly into the output array, and
> then transpose them directly in place if start_blkno > 0?  See
> attached.  I may be missing something, but the only downside I can
> think of is that the output array is still clobbered even if we decide
> to return BACK_UP_FILE_FULLY because of the 90% rule, but that just
> requires a warning in the comment at the top.

Yeah. This approach makes the name "relative_block_numbers" a bit
confusing, but not running out of memory is nice, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Table AM Interface Enhancements

2024-04-10 Thread Alexander Korotkov

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas  wrote:
> On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov  
> wrote:
> > Yes, it was my mistake. I got rushing trying to fit this to FF, even doing 
> > significant changes just before commit.
> > I'll revert this later today.

The patch to revert is attached.  Given that revert touches the work
done in 041b96802e, I think it needs some feedback before push.

> Alexander,
>
> Exactly how much is getting reverted here? I see these, all since March 23rd:
>
> dd1f6b0c17 Provide a way block-level table AMs could re-use
> acquire_sample_rows()
> 9bd99f4c26 Custom reloptions for table AM
> 97ce821e3e Fix the parameters order for
> TableAmRoutine.relation_copy_for_cluster()
> 867cc7b6dd Revert "Custom reloptions for table AM"
> b1484a3f19 Let table AM insertion methods control index insertion
> c95c25f9af Custom reloptions for table AM
> 27bc1772fc Generalize relation analyze in table AM interface
> 87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
> c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
> 02eb07ea89 Allow table AM to store complex data structures in rd_amcache
>
> I'm not really feeling very good about all of this, because:
>
> - 87985cc925 was previously committed as 11470f544e on March 23, 2023,
> and almost immediately reverted. Now you tried again on March 26,
> 2024. I know there was a bunch of rework in the middle, but there are
> times in the year that things can be committed other than right before
> the feature freeze. Like, don't wait a whole year for the next attempt
> and then again do it right before the cutoff.

I agree with the facts.  But I have a different interpretation on
this.  The patch was committed as 11470f544e on March 23, 2023, then
reverted on April 3.  I've proposed the revised version, but Andres
complained that this is the new API design days before FF.  Then the
patch with this design was published in the thread for the year with
periodical rebases.  So, I think I expressed my intention with that
design before 2023 FF, nobody prevented me from expressing objections
or other feedback during the year.  Then I realized that 2024 FF is
approaching and decided to give this another try for pg18.

But I don't yet see it's wrong with this patch.  I waited a year for
feedback.  I waited 2 days after saying "I will push this if no
objections". Given your feedback now, I get that it would be better to
do another attempt to commit this earlier.

I admit my mistake with dd1f6b0c17.  I get rushed trying to fix the
things actually making things worse.  I apologise for this.  But if
I'm forced to revert 87985cc925 without even hearing any reasonable
critics besides imperfection of timing, I feel like this is the
punishment for my mistake with dd1f6b0c17.  Pretty unreasonable
punishment in my view.

> - The Discussion links in the commit messages do not seem to stand for
> the proposition that these particular patches ought to be committed in
> this form. Some of them are just links to the messages where the patch
> was originally posted, which is probably not against policy or
> anything, but it'd be nicer to see links to versions of the patch with
> which people are, in nearby emails, agreeing. Even worse, some of
> these are links to emails where somebody said, "hey, some earlier
> commit does not look good." In particular,
> dd1f6b0c172a643a73d6b71259fa2d10378b39eb has a discussion link where
> Andres complains about 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa, but
> it's not clear how that justifies the new commit.

I have to repeat again, that I admit my mistake with dd1f6b0c17,
apologize for that, and make my own conclusions to not repeat this.
But dd1f6b0c17 seems to be the only one that has a link to the message
with complains.  I went through the list of commits above, it seems
that others have just linked to the first message of the thread.
Probably, there is a lack of consensus for some of them.  But I never
heard about a policy to link not just the discussion start, but also
exact messages expressing agreeing.  And I didn't see others doing
that.

> - The commit message for 867cc7b6dd says "This reverts commit
> c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues
> spotted after commit." That's not a very good justification for then
> trying again 6 days later with 9bd99f4c26, and it's *definitely* not a
> good justification for there being no meaningful discussion links in
> the commit message for 9bd99f4c26. They're just the same links you had
> in the previous attempt, so it's pretty hard for anybody to understand
> what got fixed and whether all of the concerns were really addressed.
> Just looking over the commit, it's pretty hard to understand what is
> being changed and why: there's not a lot of comment updates, there's
> no documentation changes, and there's not a lot of explanation in the
> commit message either. Even if this feature is great and all

Re: [PoC/RFC] Multiple passwords, interval expirations

2024-04-10 Thread Kirill Reshke

Hi!

I'm interested in this feature presence in the PostgreSQL core. Will
try to provide valuable review/comments/suggestions and other help.

On Tue, 10 Oct 2023 at 16:17, Gurjeet Singh  wrote:
>
> > On Mon, Oct 9, 2023 at 2:31 AM Gurjeet Singh  wrote:
> > >
> > > Next steps:
> > > - Break the patch into a series of smaller patches.
> > > - Add TAP tests (test the ability to actually login with these passwords)
> > > - Add/update documentation
> > > - Add more regression tests
>
> Please see attached the v4 of the patchset that introduces the notion
> of named passwords slots, namely 'first' and 'second' passwords, and
> allows users to address each of these passwords separately for the
> purposes of adding, dropping, or assigning expiration times.
>
> Apart from the changes described by each patch's commit title, one
> significant change since v3 is that now (included in v4-0002...patch)
> it is not allowed for a role to have a mix of a types of passwords.
> When adding a password, the patch ensures that the password being
> added uses the same hashing algorithm (md5 or scram-sha-256) as the
> existing password, if any.  Having all passwords of the same type
> helps the server pick the corresponding authentication method during
> connection attempt.
>
> The v3 patch also had a few bugs that were exposed by cfbot's
> automatic run. All those bugs have now been fixed, and the latest run
> on the v4 branch [1] on my private Git repo shows a clean run [1].
>
> The list of patches, and their commit titles are as follows:
>
> > v4-0001-...patch Add new columns to pg_authid
> > v4-0002-...patch Update password verification infrastructure to handle two 
> > passwords
> > v4-0003-...patch Added SQL support for ALTER ROLE to manage two passwords
> > v4-0004-...patch Updated pg_dumpall to support exporting a role's second 
> > password
> > v4-0005-...patch Update system views pg_roles and pg_shadow
> > v4-0006-...patch Updated pg_authid catalog documentation
> > v4-0007-...patch Updated psql's describe-roles meta-command
> > v4-0008-...patch Added documentation for ALTER ROLE command
> > v4-0009-...patch Added TAP tests to prove that a role can use two passwords 
> > to login
> > v4-0010-...patch pgindent run
> > v4-0011-...patch Run pgperltidy on files changed by this patchset
>
> Running pgperltidy updated many perl files unrelated to this patch, so
> in the last patch I chose to include only the one perl file that is
> affected by this patchset.
>
> [1]: password_rollover_v4 (910f81be54)
> https://github.com/gurjeet/postgres/commits/password_rollover_v4
>
> [2]: https://cirrus-ci.com/build/4675613999497216
>
> Best regards,
> Gurjeet
> http://Gurje.et


Latest attachment does not apply to HEAD anymore.  I have rebased
them. While rebasing, a couple of minor changes were done:

1) Little correction in the `plain_crypt_verify` comment. IMO this
sounds a little better and comprehensible, is it?

> - * 'shadow_pass' is the user's correct password hash, as stored in
> - * pg_authid's rolpassword or rolsecondpassword.
> + * 'shadow_pass' is one of the user's correct password hashes, as stored in
> + * pg_authid's.

2) in v4-0004:

>/* note: rolconfig is dumped later */
> -   if (server_version >= 90600)
> +   if (server_version >= 17)
>printfPQExpBuffer(buf,
>  "SELECT oid, rolname, 
> rolsuper, rolinherit, "
>  "rolcreaterole, rolcreatedb, 
> "
> - "rolcanlogin, rolconnlimit, 
> rolpassword, "
> - "rolvaliduntil, 
> rolreplication, rolbypassrls, "
> + "rolcanlogin, rolconnlimit, 
> "
> + "rolpassword, 
> rolvaliduntil, "
> + "rolsecondpassword, 
> rolsecondvaliduntil, "
> + "rolreplication, 
> rolbypassrls, "
> + 
> "pg_catalog.shobj_description(oid, '%s') as rolcomment, "
> + "rolname = current_user AS 
> is_current_user "
> + "FROM %s "
> + "WHERE rolname !~ '^pg_' "
> + "ORDER BY 2", role_catalog, 
> role_catalog);
> +   else if (server_version >= 90600)
> +   printfPQExpBuffer(buf,
> + "SELECT oid, rolname, 
> rolsuper, rolinherit, "
> + "rolcreaterole, 
> rolcreatedb, "
> + "rolcanlogin, rolconnlimit, 
> "
> + "rolpassword, 
> rolvaliduntil, "
> +

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-10 Thread Jacob Champion

On Wed, Apr 10, 2024 at 12:31 AM Peter Eisentraut  wrote:
> * src/backend/libpq/be-secure-openssl.c
>
> +#include 
>
> This patch doesn't appear to add anything, so why does it need a new
> include?

This one was mine -- it was an indirect header dependency that was
effectively removed in 1.1.0 and later, due to the bump to
OPENSSL_API_COMPAT [1]. We have to depend on it directly now.

--Jacob

[1] 
https://github.com/openssl/openssl/blob/b372b1f764/include/openssl/dh.h#L20-L22

Re: Table AM Interface Enhancements

2024-04-10 Thread Robert Haas

On Wed, Apr 10, 2024 at 8:20 AM Alexander Korotkov  wrote:
> I agree with the facts.  But I have a different interpretation on
> this.  The patch was committed as 11470f544e on March 23, 2023, then
> reverted on April 3.  I've proposed the revised version, but Andres
> complained that this is the new API design days before FF.

Well, his first complaint that your committed patch was full of bugs:

https://www.postgresql.org/message-id/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de

When you commit a patch and another committer writes a post-commit
review saying that your patch has so many serious problems that he
gave up on reviewing before enumerating all of them, that's a really
bad sign. That should be a strong signal to you to step back and take
a close look at whether you really understand the area of the code
that you're touching well enough to be doing whatever it is that
you're doing. If I got a review like that, I would have reverted the
patch instantly, given up for the release cycle, possibly given up on
the patch permanently, and most definitely not tried again to commit
unless I was absolutely certain that I'd learned a lot in the meantime
*and* had the agreement of the committer who wrote that review (or
maybe some other committer who was acknowledged as an expert in that
area of the code).

What you did instead is try to do a bunch of post-commit fixup in a
desperate rush right before feature freeze, to which Andres
understandably objected. But that was your second mistake, not your
first one.

> Then the
> patch with this design was published in the thread for the year with
> periodical rebases.  So, I think I expressed my intention with that
> design before 2023 FF, nobody prevented me from expressing objections
> or other feedback during the year.  Then I realized that 2024 FF is
> approaching and decided to give this another try for pg18.

This doesn't seem to match the facts as I understand them. It appears
to me that there was no activity on the thread from April until
November. The message in November was not written by you. Your first
post to the thread after April of 2023 was on March 19, 2024. Five
days later you said you wanted to commit. That doesn't look to me like
you worked diligently on the patch set throughout the year and other
people had reasonable notice that you planned to get the work done
this cycle. It looks like you ignored the patch for 11 months and then
committed it without any real further feedback from anyone. True,
Pavel did post and say that he thought the patches were in good shape.
But you could hardly take that as evidence that Andres was now content
that the problems he'd raised earlier had been fixed, because (a)
Pavel had also been involved beforehand and had not raised the
concerns that Andres later raised and (b) Pavel wrote nothing in his
email specifically about why he thought your changes or his had
resolved those concerns. I certainly agree that Andres doesn't always
give as much review feedback as I'd like to have from him in, and it's
also true that he doesn't always give that feedback as quickly as I'd
like to have it ... but you know what?

It's not Andres's job to make sure my patches are not broken. It's my
job. That applies to the patches I write, and the patches written by
other people that I commit. If I commit something and it turns out
that it is broken, that's my bad. If I commit something and it turns
out that it does not have consensus, that is also my bad. It is not
the fault of the other people for not helping me get my patches to a
state where they are up to project standard. It is my fault, and my
fault alone, for committing something that was not ready. Now that
does not mean that it isn't frustrating when I can't get the help I
need. It is extremely frustrating. But the solution is not to commit
anyway and then blame the other people for not providing feedback.

I mean, committing without explicit agreement from someone else is OK
if you're pretty sure that you've got everything sorted out correctly.
But I don't think that the paper trail here supports the narrative
that you worked on this diligently throughout the year and had every
reason to believe it would be acceptable to the community. If I'd
looked at this thread, I would have concluded that you'd abandoned the
project. I would have expected that, when you picked it up again,
there would be a series of emails over a period of time carefully
working through the various issues that had been raised, inviting
specific commentary on specific discussion points, and generally
refining the work, and then maybe a suggestion of a commit at the end.
I would not have expected an email or two basically saying "well,
seems like it's all fixed now," followed by a commit.

> Do you propose a ban from March 1 to the end of any year?  I think the
> first doesn't make sense, because it leaves only 2 months a year for
> the work.  That would create a potential rush during these 2 month

Re: Revise some error messages in split partition code

2024-04-10 Thread Tender Wang

Richard Guo  于2024年4月10日周三 19:32写道：

> I noticed some error messages in the split partition code that are not
> up to par.  Such as:
>
> "new partitions not have value %s but split partition has"
>
> how about we revise it to:
>
> "new partitions do not have value %s but split partition does"
>
> Another one is:
>
> "any partition in the list should be DEFAULT because split partition is
> DEFAULT"
>
> how about we revise it to:
>
> "all partitions in the list should be DEFAULT because split partition is
> DEFAULT"
>
> Another problem I noticed is that in the test files partition_split.sql
> and partition_merge.sql, there are comments specifying the expected
> error messages for certain test queries.  However, in some cases, the
> error message mentioned in the comment does not match the error message
> actually generated by the query.  Such as:
>
> -- ERROR:  invalid partitions order, partition "sales_mar2022" can not be
> merged
> -- (space between sections sales_jan2022 and sales_mar2022)
> ALTER TABLE sales_range MERGE PARTITIONS (sales_jan2022, sales_mar2022)
> INTO sales_jan_mar2022;
> ERROR:  lower bound of partition "sales_mar2022" conflicts with upper
> bound of previous partition "sales_jan2022"
>
> I'm not sure if it's a good practice to specify the expected error
> message in the comment.  But if we choose to do so, I think we at least
> need to ensure that the specified error message in the comment remains
> consistent with the error message produced by the query.
>
> Also there are some comments containing grammatical issues.  Such as:
>
> -- no error: bounds of sales_noerror equals to lower and upper bounds of
> sales_dec2022 and sales_feb2022
>
> Attached is a patch to fix the issues I've observed.  I suspect there
> may be more to be found.
>

Yeah. The day before yesterday I found some grammer errors from error
messages and code comments [1] .
Except  those issues, @Alexander Lakhin   has found
some bugs [2]
I have some concerns that whether this patch is ready to commit.


[1]
https://www.postgresql.org/message-id/CAHewXNkGMPU50QG7V6Q60JGFORfo8LfYO1_GCkCa0VWbmB-fEw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/dbc8b96c-3cf0-d1ee-860d-0e491da20485%40gmail.com

-- 
Tender Wang
OpenPie:  https://en.openpie.com/

Re: Table AM Interface Enhancements

2024-04-10 Thread Pavel Borisov

Hi, Alexander!

On Wed, 10 Apr 2024 at 16:20, Alexander Korotkov 
wrote:

> On Mon, Apr 8, 2024 at 9:54 PM Robert Haas  wrote:
> > On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov 
> wrote:
> > > Yes, it was my mistake. I got rushing trying to fit this to FF, even
> doing significant changes just before commit.
> > > I'll revert this later today.
>
> The patch to revert is attached.  Given that revert touches the work
> done in 041b96802e, I think it needs some feedback before push.
>
> > Alexander,
> >
> > Exactly how much is getting reverted here? I see these, all since March
> 23rd:
> >
> > dd1f6b0c17 Provide a way block-level table AMs could re-use
> > acquire_sample_rows()
> > 9bd99f4c26 Custom reloptions for table AM
> > 97ce821e3e Fix the parameters order for
> > TableAmRoutine.relation_copy_for_cluster()
> > 867cc7b6dd Revert "Custom reloptions for table AM"
> > b1484a3f19 Let table AM insertion methods control index insertion
> > c95c25f9af Custom reloptions for table AM
> > 27bc1772fc Generalize relation analyze in table AM interface
> > 87985cc925 Allow locking updated tuples in tuple_update() and
> tuple_delete()
> > c35a3fb5e0 Allow table AM tuple_insert() method to return the different
> slot
> > 02eb07ea89 Allow table AM to store complex data structures in rd_amcache
> >
> > I'm not really feeling very good about all of this, because:
> >
> > - 87985cc925 was previously committed as 11470f544e on March 23, 2023,
> > and almost immediately reverted. Now you tried again on March 26,
> > 2024. I know there was a bunch of rework in the middle, but there are
> > times in the year that things can be committed other than right before
> > the feature freeze. Like, don't wait a whole year for the next attempt
> > and then again do it right before the cutoff.
>
> I agree with the facts.  But I have a different interpretation on
> this.  The patch was committed as 11470f544e on March 23, 2023, then
> reverted on April 3.  I've proposed the revised version, but Andres
> complained that this is the new API design days before FF.  Then the
> patch with this design was published in the thread for the year with
> periodical rebases.  So, I think I expressed my intention with that
> design before 2023 FF, nobody prevented me from expressing objections
> or other feedback during the year.  Then I realized that 2024 FF is
> approaching and decided to give this another try for pg18.
>
> But I don't yet see it's wrong with this patch.  I waited a year for
> feedback.  I waited 2 days after saying "I will push this if no
> objections". Given your feedback now, I get that it would be better to
> do another attempt to commit this earlier.
>
> I admit my mistake with dd1f6b0c17.  I get rushed trying to fix the
> things actually making things worse.  I apologise for this.  But if
> I'm forced to revert 87985cc925 without even hearing any reasonable
> critics besides imperfection of timing, I feel like this is the
> punishment for my mistake with dd1f6b0c17.  Pretty unreasonable
> punishment in my view.
>
> > - The Discussion links in the commit messages do not seem to stand for
> > the proposition that these particular patches ought to be committed in
> > this form. Some of them are just links to the messages where the patch
> > was originally posted, which is probably not against policy or
> > anything, but it'd be nicer to see links to versions of the patch with
> > which people are, in nearby emails, agreeing. Even worse, some of
> > these are links to emails where somebody said, "hey, some earlier
> > commit does not look good." In particular,
> > dd1f6b0c172a643a73d6b71259fa2d10378b39eb has a discussion link where
> > Andres complains about 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa, but
> > it's not clear how that justifies the new commit.
>
> I have to repeat again, that I admit my mistake with dd1f6b0c17,
> apologize for that, and make my own conclusions to not repeat this.
> But dd1f6b0c17 seems to be the only one that has a link to the message
> with complains.  I went through the list of commits above, it seems
> that others have just linked to the first message of the thread.
> Probably, there is a lack of consensus for some of them.  But I never
> heard about a policy to link not just the discussion start, but also
> exact messages expressing agreeing.  And I didn't see others doing
> that.
>
> > - The commit message for 867cc7b6dd says "This reverts commit
> > c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues
> > spotted after commit." That's not a very good justification for then
> > trying again 6 days later with 9bd99f4c26, and it's *definitely* not a
> > good justification for there being no meaningful discussion links in
> > the commit message for 9bd99f4c26. They're just the same links you had
> > in the previous attempt, so it's pretty hard for anybody to understand
> > what got fixed and whether all of the concerns were really addressed.
> > Just looking over the commit, it'

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-10 Thread Robert Haas

On Tue, Apr 9, 2024 at 5:16 PM Bruce Momjian  wrote:
> Committing code is a hard job, no question.  However, committers have to
> give up the idea that they should wait for brilliant ideas before
> finalizing patches.  If you come up with a better idea later, great, but
> don't wait to finalize patches.

Completely agreed. If your ideas are too bad, you should just give up.
But if they're basically fine and you're just waiting for inspiration
to strike from above, you might as well get on with it. If it turns
out that you've broken something, well, that sucks, but it's still
better to get that out of the way in January ... or November ... or
August ... rather than waiting until March or April.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Table AM Interface Enhancements

2024-04-10 Thread Joe Conway


On 4/10/24 09:19, Robert Haas wrote:

When you commit a patch and another committer writes a post-commit
review saying that your patch has so many serious problems that he
gave up on reviewing before enumerating all of them, that's a really
bad sign. That should be a strong signal to you to step back and take
a close look at whether you really understand the area of the code
that you're touching well enough to be doing whatever it is that
you're doing. If I got a review like that, I would have reverted the
patch instantly, given up for the release cycle, possibly given up on
the patch permanently, and most definitely not tried again to commit
unless I was absolutely certain that I'd learned a lot in the meantime
*and* had the agreement of the committer who wrote that review (or
maybe some other committer who was acknowledged as an expert in that
area of the code).





It's not Andres's job to make sure my patches are not broken. It's my
job. That applies to the patches I write, and the patches written by
other people that I commit. If I commit something and it turns out
that it is broken, that's my bad. If I commit something and it turns
out that it does not have consensus, that is also my bad. It is not
the fault of the other people for not helping me get my patches to a
state where they are up to project standard. It is my fault, and my
fault alone, for committing something that was not ready. Now that
does not mean that it isn't frustrating when I can't get the help I
need. It is extremely frustrating. But the solution is not to commit
anyway and then blame the other people for not providing feedback.


+many

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-10 Thread Tom Lane

Matthias van de Meent  writes:
> Upthread at [0], Stepan mentioned that we should default to SERIALIZE
> when ANALYZE is enabled. I suspect a patch in that direction would
> primarily contain updates in the test plan outputs, but I've not yet
> worked on that.

> Does anyone else have a strong opinion for or against adding SERIALIZE
> to the default set of explain features enabled with ANALYZE?

I am 100% dead set against that, because it would silently render
EXPLAIN outputs from different versions quite non-comparable.

regards, tom lane

Re: Can't find not null constraint, but \d+ shows that

2024-04-10 Thread Alvaro Herrera

It turns out that trying to close all holes that lead to columns marked
not-null without a pg_constraint row is not possible within the ALTER
TABLE framework, because it can happen outside it also.  Consider this

CREATE DOMAIN dom1 AS integer;
CREATE TABLE notnull_tbl (a dom1, b int, PRIMARY KEY (a, b));
DROP DOMAIN dom1 CASCADE;

In this case you'll end up with b having attnotnull=true and no
constraint; and no amount of messing with tablecmds.c will fix it.

So I propose to instead allow those constraints, and treat them as
second-class citizens.  We allow dropping them with ALTER TABLE DROP NOT
NULL, and we allow to create a backing full-fledged constraint with SET
NOT NULL or ADD CONSTRAINT.  So here's a partial crude initial patch to
do that.


One thing missing here is pg_dump support.  If you just dump this table,
it'll end up with no constraint at all.  That's obviously bad, so I
propose we have pg_dump add a regular NOT NULL constraint for those, to
avoid perpetuating the weird situation further.

Another thing I wonder if whether I should use the existing
set_attnotnull() instead of adding drop_orphaned_notnull().  Or we could
just inline the code in ATExecDropNotNull, since it's small and
self-contained.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)
>From 3a00358847f1792c01e608909dc8a4a0edb8739c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 10 Apr 2024 13:02:35 +0200
Subject: [PATCH] Handle ALTER .. DROP NOT NULL when no pg_constraint row
 exists

---
 src/backend/commands/tablecmds.c  | 67 +++
 src/test/regress/expected/constraints.out | 55 +++
 src/test/regress/sql/constraints.sql  | 29 ++
 3 files changed, 139 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 865c6331c1..f692001e55 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -448,6 +448,7 @@ static ObjectAddress ATExecDropNotNull(Relation rel, const char *colName, bool r
 	   LOCKMODE lockmode);
 static bool set_attnotnull(List **wqueue, Relation rel,
 		   AttrNumber attnum, bool recurse, LOCKMODE lockmode);
+static void drop_orphaned_notnull(Oid relationOid, AttrNumber attnum);
 static ObjectAddress ATExecSetNotNull(List **wqueue, Relation rel,
 	  char *constrname, char *colName,
 	  bool recurse, bool recursing,
@@ -7678,17 +7679,23 @@ ATExecDropNotNull(Relation rel, const char *colName, bool recurse,
 	}
 
 	/*
-	 * Find the constraint that makes this column NOT NULL.
+	 * Find the constraint that makes this column NOT NULL, and drop it if we
+	 * see one.  dropconstraint_internal() will do necessary consistency
+	 * checking.  If there isn't one, there are two possibilities: either the
+	 * column is marked attnotnull because it's part of the primary key, and
+	 * then we just throw an appropriate error; or it's a leftover marking that
+	 * we can remove.  However, before doing the latter, to avoid breaking
+	 * consistency any further, prevent this if the column is part of the
+	 * replica identity.
 	 */
 	conTup = findNotNullConstraint(RelationGetRelid(rel), colName);
 	if (conTup == NULL)
 	{
 		Bitmapset  *pkcols;
+		Bitmapset  *ircols;
 
 		/*
-		 * There's no not-null constraint, so throw an error.  If the column
-		 * is in a primary key, we can throw a specific error.  Otherwise,
-		 * this is unexpected.
+		 * If the column is in a primary key, throw a specific error message.
 		 */
 		pkcols = RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_PRIMARY_KEY);
 		if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
@@ -7697,16 +7704,25 @@ ATExecDropNotNull(Relation rel, const char *colName, bool recurse,
 	errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 	errmsg("column \"%s\" is in a primary key", colName));
 
-		/* this shouldn't happen */
-		elog(ERROR, "could not find not-null constraint on column \"%s\", relation \"%s\"",
-			 colName, RelationGetRelationName(rel));
+		/* Also throw an error if the column is in the replica identity */
+		ircols = RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_IDENTITY_KEY);
+		if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber, ircols))
+			ereport(ERROR,
+	errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+	errmsg("column \"%s\" is in index used as replica identity",
+		   get_attname(RelationGetRelid(rel), attnum, false)));
+
+		/* Otherwise, just remove the attnotnull marking and do nothing else. */
+		drop_orphaned_notnull(RelationGetRelid(rel), attnum);
 	}
+	else
+	{
+		readyRels = NIL;
+		dropconstraint_internal(rel, conTup, DROP_RESTRICT, recurse, false,
+false, &readyRels, lockmode);
 
-	readyRels = NIL;
-	dropconstraint_internal(rel, conTup, DROP_RESTRICT, recurse, false,
-			fa

Re: Transparent column encryption

2024-04-10 Thread Jelte Fennema-Nio

On Wed, 10 Apr 2024 at 12:13, Peter Eisentraut  wrote:
>
> To kick some things off for PG18, here is an updated version of the
> patch for automatic client-side column-level encryption.

I only read the docs and none of the code, but here is my feedback on
the current design:

> (The CEK can't be rotated easily, since
> that would require reading out all the data from a table/column and
> reencrypting it.  We could/should add some custom tooling for that,
> but it wouldn't be a routine operation.)

This seems like something that requires some more thought because CEK
rotation seems just as important as CMK rotation (often both would be
compromised at the same time). As far as I can tell the only way to
rotate a CEK is by re-encrypting the column for all rows in a single
go at the client side, thus taking a write-lock on all rows of the
table. That seems quite problematic, because that makes key rotation
an operation that requires application downtime. Allowing online
rotation is important, otherwise almost no-one will do it preventative
at a regular interval.

One way to allow online CEK rotation is by allowing a column to be
encrypted by one of several keys and/or allow a key to have multiple
versions. And then for each row we would store which key/version it
was encrypted with. That way for new insertions/updates clients would
use the newest version. But clients would still be able to decrypt
both old rows with the old key and new rows encrypted with the new
key, because the server would give them both keys and tell which row
was encrypted with which. Then the old rows can be rewritten by a
client in small batches, so that writes to the table can keep working
while this operation takes place.

This could even be used to allow encrypting previously unencrypted
columns using something like "ALTER COLUMN mycol ENCRYPTION KEY cek1".
Then unencrypted rows could be indicated by e.g. returning something
like NULL for the CEK.

+The plaintext inside
+the ciphertext is always in text format, but this is invisible to the
+protocol.

It seems like it would be useful to have a way of storing the
plaintext in binary form too. I'm not saying this should be part of
the initial version, but it would be good to keep that in mind with
the design.

+ The session-specific identifier of the key.

Is it necessary for this identifier to be session-specific? Why not
use a global identifier like an oid? Anything session specific makes
the job of transaction poolers quite a bit harder. If this identifier
would be global, then the message can be forwarded as is to the client
instead of re-mapping identifiers between clients and servers (like is
needed for prepared statements).

+   Additional algorithms may be added to this protocol specification without a
+   change in the protocol version number.

What's the reason for not requiring a version bump for this?

+  If the protocol extension _pq_.column_encryption is
+  enabled (see ), then
+  there is also the following for each parameter:

It seems a little bit wasteful to include these for all columns, even
the ones that don't require encryption. How about only adding these
fields when format code is 0x11

Finally, I'm trying to figure out if this really needs to be a
protocol extension or if a protocol version bump would work as well
without introducing a lot of work for clients/poolers that don't care
about it (possibly with some modifications to the proposed protocol
changes). What makes this a bit difficult for me is that there's not
much written in the documentation on what is supposed to happen for
encrypted columns when the protocol extension is not enabled. Is the
content just returned/written like it would be with a bytea? Or is
writing disallowed because the format code would never be set to 0x11.

A related question to this is that currently libpq throws an error if
e.g. a master key realm is not defined but another one is. Is that
really what we want? Is not having one of the realms really that
different from not providing any realms at all?

But no-matter these behavioural details, I think it would be fairly
easy to add minimal "non-support" for this feature while supporting
the new protocol messages. All they would need to do is understand
what the new protocol messages/fields mean and either ignore them or
throw a clear error. For poolers it's a different story however. For
transaction pooling there's quite a bit of work to be done. I already
mentioned the session-specific ID being a problem, but even assuming
we change that to a global ID there's still difficulties. Key
information is only sent by the server if it wasn't sent before in the
session[1], so a pooler would need to keep it's own cache and send
keys to clients that haven't received them yet.

So yeah, I think it would make sense to put this behind a protocol
extension feature flag, since it's fairly niche and would require
significant work at the pooler side to support.

Re: post-freeze damage control

2024-04-10 Thread Robert Haas

On Tue, Apr 9, 2024 at 1:59 AM Peter Eisentraut  wrote:
> On 09.04.24 00:58, Michael Paquier wrote:
> > That's more linked to the fact that I was going silent without a
> > laptop for a few weeks before the end of the release cycle, and a way
> > to say to not count on me, while I was trying to keep my room clean to
> > avoid noise for others who would rush patches.  It is a vacation
> > period for schools in Japan as the fiscal year finishes at the end of
> > March, while the rest of the world still studies/works, so that makes
> > trips much easier with areas being less busy when going abroad.  If
> > you want to limit commit activity during this period, the answer is
> > simple then: require that all the committers live in Japan.
>
> Well, due to the Easter holiday being earlier this year, I adopted a
> similar approach: Go on vacation the last week of March and watch the
> rest from the pool. :)  So far I feel this was better for my well-being.

I usually aim to have my major work for the release code complete by
approximately September and committed by December or January. Then if
it slips, I still have a chance of finishing before the freeze, and if
it doesn't, then I don't have to deal with the mad flurry of activity
at the end, and perhaps there's even time for some follow-up work
afterwards (as in the case of IB), or just time to review some other
patches.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-10 Thread Давыдов Виталий


Hi Amit, Ajin, All

Thank you for the patch and the responses. I apologize for my delayed answer 
due to some curcumstances.
On Wednesday, April 10, 2024 14:18 MSK, Amit Kapila  
wrote:

Vitaly, does the minimal solution provided by the proposed patch (Allow to 
alter two_phase option of a subscriber provided no uncommitted prepared 
transactions are pending on that subscription.) address your use case?In 
general, the idea behind the patch seems to be suitable for my case. 
Furthermore, the case of two_phase switch from false to true with uncommitted 
pending prepared transactions probably never happens in my case. The switch 
from false to true means that the replica completes the catchup from the master 
and switches to the normal mode when it participates in the multi-node 
configuration. There should be no uncommitted pending prepared transactions at 
the moment of the switch to the normal mode.

I'm going to try this patch. Give me please some time to investigate the patch. 
I will come with some feedback a little bit later.

Thank you for your help!

With best regards,
Vitaly Davydov

Re: post-freeze damage control

2024-04-10 Thread Robert Haas

On Mon, Apr 8, 2024 at 10:12 PM Tom Lane  wrote:
> I have another one that I'm not terribly happy about:
>
> Author: Alexander Korotkov 
> Branch: master [72bd38cc9] 2024-04-08 01:27:52 +0300
>
> Transform OR clauses to ANY expression

I realize that this has been reverted now, but what's really
frustrating about this case is that I reviewed this patch before and
gave feedback similar to some of the feedback you gave, and it just
didn't matter, and the patch was committed anyway.

> I don't know that I'd call it scary exactly, but I do think it
> was premature.  A week ago there was no consensus that it was
> ready to commit, but Alexander pushed it (or half of it, anyway)
> despite that.  A few concrete concerns:
>
> * Yet another planner GUC.  Do we really need or want that?

IMHO, no, and I said so in
https://www.postgresql.org/message-id/CA%2BTgmob%3DebuCHFSw327b55DJzE3JtOuZ5owxob%2BMgErb4me_Ag%40mail.gmail.com

> * What the medical community would call off-label usage of
> query jumbling.  I'm not sure this is even correct as-used,
> and for sure it's using that code for something never intended.
> Nor is the added code adequately (as in, at all) documented.

And I raised this point here:
https://www.postgresql.org/message-id/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com

> * Patch refuses to group anything but Consts into the SAOP
> transformation.  I realize that if you want to produce an
> array Const you need Const inputs, but I wonder why it
> wasn't considered to produce an ARRAY[] construct if there
> are available clauses with pseudo-constant (eg Param)
> comparison values.
>
> * I really, really dislike jamming this logic into prepqual.c,
> where it has no business being.  I note that it was shoved
> into process_duplicate_ors without even the courtesy of
> expanding the header comment:
>
>  * process_duplicate_ors
>  *Given a list of exprs which are ORed together, try to apply
>  *the inverse OR distributive law.
>
> Another reason to think this wasn't a very well chosen place is
> that the file's list of #include's went from 4 entries to 11.
> Somebody should have twigged to the idea that this was off-topic
> for prepqual.c.

All of this seems like it might be related to my comments in the above
email about the transformation being done too early.

> * OrClauseGroupKey is not a Node type, so why does it have
> a NodeTag?  I wonder what value will appear in that field,
> and what will happen if the struct is passed to any code
> that expects real Nodes.

I don't think I raised this issue.

> I could probably find some other nits if I spent more time
> on it, but I think these are sufficient to show that this
> was not commit-ready.

Just imagine if someone had taken time to give similar feedback before
the commit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ❓ JSON Path Dot Precedence

2024-04-10 Thread Peter Eisentraut


On 07.04.24 18:13, David E. Wheeler wrote:

Hello Hackers,

A question about the behavior of the JSON Path parser. The docs[1] have this to 
say about numbers:


  Numeric literals in SQL/JSON path expressions follow JavaScript rules, which 
are different from both SQL and JSON in some minor details. For example, 
SQL/JSON path allows .1 and 1., which are invalid in JSON.


In other words, this is valid:

david=# select '2.'::jsonpath;
  jsonpath
--
  2

But this feature creates a bit of a conflict with the use of a dot for path 
expressions. Consider `0x2.p10`. How should that be parsed? As an invalid decimal 
expression ("trailing junk after numeric literal”), or as a valid integer 2 
followed by the path segment “p10”? Here’s the parser’s answer:

david=# select '0x2.p10'::jsonpath;
  jsonpath
---
  (2)."p10"

So it would seem that, other things being equal, a path key expression (`.foo`) 
is slightly higher precedence than a decimal expression. Is that 
intentional/correct?


I think the derivation would be like this:

(I'm not sure what the top-level element would be, so let's start 
somewhere in the middle ...)


 ::= 

 ::=  

 ::= 

 ::= 

 ::=  

So the whole thing is

  

The syntax of  and  is then 
punted to ECMAScript 5.1.


0x2 is a HexIntegerLiteral.  (There can be no dots in that.)

p10 is an Identifier.

So I think this is all correct.

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-10 Thread Ants Aasma

On Mon, 8 Apr 2024 at 16:26, Robert Haas  wrote:

> And maybe we need to think of a way to further mitigate this crush of
> last minute commits. e.g. In the last week, you can't have more
> feature commits, or more lines of insertions in your commits, than you
> did in the prior 3 weeks combined. I don't know. I think this mad rush
> of last-minute commits is bad for the project.
>

I think some part of this rush of commits could also be explained as a form
of entrainment[1]. Only patches reasonably close to commit will get picked
up with extra attention to get them ready before the deadline. After the
release hammer drops, the pool of remaining patches will have few patches
close to commit remaining. And to make matters worse the attention of
working on them will be spread thinner. When repeated, this pattern can be
self reinforcing.

If this hypothesis is true, maybe some forces could be introduced to
counteract this natural tendency. I don't have any bright ideas on how
exactly yet.

Ants

[1] Emergent synchronization of interacting oscillators, see:
https://en.wikipedia.org/wiki/Injection_locking#Entrainment
https://en.wikipedia.org/wiki/Entrainment_(biomusicology)

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Robert Haas

On Tue, Apr 9, 2024 at 5:05 PM Andres Freund  wrote:
> ISTM that the fix here is to not use a spinlock for whatever the contention is
> on, rather than improve the RNG.

I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.

If I'm reading this correctly, this was introduced here:

commit 59bb147353ba274e0836d06f429176d4be47452c
Author: Bruce Momjian 
Date:   Fri Feb 3 12:45:47 2006 +

Update random() usage so ranges are inclusive/exclusive as required.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul

Hi Andres,
This is a little bit more complex than that. The spinlocks are taken in the
LWLock(Mutex) code, when the lock is not available right away.
The spinlock is taken to attach the current backend to the wait list of the
LWLock. This means, that this cannot be controlled.
The repro when it happens, it affects any mutex or LWLock code path, since
the low hamming index can cause problems by removing fairness from the
system.

Also, I believe the rounding off error still remains within the RNG. I will
send a patch today.

Thanks for the response.
-Parag

On Tue, Apr 9, 2024 at 2:05 PM Andres Freund  wrote:

> Hi,
>
> On 2024-04-08 22:52:09 -0700, Parag Paul wrote:
> >  We have an interesting problem, where PG went to PANIC due to stuck
> > spinlock case.
> > On careful analysis and hours of trying to reproduce this(something that
> > showed up in production after almost 2 weeks of stress run), I did some
> > statistical analysis on the RNG generator that PG uses to create the
> > backoff for the spin locks.
>
> ISTM that the fix here is to not use a spinlock for whatever the
> contention is
> on, rather than improve the RNG.
>
> Greetings,
>
> Andres Freund
>

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul

Thank you Robert.
I am in the process of patching this.
-Parag

On Wed, Apr 10, 2024 at 7:43 AM Robert Haas  wrote:

> On Tue, Apr 9, 2024 at 5:05 PM Andres Freund  wrote:
> > ISTM that the fix here is to not use a spinlock for whatever the
> contention is
> > on, rather than improve the RNG.
>
> I'm not convinced that we should try to improve the RNG, but surely we
> need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> 0.5. IIUC, the current logic is making us multiply the spin delay by a
> value between 0 and 1 when what was intended was that it should be
> multiplied by a value between 0.5 and 1.5.
>
> If I'm reading this correctly, this was introduced here:
>
> commit 59bb147353ba274e0836d06f429176d4be47452c
> Author: Bruce Momjian 
> Date:   Fri Feb 3 12:45:47 2006 +
>
> Update random() usage so ranges are inclusive/exclusive as required.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

Re: Allow non-superuser to cancel superuser tasks.

2024-04-10 Thread Nathan Bossart

On Wed, Apr 10, 2024 at 07:58:39AM +0900, Michael Paquier wrote:
> On Wed, Apr 10, 2024 at 12:52:19AM +0300, Kirill Reshke wrote:
>> On Tue, 9 Apr 2024 at 08:53, Michael Paquier  wrote:
>>> The thing is that you cannot rely on a lookup of the backend type for
>>> the error information, or you open yourself to letting the caller of
>>> pg_cancel_backend or pg_terminate_backend know if a backend is
>>> controlled by a superuser or if a backend is an autovacuum worker.
>> 
>> Good catch. Thanks.  I think we need to update the error message to not
>> leak backend type info.
> 
> Yep, that's necessary I am afraid.

Isn't it relatively easy to discover this same information today via
pg_stat_progress_vacuum?  That has the following code:

/* Value available to all callers */
values[0] = Int32GetDatum(beentry->st_procpid);
values[1] = ObjectIdGetDatum(beentry->st_databaseid);

I guess I'm not quite following why we are worried about leaking whether a
backend is an autovacuum worker.

>>> The choice of pg_signal_autovacuum is a bit inconsistent, as well,
>>> because autovacuum workers operate like regular backends.  This name
>>> can also be confused with the autovacuum launcher.
>> 
>> Ok. What would be a good choice? Is `pg_signal_autovacuum_worker` good
>> enough?
> 
> Sounds fine to me.  Perhaps others have an opinion about that?

WFM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Can't find not null constraint, but \d+ shows that

2024-04-10 Thread jian he

On Wed, Apr 10, 2024 at 7:01 PM Alvaro Herrera  wrote:
>
> On 2024-Apr-10, jian he wrote:
>
> > another related bug, in master.
> >
> > drop table if exists notnull_tbl1;
> > CREATE TABLE notnull_tbl1 (c0 int not null, c1 int);
> > ALTER TABLE notnull_tbl1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
> > \d+ notnull_tbl1
> > ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;
> > ALTER TABLE notnull_tbl1 ALTER c1 DROP NOT NULL;
> >
> > "ALTER TABLE notnull_tbl1 ALTER c0 DROP NOT NULL;"
> > should fail?
>
> No, this should not fail, and it is working correctly in master.  You
> can drop the not-null constraint, but the column will still be
> non-nullable, because the primary key still exists.  If you drop the
> primary key later, then the column becomes nullable.  This is by design.
>

now I got it. the second time, it will fail.
it should be the expected behavior.

per commit:
https://git.postgresql.org/cgit/postgresql.git/commit/?id=14dd0f27d7cd56ffae9ecdbe324965073d01a9ff
In the function dropconstraint_internal, I changed "foreach" to
"foreach_int" in some places,
and other minor cosmetic changes within the function
dropconstraint_internal only.

Since I saw your changes in dropconstraint_internal, I posted here.
I will review your latest patch later.
From dd41bf8d1f2dea5725909150609f8e565e11efcf Mon Sep 17 00:00:00 2001
From: jian he 
Date: Wed, 10 Apr 2024 20:49:04 +0800
Subject: [PATCH v1 1/1] minor coesmetuic refactor in dropconstraint_internal

mainly change `foreach` to `foreach_int` while iterating through list unconstrained_cols.
---
 src/backend/commands/tablecmds.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a02c5b0..8dc623e8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -12968,12 +12968,11 @@ dropconstraint_internal(Relation rel, HeapTuple constraintTup, DropBehavior beha
 	 * have had pg_attribute.attnotnull set.  See if we need to reset it, and
 	 * do so.
 	 */
-	if (unconstrained_cols)
+	if (unconstrained_cols != NIL)
 	{
 		Relation	attrel;
 		Bitmapset  *pkcols;
 		Bitmapset  *ircols;
-		ListCell   *lc;
 
 		/* Make the above deletion visible */
 		CommandCounterIncrement();
@@ -12989,9 +12988,8 @@ dropconstraint_internal(Relation rel, HeapTuple constraintTup, DropBehavior beha
 	   INDEX_ATTR_BITMAP_PRIMARY_KEY);
 		ircols = RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_IDENTITY_KEY);
 
-		foreach(lc, unconstrained_cols)
+		foreach_int(attnum, unconstrained_cols)
 		{
-			AttrNumber	attnum = lfirst_int(lc);
 			HeapTuple	atttup;
 			HeapTuple	contup;
 			Form_pg_attribute attForm;
@@ -13039,11 +13037,11 @@ dropconstraint_internal(Relation rel, HeapTuple constraintTup, DropBehavior beha
 			 * It's not valid to drop the not-null constraint for a column in
 			 * the replica identity index, either. (FULL is not affected.)
 			 */
-			if (bms_is_member(lfirst_int(lc) - FirstLowInvalidHeapAttributeNumber, ircols))
+			if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber, ircols))
 ereport(ERROR,
 		errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 		errmsg("column \"%s\" is in index used as replica identity",
-			   get_attname(RelationGetRelid(rel), lfirst_int(lc), false)));
+			   get_attname(RelationGetRelid(rel), attnum, false)));
 
 			/* Reset attnotnull */
 			if (attForm->attnotnull)
@@ -13233,10 +13231,10 @@ dropconstraint_internal(Relation rel, HeapTuple constraintTup, DropBehavior beha
 		 * Find out the list of column names to process.  Fortunately, we
 		 * already have the list of column numbers.
 		 */
-		foreach(lc, unconstrained_cols)
+		foreach_int(attnum, unconstrained_cols)
 		{
 			colnames = lappend(colnames, get_attname(RelationGetRelid(rel),
-	 lfirst_int(lc), false));
+	 attnum, false));
 		}
 
 		foreach(child, children)
-- 
2.34.1

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Robert Haas  writes:
> I'm not convinced that we should try to improve the RNG, but surely we
> need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> 0.5. IIUC, the current logic is making us multiply the spin delay by a
> value between 0 and 1 when what was intended was that it should be
> multiplied by a value between 0.5 and 1.5.

No, I think you are misreading it, because the assignment is += not =.
The present coding is

/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) + 
0.5);

which looks fine to me.  The +0.5 is so that the conversion to integer
rounds rather than truncating.

In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.

regards, tom lane

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-10 Thread Kartyshov Ivan


I did some experiments over synchronous replications and
got that cascade replication can`t be synchronous. And 
pg_wal_replay_wait() allows us to read your writes
consistency on cascade replication.
Beyond that, I added more tests on multi-standby replication
and cascade replications.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index bbd64aa67b..867c3dd94a 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -19,17 +19,17 @@ my $backup_name = 'my_backup';
 $node_primary->backup($backup_name);
 
 # Create a streaming standby with a 1 second delay from the backup
-my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
 my $delay = 1;
-$node_standby->init_from_backup($node_primary, $backup_name,
+$node_standby1->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
-$node_standby->append_conf(
+$node_standby1->append_conf(
 	'postgresql.conf', qq[
 	recovery_min_apply_delay = '${delay}s'
 ]);
-$node_standby->start;
-
+$node_standby1->start;
 
+# I
 # Make sure that pg_wal_replay_wait() works: add new content to
 # primary and memorize primary's insert LSN, then wait for that LSN to be
 # replayed on standby.
@@ -37,7 +37,7 @@ $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
 my $lsn1 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $output = $node_standby->safe_psql(
+my $output = $node_standby1->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn1}', 100);
 	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
@@ -48,12 +48,13 @@ my $output = $node_standby->safe_psql(
 ok($output >= 0,
 	"standby reached the same LSN as primary after pg_wal_replay_wait()");
 
+# II
 # Check that new data is visible after calling pg_wal_replay_wait()
 $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
 my $lsn2 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-$output = $node_standby->safe_psql(
+$output = $node_standby1->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn2}');
 	SELECT count(*) FROM wait_test;
@@ -62,6 +63,82 @@ $output = $node_standby->safe_psql(
 # Make sure the current LSN on standby and is the same as primary's LSN
 ok($output eq 30, "standby reached the same LSN as primary");
 
+# III
+# Check two standby waiting LSN
+# Create a streaming second standby with a 1 second delay from the backup
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby2->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+my $output2 = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 40, "standby1 reached the same LSN as primary");
+ok($output2 eq 40, "standby2 reached the same LSN as primary");
+
+# IV
+# Create a cascading standby waiting LSN
+$backup_name = 'cas_backup';
+$node_standby1->backup($backup_name);
+
+my $cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+$cascading_standby->init_from_backup($node_standby1, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $cascading_connstr = $node_standby1->connstr;
+$cascading_standby->append_conf(
+	'postgresql.conf', qq(
+	hot_standby_feedback = on
+	recovery_min_apply_delay = '${delay}s'
+));
+
+$cascading_standby->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+$output2 = $cascading_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 50, "standby1 reached the same LSN as primary");
+ok($output2 eq 50, "cascading_standby reached the same LSN as primary");

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul

hi Tom,
 First of all thanks for you response. I did not misread it. The 0.5 is
added to the result of the multiplication which then uses C integer
casting, which does not round off, but just drops the decimal portion.


status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5);


So, if RNG generated 0.001 and cur_delay =1000.
Result will be
1000 + int(1000*0.01 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
<--  back to the same value
and there is no change after that starts happening, if the RNG is in the
low hamming index state. If avalanche happens soon, then it will correct it
self, but in the mean time, we have a stuck_spin_lock PANIC that could
bring down a production server.
-Parag

On Wed, Apr 10, 2024 at 8:09 AM Tom Lane  wrote:

> Robert Haas  writes:
> > I'm not convinced that we should try to improve the RNG, but surely we
> > need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> > 0.5. IIUC, the current logic is making us multiply the spin delay by a
> > value between 0 and 1 when what was intended was that it should be
> > multiplied by a value between 0.5 and 1.5.
>
> No, I think you are misreading it, because the assignment is += not =.
> The present coding is
>
> /* increase delay by a random fraction between 1X and 2X */
> status->cur_delay += (int) (status->cur_delay *
> pg_prng_double(&pg_global_prng_state)
> + 0.5);
>
> which looks fine to me.  The +0.5 is so that the conversion to integer
> rounds rather than truncating.
>
> In any case, I concur with Andres: if this behavior is anywhere near
> critical then the right fix is to not be using spinlocks.
>
> regards, tom lane
>

Re: Speed up clean meson builds by ~25%

2024-04-10 Thread Tom Lane

Thomas Munro  writes:
> On Wed, Apr 10, 2024 at 5:03 PM Tom Lane  wrote:
>> I don't doubt that there are other clang versions where the problem
>> bites a lot harder.  What result do you get from the test I tried
>> (turning mm_strdup into a no-op macro)?

> #define mm_strdup(x) (x) does this:
> Apple clang 15: master: 14s -> 9s
> MacPorts clang 16, master: 170s -> 10s

Wow.  So (a) it's definitely worth doing something about this;
but (b) anything that would move the needle would probably require
significant refactoring of ecpg's string handling, and I doubt we
want to consider that post-feature-freeze.  The earliest we could
land such a fix would be ~ July, if we follow past schedules.

The immediate question then is do we want to take Jelte's patch
as a way to ameliorate the pain meanwhile.  I'm kind of down on
it, because AFAICS what would happen if you break the core
grammar is that (most likely) the failure would be reported
against ecpg first.  That seems pretty confusing.  However, maybe
it's tolerable as a short-term band-aid that we plan to revert later.
I would not commit the makefile side of the patch though, as
that creates the same problem for makefile users while providing
little benefit.

As for the longer-term fix, I'd be willing to have a go at
implementing the sketch I wrote last night; but I'm also happy
to defer to anyone else who's hot to work on this.

regards, tom lane

psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Jelte Fennema-Nio

Running "\d tablename" from psql could take multiple seconds when
running on a system with 100k+ tables. The reason for this was that
a sequence scan on pg_class takes place, due to regex matching being
used.

Regex matching is obviously unnecessary when we're looking for an exact
match. This checks for this (common) case and starts using plain
equality in that case.


v1-0001-psql-Greatly-speed-up-d-tablename-when-not-using-.patch
Description: Binary data

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Robert Haas

On Wed, Apr 10, 2024 at 11:09 AM Tom Lane  wrote:
> No, I think you are misreading it, because the assignment is += not =.
> The present coding is
>
> /* increase delay by a random fraction between 1X and 2X */
> status->cur_delay += (int) (status->cur_delay *
> pg_prng_double(&pg_global_prng_state) + 
> 0.5);
>
> which looks fine to me.  The +0.5 is so that the conversion to integer
> rounds rather than truncating.

Oh, yeah ... right. But then why does the comment say that it's
increasing the delay between a random fraction between 1X and 2X?
Isn't this a random fraction between 0X and 1X? Or am I still full of
garbage?

> In any case, I concur with Andres: if this behavior is anywhere near
> critical then the right fix is to not be using spinlocks.

It would certainly be interesting to know which spinlocks were at
issue, here. But it also seems to me that it's reasonable to be
unhappy about the possibility of this increasing cur_delay by exactly
0. Storms of spinlock contention can and do happen on real-world
production servers, and trying to guess which spinlocks might be
affected before it happens is difficult. Until we fully eliminate all
spinlocks from our code base, the theoretical possibility of such
storms remains, and if making sure that the increment here is >0
mitigates that, then I think we should. Mind you, I don't think that
the original poster has necessarily provided convincing or complete
evidence to justify a change here, but I find it odd that you and
Andres seem to want to dismiss it out of hand.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Parag Paul  writes:
> So, if RNG generated 0.001 and cur_delay =1000.
> Result will be
> 1000 + int(1000*0.01 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
> <--  back to the same value

Yes, with a sufficiently small RNG result, the sleep delay will not
increase that time through the loop.  So what?  It's supposed to be
a random amount of backoff, and I don't see why "random" shouldn't
occasionally include "zero".  Why would requiring the delay to
increase improve matters at all?

Now, if the xoroshiro RNG is so bad that there's a strong likelihood
that multiple backends will sit on cur_delay = MIN_DELAY_USEC for
a long time, then yeah we could have a thundering-herd problem with
contention for the spinlock.  You've provided no reason to think
that the probability of that is more than microscopic, though.
(If it is more than microscopic, then we are going to have
nonrandomness problems in a lot of other places, so I'd lean
towards revising the RNG not band-aiding s_lock.  We've seen
nothing to suggest such problems, though.)

regards, tom lane

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Robert Haas  writes:
> Oh, yeah ... right. But then why does the comment say that it's
> increasing the delay between a random fraction between 1X and 2X?

I think the comment is meant to say that the new delay value will be
1X to 2X the old value.  If you want to suggest different phrasing,
feel free.

> It would certainly be interesting to know which spinlocks were at
> issue, here. But it also seems to me that it's reasonable to be
> unhappy about the possibility of this increasing cur_delay by exactly
> 0.

As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault.  Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it.  But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.

regards, tom lane

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 07:55:16 -0700, Parag Paul wrote:
> This is a little bit more complex than that. The spinlocks are taken in the
> LWLock(Mutex) code, when the lock is not available right away.
> The spinlock is taken to attach the current backend to the wait list of the
> LWLock. This means, that this cannot be controlled.
> The repro when it happens, it affects any mutex or LWLock code path, since
> the low hamming index can cause problems by removing fairness from the
> system.

Please provide a profile of a problematic workload.  I'm fairly suspicious
of the claim that RNG unfairness is a real part of the problem here.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks.  It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:

while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break;/* got lock */

/* and then spin without atomic operations until lock is released */
{
SpinDelayStatus delayStatus;

init_local_spin_delay(&delayStatus);

while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}

/*
 * Retry. The lock might obviously already be re-acquired by the time
 * we're attempting to get it again.
 */
}

I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop.  That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.

(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)

regards, tom lane

Re: Table AM Interface Enhancements

2024-04-10 Thread Alexander Korotkov

On Wed, Apr 10, 2024 at 4:19 PM Robert Haas  wrote:
>
> On Wed, Apr 10, 2024 at 8:20 AM Alexander Korotkov  
> wrote:
> > I agree with the facts.  But I have a different interpretation on
> > this.  The patch was committed as 11470f544e on March 23, 2023, then
> > reverted on April 3.  I've proposed the revised version, but Andres
> > complained that this is the new API design days before FF.
>
> Well, his first complaint that your committed patch was full of bugs:
>
> https://www.postgresql.org/message-id/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
>
> When you commit a patch and another committer writes a post-commit
> review saying that your patch has so many serious problems that he
> gave up on reviewing before enumerating all of them, that's a really
> bad sign. That should be a strong signal to you to step back and take
> a close look at whether you really understand the area of the code
> that you're touching well enough to be doing whatever it is that
> you're doing. If I got a review like that, I would have reverted the
> patch instantly, given up for the release cycle, possibly given up on
> the patch permanently, and most definitely not tried again to commit
> unless I was absolutely certain that I'd learned a lot in the meantime
> *and* had the agreement of the committer who wrote that review (or
> maybe some other committer who was acknowledged as an expert in that
> area of the code).
>
> What you did instead is try to do a bunch of post-commit fixup in a
> desperate rush right before feature freeze, to which Andres
> understandably objected. But that was your second mistake, not your
> first one.
>
> > Then the
> > patch with this design was published in the thread for the year with
> > periodical rebases.  So, I think I expressed my intention with that
> > design before 2023 FF, nobody prevented me from expressing objections
> > or other feedback during the year.  Then I realized that 2024 FF is
> > approaching and decided to give this another try for pg18.
>
> This doesn't seem to match the facts as I understand them. It appears
> to me that there was no activity on the thread from April until
> November. The message in November was not written by you. Your first
> post to the thread after April of 2023 was on March 19, 2024. Five
> days later you said you wanted to commit. That doesn't look to me like
> you worked diligently on the patch set throughout the year and other
> people had reasonable notice that you planned to get the work done
> this cycle. It looks like you ignored the patch for 11 months and then
> committed it without any real further feedback from anyone. True,
> Pavel did post and say that he thought the patches were in good shape.
> But you could hardly take that as evidence that Andres was now content
> that the problems he'd raised earlier had been fixed, because (a)
> Pavel had also been involved beforehand and had not raised the
> concerns that Andres later raised and (b) Pavel wrote nothing in his
> email specifically about why he thought your changes or his had
> resolved those concerns. I certainly agree that Andres doesn't always
> give as much review feedback as I'd like to have from him in, and it's
> also true that he doesn't always give that feedback as quickly as I'd
> like to have it ... but you know what?
>
> It's not Andres's job to make sure my patches are not broken. It's my
> job. That applies to the patches I write, and the patches written by
> other people that I commit. If I commit something and it turns out
> that it is broken, that's my bad. If I commit something and it turns
> out that it does not have consensus, that is also my bad. It is not
> the fault of the other people for not helping me get my patches to a
> state where they are up to project standard. It is my fault, and my
> fault alone, for committing something that was not ready. Now that
> does not mean that it isn't frustrating when I can't get the help I
> need. It is extremely frustrating. But the solution is not to commit
> anyway and then blame the other people for not providing feedback.
>
> I mean, committing without explicit agreement from someone else is OK
> if you're pretty sure that you've got everything sorted out correctly.
> But I don't think that the paper trail here supports the narrative
> that you worked on this diligently throughout the year and had every
> reason to believe it would be acceptable to the community. If I'd
> looked at this thread, I would have concluded that you'd abandoned the
> project. I would have expected that, when you picked it up again,
> there would be a series of emails over a period of time carefully
> working through the various issues that had been raised, inviting
> specific commentary on specific discussion points, and generally
> refining the work, and then maybe a suggestion of a commit at the end.
> I would not have expected an email or two basically saying "well,
> seems like it's all fixed now," followed by a commit.

R

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Robert Haas

On Wed, Apr 10, 2024 at 12:02 PM Tom Lane  wrote:
> As I said to Parag, I see exactly no reason to believe that that's a
> problem, unless it happens *a lot*, like hundreds of times in a row.
> If it does, that's an RNG problem not s_lock's fault.  Now, I'm not
> going to say that xoroshiro can't potentially do that, because I've
> not studied it.  But if it is likely to do that, I'd think we'd
> have noticed the nonrandomness in other ways.

The blog post to which Parag linked includes this histogram as an
example of a low-Hamming-weight situation:

Reps | Value
-+
  84 | 0x00
  10 | 0x2d
   6 | 0xa0
   6 | 0x68
   6 | 0x40
   6 | 0x02
   5 | 0x0b
   4 | 0x05
   4 | 0x01
   3 | 0xe1
   3 | 0x5a
   3 | 0x41
   3 | 0x20
   3 | 0x16

That's certainly a lot more zeroes than you'd like to see in 192 bytes
of output, but it's not hundreds in a row either.

Also, the blog post says "On the one hand, from a practical
perspective, having vastly, vastly more close repeats than it should
isn't likely to be an issue that users will ever detect in practice.
Xoshiro256's large state space means it is too big to fail any
statistical tests that look for close repeats." So your theory that we
have a bug elsewhere in the code seems like it might be the real
answer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul

hi Robert,
We are using xoroshiro128 and not moved to the next state of art.
We did see a lot of low values as put in my last message.
-Parag

On Wed, Apr 10, 2024 at 9:37 AM Robert Haas  wrote:

> On Wed, Apr 10, 2024 at 12:02 PM Tom Lane  wrote:
> > As I said to Parag, I see exactly no reason to believe that that's a
> > problem, unless it happens *a lot*, like hundreds of times in a row.
> > If it does, that's an RNG problem not s_lock's fault.  Now, I'm not
> > going to say that xoroshiro can't potentially do that, because I've
> > not studied it.  But if it is likely to do that, I'd think we'd
> > have noticed the nonrandomness in other ways.
>
> The blog post to which Parag linked includes this histogram as an
> example of a low-Hamming-weight situation:
>
> Reps | Value
> -+
>   84 | 0x00
>   10 | 0x2d
>6 | 0xa0
>6 | 0x68
>6 | 0x40
>6 | 0x02
>5 | 0x0b
>4 | 0x05
>4 | 0x01
>3 | 0xe1
>3 | 0x5a
>3 | 0x41
>3 | 0x20
>3 | 0x16
>
> That's certainly a lot more zeroes than you'd like to see in 192 bytes
> of output, but it's not hundreds in a row either.
>
> Also, the blog post says "On the one hand, from a practical
> perspective, having vastly, vastly more close repeats than it should
> isn't likely to be an issue that users will ever detect in practice.
> Xoshiro256's large state space means it is too big to fail any
> statistical tests that look for close repeats." So your theory that we
> have a bug elsewhere in the code seems like it might be the real
> answer.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Robert Haas

On Wed, Apr 10, 2024 at 12:40 PM Parag Paul  wrote:
> The reason why this could be a problem is a flaw in the RNG with the enlarged 
> Hamming belt.
> I attached an image here, with the RNG outputs from 2 backends. I ran our 
> code for weeks, and collected ther
> values generated by the RNG over many backends. The one in Green (say backend 
> id 600), stopped flapping values and
> only produced low (near 0 ) values for half an hour, whereas the Blue(say 
> backend 700), kept generating good values and had
> a range between [0-1)
> During this period, the backed 600 suffered and ended up with spinlock stuck 
> condition.

This is a very vague description of a test procedure. If you provide a
reproducible series of steps that causes a stuck spinlock, I imagine
everyone will be on board with doing something about it. But this
graph is not going to convince anyone of anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Parag Paul

Yes, the probability of this happening is astronomical, but in production
with 128 core servers with 7000 max_connections, with petabyte scale data,
this did repro 2 times in the last month. We had to move to a local
approach to manager our ratelimiting counters.
This is not reproducible very easily. I feel that we should at least shield
ourselves with the following change, so that we at least increase the delay
by 1000us every time. We will follow a linear back off, but better than no
backoff.
  status->cur_delay += max(1000, (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5));

On Wed, Apr 10, 2024 at 9:43 AM Robert Haas  wrote:

> On Wed, Apr 10, 2024 at 12:40 PM Parag Paul  wrote:
> > The reason why this could be a problem is a flaw in the RNG with the
> enlarged Hamming belt.
> > I attached an image here, with the RNG outputs from 2 backends. I ran
> our code for weeks, and collected ther
> > values generated by the RNG over many backends. The one in Green (say
> backend id 600), stopped flapping values and
> > only produced low (near 0 ) values for half an hour, whereas the
> Blue(say backend 700), kept generating good values and had
> > a range between [0-1)
> > During this period, the backed 600 suffered and ended up with spinlock
> stuck condition.
>
> This is a very vague description of a test procedure. If you provide a
> reproducible series of steps that causes a stuck spinlock, I imagine
> everyone will be on board with doing something about it. But this
> graph is not going to convince anyone of anything.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

Re: Table AM Interface Enhancements

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-08 14:54:46 -0400, Robert Haas wrote:
> Exactly how much is getting reverted here? I see these, all since March 23rd:

IMO:

> dd1f6b0c17 Provide a way block-level table AMs could re-use
> acquire_sample_rows()

Should be reverted.

> 9bd99f4c26 Custom reloptions for table AM

Hm. There are some oddities here:

- It doesn't seem great that relcache.c now needs to know about the default
  values for all kinds of reloptions.

- why is there table_reloptions() and tableam_reloptions()?

- Why does extractRelOptions() need a TableAmRoutine parameter, extracted by a
  caller, instead of doing that work itself?

> 97ce821e3e Fix the parameters order for
> TableAmRoutine.relation_copy_for_cluster()

Shouldn't be, this is a clear fix.

> b1484a3f19 Let table AM insertion methods control index insertion

I'm not sure. I'm not convinced this is right, nor the opposite. If the
tableam takes control of index insertion, shouldn't nodeModifyTuple know this
earlier, so it doesn't prepare a bunch of index insertion state?  Also,
there's pretty much no motivating explanation in the commit.

> 27bc1772fc Generalize relation analyze in table AM interface

Should be reverted.

> 87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()

Strongly suspect this should be reverted. The last time this was committed it
was far from ready. It's very easy to cause corruption due to subtle bugs in
this area.

> c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot

If the AM returns a different slot, who is responsible for cleaning it up? And
how is creating a new slot for every insert not going to be a measurable
overhead?

> 02eb07ea89 Allow table AM to store complex data structures in rd_amcache

I am doubtful this is right.  Is it really sufficient to have a callback for
freeing? What happens when relcache entries are swapped as part of a rebuild?
That works for "flat" caches, but I don't immediately see how it works for
more complicated datastructures.  At least from the commit message it's hard
to evaluate how this actually intended to be used.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

I wrote:
> I don't think it's correct to re-initialize the SpinDelayStatus each
> time around the outer loop.  That state should persist through the
> entire acquire operation, as it does in a regular spinlock acquire.
> As this stands, it resets the delay to minimum each time around the
> outer loop, and I bet it is that behavior not the RNG that's to blame
> for what he's seeing.

After thinking about this some more, it is fairly clear that that *is*
a mistake that can cause a thundering-herd problem.  Assume we have
two or more backends waiting in perform_spin_delay, and for whatever
reason the scheduler wakes them up simultaneously.  They see the
LW_FLAG_LOCKED bit clear (because whoever had the lock when they went
to sleep is long gone) and iterate the outer loop.  One gets the lock;
the rest go back to sleep.  And they will all sleep exactly
MIN_DELAY_USEC, because they all reset their SpinDelayStatus.
Lather, rinse, repeat.  If the s_lock code were being used as
intended, they would soon acquire different cur_delay settings;
but that never happens.  That is not the RNG's fault.

So I think we need something like the attached.

regards, tom lane

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b1e388dc7c..54c84b9d67 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -857,46 +857,49 @@ static void
 LWLockWaitListLock(LWLock *lock)
 {
 	uint32		old_state;
-#ifdef LWLOCK_STATS
-	lwlock_stats *lwstats;
-	uint32		delays = 0;
 
-	lwstats = get_lwlock_stats_entry(lock);
-#endif
+	/*
+	 * Try once to acquire the lock, before we go to the trouble of setting up
+	 * state for the spin loop.
+	 */
+	old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
+	if (!(old_state & LW_FLAG_LOCKED))
+		return;	/* got lock */
 
-	while (true)
 	{
-		/* always try once to acquire lock directly */
-		old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
-		if (!(old_state & LW_FLAG_LOCKED))
-			break;/* got lock */
+		SpinDelayStatus delayStatus;
+#ifdef LWLOCK_STATS
+		lwlock_stats *lwstats = get_lwlock_stats_entry(lock);
+#endif
 
-		/* and then spin without atomic operations until lock is released */
-		{
-			SpinDelayStatus delayStatus;
+		init_local_spin_delay(&delayStatus);
 
-			init_local_spin_delay(&delayStatus);
+		while (true)
+		{
+			/* try again to acquire lock directly */
+			old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
+			if (!(old_state & LW_FLAG_LOCKED))
+break;			/* got lock */
 
+			/* spin without atomic operations until lock is released */
 			while (old_state & LW_FLAG_LOCKED)
 			{
 perform_spin_delay(&delayStatus);
 old_state = pg_atomic_read_u32(&lock->state);
 			}
-#ifdef LWLOCK_STATS
-			delays += delayStatus.delays;
-#endif
-			finish_spin_delay(&delayStatus);
+
+			/*
+			 * Retry. The lock might obviously already be re-acquired by the
+			 * time we're attempting to get it again.
+			 */
 		}
 
-		/*
-		 * Retry. The lock might obviously already be re-acquired by the time
-		 * we're attempting to get it again.
-		 */
-	}
+		finish_spin_delay(&delayStatus);
 
 #ifdef LWLOCK_STATS
-	lwstats->spin_delay_count += delays;
+		lwstats->spin_delay_count += delayStatus.delays;
 #endif
+	}
 }
 
 /*

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 09:48:42 -0700, Parag Paul wrote:
> Yes, the probability of this happening is astronomical, but in production
> with 128 core servers with 7000 max_connections, with petabyte scale data,
> this did repro 2 times in the last month. We had to move to a local
> approach to manager our ratelimiting counters.

What version of PG was this?  I think it's much more likely that you're
hitting a bug that caused a lot more contention inside lwlocks. That was fixed
for 16+ in a4adc31f690 on 2022-11-20, but only backpatched to 12-15 on
2024-01-18.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 12:28:10 -0400, Tom Lane wrote:
> Actually ... Parag mentioned that this was specifically about
> lwlock.c's usage of spinlocks.  It doesn't really use a spinlock,
> but it does use s_lock.c's delay logic, and I think it's got the
> usage pattern wrong:
> 
> while (true)
> {
> /* always try once to acquire lock directly */
> old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
> if (!(old_state & LW_FLAG_LOCKED))
> break;/* got lock */
> 
> /* and then spin without atomic operations until lock is released */
> {
> SpinDelayStatus delayStatus;
> 
> init_local_spin_delay(&delayStatus);
> 
> while (old_state & LW_FLAG_LOCKED)
> {
> perform_spin_delay(&delayStatus);
> old_state = pg_atomic_read_u32(&lock->state);
> }
> #ifdef LWLOCK_STATS
> delays += delayStatus.delays;
> #endif
> finish_spin_delay(&delayStatus);
> }
> 
> /*
>  * Retry. The lock might obviously already be re-acquired by the time
>  * we're attempting to get it again.
>  */
> }
> 
> I don't think it's correct to re-initialize the SpinDelayStatus each
> time around the outer loop.  That state should persist through the
> entire acquire operation, as it does in a regular spinlock acquire.
> As this stands, it resets the delay to minimum each time around the
> outer loop, and I bet it is that behavior not the RNG that's to blame
> for what he's seeing.

Hm, yea, that's not right. Normally this shouldn't be heavily contended enough
to matter.  I don't think just pulling out the spin delay would be the right
thing though, because that'd mean we'd initialize it even in the much much
more common case of there not being any contention.  I think we'll have to add
a separate fetch_or before the outer loop.


> (One should still wonder what is the LWLock usage pattern that is
> causing this spot to become so heavily contended.)

My suspicion is that it's a4adc31f690 which we only recently backpatched to
< 16.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Parag Paul  writes:
> Yes, the probability of this happening is astronomical, but in production
> with 128 core servers with 7000 max_connections, with petabyte scale data,
> this did repro 2 times in the last month. We had to move to a local
> approach to manager our ratelimiting counters.
> This is not reproducible very easily. I feel that we should at least shield
> ourselves with the following change, so that we at least increase the delay
> by 1000us every time. We will follow a linear back off, but better than no
> backoff.

I still say you are proposing to band-aid the wrong thing.  Moreover:

* the proposed patch will cause the first few cur_delay values to grow
much faster than before, with direct performance impact to everyone,
whether they are on 128-core servers or not;

* if we are in a regime where xoroshiro repeatedly returns zero
across multiple backends, your patch doesn't improve the situation
AFAICS, because the backends will still choose the same series
of cur_delay values and thus continue to exhibit thundering-herd
behavior.  Indeed, as coded I think the patch makes it *more*
likely that the same series of cur_delay values would be chosen
by multiple backends.

regards, tom lane

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Jeff Davis

On Wed, 2024-04-10 at 08:30 +0300, Heikki Linnakangas wrote:
> My #1 choice would be to write a patch to switch the pairing heap, 
> performance test that, and revert the binary heap changes.

Sounds good to me. I would expect it to perform better than the extra
hash table, if anything.

It also has the advantage that we don't change the API for binaryheap
in 17.

Regards,
Jeff Davis

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Andres Freund  writes:
> Hi,
> On 2024-04-10 12:28:10 -0400, Tom Lane wrote:
>> I don't think it's correct to re-initialize the SpinDelayStatus each
>> time around the outer loop.  That state should persist through the
>> entire acquire operation, as it does in a regular spinlock acquire.
>> As this stands, it resets the delay to minimum each time around the
>> outer loop, and I bet it is that behavior not the RNG that's to blame
>> for what he's seeing.

> Hm, yea, that's not right. Normally this shouldn't be heavily contended enough
> to matter.  I don't think just pulling out the spin delay would be the right
> thing though, because that'd mean we'd initialize it even in the much much
> more common case of there not being any contention.  I think we'll have to add
> a separate fetch_or before the outer loop.

Agreed, and I did that in my draft patch.  AFAICS we can also avoid
the LWLOCK_STATS overhead if the initial attempt succeeds, not that
that is something you'd really be worried about.

>> (One should still wonder what is the LWLock usage pattern that is
>> causing this spot to become so heavily contended.)

> My suspicion is that it's a4adc31f690 which we only recently backpatched to
> < 16.

Seems like a plausible theory.

regards, tom lane

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-10 Thread Dmitry Koval


Hi!

Alexander Korotkov, thanks for the commit of previous fix.
Alexander Lakhin, thanks for the problem you found.

There are two corrections attached to the letter:

1) v1-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch - fix 
for the problem [1].


2) v1-0002-Fixes-for-english-text.patch - fixes for English text 
(comments, error messages etc.).


Links:
[1] 
https://www.postgresql.org/message-id/dbc8b96c-3cf0-d1ee-860d-0e491da20485%40gmail.com


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.comFrom 3c443a57c334c74e9218fd4e2f1ced45e6d4141d Mon Sep 17 00:00:00 2001
From: Koval Dmitry 
Date: Wed, 10 Apr 2024 18:54:05 +0300
Subject: [PATCH v1 1/2] Fix for SPLIT/MERGE partitions of temporary table

---
 src/backend/commands/tablecmds.c  | 11 ---
 src/backend/parser/parse_utilcmd.c|  2 +-
 src/test/regress/expected/partition_merge.out | 32 +--
 src/test/regress/expected/partition_split.out | 29 +
 src/test/regress/sql/partition_merge.sql  | 24 +-
 src/test/regress/sql/partition_split.sql  | 23 +
 6 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a98a0af48..3da9f6389d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -21145,17 +21145,19 @@ moveSplitTableRows(Relation rel, Relation splitRel, 
List *partlist, List *newPar
  * createPartitionTable: create table for a new partition with given name
  * (newPartName) like table (modelRelName)
  *
- * Emulates command: CREATE TABLE  (LIKE 
+ * Emulates command: CREATE [TEMP] TABLE  (LIKE 
  * INCLUDING ALL EXCLUDING INDEXES EXCLUDING IDENTITY)
  */
 static void
-createPartitionTable(RangeVar *newPartName, RangeVar *modelRelName,
+createPartitionTable(Relation rel, RangeVar *newPartName, RangeVar 
*modelRelName,
 AlterTableUtilityContext *context)
 {
CreateStmt *createStmt;
TableLikeClause *tlc;
PlannedStmt *wrapper;
 
+   newPartName->relpersistence = rel->rd_rel->relpersistence;
+
createStmt = makeNode(CreateStmt);
createStmt->relation = newPartName;
createStmt->tableElts = NIL;
@@ -21291,7 +21293,7 @@ ATExecSplitPartition(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
SinglePartitionSpec *sps = (SinglePartitionSpec *) 
lfirst(listptr);
RelationnewPartRel;
 
-   createPartitionTable(sps->name, parentName, context);
+   createPartitionTable(rel, sps->name, parentName, context);
 
/* Open the new partition and acquire exclusive lock on it. */
newPartRel = table_openrv(sps->name, AccessExclusiveLock);
@@ -21491,7 +21493,8 @@ ATExecMergePartitions(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
mergePartName = 
makeRangeVar(get_namespace_name(RelationGetNamespace(rel)),
 
tmpRelName, -1);
}
-   createPartitionTable(mergePartName,
+   createPartitionTable(rel,
+mergePartName,
 
makeRangeVar(get_namespace_name(RelationGetNamespace(rel)),
  
RelationGetRelationName(rel), -1),
 context);
diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c
index ceba069905..ef1a2a97c0 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3430,7 +3430,7 @@ checkPartition(Relation rel, Oid partRelOid)
if (partRel->rd_rel->relkind != RELKIND_RELATION)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-errmsg("\"%s\" is not a table",
+errmsg("\"%s\" is not a ordinary table",

RelationGetRelationName(partRel;
 
if (!partRel->rd_rel->relispartition)
diff --git a/src/test/regress/expected/partition_merge.out 
b/src/test/regress/expected/partition_merge.out
index 60eacf6bf3..c69a717aaa 100644
--- a/src/test/regress/expected/partition_merge.out
+++ b/src/test/regress/expected/partition_merge.out
@@ -25,9 +25,9 @@ ALTER TABLE sales_range MERGE PARTITIONS (sales_feb2022, 
sales_mar2022, sales_fe
 ERROR:  partition with name "sales_feb2022" already used
 LINE 1: ...e MERGE PARTITIONS (sales_feb2022, sales_mar2022, sales_feb2...
  ^
--- ERROR:  "sales_apr2022" is not a table
+-- ERROR:  "sales_apr2022" is not a ordinary table
 ALTER TABLE sales_range MERGE PARTITIONS (sales_feb2022, sales_mar2022, 
sales_apr2022) INTO sales_

Re: Can't find not null constraint, but \d+ shows that

2024-04-10 Thread Alvaro Herrera

On 2024-Apr-10, Alvaro Herrera wrote:

> One thing missing here is pg_dump support.  If you just dump this table,
> it'll end up with no constraint at all.  That's obviously bad, so I
> propose we have pg_dump add a regular NOT NULL constraint for those, to
> avoid perpetuating the weird situation further.

Here's another crude patchset, this time including the pg_dump aspect.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"On the other flipper, one wrong move and we're Fatal Exceptions"
(T.U.X.: Term Unit X  - http://www.thelinuxreview.com/TUX/)
>From 3a00358847f1792c01e608909dc8a4a0edb8739c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 10 Apr 2024 13:02:35 +0200
Subject: [PATCH v2 1/2] Handle ALTER .. DROP NOT NULL when no pg_constraint
 row exists

---
 src/backend/commands/tablecmds.c  | 67 +++
 src/test/regress/expected/constraints.out | 55 +++
 src/test/regress/sql/constraints.sql  | 29 ++
 3 files changed, 139 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 865c6331c1..f692001e55 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -448,6 +448,7 @@ static ObjectAddress ATExecDropNotNull(Relation rel, const char *colName, bool r
 	   LOCKMODE lockmode);
 static bool set_attnotnull(List **wqueue, Relation rel,
 		   AttrNumber attnum, bool recurse, LOCKMODE lockmode);
+static void drop_orphaned_notnull(Oid relationOid, AttrNumber attnum);
 static ObjectAddress ATExecSetNotNull(List **wqueue, Relation rel,
 	  char *constrname, char *colName,
 	  bool recurse, bool recursing,
@@ -7678,17 +7679,23 @@ ATExecDropNotNull(Relation rel, const char *colName, bool recurse,
 	}
 
 	/*
-	 * Find the constraint that makes this column NOT NULL.
+	 * Find the constraint that makes this column NOT NULL, and drop it if we
+	 * see one.  dropconstraint_internal() will do necessary consistency
+	 * checking.  If there isn't one, there are two possibilities: either the
+	 * column is marked attnotnull because it's part of the primary key, and
+	 * then we just throw an appropriate error; or it's a leftover marking that
+	 * we can remove.  However, before doing the latter, to avoid breaking
+	 * consistency any further, prevent this if the column is part of the
+	 * replica identity.
 	 */
 	conTup = findNotNullConstraint(RelationGetRelid(rel), colName);
 	if (conTup == NULL)
 	{
 		Bitmapset  *pkcols;
+		Bitmapset  *ircols;
 
 		/*
-		 * There's no not-null constraint, so throw an error.  If the column
-		 * is in a primary key, we can throw a specific error.  Otherwise,
-		 * this is unexpected.
+		 * If the column is in a primary key, throw a specific error message.
 		 */
 		pkcols = RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_PRIMARY_KEY);
 		if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
@@ -7697,16 +7704,25 @@ ATExecDropNotNull(Relation rel, const char *colName, bool recurse,
 	errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 	errmsg("column \"%s\" is in a primary key", colName));
 
-		/* this shouldn't happen */
-		elog(ERROR, "could not find not-null constraint on column \"%s\", relation \"%s\"",
-			 colName, RelationGetRelationName(rel));
+		/* Also throw an error if the column is in the replica identity */
+		ircols = RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_IDENTITY_KEY);
+		if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber, ircols))
+			ereport(ERROR,
+	errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+	errmsg("column \"%s\" is in index used as replica identity",
+		   get_attname(RelationGetRelid(rel), attnum, false)));
+
+		/* Otherwise, just remove the attnotnull marking and do nothing else. */
+		drop_orphaned_notnull(RelationGetRelid(rel), attnum);
 	}
+	else
+	{
+		readyRels = NIL;
+		dropconstraint_internal(rel, conTup, DROP_RESTRICT, recurse, false,
+false, &readyRels, lockmode);
 
-	readyRels = NIL;
-	dropconstraint_internal(rel, conTup, DROP_RESTRICT, recurse, false,
-			false, &readyRels, lockmode);
-
-	heap_freetuple(conTup);
+		heap_freetuple(conTup);
+	}
 
 	InvokeObjectPostAlterHook(RelationRelationId,
 			  RelationGetRelid(rel), attnum);
@@ -7796,6 +7812,33 @@ set_attnotnull(List **wqueue, Relation rel, AttrNumber attnum, bool recurse,
 	return retval;
 }
 
+/*
+ * In rare cases, a column can end up with an "orphaned" attnotnull marking
+ * with no corresponding pg_constraint row.  If the user then does ALTER TABLE
+ * DROP NOT NULL, this takes care of resetting that.
+ */
+static void
+drop_orphaned_notnull(Oid relationOid, AttrNumber attnum)
+{
+	Relation	attr_rel;
+	HeapTuple	tuple;
+	Form_pg_attribute attForm;
+
+	attr_rel = table_open(AttributeRelationId, RowExclusiveLock);
+
+	tuple = SearchSysCacheCopyAttNum(relationOid, attnum);
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, "cache

Re: Table AM Interface Enhancements

2024-04-10 Thread Robert Haas

On Wed, Apr 10, 2024 at 12:36 PM Alexander Korotkov
 wrote:
> But I have to mention that even that I've committed table AM stuff
> close to the FF, there has been quite amount of depended work
> committed.  So, revert of these patches is promising to be not
> something immediate and easy, which requires just the decision.  It
> would touch others work.  And and revert patches might also need
> review.  I get the point that patches got lack of consensus.  But in
> terms of efforts (not my efforts) it's probably makes sense to get
> them some post-commit review.

That is somewhat fair, but it is also a lot of work. There are
multiple people asking for you to revert things on multiple threads,
and figuring out all of the revert requests and trying to come to some
consensus about what should be done in each case is going to take an
enormous amount of time. I know you've done lots of good work on
PostgreSQL in the past and I respect that, but I think you also have
to realize that you're asking other people to spend a LOT of time
figuring out what to do about the current situation. I see Andres has
posted more specifically about what he thinks should happen to each of
the table AM patches and I am willing to defer to his opinion, but we
need to make some quick decisions here to either keep things or take
them out. Extensive reworks after feature freeze should not be an
option that is on the table; that's what makes it a freeze.

I also do not think I really believe that there's been so much stuff
committed that a blanket revert would be all that hard to carry off,
if that were the option that the community ended up preferring.

> Robert, look.  Last year I went through the arrest for expressing my
> opinion.  I that was not what normal arrest should look like, but a
> period of survival.  My family went through a period of fear, struggle
> and uncertainty.  Now, we're healthy and safe, but there is still
> uncertainty given asylum seeker status.  During all this period, I
> have to just obey, agree with everything, lie that I apologize about
> things I don't apologize.  I had to do this, because the price of
> expressing myself was not just my life, but also health, freedom and
> well-being of my family.
>
> I owe you great respect for all your work for PostgreSQL, and
> especially for your efforts on getting things organized.  But it
> wouldn't work the way you increase my potential punishment and I just
> say that I'm obey and you're right about everything.  You may even
> initiate the procedure of my exclusion from committers (no idea what
> the procedure is), ban from the list etc.  I see you express many
> valuable points, but my view is not exactly same as yours.  And like a
> conclusion to some as result of discussion not threats.
>
> I feel the sense of blame and fear in latest discussions, and I don't
> like it.  That's OK to place the blame from time to time.  But I would
> like to add here more joy and respect (and I'm sorry I personally
> didn't do enough in this matter).  It's important get things right
> etc.  But in long term relationships may mean more.

I am not sure how to respond to this. On a personal level, I am sorry
to hear that you were arrested and, if I can be of some help to you,
we can discuss that off-list. However, if you're suggesting that there
is some kind of equivalence between me criticizing your decisions
about what to commit and someone in a position of authority putting
you in jail, well, I don't think it's remotely fair to compare those
things.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Greg Sabino Mullane

Patch looks good to me. Great idea overall, that forced regex has always
bugged me.

+   char   *regexChars = "|*+?()[]{}.^$\\";


One super minor optimization is that we technically do not need to scan for
')' and ']'. If they appear without their partner, the query will fail
anyway. :)
Cheers,
Greg

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 13:03:05 -0400, Tom Lane wrote:
> After thinking about this some more, it is fairly clear that that *is*
> a mistake that can cause a thundering-herd problem.

> Assume we have two or more backends waiting in perform_spin_delay, and for
> whatever reason the scheduler wakes them up simultaneously.

That's not really possible, at least not repeatably. Multiple processes
obviously can't be scheduled concurrently on one CPU and scheduling something
on another core entails interrupting that CPU with an inter processor
interrupt or that other CPU scheduling on its own, without coordination.

That obviously isn't a reason to not fix the delay logic in lwlock.c.

Looks like the wrong logic was introduced by me in

commit 008608b9d51061b1f598c197477b3dc7be9c4a64
Author: Andres Freund 
Date:   2016-04-10 20:12:32 -0700

Avoid the use of a separate spinlock to protect a LWLock's wait queue.

Likely because I was trying to avoid the overhead of init_local_spin_delay(),
without duplicating the few lines to acquire the "spinlock".

> So I think we need something like the attached.

LGTM.

I think it might be worth breaking LWLockWaitListLock() into two pieces, a
fastpath to be inlined into a caller, and a slowpath, but that's separate work
from a bugfix.

I looked around and the other uses of init_local_spin_delay() look correct
from this angle. However LockBufHdr() is more expensive than it needs to be,
because it always initializes SpinDelayStatus. IIRC I've seen that show up in
profiles before, but never got around to writing a nice-enough patch.  But
that's also something separate from a bugfix.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Andres Freund  writes:
> On 2024-04-10 13:03:05 -0400, Tom Lane wrote:
>> So I think we need something like the attached.

> LGTM.

On third thought ... while I still think this is a misuse of
perform_spin_delay and we should change it, I'm not sure it'll do
anything to address Parag's problem, because IIUC he's seeing actual
"stuck spinlock" reports.  That implies that the inner loop of
LWLockWaitListLock slept NUM_DELAYS times without ever seeing
LW_FLAG_LOCKED clear.  What I'm suggesting would change the triggering
condition to "NUM_DELAYS sleeps without acquiring the lock", which is
strictly more likely to happen, so it's not going to help him.  It's
certainly still well out in we-shouldn't-get-there territory, though.

Also, fooling around with the cur_delay adjustment doesn't affect
this at all: "stuck spinlock" is still going to be raised after
NUM_DELAYS failures to observe the lock clear or obtain the lock.
Increasing cur_delay won't change that, it'll just spread the
fixed number of attempts over a longer period; and there's no
reason to believe that does anything except make it take longer
to fail.  Per the header comment for s_lock.c:

 * We time out and declare error after NUM_DELAYS delays (thus, exactly
 * that many tries).  With the given settings, this will usually take 2 or
 * so minutes.  It seems better to fix the total number of tries (and thus
 * the probability of unintended failure) than to fix the total time
 * spent.

If you believe that waiting processes can be awakened close enough to
simultaneously to hit the behavior I posited earlier, then encouraging
them to have different cur_delay values will help; but Andres doesn't
believe it and I concede it seems like a stretch.

So I think fooling with the details in s_lock.c is pretty much beside
the point.  The most likely bet is that Parag's getting bit by the
bug fixed in a4adc31f690.  It's possible he's seeing the effect of
some different issue that causes lwlock.c to hold that lock a long
time at scale, but that's where I'd look first.

regards, tom lane

Re: psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Tom Lane

Jelte Fennema-Nio  writes:
> Running "\d tablename" from psql could take multiple seconds when
> running on a system with 100k+ tables. The reason for this was that
> a sequence scan on pg_class takes place, due to regex matching being
> used.

> Regex matching is obviously unnecessary when we're looking for an exact
> match. This checks for this (common) case and starts using plain
> equality in that case.

Really?  ISTM this argument is ignoring an optimization the backend
has understood for a long time.

regression=# explain select * from pg_class where relname ~ '^foo$';
 QUERY PLAN 
 

-
 Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.28..8.30 rows=
1 width=739)
   Index Cond: (relname = 'foo'::text)
   Filter: (relname ~ '^foo$'::text)
(3 rows)

regards, tom lane

Fix possible dereference null pointer (src/backend/replication/logical/reorderbuffer.c)

2024-04-10 Thread Ranier Vilela

Hi,

Per Coverity.

The function ReorderBufferTXNByXid,
can return NULL when the parameter *create* is false.

In the functions ReorderBufferSetBaseSnapshot
and ReorderBufferXidHasBaseSnapshot,
the second call to ReorderBufferTXNByXid,
pass false to *create* argument.

In the function ReorderBufferSetBaseSnapshot,
fixed passing true as argument to always return
a valid ReorderBufferTXN pointer.

In the function ReorderBufferXidHasBaseSnapshot,
fixed by checking if the pointer is NULL.

best regards,
Ranier Vilela


fix-possible-dereference-null-pointer-reorderbuffer.patch
Description: Binary data

Re: Table AM Interface Enhancements

2024-04-10 Thread Peter Geoghegan

On Wed, Apr 10, 2024 at 1:25 PM Robert Haas  wrote:
> That is somewhat fair, but it is also a lot of work. There are
> multiple people asking for you to revert things on multiple threads,
> and figuring out all of the revert requests and trying to come to some
> consensus about what should be done in each case is going to take an
> enormous amount of time. I know you've done lots of good work on
> PostgreSQL in the past and I respect that, but I think you also have
> to realize that you're asking other people to spend a LOT of time
> figuring out what to do about the current situation. I see Andres has
> posted more specifically about what he thinks should happen to each of
> the table AM patches and I am willing to defer to his opinion, but we
> need to make some quick decisions here to either keep things or take
> them out. Extensive reworks after feature freeze should not be an
> option that is on the table; that's what makes it a freeze.

Alexander has been sharply criticized for acting in haste, pushing
work in multiple areas when it was clearly not ready. And that seems
proportionate to me. I agree that he showed poor judgement in the past
few months, and especially in the past few weeks. Not just on one
occasion, but on several. That must have consequences.

> I also do not think I really believe that there's been so much stuff
> committed that a blanket revert would be all that hard to carry off,
> if that were the option that the community ended up preferring.

It seems to me that emotions are running high right now. I think that
it would be a mistake to act in haste when determining next steps.
It's very important, but it's not very urgent.

I've known Alexander for about 15 years. I think that he deserves some
consideration here. Say a week or two, to work through some of the
more complicated issues -- and to take a breather. I just don't see
any upside to rushing through this process, given where we are now.

-- 
Peter Geoghegan

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-10 Thread stepan rutz


First of all thanks for bringing this Feature to PostgreSQL. From a
regular-user perspective (not everyone is a Pro) it is very misleading
that ANALYZE doesn't do what it suggests it does. To run the query into
some kind of /dev/null type of destination is feasible and that is what
people end up doing after they have fallen into the "de-toasting" trap.

Having SERIALIZE is a great improvement for certain. When I said that
SERIALIZE should be the default, then this came mostly out of very
surprising subjective experiences in the past. Turning it on certainly
alters some existing benchmarks and timings. That is destructive in a
way and would destroy some existing work and measures. I lack the
overall understanding of the consequences, so please don't follow this
(emotional) advice.

So thanks again! and this will really help a lot of people. The people
actually bothering with EXPLAIN options are likely to explore the
documentation and now have a hint about this pitfall. The EXPLAIN part
of PostgreSQL "feels" a lot better now.

I appreciate all of your work on this issue, which came up without being
on some kind of plan and of course for the overall work on PostgreSQL.

/Stepan

On 4/10/24 15:57, Tom Lane wrote:


Matthias van de Meent  writes:

Upthread at [0], Stepan mentioned that we should default to SERIALIZE
when ANALYZE is enabled. I suspect a patch in that direction would
primarily contain updates in the test plan outputs, but I've not yet
worked on that.
Does anyone else have a strong opinion for or against adding SERIALIZE
to the default set of explain features enabled with ANALYZE?

I am 100% dead set against that, because it would silently render
EXPLAIN outputs from different versions quite non-comparable.

regards, tom lane

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Robert Haas  writes:
> The blog post to which Parag linked includes this histogram as an
> example of a low-Hamming-weight situation:

That's an interesting post indeed, but I'm not sure how relevant
it is to us, because it is about Xoshiro not Xoroshiro, and the
latter is what we use.  The last sentence of the blog post is

In this case, a mix of practice and theory has shown that the
structure of Xoshiro's state space is poorer than that of many
competing generation schemes, including Blackman's gjrand and
perhaps even Vigna and Blackman's earlier Xoroshiro scheme (which
has smaller zeroland expanses and does not appear to have similar
close-repeat problems), and its output functions are unable to
completely conceal these issues.

So while pg_prng.c might have the issue posited here, this blog
post is not evidence for that, and indeed might be evidence
against it.  Someone would have to do similar analysis on the
code we *actually* use to convince me that we need to worry.

regards, tom lane

Re: psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Kirill Reshke

Hi


> Regex matching is obviously unnecessary when we're looking for an exact
> match. This checks for this (common) case and starts using plain
> equality in that case.

+1

> + appendPQExpBuffer(buf, "(%s OPERATOR(pg_catalog.=) ", namevar);
> + appendStringLiteralConn(buf, &namebuf.data[2], conn);
> + appendPQExpBuffer(buf, "\nOR %s OPERATOR(pg_catalog.=) ",
> +  altnamevar);
> + appendStringLiteralConn(buf, &namebuf.data[2], conn);
> + appendPQExpBufferStr(buf, ")\n");

Do we need to force Collaction here like in other branches?
if (PQserverVersion(conn) >= 12)
   appendPQExpBufferStr(buf, " COLLATE pg_catalog.default");

Re: Fix resource leak (src/backend/libpq/be-secure-common.c)

2024-04-10 Thread Ranier Vilela

Em ter., 2 de abr. de 2024 às 15:31, Daniel Gustafsson 
escreveu:

> > On 2 Apr 2024, at 20:13, Ranier Vilela  wrote:
>
> > Fix by freeing the pointer, like pclose_check (src/common/exec.c)
> similar case.
>
> Off the cuff, seems reasonable when loglevel is LOG.
>

Per Coverity.

Another case of resource leak, when loglevel is LOG.
In the function shell_archive_file (src/backend/archive/shell_archive.c)
The pointer *xlogarchcmd*  is not freed.

best regards,
Ranier Vilela

fix-resource-leak-shell_archive.patch
Description: Binary data

Re: psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Jelte Fennema-Nio

On Wed, 10 Apr 2024 at 20:06, Tom Lane  wrote:
> Really?  ISTM this argument is ignoring an optimization the backend
> has understood for a long time.

Interesting. I didn't know about that optimization. I can't check
right now, but probably the COLLATE breaks that optimization.

Re: Fix resource leak (src/backend/libpq/be-secure-common.c)

2024-04-10 Thread Daniel Gustafsson

> On 10 Apr 2024, at 20:31, Ranier Vilela  wrote:
> 
> Em ter., 2 de abr. de 2024 às 15:31, Daniel Gustafsson  > escreveu:
>> > On 2 Apr 2024, at 20:13, Ranier Vilela > > > wrote:
>> 
>> > Fix by freeing the pointer, like pclose_check (src/common/exec.c) similar 
>> > case.
>> 
>> Off the cuff, seems reasonable when loglevel is LOG.
> 
> Per Coverity.
> 
> Another case of resource leak, when loglevel is LOG.
> In the function shell_archive_file (src/backend/archive/shell_archive.c)
> The pointer *xlogarchcmd*  is not freed.

Thanks, I'll have a look.  I've left this for post-freeze on purpose to not
cause unnecessary rebasing.  Will take a look over the next few days unless
beaten to it.

--
Daniel Gustafsson

Re: psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Jelte Fennema-Nio

On Wed, 10 Apr 2024 at 20:21, Kirill Reshke  wrote:
> Do we need to force Collaction here like in other branches?
> if (PQserverVersion(conn) >= 12)
>appendPQExpBufferStr(buf, " COLLATE pg_catalog.default");

According to the commit and codecomment that introduced the COLLATE,
it was specifically added for correct regex matching (e.g. \w). So I
don't think it's necessary, and I'm pretty sure adding it will cause
the index scan not to be used anymore.

Re: psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Kirill Reshke

On Wed, 10 Apr 2024, 23:37 Jelte Fennema-Nio,  wrote:

> On Wed, 10 Apr 2024 at 20:21, Kirill Reshke 
> wrote:
> > Do we need to force Collaction here like in other branches?
> > if (PQserverVersion(conn) >= 12)
> >appendPQExpBufferStr(buf, " COLLATE pg_catalog.default");
>
> According to the commit and codecomment that introduced the COLLATE,
> it was specifically added for correct regex matching (e.g. \w). So I
> don't think it's necessary, and I'm pretty sure adding it will cause
> the index scan not to be used anymore.
>

Ok, thanks for the clarification. If all of this is actually true, and
patch is really does speedup, maybe we need to state this in the comments?

>

Re: broken JIT support on Fedora 40

2024-04-10 Thread Pavel Stehule

st 10. 4. 2024 v 2:44 odesílatel Thomas Munro 
napsal:

> On Tue, Apr 9, 2024 at 10:05 PM Dmitry Dolgov <9erthali...@gmail.com>
> wrote:
> > +   /* In assertion builds, run the LLVM verify pass. */
> > +#ifdef USE_ASSERT_CHECKING
> > +   LLVMPassBuilderOptionsSetVerifyEach(options, true);
> > +#endif
>
> Thanks, that seems nicer.  I think the question is whether it will
> slow down build farm/CI/local meson test runs to a degree that exceeds
> its value.  Another option would be to have some other opt-in macro,
> like the existing #ifdef LLVM_PASS_DEBUG, for people who maintain
> JIT-related stuff to turn on.
>
> Supposing we go with USE_ASSERT_CHECKING, I have another question:
>
> -   const char *nm = "llvm.lifetime.end.p0i8";
> +   const char *nm = "llvm.lifetime.end.p0";
>
> Was that a mistake, or did the mangling rules change in some version?
> I don't currently feel inclined to go and test this on the ancient
> versions we claim to support in back-branches.  Perhaps we should just
> do this in master, and then it'd be limited to worrying about LLVM
> versions 10-18 (see 820b5af7), which have the distinct advantage of
> being available in package repositories for testing.  Or I suppose we
> could back-patch, but only do it if LLVM_VERSION_MAJOR >= 10.  Or we
> could do it unconditionally, and wait for ancient-LLVM build farm
> animals to break if they're going to.
>
> I pushed the illegal attribute fix though.  Thanks for the detective work!
>
> (It crossed my mind that perhaps deform functions should have their
> own template function, but if someone figures out that that's a good
> idea, I think we'll *still* need that change just pushed.)
>

all tests passed on fc 40 without problems

Thank you

Pavel

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 14:02:20 -0400, Tom Lane wrote:
> On third thought ... while I still think this is a misuse of
> perform_spin_delay and we should change it, I'm not sure it'll do
> anything to address Parag's problem, because IIUC he's seeing actual
> "stuck spinlock" reports.  That implies that the inner loop of
> LWLockWaitListLock slept NUM_DELAYS times without ever seeing
> LW_FLAG_LOCKED clear.  What I'm suggesting would change the triggering
> condition to "NUM_DELAYS sleeps without acquiring the lock", which is
> strictly more likely to happen, so it's not going to help him.  It's
> certainly still well out in we-shouldn't-get-there territory, though.

I think it could exascerbate the issue. Parag reported ~7k connections on a
128 core machine. The buffer replacement logic in < 16 tries to lock the old
and new lock partitions at once. That can lead to quite bad "chains" of
dependent lwlocks, occasionally putting all the pressure on a single lwlock.
With 7k waiters on a single spinlock, higher frequency of wakeups will make it
much more likely that the process holding the spinlock will be put to sleep.

This is greatly exacerbated by the issue fixed in a4adc31f690, once the
waitqueue is long, the spinlock will be held for an extended amount of time.

Greetings,

Andres Freund

Re: Table AM Interface Enhancements

2024-04-10 Thread Bruce Momjian

On Wed, Apr 10, 2024 at 05:42:51PM +0400, Pavel Borisov wrote:
> Hi, Alexander!
> In my view, the actual list of what has raised discussion is:
> dd1f6b0c17 Provide a way block-level table AMs could re-use 
> acquire_sample_rows
> ()
> 27bc1772fc Generalize relation analyze in table AM interface
> 
> Proposals to revert the other patches in a wholesale way look to me like an
> ill-performed continuation of a discussion [1]. I can't believe that "Let's

For reference this disussion was:

I don't dispute that we could do better, and this is just a
simplistic look based on "number of commits per day", but the
attached does put it in perspective to some extent.

> select which commits close to FF looks worse than the others" based on
> whereabouts, not patch contents is a good and productive way for the community
> to use.

I don't know how you can say these patches are being questioned just
because they are near the feature freeze (FF).  There are clear
concerns, and post-feature freeze is not the time to be evaluating which
patches which were pushed in near feature freeze need help.

What is the huge rush for these patches, and if they were so important,
why was this not done earlier?  This can all wait until PG 18.  If
Supabase or someone else needs these patches for PG 17, they will need
to create a patched verison of PG 17 with these patches.

> At the same time if Andres, who is the most experienced person in the scope of
> access methods is willing to give his post-commit re-review of any of the
> committed patches and will recommend some of them reverted, it would be a good
> sensible input to act accordingly.
> patch 

So the patches were rushed, have problems, and now we are requiring
Andres to stop what he is doing to give immediate feedback --- that is
not fair to him.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Table AM Interface Enhancements

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 15:19:47 +0300, Alexander Korotkov wrote:
> On Mon, Apr 8, 2024 at 9:54 PM Robert Haas  wrote:
> > On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov  
> > wrote:
> > > Yes, it was my mistake. I got rushing trying to fit this to FF, even 
> > > doing significant changes just before commit.
> > > I'll revert this later today.
>
> The patch to revert is attached.  Given that revert touches the work
> done in 041b96802e, I think it needs some feedback before push.

Hm. It's a bit annoying to revert it, you're right. I think on its own the
revert looks reasonable from what I've seen so far, will continue looking for
a bit.

I think we'll need to do some cleanup of 041b96802e separately afterwards -
possibly in 17, possibly in 18.  Particularly post-27bc1772fc8
acquire_sample_rows() was tied hard to heapam, so it made sense for 041b96802e
to create the stream in acquire_sample_rows() and have
block_sampling_read_stream_next() be in analyze.c. But eventually that should
be in access/heap/. Compared to 16, the state post the revert does tie
analyze.c a bit closer to the internals of the AM than before, but I'm not
sure the increase matters.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Andres Freund  writes:
> I think it could exascerbate the issue. Parag reported ~7k connections on a
> 128 core machine. The buffer replacement logic in < 16 tries to lock the old
> and new lock partitions at once. That can lead to quite bad "chains" of
> dependent lwlocks, occasionally putting all the pressure on a single lwlock.
> With 7k waiters on a single spinlock, higher frequency of wakeups will make it
> much more likely that the process holding the spinlock will be put to sleep.
> This is greatly exacerbated by the issue fixed in a4adc31f690, once the
> waitqueue is long, the spinlock will be held for an extended amount of time.

Yeah.  So what's the conclusion?  Leave it alone?  Commit to
HEAD only?

regards, tom lane

Re: psql: Greatly speed up "\d tablename" when not using regexes

2024-04-10 Thread Tom Lane

Jelte Fennema-Nio  writes:
> On Wed, 10 Apr 2024 at 20:06, Tom Lane  wrote:
>> Really?  ISTM this argument is ignoring an optimization the backend
>> has understood for a long time.

> Interesting. I didn't know about that optimization. I can't check
> right now, but probably the COLLATE breaks that optimization.

Not for me.

# explain select * from pg_class where relname ~ '^(foo)$' collate "en_US";
 QUERY PLAN 
 
-
 Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.27..8.29 
rows=1 width=263)
   Index Cond: (relname = 'foo'::text)
   Filter: (relname ~ '^(foo)$'::text COLLATE "en_US")
(3 rows)

Also, using -E:

# \d foo
/ QUERY */
SELECT c.oid,
  n.nspname,
  c.relname
FROM pg_catalog.pg_class c
 LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname OPERATOR(pg_catalog.~) '^(foo)$' COLLATE pg_catalog.default
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 2, 3;
//

# explain SELECT c.oid,
  n.nspname,
  c.relname
FROM pg_catalog.pg_class c
 LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname OPERATOR(pg_catalog.~) '^(foo)$' COLLATE pg_catalog.default
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 2, 3;
QUERY PLAN  
  
--
 Sort  (cost=9.42..9.42 rows=1 width=132)
   Sort Key: n.nspname, c.relname
   ->  Nested Loop Left Join  (cost=0.27..9.41 rows=1 width=132)
 Join Filter: (n.oid = c.relnamespace)
 ->  Index Scan using pg_class_relname_nsp_index on pg_class c  
(cost=0.27..8.32 rows=1 width=72)
   Index Cond: (relname = 'foo'::text)
   Filter: ((relname ~ '^(foo)$'::text) AND 
pg_table_is_visible(oid))
 ->  Seq Scan on pg_namespace n  (cost=0.00..1.04 rows=4 width=68)
(8 rows)


There may be an argument for psql to do what you suggest,
but so far it seems like duplicative complication.

If there's a case you can demonstrate where "\d foo" doesn't optimize
into an indexscan, we should look into exactly why that's happening,
because I think the cause must be more subtle than this.

regards, tom lane

Re: broken JIT support on Fedora 40

2024-04-10 Thread Dmitry Dolgov

> On Wed, Apr 10, 2024 at 12:43:23PM +1200, Thomas Munro wrote:
> On Tue, Apr 9, 2024 at 10:05 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > +   /* In assertion builds, run the LLVM verify pass. */
> > +#ifdef USE_ASSERT_CHECKING
> > +   LLVMPassBuilderOptionsSetVerifyEach(options, true);
> > +#endif
>
> Thanks, that seems nicer.  I think the question is whether it will
> slow down build farm/CI/local meson test runs to a degree that exceeds
> its value.  Another option would be to have some other opt-in macro,
> like the existing #ifdef LLVM_PASS_DEBUG, for people who maintain
> JIT-related stuff to turn on.

Oh, I see. I'm afraid I don't have enough knowledge of the CI pipeline,
but at least locally for me installcheck became only few percent slower
with the verify pass.

> Supposing we go with USE_ASSERT_CHECKING, I have another question:
>
> -   const char *nm = "llvm.lifetime.end.p0i8";
> +   const char *nm = "llvm.lifetime.end.p0";
>
> Was that a mistake, or did the mangling rules change in some version?

I'm not sure, but inclined to think it's the same as with noundef -- a
mistake, which was revealed in some recent version of LLVM. From what I
understand the suffix i8 indicates an overloaded argument of that type,
which is probably not needed. At the same time I can't get this error
from the verify pass with LLVM-12 or LLVM-15 (I have those at hand by
accident). To make it even more confusing I've found a few similar
examples in other projects, where this was really triggered by an issue
in LLVM [1].

[1]: https://github.com/rust-lang/rust/issues/102738

Re: broken JIT support on Fedora 40

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 22:15:27 +0200, Dmitry Dolgov wrote:
> > On Wed, Apr 10, 2024 at 12:43:23PM +1200, Thomas Munro wrote:
> > On Tue, Apr 9, 2024 at 10:05 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > > +   /* In assertion builds, run the LLVM verify pass. */
> > > +#ifdef USE_ASSERT_CHECKING
> > > +   LLVMPassBuilderOptionsSetVerifyEach(options, true);
> > > +#endif
> >
> > Thanks, that seems nicer.  I think the question is whether it will
> > slow down build farm/CI/local meson test runs to a degree that exceeds
> > its value.  Another option would be to have some other opt-in macro,
> > like the existing #ifdef LLVM_PASS_DEBUG, for people who maintain
> > JIT-related stuff to turn on.
> 
> Oh, I see. I'm afraid I don't have enough knowledge of the CI pipeline,
> but at least locally for me installcheck became only few percent slower
> with the verify pass.

I think it's worthwhile to add. It makes some problems so much easier to
find. And if you're building with debug enabled llvm, performance is so
atrocious anyway, that this isn't going to change the big picture...


> > Supposing we go with USE_ASSERT_CHECKING, I have another question:
> >
> > -   const char *nm = "llvm.lifetime.end.p0i8";
> > +   const char *nm = "llvm.lifetime.end.p0";
> >
> > Was that a mistake, or did the mangling rules change in some version?
> 
> I'm not sure, but inclined to think it's the same as with noundef -- a
> mistake, which was revealed in some recent version of LLVM. From what I
> understand the suffix i8 indicates an overloaded argument of that type,
> which is probably not needed. At the same time I can't get this error
> from the verify pass with LLVM-12 or LLVM-15 (I have those at hand by
> accident). To make it even more confusing I've found a few similar
> examples in other projects, where this was really triggered by an issue
> in LLVM [1].

I'm afraid that it actually has changed over time, I'm fairly sure that it was
required at some point.

Greetings,

Andres Freund

Re: Table AM Interface Enhancements

2024-04-10 Thread Melanie Plageman

On Wed, Apr 10, 2024 at 4:03 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-04-10 15:19:47 +0300, Alexander Korotkov wrote:
> > On Mon, Apr 8, 2024 at 9:54 PM Robert Haas  wrote:
> > > On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov  
> > > wrote:
> > > > Yes, it was my mistake. I got rushing trying to fit this to FF, even 
> > > > doing significant changes just before commit.
> > > > I'll revert this later today.
> >
> > The patch to revert is attached.  Given that revert touches the work
> > done in 041b96802e, I think it needs some feedback before push.
>
> Hm. It's a bit annoying to revert it, you're right. I think on its own the
> revert looks reasonable from what I've seen so far, will continue looking for
> a bit.
>
> I think we'll need to do some cleanup of 041b96802e separately afterwards -
> possibly in 17, possibly in 18.  Particularly post-27bc1772fc8
> acquire_sample_rows() was tied hard to heapam, so it made sense for 041b96802e
> to create the stream in acquire_sample_rows() and have
> block_sampling_read_stream_next() be in analyze.c. But eventually that should
> be in access/heap/. Compared to 16, the state post the revert does tie
> analyze.c a bit closer to the internals of the AM than before, but I'm not
> sure the increase matters.

Yes in an earlier version of 041b96802e, I gave the review feedback
that the read stream should be pushed down into heap-specific code,
but then after 27bc1772fc8, Bilal took the approach of putting the
read stream code in acquire_sample_rows() since that was no longer
table AM-agnostic.

This thread has been moving pretty fast, so could someone point out
which version of the patch has the modifications to
acquire_sample_rows() that would be relevant for Bilal (and others
involved in analyze streaming read) to review? Is it
v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patch?

- Melanie

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 16:05:21 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think it could exascerbate the issue. Parag reported ~7k connections on a
> > 128 core machine. The buffer replacement logic in < 16 tries to lock the old
> > and new lock partitions at once. That can lead to quite bad "chains" of
> > dependent lwlocks, occasionally putting all the pressure on a single lwlock.
> > With 7k waiters on a single spinlock, higher frequency of wakeups will make 
> > it
> > much more likely that the process holding the spinlock will be put to sleep.
> > This is greatly exacerbated by the issue fixed in a4adc31f690, once the
> > waitqueue is long, the spinlock will be held for an extended amount of time.
> 
> Yeah.  So what's the conclusion?  Leave it alone?  Commit to
> HEAD only?

I think we should certainly fix it. I don't really have an opinion about
backpatching, it's just on the line between the two for me.

Hm. The next set of releases is still a bit away, and this is one of the
period where HEAD is hopefully going to be more tested than usual, so I'd
perhaps very softly lean towards backpatching. There'd have to be some very
odd compiler behaviour to make it slower than before anyway.

Greetings,

Andres Freund

Re: Table AM Interface Enhancements

2024-04-10 Thread Andres Freund

Hi,

On 2024-04-10 16:24:40 -0400, Melanie Plageman wrote:
> This thread has been moving pretty fast, so could someone point out
> which version of the patch has the modifications to
> acquire_sample_rows() that would be relevant for Bilal (and others
> involved in analyze streaming read) to review? Is it
> v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patch?

I think so. It's at least what I've been looking at.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

2024-04-10 Thread Tom Lane

Andres Freund  writes:
> On 2024-04-10 16:05:21 -0400, Tom Lane wrote:
>> Yeah.  So what's the conclusion?  Leave it alone?  Commit to
>> HEAD only?

> I think we should certainly fix it. I don't really have an opinion about
> backpatching, it's just on the line between the two for me.
> Hm. The next set of releases is still a bit away, and this is one of the
> period where HEAD is hopefully going to be more tested than usual, so I'd
> perhaps very softly lean towards backpatching. There'd have to be some very
> odd compiler behaviour to make it slower than before anyway.

I'm not worried about it being slower, but about whether it could
report "stuck spinlock" in cases where the existing code succeeds.
While that seems at least theoretically possible, it seems like
if you hit it you have got problems that need to be fixed anyway.
Nonetheless, I'm kind of leaning to not back-patching.  I do agree
on getting it into HEAD sooner not later though.

regards, tom lane

Re: ❓ JSON Path Dot Precedence

2024-04-10 Thread David E. Wheeler

On Apr 10, 2024, at 10:29, Peter Eisentraut  wrote:

> So the whole thing is
> 
>   
> 
> The syntax of  and  is then punted to 
> ECMAScript 5.1.
> 
> 0x2 is a HexIntegerLiteral.  (There can be no dots in that.)
> 
> p10 is an Identifier.
> 
> So I think this is all correct.

That makes sense, thanks. It’s just a little odd to me that the resulting path 
isn’t a query at all. To Erik’s point: what path can `'0x2.p10` even select?

Best,

David

Re: Table AM Interface Enhancements

2024-04-10 Thread Melanie Plageman

On Wed, Apr 10, 2024 at 4:33 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-04-10 16:24:40 -0400, Melanie Plageman wrote:
> > This thread has been moving pretty fast, so could someone point out
> > which version of the patch has the modifications to
> > acquire_sample_rows() that would be relevant for Bilal (and others
> > involved in analyze streaming read) to review? Is it
> > v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patch?
>
> I think so. It's at least what I've been looking at.

I took a look at this patch, and you're right we will need to do
follow-on work with streaming ANALYZE. The streaming read code will
have to be moved now that acquire_sample_rows() is table-AM agnostic
again.

I don't think there was ever a version that Bilal wrote
where the streaming read code was outside of acquire_sample_rows(). By
the time he got that review feedback, 27bc1772fc8 had gone in.

This brings up a question about the prefetching. We never had to have
this discussion for sequential scan streaming read because it didn't
(and still doesn't) do prefetching. But, if we push the streaming read
code down into the heap AM layer, it will be doing the prefetching.
So, do we remove the prefetching from acquire_sample_rows() and expect
other table AMs to implement it themselves or use the streaming read
API?

- Melanie

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Heikki Linnakangas


On 10/04/2024 08:31, Amit Kapila wrote:

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  wrote:


On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?


Also consider commits b840508644 and bcb14f4abc.


Indeed.  These are also linked.


I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to
implement it. We could also do nothing, and revisit this in v18. The
only must-fix issue is some compiler warnings IIUC.

- It's a pretty localized change in reorderbuffer.c, so it's not in the
way of other patches or reverts. Nothing else depends on the binaryheap
changes yet either.

- It seems straightforward to repeat the performance tests with whatever
alternative implementations we want to consider.

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.



+1.


To move this forward, here's a patch to switch to a pairing heap. In my 
very quick testing, with the performance test cases posted earlier in 
this thread [1] [2], I'm seeing no meaningful performance difference 
between this and what's in master currently.


Sawada-san, what do you think of this? To be sure, if you could also 
repeat the performance tests you performed earlier, that'd be great. If 
you agree with this direction, and you're happy with this patch, feel 
free take it from here and commit this, and also revert commits 
b840508644 and bcb14f4abc.


--
Heikki Linnakangas
Neon (https://neon.tech)
From c883d03bd221341b0bbb315d376d9e125b84329a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 10 Apr 2024 08:09:38 +0300
Subject: [PATCH 1/1] Replace binaryheap + index with pairingheap

A pairing heap can perform the same operations as the binary heap +
index, with as good or better algorithmic complexity, and that's an
existing data structure so that we don't need to invent anything new
compared to v16. This commit makes the new binaryheap functionality
that was added in commits b840508644 and bcb14f4abc unnecessarily, but
they will be reverted separately.

Remove the optimization to only build and maintain the heap when the
amount of memory used is close to the limit, becuase the bookkeeping
overhead with the pairing heap seems to be small enough that it
doesn't matter in practice.
---
 .../replication/logical/reorderbuffer.c   | 186 +++---
 src/include/replication/reorderbuffer.h   |   8 +-
 2 files changed, 29 insertions(+), 165 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cf28d4df42..ab2ddabfadd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -68,19 +68,9 @@
  *	  memory gets actually freed.
  *
  *	  We use a max-heap with transaction size as the key to efficiently find
- *	  the largest transaction. While the max-heap is empty, we don't update
- *	  the max-heap when updating the memory counter. Therefore, we can get
- *	  the largest transaction in O(N) time, where N is the number of
- *	  transactions including top-level transactions and subtransactions.
- *
- *	  We build the max-heap just before selecting the largest transactions
- *	  if the number of transactions being decoded is higher than the threshold,
- *	  MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also
- *	  update the max-heap when updating the memory counter. The intention is
- *	  to efficiently find the largest transaction in O(1) time instead of
- *	  incurring the cost of memory counter updates (O(log N)). Once the number
- *	  of transactions got lower than the threshold, we reset the max-heap
- *	  (refer to ReorderBufferMaybeResetMaxHeap() for details).
+ *	  the largest transaction. We update the max-heap whenever the memory
+ *	  counter is updated; however transactions with size 0 are not stored in
+ *	  the heap, because they have no changes to evict.
  *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
@@ -122,23 +112,6 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
-/*
- * Threshold of the total number of top-level and sub transactions that
- * controls whether we use the max-heap for tracking their sizes. Although
- * using the max-heap to select the largest transaction is effective when
- * there are many transactions being decoded, maintaining the max-heap while
- * updating the memory statistics can be costly. Therefore, we use
- * MaxConnections as the threshold so that we use the max-heap only when
- * using subtransactions.
- */
-#define MAX_HEAP_TXN_COUNT_THRESHOLD	MaxConnections
-
-/*
- * A macro to check if the

1 2 >

1 - 100 of 139 matches

Mail list logo