Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Michael Paquier
On Thu, Aug 10, 2023 at 02:47:51PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier  
> wrote in 
>> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
>> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi 
>> >  wrote in 
>> >> We have treated every kind of broken data as end-of-recovery, like
>> >> incorrect rm_id or prev link including excessively large record length
>> >> due to corruption. This patch is going to change the behavior only for
>> >> the last one. If you think there can't be non-zero broken data, we
>> >> should inhibit proceeding recovery after all non-zero incorrect
>> >> data. This seems to be a quite big change in our recovery policy.
>> 
>> Well, per se the report that led to this thread.  We may lose data and
>> finish with corrupted pages.  I was planning to reply to the other
>> thread [1] and the patch of Noah anyway, because we have to fix the
>> detection of OOM vs corrupted records in the allocation path anyway.
> 
> Does this mean we will address the distinction between an OOM and a
> corrupt total record length later on? If that's the case, should we
> modify that behavior right now?

My apologies if I sounded unclear here.  It seems to me that we should
wrap the patch on [1] first, and get it backpatched.  At least that
makes for less conflicts when 0001 gets merged for HEAD when we are
able to set a proper error code.  (Was looking at it, actually.)

 4) Wrap up recovery then continue to normal operation.
 
 This is the traditional behavior for currupt WAL data.
>> 
>> Yeah, we've been doing a pretty bad job in classifying the errors that
>> can happen doing WAL replay in crash recovery, because we assume that
>> all the WAL still in pg_wal/ is correct.  That's not an easy problem,
> 
> I'm not quite sure what "correct" means here. I believe xlogreader
> runs various checks since the data may be incorrect. Given that can
> break for various reasons, during crash recovery, we continue as long
> as incoming WAL record remains consistently valid. The problem raised
> here that we can't distinctly identify an OOM from a corrupted total
> record length field. One reason is we check the data after a part is
> loaded, but we can't load all the bytes from the record into memory in
> such cases.

Yep.  Correct means that we end recovery in a consistent state, not
too early than we should.

>> because the record CRC works for all the contents of the record, and
>> we look at the record header before that.  Another idea may be to have
>> an extra CRC only for the header itself, that can be used in isolation
>> as one of the checks in XLogReaderValidatePageHeader().
> 
> Sounds reasonable. By using CRC to protect the header part and
> allocating a fixed-length buffer for it, I believe we're adopting a
> standard approach and can identify OOM and other kind of header errors
> with a good degree of certainty.

Not something that I'd like to cover in this patch set, though..  This
is a problem on its own.

 I'm not entirely certain, but if you were to ask me which is more
 probable during recovery - encountering a correct record that's too
 lengthy for the server to buffer or stumbling upon a corrupt byte
 sequence - I'd bet on the latter.
>> 
>> I don't really believe in chance when it comes to computer science,
>> facts and a correct detection of such facts are better :)
> 
> Even now, it seems like we're balancing the risk of potential data
> loss against the potential inability to start the server. I meant
> that.. if I had to do choose, I'd lean slightly towards prioritizing
> saving the latter, in other words, keeping the current behavior.

On OOM, this means data loss and silent corruption.  A failure has the
merit to tell someone that something is wrong, at least, and that
they'd better look at it rather than hope for the best.

>>> ... of course this refers to crash recovery. For replication, we
>>> should keep retrying the current record until the operator commands
>>> promotion.
>> 
>> Are you referring about a retry if there is a standby.signal?  I am a
>> bit confused by this sentence, because we could do a crash recovery,
>> then switch to archive recovery.  So, I guess that you mean that on
>> OOM we should retry to retrieve WAL from the local pg_wal/ even in the
>> case where we are in the crash recovery phase, *before* switching to
> 
> Apologies for the confusion. What I was thinking is that an OOM is
> more likely to occur in replication downstreams than during server
> startup. I also felt for the latter case that such a challenging
> environment probably wouldn't let the server enter stable normal
> operation.

It depends on what the user does with the host running the cluster.
Both could be impacted.

 I'm not sure how often users encounter currupt WAL data, but I believe
 they should have the option to terminate recovery and then switch to
 normal operation.

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Kyotaro Horiguchi
At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier  wrote 
in 
> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> >> We have treated every kind of broken data as end-of-recovery, like
> >> incorrect rm_id or prev link including excessively large record length
> >> due to corruption. This patch is going to change the behavior only for
> >> the last one. If you think there can't be non-zero broken data, we
> >> should inhibit proceeding recovery after all non-zero incorrect
> >> data. This seems to be a quite big change in our recovery policy.
> 
> Well, per se the report that led to this thread.  We may lose data and
> finish with corrupted pages.  I was planning to reply to the other
> thread [1] and the patch of Noah anyway, because we have to fix the
> detection of OOM vs corrupted records in the allocation path anyway.

Does this mean we will address the distinction between an OOM and a
corrupt total record length later on? If that's the case, should we
modify that behavior right now?

> The infra introduced by 0001 and something like 0002 that allows the
> startup process to take a different path depending on the type of
> error are still needed to avoid a too early end-of-recovery, though.

Agreed.

> >> 4) Wrap up recovery then continue to normal operation.
> >> 
> >> This is the traditional behavior for currupt WAL data.
> 
> Yeah, we've been doing a pretty bad job in classifying the errors that
> can happen doing WAL replay in crash recovery, because we assume that
> all the WAL still in pg_wal/ is correct.  That's not an easy problem,

I'm not quite sure what "correct" means here. I believe xlogreader
runs various checks since the data may be incorrect. Given that can
break for various reasons, during crash recovery, we continue as long
as incoming WAL record remains consistently valid. The problem raised
here that we can't distinctly identify an OOM from a corrupted total
record length field. One reason is we check the data after a part is
loaded, but we can't load all the bytes from the record into memory in
such cases.

> because the record CRC works for all the contents of the record, and
> we look at the record header before that.  Another idea may be to have
> an extra CRC only for the header itself, that can be used in isolation
> as one of the checks in XLogReaderValidatePageHeader().

Sounds reasonable. By using CRC to protect the header part and
allocating a fixed-length buffer for it, I believe we're adopting a
standard approach and can identify OOM and other kind of header errors
with a good degree of certainty.

> >> I'm not entirely certain, but if you were to ask me which is more
> >> probable during recovery - encountering a correct record that's too
> >> lengthy for the server to buffer or stumbling upon a corrupt byte
> >> sequence - I'd bet on the latter.
> 
> I don't really believe in chance when it comes to computer science,
> facts and a correct detection of such facts are better :)

Even now, it seems like we're balancing the risk of potential data
loss against the potential inability to start the server. I meant
that.. if I had to do choose, I'd lean slightly towards prioritizing
saving the latter, in other words, keeping the current behavior.

> > ... of course this refers to crash recovery. For replication, we
> > should keep retrying the current record until the operator commands
> > promotion.
> 
> Are you referring about a retry if there is a standby.signal?  I am a
> bit confused by this sentence, because we could do a crash recovery,
> then switch to archive recovery.  So, I guess that you mean that on
> OOM we should retry to retrieve WAL from the local pg_wal/ even in the
> case where we are in the crash recovery phase, *before* switching to

Apologies for the confusion. What I was thinking is that an OOM is
more likely to occur in replication downstreams than during server
startup. I also felt for the latter case that such a challenging
environment probably wouldn't let the server enter stable normal
operation.

> archive recovery and a different source, right?  I think that this
> argument can go two ways, because it could be more helpful for some to
> see a FATAL when we are still in crash recovery, even if there is a
> standby.signal.  It does not seem to me that we have a clear
> definition about what to do in which case, either.  Now we just fail
> and hope for the best when doing crash recovery.

Agreed.

> >> I'm not sure how often users encounter currupt WAL data, but I believe
> >> they should have the option to terminate recovery and then switch to
> >> normal operation.
> >> 
> >> What if we introduced an option to increase the timeline whenever
> >> recovery hits data error? If that option is disabled, the server stops
> >> when recovery detects an incorrect data, except in the case of an
> >> OOM. OOM cause record retry.
> 
> I guess that it depends on

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-09 Thread Peter Smith
Hi Melih,

FYI -- The same testing was repeated but this time PG was configured
to say synchronous_commit=on. Other factors and scripts were the same
as before --- busy apply, 5 runs, 4 workers, 1000 inserts/tx, 100
empty tables, etc.

There are still more xlog records seen for the v26 patch, but now the
v26 performance was better than HEAD.

RESULTS (synchronous_commit=on)
---

Xlog Counts

HEAD
postgres=# select avg(counttime) "avg", median(counttime) "median",
min(counttime) "min", max(counttime) "max", logtype from test_head
group by logtype;
  avg  |median | min | max  |
   logtype
---+---+-+--+---
---+---+-+--+---
---+---+-+--+---
1253.7509433962264151 | 1393. |   1 | 2012 |
FIND_DECODING_XLOG_RECORD_COUNT
(1 row)


HEAD+v26-0001
postgres=# select avg(counttime) "avg", median(counttime) "median",
min(counttime) "min", max(counttime) "max", logtype from test_v26
group by logtype;
  avg  |median | min | max  |
   logtype
---+---+-+--+---
---+---+-+--+---
---+---+-+--+---
1278.4075471698113208 | 1423.5000 |   1 | 2015 |
FIND_DECODING_XLOG_RECORD_COUNT
(1 row)

~~

Performance

HEAD
[peter@localhost res_0809_vignesh_timing_sync_head]$ cat *.dat_SUB |
grep RESULT | grep -v duration | awk '{print $3}'
4014.266
3892.089
4195.318
3571.862
4312.183


HEAD+v26-0001
[peter@localhost res_0809_vignesh_timing_sync_v260001]$ cat *.dat_SUB
| grep RESULT | grep -v duration | awk '{print $3}'
3326.627
3213.028
3433.611
3299.803
3258.821

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-09 Thread Amit Kapila
On Mon, Aug 7, 2023 at 3:46 PM Amit Kapila  wrote:
>
> On Mon, Aug 7, 2023 at 1:06 PM Julien Rouhaud  wrote:
> >
> > On Mon, Aug 07, 2023 at 12:42:33PM +0530, Amit Kapila wrote:
> > > On Mon, Aug 7, 2023 at 11:29 AM Julien Rouhaud  wrote:
> > > >
> > > > Unless I'm missing something I don't see what prevents something to 
> > > > connect
> > > > using the replication protocol and issue any query or even create new
> > > > replication slots?
> > > >
> > >
> > > I think the point is that if we have any slots where we have not
> > > consumed the pending WAL (other than the expected like
> > > SHUTDOWN_CHECKPOINT) or if there are invalid slots then the upgrade
> > > won't proceed and we will request user to remove such slots or ensure
> > > that WAL is consumed by slots. So, I think in the case you mentioned,
> > > the upgrade won't succeed.
> >
> > What if new slots are added while the old instance is started in the middle 
> > of
> > pg_upgrade, *after* the various checks are done?
> >
>
> They won't be copied but I think that won't be any different than
> other objects like tables. Anyway, I have another idea which is to not
> allow creating slots during binary upgrade unless one specifically
> requests it by having an API like binary_upgrade_allow_slot_create()
> similar to existing APIs binary_upgrade_*.
>

Sawada-San, Julien, and others, do you have any thoughts on the above point?

-- 
With Regards,
Amit Kapila.




Re: Fix pg_stat_reset_single_table_counters function

2023-08-09 Thread Masahiro Ikeda

On 2023-08-01 15:23, Masahiro Ikeda wrote:

Hi,

My colleague, Mitsuru Hinata (in CC), found the following issue.

The documentation of pg_stat_reset_single_table_counters() says

pg_stat_reset_single_table_counters ( oid ) → void
Resets statistics for a single table or index in the current database 
or shared across all databases in the cluster to zero.
This function is restricted to superusers by default, but other users 
can be granted EXECUTE to run the function.

https://www.postgresql.org/docs/devel/monitoring-stats.html

But, the stats will not be reset if the table shared across all
databases is specified. IIUC, 5891c7a8e seemed to have mistakenly
removed the feature implemented in e04267844. What do you think?


I fix an issue with the v1 patch reported by cfbot.

The test case didn't assume the number of databases will change in
the test of pg_upgrade.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 182c676380aa91a356814d8e387e5e5fe7e2c3c0 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Thu, 10 Aug 2023 13:46:09 +0900
Subject: [PATCH] Fix pg_stat_reset_single_table_counters function.

This commit revives the feature to reset statistics for a single
relation shared across all databases in the cluster to zero, which
was implemented by the following commit.
* Enhance pg_stat_reset_single_table_counters function(e04267844)

The following commit accidentally deleted the feature.
* pgstat: store statistics in shared memory(5891c7a8e)

Bump catalog version.

Need to backpatch from 15.

Reported-by: Mitsuru Hinata
---
 src/backend/utils/adt/pgstatfuncs.c |  9 -
 src/include/catalog/catversion.h|  2 +-
 src/test/regress/expected/stats.out | 60 +
 src/test/regress/sql/stats.sql  | 18 +
 4 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2a4c8ef87f..2b9742ad21 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -17,6 +17,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
+#include "catalog/catalog.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -1776,13 +1777,17 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
-/* Reset a single counter in the current database */
+/*
+ * Reset a statistics for a single object, which may be of current
+ * database or shared across all databases in the cluster.
+ */
 Datum
 pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS)
 {
 	Oid			taboid = PG_GETARG_OID(0);
+	Oid			dboid = (IsSharedRelation(taboid) ? InvalidOid : MyDatabaseId);
 
-	pgstat_reset(PGSTAT_KIND_RELATION, MyDatabaseId, taboid);
+	pgstat_reset(PGSTAT_KIND_RELATION, dboid, taboid);
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f507b49bb2..7d20c58b1d 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	202307261
+#define CATALOG_VERSION_NO	202308101
 
 #endif
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 319164a5e9..248e072395 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -592,6 +592,66 @@ SELECT seq_scan, idx_scan FROM pg_stat_all_tables WHERE relid = 'test_last_scan'
 0 |0
 (1 row)
 
+-- ensure to reset statistics for a table and a index shared across all databases
+BEGIN;
+SET LOCAL enable_seqscan TO on;
+SET LOCAL enable_indexscan TO off;
+SET LOCAL enable_indexonlyscan TO off;
+EXPLAIN (COSTS off) SELECT count(*) >= 0 FROM pg_database;
+  QUERY PLAN   
+---
+ Aggregate
+   ->  Seq Scan on pg_database
+(2 rows)
+
+SELECT count(*) >= 0 FROM pg_database; -- increment stats for the table
+ ?column? 
+--
+ t
+(1 row)
+
+SET LOCAL enable_seqscan TO off;
+SET LOCAL enable_indexscan TO on;
+SET LOCAL enable_indexonlyscan TO off;
+EXPLAIN (COSTS off) SELECT count(*) >= 0 FROM pg_database WHERE oid = 1;
+ QUERY PLAN  
+-
+ Aggregate
+   ->  Index Scan using pg_database_oid_index on pg_database
+ Index Cond: (oid = '1'::oid)
+(3 rows)
+
+SELECT count(*) >= 0 FROM pg_database WHERE oid = 1; -- increment stats for the index
+ ?column? 
+--
+ t
+(1 row)
+
+COMMIT;
+SELECT pg_stat_reset_single_table_counters('pg_database'::regclass);
+ pg_stat_reset_single_table_counters 
+-
+ 
+(1 row)
+
+SELECT pg_stat_reset_single_table_counters('pg_database_oid_index'::regclass);
+ pg_stat_reset_single_table_counters 
+-
+ 
+(1 row)
+
+SELECT pg_stat_reset_single_table_counters('pg_database_datname_index'::regc

Re: [PATCH] Add loongarch native checksum implementation.

2023-08-09 Thread John Naylor
On Tue, Aug 8, 2023 at 2:22 PM YANG Xudong  wrote:
>
> On 2023/8/8 14:38, John Naylor wrote:
>
> > v4 0001 is the same as v3, but with a draft commit message. I will
> > squash and commit this week, unless there is additional feedback.
>
> Looks good to me. Thanks for the additional patch.

I pushed this with another small comment change. Unfortunately, I didn't
glance at the buildfarm beforehand -- it seems many members are failing an
isolation check added by commit fa2e87494, including both loongarch64
members. I'll check back periodically.

https://buildfarm.postgresql.org/cgi-bin/show_status.pl

--
John Naylor
EDB: http://www.enterprisedb.com


Re: PG 16 draft release notes ready

2023-08-09 Thread Pavel Luzanov

On 09.08.2023 21:06, Bruce Momjian wrote:

On Sun, Jul 23, 2023 at 02:09:17PM +0300, Pavel Luzanov wrote:

Please consider to add item to the psql section:

Add psql \drg command to display role grants and remove the "Member of"
column from \du & \dg altogether (d65ddaca)

The release notes are only current as of 2023-06-26 and I will consider
this when I updated them next week, thanks.


This item is a part of Beta 3 scheduled for August 10, 2023 (today). [1]
It might be worth updating the release notes before the release.
But I'm not familiar with the release process in detail, so I could be 
wrong.


1. 
https://www.postgresql.org/message-id/93c00ac3-08b3-33ba-5d77-6ceb6ab20254%40postgresql.org


--
Pavel Luzanov
Postgres Professional: https://postgrespro.com





Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-09 Thread Amit Kapila
On Wed, Aug 9, 2023 at 8:28 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, August 3, 2023 7:30 PM Melih Mutlu   
> wrote:
>
> > Right. I attached the v26 as you asked.
>
> Thanks for posting the patches.
>
> While reviewing the patch, I noticed one rare case that it's possible that 
> there
> are two table sync worker for the same table in the same time.
>
> The patch relies on LogicalRepWorkerLock to prevent concurrent access, but the
> apply worker will start a new worker after releasing the lock. So, at the 
> point[1]
> where the lock is released and the new table sync worker has not been started,
> it seems possible that another old table sync worker will be reused for the
> same table.
>
> /* Now safe to release the LWLock */
> LWLockRelease(LogicalRepWorkerLock);
> *[1]
> /*
>  * If there are free sync worker slot(s), 
> start a new sync
>  * worker for the table.
>  */
> if (nsyncworkers < 
> max_sync_workers_per_subscription)
> ...
> 
> logicalrep_worker_launch(MyLogicalRepWorker->dbid,
>

Yeah, this is a problem. I think one idea to solve this is by
extending the lock duration till we launch the tablesync worker but we
should also consider changing this locking scheme such that there is a
better way to indicate that for a particular rel, tablesync is in
progress. Currently, the code in TablesyncWorkerMain() also acquires
the lock in exclusive mode even though the tablesync for a rel is in
progress which I guess could easily heart us for larger values of
max_logical_replication_workers. So, that could be another motivation
to think for a different locking scheme.

-- 
With Regards,
Amit Kapila.




Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Michael Paquier
On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote:
> At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
>> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier  
>> wrote in 
 While it's a kind of bug in total, we encountered a case where an
 excessively large xl_tot_len actually came from a corrupted
 record. [1]
>>> 
>>> Right, I remember this one.  I think that Thomas was pretty much right
>>> that this could be caused because of a lack of zeroing in the WAL
>>> pages.
>> 
>> We have treated every kind of broken data as end-of-recovery, like
>> incorrect rm_id or prev link including excessively large record length
>> due to corruption. This patch is going to change the behavior only for
>> the last one. If you think there can't be non-zero broken data, we
>> should inhibit proceeding recovery after all non-zero incorrect
>> data. This seems to be a quite big change in our recovery policy.

Well, per se the report that led to this thread.  We may lose data and
finish with corrupted pages.  I was planning to reply to the other
thread [1] and the patch of Noah anyway, because we have to fix the
detection of OOM vs corrupted records in the allocation path anyway.
The infra introduced by 0001 and something like 0002 that allows the
startup process to take a different path depending on the type of
error are still needed to avoid a too early end-of-recovery, though.

>>> There are a few options on the table, only doable once the WAL reader
>>> provider the error state to the startup process:
>>> 1) Retry a few times and FATAL.
>>> 2) Just FATAL immediately and don't wait.
>>> 3) Retry and hope for the best that the host calms down.
>> 
>> 4) Wrap up recovery then continue to normal operation.
>> 
>> This is the traditional behavior for currupt WAL data.

Yeah, we've been doing a pretty bad job in classifying the errors that
can happen doing WAL replay in crash recovery, because we assume that
all the WAL still in pg_wal/ is correct.  That's not an easy problem,
because the record CRC works for all the contents of the record, and
we look at the record header before that.  Another idea may be to have
an extra CRC only for the header itself, that can be used in isolation
as one of the checks in XLogReaderValidatePageHeader().

>>> I have not seeing this issue being much of an issue in the field, so
>>> perhaps option 2 with the structure of 0002 and a FATAL when we catch
>>> XLOG_READER_OOM in the switch would be enough.  At least that's enough
>>> for the cases we've seen.  I'll think a bit more about it, as well.
>>> 
>>> Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
>>> unfortunately, where he was able to trigger the issue of this thread
>>> by manipulating the sizing of a host after producing a record larger
>>> than what the host could afford after the resizing :/
>> 
>> I'm not entirely certain, but if you were to ask me which is more
>> probable during recovery - encountering a correct record that's too
>> lengthy for the server to buffer or stumbling upon a corrupt byte
>> sequence - I'd bet on the latter.

I don't really believe in chance when it comes to computer science,
facts and a correct detection of such facts are better :)

> ... of course this refers to crash recovery. For replication, we
> should keep retrying the current record until the operator commands
> promotion.

Are you referring about a retry if there is a standby.signal?  I am a
bit confused by this sentence, because we could do a crash recovery,
then switch to archive recovery.  So, I guess that you mean that on
OOM we should retry to retrieve WAL from the local pg_wal/ even in the
case where we are in the crash recovery phase, *before* switching to
archive recovery and a different source, right?  I think that this
argument can go two ways, because it could be more helpful for some to
see a FATAL when we are still in crash recovery, even if there is a
standby.signal.  It does not seem to me that we have a clear
definition about what to do in which case, either.  Now we just fail
and hope for the best when doing crash recovery.

>> I'm not sure how often users encounter currupt WAL data, but I believe
>> they should have the option to terminate recovery and then switch to
>> normal operation.
>> 
>> What if we introduced an option to increase the timeline whenever
>> recovery hits data error? If that option is disabled, the server stops
>> when recovery detects an incorrect data, except in the case of an
>> OOM. OOM cause record retry.

I guess that it depends on how much responsiveness one may want.
Forcing a failure on OOM is at least something that users would be
immediately able to act on when we don't run a standby but just
recover from a crash, while a standby would do what it is designed to
do, aka continue to replay what it can see.  One issue with the
wait-and-continue is that a standby may loop continuously on OOM,
which could be also bad if there's a replica

Re: Support to define custom wait events for extensions

2023-08-09 Thread Masahiro Ikeda

Hi,

Thanks for your comments about the v2 patches. I updated to v3 patches.

The main changes are:
* remove the AddinShmemInitLock assertion
* add the new lock (WaitEventExtensionLock) to wait_event_names.txt
* change "static const int wee_hash_XXX_size" to "#define XXX"
* simplify worker_spi. I removed codes related to share memory and
  try to allocate the new wait event before waiting per background 
worker.

* change to elog from ereport because the errors are for developers.
* revise comments as advised.
* fix the request size for shared memory correctly
* simplify dblink.c
* fix process ordering of WaitEventExtensionNew() as advised to
  avoid leading illegal state.

In addition, I change the followings:
* update about custom wait events in sgml. we don't need to use
  shmem_startup_hook.
* change the hash names for readability.
 (ex. WaitEventExtensionNameHash -> WaitEventExtensionHashById)
* fix a bug to fail to get already defined events by names
  because HASH_BLOBS was specified. HASH_STRINGS is right since
  the key is C strings.

I create a new entry in commitfest for CI testing.
(https://commitfest.postgresql.org/44/4494/)

Thanks for closing the former entry.
(https://commitfest.postgresql.org/43/4368/)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 234cc7d852aebf78285ccde383f27ea35a27dad2 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Thu, 10 Aug 2023 10:44:41 +0900
Subject: [PATCH 2/2] poc: custom wait event for dblink

---
 contrib/dblink/dblink.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 41e1f6c91d..b7158ecd4f 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -129,6 +129,7 @@ static void restoreLocalGucs(int nestlevel);
 /* Global */
 static remoteConn *pconn = NULL;
 static HTAB *remoteConnHash = NULL;
+static uint32 wait_event_info = 0;
 
 /*
  *	Following is list that holds multiple remote connections.
@@ -203,7 +204,9 @@ dblink_get_conn(char *conname_or_str,
 		dblink_connstr_check(connstr);
 
 		/* OK to make connection */
-		conn = libpqsrv_connect(connstr, WAIT_EVENT_EXTENSION);
+		if (wait_event_info == 0)
+			wait_event_info = WaitEventExtensionNew("dblink_get_con");
+		conn = libpqsrv_connect(connstr, wait_event_info);
 
 		if (PQstatus(conn) == CONNECTION_BAD)
 		{
-- 
2.25.1

From 07bbff06a0f58f08d313c02cbb4104ad1db2a0e9 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Thu, 10 Aug 2023 10:43:34 +0900
Subject: [PATCH 1/2] Change to manage custom wait events in shared hash

Currently, names of the custom wait event must be registered
per backends. This patch relaxes the constraints by store the
wait events and their names in hash tables in shared memory.

So, all backends can look up the wait event names on
pg_stat_activity view without additional processing by extensions.

In addition, extensions which do not use shared memory for their
use can define custom wait events because WaitEventExtensionNew()
will never allocate new one if the wait event associated to the
name is already allocated. It use hash table to identify uniqueness.
---
 doc/src/sgml/monitoring.sgml  |   7 +-
 doc/src/sgml/xfunc.sgml   |  26 +--
 src/backend/storage/lmgr/lwlocknames.txt  |   1 +
 src/backend/utils/activity/wait_event.c   | 219 +++---
 .../utils/activity/wait_event_names.txt   |   1 +
 src/include/utils/wait_event.h|  17 +-
 .../modules/worker_spi/t/001_worker_spi.pl|  18 +-
 .../modules/worker_spi/worker_spi--1.0.sql|   5 -
 src/test/modules/worker_spi/worker_spi.c  | 109 +
 9 files changed, 165 insertions(+), 238 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f4fc5d814f..19181832d7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1121,10 +1121,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  LWLock types
  to the list shown in  and
  . In some cases, the name
- assigned by an extension will not be available in all server processes;
- so an Extension or LWLock wait
- event might be reported as just
- extension rather than the
+ of LWLock assigned by an extension will not be
+ available in all server processes; so an wait event might be reported
+ as just extension rather than the
  extension-assigned name.
 

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index d6345a775b..281c178b0e 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3454,33 +3454,15 @@ if (!ptr)

 

-Shared Memory and Custom Wait Events
+Custom Wait Events
 
 
  Add-ins can define custom wait events under the wait event type
- Extension. The add-in's shared library must be
- preloaded by specifying it in shared_preload_libraries,
- and register a shmem_request_hook and a
-  

Make psql's qeury canceling test simple by using signal() routine of IPC::Run

2023-08-09 Thread Yugo NAGATA
Hello,

Currently, the psql's test of query cancelling (src/bin/psql/t/020_cancel.pl)
gets the PPID of a running psql by using "\!" meta command, and sends SIGINT to
the process by using "kill". However, IPC::Run provides signal() routine that
sends a signal to a running process, so I think we can rewrite the test using
this routine to more simple fashion as attached patch. 

What do you think about it?


Use of signal() is suggested by Fabien COELHO during the discussion of
query cancelling of pgbench [1].

[1] 
https://www.postgresql.org/message-id/7691ade8-78-dd4-e26-135ccfbf0e9%40mines-paristech.fr

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/bin/psql/t/020_cancel.pl b/src/bin/psql/t/020_cancel.pl
index 0765d82b92..081a1468d9 100644
--- a/src/bin/psql/t/020_cancel.pl
+++ b/src/bin/psql/t/020_cancel.pl
@@ -29,35 +29,10 @@ SKIP:
 
 	my ($stdin, $stdout, $stderr);
 
-	# Test whether shell supports $PPID.  It's part of POSIX, but some
-	# pre-/non-POSIX shells don't support it (e.g., NetBSD).
-	$stdin = "\\! echo \$PPID";
-	IPC::Run::run([ 'psql', '-X', '-v', 'ON_ERROR_STOP=1' ],
-		'<', \$stdin, '>', \$stdout, '2>', \$stderr);
-	$stdout =~ /^\d+$/ or skip "shell apparently does not support \$PPID", 2;
-
 	# Now start the real test
 	my $h = IPC::Run::start([ 'psql', '-X', '-v', 'ON_ERROR_STOP=1' ],
 		\$stdin, \$stdout, \$stderr);
 
-	# Get the PID
-	$stdout = '';
-	$stderr = '';
-	$stdin = "\\! echo \$PPID >$tempdir/psql.pid\n";
-	pump $h while length $stdin;
-	my $count;
-	my $psql_pid;
-	until (
-		-s "$tempdir/psql.pid"
-		  and ($psql_pid =
-			PostgreSQL::Test::Utils::slurp_file("$tempdir/psql.pid")) =~
-		  /^\d+\n/s)
-	{
-		($count++ < 100 * $PostgreSQL::Test::Utils::timeout_default)
-		  or die "pid file did not appear";
-		usleep(10_000);
-	}
-
 	# Send sleep command and wait until the server has registered it
 	$stdin = "select pg_sleep($PostgreSQL::Test::Utils::timeout_default);\n";
 	pump $h while length $stdin;
@@ -66,7 +41,7 @@ SKIP:
 	) or die "timed out";
 
 	# Send cancel request
-	kill 'INT', $psql_pid;
+	$h->signal('INT');
 
 	my $result = finish $h;
 


Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-09 Thread Amit Kapila
On Thu, Aug 10, 2023 at 6:46 AM Masahiko Sawada  wrote:
>
> On Wed, Aug 9, 2023 at 1:15 PM Amit Kapila  wrote:
> >
> > On Wed, Aug 9, 2023 at 8:01 AM Masahiko Sawada  
> > wrote:
> >
> > I feel it would be a good idea to provide such a tool for users to
> > avoid getting errors during upgrade but I think the upgrade code still
> > needs to ensure that there are no WAL records between
> > confirm_flush_lsn and SHUTDOWN_CHECKPOINT than required. Or, do you
> > want to say that we don't do any verification check during the upgrade
> > and let the data loss happens if the user didn't ensure that by
> > running such a tool?
>
> I meant that if we can check the slot state file while the old cluster
> stops, we can ensure there are no WAL records between slot's
> confirmed_fluhs_lsn (in the state file) and the latest checkpoint (in
> the control file).
>

Are you suggesting doing this before we start the old cluster or after
we stop the old cluster? I was thinking about the pros and cons of
doing this check when the server is 'on' (along with other upgrade
checks something like the patch is doing now) versus when the server
is 'off'. I think the advantage of doing it when the server is 'off'
(after check_and_dump_old_cluster()) is that it will be ensured that
there is no extra WAL that could be generated during the upgrade and
has not been verified against confirmed_flush_lsn location. But OTOH,
to retrieve slot information when the server is 'off', we need a
separate utility or probably a functionality for the same in
pg_upgrade and also some WAL reading stuff which sounds to me like a
larger change that may not be warranted here. I think anyway the extra
WAL (if any got generated during the upgrade) won't be required after
the upgrade so not convinced to make such a check while the server is
'off'. Are there reasons which make it better to do this while the old
cluster is 'off'?

> >
> > We can do that if we think so. We have two ways to make this check
> > optional (a) have a switch like --include-logical-replication-slots as
> > the proposed patch has which means by default we won't try to upgrade
> > slots; (b) have a switch like --exclude-logical-replication-slots as
> > Jonathan proposed which means we will exclude slots only if specified
> > by user. Now, one thing to note is that we don't seem to have any
> > include/exclude switch in the upgrade which I think indicates users by
> > default prefer to upgrade everything. Now, even if we decide not to
> > give any switch initially but do it only if there is a user demand for
> > it then also users will have a way to proceed with an upgrade which is
> > by dropping such slots. Do you have any preference?
>
> TBH I'm not sure if there is a use case where the user wants to
> exclude replication slots during the upgrade. Including replication
> slots by default seems to be better to me, at least for now. I
> initially thought asking for users to drop replication slots that
> possibly have not consumed all WAL records would not be a good idea,
> but since we already do such things in check.c I now think it would
> not be a problem. I guess it would be great if we can check WAL
> records between slots' confimed_flush_lsn and the latest LSN, and if
> there are no meaningful WAL records there we can upgrade the
> replication slots.
>

Agreed.

-- 
With Regards,
Amit Kapila.




Re: pgbnech: allow to cancel queries during benchmark

2023-08-09 Thread Yugo NAGATA
On Wed, 9 Aug 2023 11:18:43 +0200 (CEST)
Fabien COELHO  wrote:

> 
> I forgot, about the test:
> 
> I think that it should be integrated in the existing 
> "001_pgbench_with_server.pl" script, because a TAP script is pretty 
> expensive as it creates a cluster, starts it… before running the test.

Ok. I'll integrate the test into 001.
 
> I'm surprise that IPC::Run does not give any access to the process number. 
> Anyway, its object interface seems to allow sending signal:
> 
>   $h->signal("...")
> 
> So the code could be simplified to use that after a small delay.

Thank you for your information.

I didn't know $h->signal() and  I mimicked the way of
src/bin/psql/t/020_cancel.pl to send SIGINT to a running program. I don't
know why the psql test doesn't use the interface, I'll investigate whether
this can be used in our purpose, anyway.

Regards,
Yugo Nagata

> 
> -- 
> Fabien.


-- 
Yugo NAGATA 




Re: pgbnech: allow to cancel queries during benchmark

2023-08-09 Thread Yugo NAGATA
Hello Fabien,

On Wed, 9 Aug 2023 11:06:24 +0200 (CEST)
Fabien COELHO  wrote:

> 
> Hello Yugo-san,
> 
> Some feedback about v2.
> 
> There is some dead code (&& false) which should be removed.

I forgot to remove the debug code. I'll remove it.

> >>> In multi-threads, only one thread can catches the signal and other threads
> >>> continue to run.
> 
> Yep. This why I see plenty uncontrolled race conditions if thread 0 
> cancels clients which are managed in parallel by other threads and may be 
> active. I'm not really motivated to delve into libpq internals to check 
> whether there are possibly bad issues or not, but if two threads write 
> message at the same time in the same socket, I assume that this can be 
> bad if you are unlucky.
> 
> ISTM that the rational convention should be that each thread cancels its 
> own clients, which ensures that there is no bad interaction between 
> threads.

Actually, thread #0 and other threads never write message at the same time
in the same socket. When thread #0 sends cancel requests, they are not sent
to sockets that other threads are reading or writing. Rather, new another
socket for cancel is created for each client, and the backend PID and cancel
request are sent to the socket. PostgreSQL establishes a new connection for
the cancel request, and sent a cancel signal to the specified backend.

Therefore, thread #0 and other threads don't access any resources in the same
time except to CancelRequested. Is still there any concern about race condition?

> >>> Therefore, if Ctrl+C is pressed while threads are waiting
> >>> responses from the backend in wait_on_socket_set, only one thread can be
> >>> interrupted and return, but other threads will continue to wait and cannot
> >>> check CancelRequested.
> 
> >>> So, for implementing your suggestion, we need any hack
> >>> to make all threads return from wait_on_socket_set when the event occurs, 
> >>> but
> >>> I don't have idea to do this in simpler way.
> 
> >>> In my patch, all threads can return from wait_on_socket_set at Ctrl+C
> >>> because when thread #0 cancels all connections, the following error is
> >>> sent to all sessions:
> >>>
> >>>  ERROR:  canceling statement due to user request
> >>>
> >>> and all threads will receive the response from the backend.
> 
> Hmmm.
> 
> I understand that the underlying issue you are raising is that other 
> threads may be stuck while waiting on socket events and that with your 
> approach they will be cleared somehow by socket 0.
> 
> I'll say that (1) this point does not address potential race condition 
> issues with several thread interacting directly on the same client ;
> (2) thread 0 may also be stuck waiting for events so the cancel is only 
> taken into account when it is woken up.

I answered to (1) the consern about race condition above.

And, as to (2), the SIGINT signal is handle only in thread #0 because it is
blocked in other threads. So, when SIGINT is delivered, thread #0 will be
interrupted and woken up immediately from waiting on socket, returning EINTR. 
Therefore, thread #0 would not be stuck.
 
Regards,
Yugo Nagata

> If we accept that each thread cancels its clients when it is woken up, 
> which may imply some (usually small) delay, then it is not so different 
> from the current version because the code must wait for 0 to wake up 
> anyway, and it solves (1). The current version does not address potential 
> thread interactions.
> 
> -- 
> Fabien.


-- 
Yugo NAGATA 




Re: Adding a LogicalRepWorker type field

2023-08-09 Thread Peter Smith
On Wed, Aug 9, 2023 at 4:18 PM Amit Kapila  wrote:
>
> On Tue, Aug 8, 2023 at 1:39 PM Peter Smith  wrote:
> >
> > On Fri, Aug 4, 2023 at 5:45 PM Peter Smith  wrote:
> >
> > v4-0001 uses only 3 simple inline functions. Callers always pass
> > parameters as Bharath had suggested.
> >
>
> *
> - Assert(am_leader_apply_worker());
> + Assert(is_leader_apply_worker(MyLogicalRepWorker));
> ...
> - if (am_leader_apply_worker())
> + if (is_leader_apply_worker(MyLogicalRepWorker))
>
> Passing everywhere MyLogicalRepWorker not only increased the code
> change footprint but doesn't appear any better to me. Instead, let
> am_parallel_apply_worker() keep calling
> isParallelApplyWorker(MyLogicalRepWorker) as it is doing now. I feel
> even if you or others feel that is a better idea, we can debate it
> separately after the main patch is done because as far as I understand
> that is not the core idea of this proposal.

Right, those changes were not really core. Reverted as suggested. PSA v5.

>
> * If you do the above then there won't be a need to change the
> variable name is_parallel_apply_worker in logicalrep_worker_launch.
>

Done.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v5-0002-Switch-on-worker-type.patch
Description: Binary data


v5-0001-Add-LogicalRepWorkerType-enum.patch
Description: Binary data


Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
FYI, the current PG 16 release notes are available at:

https://momjian.us/pgsql_docs/release-16.html

I plan to add markup next week.  I am sorry I was away most of July so
could not update this until now.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: 2023-08-10 release announcement draft

2023-08-09 Thread Jonathan S. Katz

On 8/8/23 11:13 PM, Robert Treat wrote:


"Users who have skipped one or more update releases may need to run
additional, post-update steps; "

The comma should be removed.

"please see the release notes for earlier versions for details."

Use of 'for' twice is grammatically incorrect; I am partial to "please
see the release notes from earlier versions for details." but could
also see "please see the release notes for earlier versions to get
details."


Interestingly, I think this language has been unmodified for awhile. 
Upon reading it, I agree, and took your suggestions.


Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: 2023-08-10 release announcement draft

2023-08-09 Thread Jonathan S. Katz

On 8/9/23 1:04 AM, Noah Misch wrote:

On Mon, Aug 07, 2023 at 10:03:44PM -0400, Jonathan S. Katz wrote:

Fixes in PostgreSQL 16 Beta 3
-

The following includes fixes included in PostgreSQL 16 Beta 3:


With both "includes" and "included" there, this line reads awkwardly to me.
I'd just delete the line, since the heading has the same information.


I took your suggestion. Thanks!

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-09 Thread Masahiko Sawada
On Wed, Aug 9, 2023 at 1:15 PM Amit Kapila  wrote:
>
> On Wed, Aug 9, 2023 at 8:01 AM Masahiko Sawada  wrote:
> >
> > On Mon, Aug 7, 2023 at 6:02 PM Amit Kapila  wrote:
> > >
> > > On Mon, Aug 7, 2023 at 2:02 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > WAL records for hint bit updates could be generated even in upgrading 
> > > > mode?
> > > >
> > >
> > > Do you mean these records can be generated during reading catalog tables?
> >
> > Yes.
> >
>
> BTW, Kuroda-San has verified and found that three types of records
> (including XLOG_FPI_FOR_HINT) can be generated by the system during
> the upgrade. See email [1].
>
> > >
> > > > > I feel if there is a chance of any WAL activity during the
> > > > > upgrade, we need to either change the check to ensure such WAL records
> > > > > are expected or document the same in some way.
> > > >
> > > > Yes, but how does it work with the above idea of limiting the number
> > > > of WAL records to read? If XLOG_FPI_FOR_HINT can still be generated in
> > > > the upgrade mode, we cannot predict how many such records are
> > > > generated after the latest CHECKPOINT_SHUTDOWN.
> > > >
> > >
> > > Right, as said earlier, in that case, we need to rely on the type of 
> > > records.
> >
> > Another idea would be that before starting the old cluster we check if
> > the slot's confirmed_flush_lsn in the slot state file matches the
> > latest checkpoint LSN got by pg_controlfile. We need another tool to
> > dump the slot state file, though.
> >
>
> I feel it would be a good idea to provide such a tool for users to
> avoid getting errors during upgrade but I think the upgrade code still
> needs to ensure that there are no WAL records between
> confirm_flush_lsn and SHUTDOWN_CHECKPOINT than required. Or, do you
> want to say that we don't do any verification check during the upgrade
> and let the data loss happens if the user didn't ensure that by
> running such a tool?

I meant that if we can check the slot state file while the old cluster
stops, we can ensure there are no WAL records between slot's
confirmed_fluhs_lsn (in the state file) and the latest checkpoint (in
the control file).

>
> > >
> > > > I'm not really sure we should always perform the slot's
> > > > confirmed_flush_lsn check by default in the first place. With this
> > > > check, the upgrade won't be able to proceed if there is any logical
> > > > slot that is not used by logical replication (or something streaming
> > > > the changes using walsender), right? For example, if a user uses a
> > > > program that periodically consumes the changes from the logical slot,
> > > > the slot would not be able to pass the check even if the user executed
> > > > pg_logical_slot_get_changes() just before shutdown. The backend
> > > > process who consumes the changes is always terminated before the
> > > > shutdown checkpoint. On the other hand, I think there are cases where
> > > > the user can ensure that no meaningful WAL records are generated after
> > > > the last pg_logical_slot_get_changes(). I'm concerned that this check
> > > > might make upgrading such cases cumbersome unnecessarily.
> > > >
> > >
> > > You are right and I have mentioned the same case today in my response
> > > to Jonathan but do you have better ideas to deal with such slots than
> > > to give an ERROR?
> >
> > It makes sense to me to give an ERROR for such slots but does it also
> > make sense to make the check optional?
> >
>
> We can do that if we think so. We have two ways to make this check
> optional (a) have a switch like --include-logical-replication-slots as
> the proposed patch has which means by default we won't try to upgrade
> slots; (b) have a switch like --exclude-logical-replication-slots as
> Jonathan proposed which means we will exclude slots only if specified
> by user. Now, one thing to note is that we don't seem to have any
> include/exclude switch in the upgrade which I think indicates users by
> default prefer to upgrade everything. Now, even if we decide not to
> give any switch initially but do it only if there is a user demand for
> it then also users will have a way to proceed with an upgrade which is
> by dropping such slots. Do you have any preference?

TBH I'm not sure if there is a use case where the user wants to
exclude replication slots during the upgrade. Including replication
slots by default seems to be better to me, at least for now. I
initially thought asking for users to drop replication slots that
possibly have not consumed all WAL records would not be a good idea,
but since we already do such things in check.c I now think it would
not be a problem. I guess it would be great if we can check WAL
records between slots' confimed_flush_lsn and the latest LSN, and if
there are no meaningful WAL records there we can upgrade the
replication slots.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Kyotaro Horiguchi
At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier  
> wrote in 
> > > While it's a kind of bug in total, we encountered a case where an
> > > excessively large xl_tot_len actually came from a corrupted
> > > record. [1]
> > 
> > Right, I remember this one.  I think that Thomas was pretty much right
> > that this could be caused because of a lack of zeroing in the WAL
> > pages.
> 
> We have treated every kind of broken data as end-of-recovery, like
> incorrect rm_id or prev link including excessively large record length
> due to corruption. This patch is going to change the behavior only for
> the last one. If you think there can't be non-zero broken data, we
> should inhibit proceeding recovery after all non-zero incorrect
> data. This seems to be a quite big change in our recovery policy.
> 
> > There are a few options on the table, only doable once the WAL reader
> > provider the error state to the startup process:
> > 1) Retry a few times and FATAL.
> > 2) Just FATAL immediately and don't wait.
> > 3) Retry and hope for the best that the host calms down.
> 
> 4) Wrap up recovery then continue to normal operation.
> 
> This is the traditional behavior for currupt WAL data.
> 
> > I have not seeing this issue being much of an issue in the field, so
> > perhaps option 2 with the structure of 0002 and a FATAL when we catch
> > XLOG_READER_OOM in the switch would be enough.  At least that's enough
> > for the cases we've seen.  I'll think a bit more about it, as well.
> > 
> > Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
> > unfortunately, where he was able to trigger the issue of this thread
> > by manipulating the sizing of a host after producing a record larger
> > than what the host could afford after the resizing :/
> 
> I'm not entirely certain, but if you were to ask me which is more
> probable during recovery - encountering a correct record that's too
> lengthy for the server to buffer or stumbling upon a corrupt byte
> sequence - I'd bet on the latter.

... of course this refers to crash recovery. For replication, we
should keep retrying the current record until the operator commands
promotion.

> I'm not sure how often users encounter currupt WAL data, but I believe
> they should have the option to terminate recovery and then switch to
> normal operation.
> 
> What if we introduced an option to increase the timeline whenever
> recovery hits data error? If that option is disabled, the server stops
> when recovery detects an incorrect data, except in the case of an
> OOM. OOM cause record retry.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Kyotaro Horiguchi
At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier  wrote 
in 
> > While it's a kind of bug in total, we encountered a case where an
> > excessively large xl_tot_len actually came from a corrupted
> > record. [1]
> 
> Right, I remember this one.  I think that Thomas was pretty much right
> that this could be caused because of a lack of zeroing in the WAL
> pages.

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

> There are a few options on the table, only doable once the WAL reader
> provider the error state to the startup process:
> 1) Retry a few times and FATAL.
> 2) Just FATAL immediately and don't wait.
> 3) Retry and hope for the best that the host calms down.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

> I have not seeing this issue being much of an issue in the field, so
> perhaps option 2 with the structure of 0002 and a FATAL when we catch
> XLOG_READER_OOM in the switch would be enough.  At least that's enough
> for the cases we've seen.  I'll think a bit more about it, as well.
> 
> Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
> unfortunately, where he was able to trigger the issue of this thread
> by manipulating the sizing of a host after producing a record larger
> than what the host could afford after the resizing :/

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Fix last unitialized memory warning

2023-08-09 Thread Julien Rouhaud
On Wed, Aug 09, 2023 at 10:29:56AM -0500, Tristan Partin wrote:
> On Wed Aug 9, 2023 at 10:02 AM CDT, Peter Eisentraut wrote:
> >
> > This patch has apparently upset one buildfarm member with a very old
> > compiler:
> > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=lapwing&br=HEAD
> >
> > Any thoughts?
>
> Best I could find is SO question[0] which links out to[1]. Try this patch.
> Otherwise, a memset() would probably do too.

Yes, it's a buggy warning that came up in the past a few times as I recall, for
which we previously used the {{...}} approach to silence it.

As there have been previous complaints about it, I removed the -Werror from
lapwing and forced a new run to make it green again.




Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Sat, Aug  5, 2023 at 04:08:47PM -0700, Noah Misch wrote:
> > Remove libpq support for SCM credential authentication (Michael Paquier)
> 
> Since the point of removing it is the deep unlikelihood of anyone using it, I
> wouldn't list this in "incompatibilities".

I moved this to the Source Code section.

> > Deprecate createuser option --role (Nathan Bossart)
> 
> This is indeed a deprecation, not a removal.  By the definition of
> "deprecate", it's not an incompatibility.

I moved this to the Server Applications section.

Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Sat, Aug  5, 2023 at 04:08:47PM -0700, Noah Misch wrote:
> > Author: Robert Haas 
> > 2022-08-25 [e3ce2de09] Allow grant-level control of role inheritance 
> > behavior.
> > -->
> > 
> > 
> > 
> > Allow GRANT to control role inheritance behavior (Robert Haas)
> > 
> > 
> > 
> > By default, role inheritance is controlled by the inheritance status of the 
> > member role.  The new GRANT clauses WITH INHERIT and WITH ADMIN can now 
> > override this.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Allow roles that create other roles to automatically inherit the new role's 
> > rights or SET ROLE to the new role (Robert Haas, Shi Yu)
> > 
> > 
> > 
> > This is controlled by server variable createrole_self_grant.
> > 
> > 
> 
> Similarly, v16 radically changes the CREATE ROLE ... WITH INHERIT clause.  The
> clause used to "change the behavior of already-existing grants."  Let's merge
> these two and move the combination to the incompatibilities section.

I need help with this.  I don't understand how they can be combined, and
I don't understand the incompatibility text in commit e3ce2de09d:

If a GRANT does not specify WITH INHERIT, the behavior based on
whether the member role is marked INHERIT or NOINHERIT. This means
that if all roles are marked INHERIT or NOINHERIT before any role
grants are performed, the behavior is identical to what we had before;
otherwise, it's different, because ALTER ROLE [NO]INHERIT now only
changes the default behavior of future grants, and has no effect on
existing ones.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Use of additional index columns in rows filtering

2023-08-09 Thread Peter Geoghegan
On Wed, Aug 9, 2023 at 11:15 AM Tomas Vondra
 wrote:
> Cool. I'll try to build my own set of examples that I find interesting
> either because it's what the patch aims to help with, or because I
> expect it to be problematic for some reason. And then we can compare.

That would be great. I definitely want to make this a collaborative thing.

> Yup, I agree with that principle. The AM can evaluate the expression in
> a smarter way, without the visibility checks.

Attached text document (which I guess might be my first draft) is an
attempt to put the discussion up until this point on a more formal
footing.

The format here tries to reduce the principles to a handful of bullet
points. For example, one line reads:

+ Index quals are better than equivalent index filters because bitmap
index scans can only use index quals

I'm pretty sure that these are all points that you and I both agree on
already. But you should confirm that. And make your own revisions, as
you see fit.

It's definitely possible that I overlooked an interesting and durable
advantage that index filters have. If there is some other way in which
index filters are likely to remain the best and only viable approach,
then we should note that. I just went with the really obvious case of
an expression that definitely needs visibility checks to avoid ever
throwing a division-by-zero error, related to some invisible tuple.
It's an extreme case in that it focuses on requirements that seem just
about unavoidable in any future world. (Come to think of it, the
biggest and most durable advantage for index filters is probably just
how general they are, which I do mention.)

-- 
Peter Geoghegan
General principles for index quals
==

Tomas Vondra's "evaluate filters on the index tuple" patch works by making an 
expressions that didn't becoming index
quals into additional index filter (not table filters) where possible.  The 
general idea is that the planner should
recognize a kind of "qual hierarchy", where it is always preferable to make 
each index path have use true index quals
where possible, but, failing that, it is still preferable to make the predicate 
into an index filter rather than into a
table filter.

In planner:
Index Quals > Index Filter > Table Filter

Although this general design makes sense, it nevertheless can lead to 
suboptimal outcomes: situations where the planner
chooses a plan that is obviously not as good as some hypothetical other plan 
that manages to use true index quals
through some (currently unimplemented) other means.

This isn't really a problem with the patch itself; it is more a case of the 
planner lacking complementary optimizations
that make predicates into true index quals before the mechanism added by the 
patch is even reached.  The "catch-all"
nature of the patch isn't a problem to be fixed -- its very general nature is a 
key strength of the approach (it uses
full expression evaluation, which is much more general than any approach that 
ultimately reduces predicates to indexable
operators from an opclass).

This document lists certain specific scenarios where the patch doesn't quite do 
as well as one would hope.  That puts
things on a good footing for adding other complementary techniques later on.  
Theoretically this will be independent
work, that could have happened at any point in the past.  But the patch has 
brought these other related issues into
sharp relief for the first time -- so it doesn't necessarily feel all that 
independent.

A working "taxonomy" for all of these techniques seems useful, even one such as 
this, that's generally acknowledged to
be incomplete and far from perfect.

Why index filters are better than equivalent table filters
--

This part is very simple.

+ They're better because they have a decent chance of avoiding significant 
amounts of heap accesses for expression evaluation.

This extends some of the benefits that index-only scans have to plain index 
scans.

Why index quals are better than equivalent index filters


The advantages that index quals have over index filters can be much less clear 
than one would hope because index filters
are confusingly similar to index quals.  But index quals are significantly 
better, overall.  This is guaranteed to be
true, assuming we're comparing "equivalent" index quals and index filters -- 
logically interchangeable forms of quals
with different physical/runtime characteristics.

(An index filter might be better than an index qual when it happens to be much 
more selective, because it filters tuples
in some fundamentally different way, but that's not relevant to this 
discussion.  We're focussed on clear-cut cases,
where we can speak with the most confidence, since that's a useful basis for 
further improvements later on.)

We may even be missing out on the advantages of true index quals in cases

Re: PG 16 draft release notes ready

2023-08-09 Thread Michael Paquier
On Wed, Aug 09, 2023 at 05:45:27PM -0400, Bruce Momjian wrote:
> That is very helpful.  I added this to the release notes Server
> Configuration section:
> 
>   
>   
>   Allow huge pages to work on newer versions of Windows 10 (Thomas Munro)
>   
>   
>   
>   This adds the special handling required to enable huge pages on newer
>   versions of Windows 10.
>   
>   

Looks good to me, thanks for updating the notes!
--
Michael


signature.asc
Description: PGP signature


Should the archiver process always make sure that the timeline history files exist in the archive?

2023-08-09 Thread Jimmy Yih
Hello pgsql-hackers,

While testing out some WAL archiving and PITR scenarios, it was observed that
enabling WAL archiving for the first time on a primary that was on a timeline
higher than 1 would not initially archive the timeline history file for the
timeline it was currently on. While this might be okay for most use cases, there
are scenarios where this leads to unexpected failures that seem to expose some
flaws in the logic.

Scenario 1:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Create a
standby with that backup that will be continuously restoring from the WAL
archives, the standby will not contain the timeline 2 history file. The standby
will operate normally but if you try to create a cascading standby off it using
streaming replication, the cascade standby's WAL receiver will continuously
FATAL trying to request the timeline 2 history file that the main standby does
not have.

Scenario 2:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Then try
to create a new node by doing PITR with recovery_target_timeline set to
'current' or 'latest' which will succeed. However, doing PITR with
recovery_target_timeline = '2' will fail since it is unable to find the timeline
2 history file in the WAL archives. This may be a bit contradicting since we
allow 'current' and 'latest' to recover but explicitly setting the
recovery_target_timeline to the control file's timeline id ends up with failure.

Attached is a patch containing two TAP tests that demonstrate the scenarios.

My questions are:
1. Why doesn't the archiver process try to archive timeline history files when
   WAL archiving is first configured and/or continually check (maybe when the
   archiver process gets started before the main loop)?
2. Why does explicitly setting the recovery_target_timeline to the control
   file's timeline id not follow the same logic as recovery_target_timeline set
   to 'current'?
3. Why does a cascaded standby require the timeline history file of its control
   file's timeline id (startTLI) when the main replica is able to operate fine
   without the timeline history file?

Note that my initial observations came from testing with pgBackRest (copying
pg_wal/ during backup is disabled by default) but using `pg_basebackup -Xnone`
reproduced the issues similarly and is what I present in the TAP tests. At the
moment, the only workaround I can think of is to manually run the
archive_command on the missing timeline history file(s).

Are these valid issues that should be looked into or are they expected? Scenario
2 seems like it could be easily fixed if we determine that the
recovery_target_timeline numeric value is equal to the control file's timeline
id (compare rtli and recoveryTargetTLI in validateRecoveryParameters()?) but I
wasn't sure if maybe the opposite was true where we should make 'current' and
'latest' require retrieving the timeline history files instead to help prevent
Scenario 1.

Regards,
Jimmy Yih

0001-TAP-tests-to-show-missing-timeline-history-issues.patch
Description:  0001-TAP-tests-to-show-missing-timeline-history-issues.patch


Re: Support to define custom wait events for extensions

2023-08-09 Thread Michael Paquier
On Wed, Aug 09, 2023 at 08:10:42PM +0900, Masahiro Ikeda wrote:
> * create second hash table to find a event id from a name to
>   identify uniquness. It enable extensions which don't use share
>   memory for their use to define custom wait events because
>   WaitEventExtensionNew() will not allocate duplicate wait events.

Okay, a second hash table to check if events are registered works for
me.

> * create PoC patch to show that extensions, which don't use shared
>   memory for their use, can define custom wait events.
>  (v2-0002-poc-custom-wait-event-for-dblink.patch)
> 
> I'm worrying about
> * Is 512(wee_hash_max_size) the maximum number of the custom wait
>   events sufficient?

Thanks for sending a patch!

I'm OK to start with that.  This could always be revisited later, but
even for a server loaded with a bunch of extensions that looks more
than enough to me.

> * Is there any way to not force extensions that don't use shared
>   memory for their use like dblink to acquire AddinShmemInitLock?;

Yes, they don't need it at all as the dynahashes are protected with
their own LWLocks.

+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock   46
 NotifyQueueTailLock47
+WaitEventExtensionLock  48

This new LWLock needs to be added to wait_event_names.txt, or it won't
be reported to pg_stat_activity and it would not be documented when
the sgml docs are generated from the txt data.

-extern uint32 WaitEventExtensionNew(void);
-extern void WaitEventExtensionRegisterName(uint32 wait_event_info,
-  const char *wait_event_name);
+extern uint32 WaitEventExtensionNew(const char *wait_event_name);
Looks about right, and the docs are refreshed.

+static const int wee_hash_init_size = 128;
+static const int wee_hash_max_size = 512;
I would use a few #defines with upper-case characters here instead as
these are constants for us.

Now that it is possible to rely on LWLocks for the hash tables, more
cleanup is possible in worker_spi, with the removal of
worker_spi_state, the shmem hooks and their routines.  The only thing
that should be needed is something like that at the start of
worker_spi_main() (same position as worker_spi_shmem_init now):
+static uint32 wait_event = 0;
[...]
+   if (wait_event == 0)
+   wait_event = WaitEventExtensionNew("worker_spi_main");

The updates in 001_worker_spi.pl look OK.

+* The entry must be stored because it's registered in
+* WaitEventExtensionNew().
 */
-   eventId -= NUM_BUILTIN_WAIT_EVENT_EXTENSION;
+   if (!entry)
+   ereport(ERROR,
+   errmsg("could not find the name for custom wait event ID %u", 
eventId));
Yeah, I think that's better than just falling back to "extension".  An
ID reported in pg_stat_activity should always have an entry, or we
have race conditions.  This should be an elog(ERROR), as in
this-error-shall-never-happen.  No need to translate the error string,
as well (the docs have been updated with this change. thanks!).

Additionally, LWLockHeldByMeInMode(AddinShmemInitLock) in
WaitEventExtensionNew() should not be needed, thanks to
WaitEventExtensionLock.

+ * WaitEventExtensionNameHash is used to find the name from a event id.
+ * It enables all backends look up them without additional processing
+ * per backend like LWLockRegisterTranche().

It does not seem necessary to mention LWLockRegisterTranche().

+ * WaitEventExtensionIdHash is used to find the event id from a name.
+ * Since it can identify uniquness by the names, extensions that do not
+ * use shared memory also be able to define custom wait events without
+ * defining duplicate wait events.

Perhaps this could just say that this table is necessary to ensure
that we don't have duplicated entries when registering new strings
with their IDs?  s/uniquness/uniqueness/.  The second part of the
sentence about shared memory does not seem necessary now.

+   sz = add_size(sz, hash_estimate_size(wee_hash_init_size,
+   sizeof(WaitEventExtensionNameEntry)));
+   sz = add_size(sz, hash_estimate_size(wee_hash_init_size,
+   sizeof(WaitEventExtensionIdEntry)));

Err, this should use the max size, and not the init size for the size
estimation, no?

+   if (strlen(wait_event_name) >= NAMEDATALEN)
+   ereport(ERROR,
+   errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+   errmsg("wait event name is too long"));
This could just be an elog(ERROR), I assume, as that could only be
reached by developers.  The string needs to be rewritten, like "cannot
use custom wait event string longer than %u characters", or something
like that.

+   if (wait_event_info == NULL)
+   {
+   wait_event_info = (uint32 *) MemoryContextAlloc(TopMemoryContext, 
sizeof(uint32));
+   

Re: pg_dump needs SELECT privileges on irrelevant extension table

2023-08-09 Thread Jacob Champion
Hi all,

v3 fixes a doc comment I forgot to fill in; there are no other code
changes. To try to further reduce the activation energy, I've also
attached an attempt at a backport to 11. The main difference is the
absence of catalogIdHash, which showed up in 15, so we don't get the
benefit of that deduplication.

Thanks,
--Jacob
1:  6373939500 = 1:  6373939500 Add failing test for undumped extension table
2:  41db5f9c75 ! 2:  3e997da147 pg_dump: skip lock for extension tables without 
policies
@@ src/bin/pg_dump/pg_dump.c: dumpLOs(Archive *fout, const void *arg)
  }
  
 +/*
-+ * getTablesWithPolicies TODO
++ * getTablesWithPolicies
++ *   retrieve the IDs of all tables with pg_policy entries
 + */
 +void
 +getTablesWithPolicies(Archive *fout)
From 637393950006c3752c6db3ca64f84b18fcaa4896 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Thu, 22 Jun 2023 16:21:41 -0700
Subject: [PATCH v3 1/2] Add failing test for undumped extension table

Currently, SELECT permission is required for extension tables even
if they're internal (i.e. undumpable) and have no RLS policies. Add a
failing test for this situation.
---
 src/test/modules/test_pg_dump/t/001_base.pl   | 27 +++
 .../test_pg_dump/test_pg_dump--1.0.sql|  2 ++
 2 files changed, 29 insertions(+)

diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index d00c3544e9..68a767d2f5 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -175,6 +175,19 @@ my %pgdump_runs = (
 			'postgres',
 		],
 	},
+
+	# regress_dump_login_role shouldn't need SELECT rights on internal
+	# (undumped) extension tables
+	privileged_internals => {
+		dump_cmd => [
+			'pg_dump', '--no-sync', "--file=$tempdir/privileged_internals.sql",
+			# these two tables are irrelevant to the test case
+			'--exclude-table=regress_pg_dump_schema.external_tab',
+			'--exclude-table=regress_pg_dump_schema.extdependtab',
+			'--username=regress_dump_login_role', 'postgres',
+		],
+	},
+
 	schema_only => {
 		dump_cmd => [
 			'pg_dump', '--no-sync', "--file=$tempdir/schema_only.sql",
@@ -284,6 +297,7 @@ my %full_runs = (
 	exclude_table => 1,
 	no_privs => 1,
 	no_owner => 1,
+	privileged_internals => 1,
 	with_extension => 1,
 	without_extension => 1);
 
@@ -321,6 +335,16 @@ my %tests = (
 		like => { pg_dumpall_globals => 1, },
 	},
 
+	'CREATE ROLE regress_dump_login_role' => {
+		create_order => 1,
+		create_sql => 'CREATE ROLE regress_dump_login_role LOGIN;',
+		regexp => qr/^
+			\QCREATE ROLE regress_dump_login_role;\E
+			\n\QALTER ROLE regress_dump_login_role WITH \E.*\Q LOGIN \E.*;
+			\n/xm,
+		like => { pg_dumpall_globals => 1, },
+	},
+
 	'GRANT ALTER SYSTEM ON PARAMETER full_page_writes TO regress_dump_test_role'
 	  => {
 		create_order => 2,
@@ -704,6 +728,7 @@ my %tests = (
 			data_only => 1,
 			extension_schema => 1,
 			pg_dumpall_globals => 1,
+			privileged_internals => 1,
 			section_data => 1,
 			section_pre_data => 1,
 			# Excludes this schema as extension is not listed.
@@ -720,6 +745,7 @@ my %tests = (
 			data_only => 1,
 			extension_schema => 1,
 			pg_dumpall_globals => 1,
+			privileged_internals => 1,
 			section_data => 1,
 			section_pre_data => 1,
 			# Excludes this schema as extension is not listed.
@@ -743,6 +769,7 @@ my %tests = (
 			# Excludes the extension and keeps the schema's data.
 			without_extension_internal_schema => 1,
 		},
+		unlike => { privileged_internals => 1 },
 	},);
 
 #
diff --git a/src/test/modules/test_pg_dump/test_pg_dump--1.0.sql b/src/test/modules/test_pg_dump/test_pg_dump--1.0.sql
index 110f7eef66..1c68e146d9 100644
--- a/src/test/modules/test_pg_dump/test_pg_dump--1.0.sql
+++ b/src/test/modules/test_pg_dump/test_pg_dump--1.0.sql
@@ -12,11 +12,13 @@ CREATE SEQUENCE regress_pg_dump_seq;
 
 CREATE SEQUENCE regress_seq_dumpable;
 SELECT pg_catalog.pg_extension_config_dump('regress_seq_dumpable', '');
+GRANT SELECT ON SEQUENCE regress_seq_dumpable TO public;
 
 CREATE TABLE regress_table_dumpable (
 	col1 int check (col1 > 0)
 );
 SELECT pg_catalog.pg_extension_config_dump('regress_table_dumpable', '');
+GRANT SELECT ON regress_table_dumpable TO public;
 
 CREATE SCHEMA regress_pg_dump_schema;
 
-- 
2.25.1

From 3e997da1472d4c7bc067a2178227c346bee91bfb Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Thu, 16 Mar 2023 11:46:08 -0700
Subject: [PATCH v3 2/2] pg_dump: skip lock for extension tables without
 policies

If a user without SELECT permissions on an internal extension table
tries to dump the extension, the dump will fail while trying to lock the
table with ACCESS SHARE, even though the user doesn't want or need to
dump the table in question. (The lock is taken to allow later
pg_get_expr() calls on pg_policy to remain consistent in the face of
concurrent schema changes.)

It'd be ideal not to require SELECT permissions

Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Sat, Aug  5, 2023 at 04:08:47PM -0700, Noah Misch wrote:
> On Thu, May 18, 2023 at 04:49:47PM -0400, Bruce Momjian wrote:
> > https://momjian.us/pgsql_docs/release-16.html
> 
> > 
> > 
> > 
> > 
> > Restrict the privileges of CREATEROLE roles (Robert Haas)
> > 
> > 
> > 
> > Previously roles with CREATEROLE privileges could change many aspects of 
> > any non-superuser role.  Such changes, including adding members, now 
> > require the role requesting the change to have ADMIN OPTION
> > permission.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Improve logic of CREATEROLE roles ability to control other roles (Robert 
> > Haas)
> > 
> > 
> > 
> > For example, they can change the CREATEDB, REPLICATION, and BYPASSRLS 
> > properties only if they also have those permissions.
> > 
> > 
> 
> CREATEROLE is a radically different feature in v16.  In v15-, it was an
> almost-superuser.  In v16, informally speaking, it can create and administer
> its own collection of roles, but it can't administer roles outside its
> collection or grant memberships or permissions not offered to itself.  Hence,
> let's move these two into the incompatibilities section.  Let's also merge
> them, since f1358ca52 is just doing to clauses like CREATEDB what cf5eb37c5
> did to role memberships.

Good point. I have adjusted this item with the attached patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/release-16.sgml b/doc/src/sgml/release-16.sgml
index 1213f876f4..cccdc01d11 100644
--- a/doc/src/sgml/release-16.sgml
+++ b/doc/src/sgml/release-16.sgml
@@ -244,6 +244,24 @@ Collations and locales can vary between databases so having them as read-only se
 
 
 
+
+
+
+
+Restrict the privileges of CREATEROLE and its ability to modify other roles (Robert Haas)
+
+
+
+Previously roles with CREATEROLE privileges could change many aspects of any non-superuser role.  Such changes, including adding members, now require the role requesting the change to have ADMIN OPTION
+permission.  For example, they can now change the CREATEDB, REPLICATION, and BYPASSRLS properties only if they also have those permissions.
+
+
+
 
-
-
-
-Restrict the privileges of CREATEROLE roles (Robert Haas)
-
-
-
-Previously roles with CREATEROLE privileges could change many aspects of any non-superuser role.  Such changes, including adding members, now require the role requesting the change to have ADMIN OPTION
-permission.
-
-
-
-
-
-
-
-Improve logic of CREATEROLE roles ability to control other roles (Robert Haas)
-
-
-
-For example, they can change the CREATEDB, REPLICATION, and BYPASSRLS properties only if they also have those permissions.
-
-
-
 

Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Sun, Jul 23, 2023 at 08:19:55PM +0900, Michael Paquier wrote:
> On Tue, Jul 04, 2023 at 05:32:07PM -0400, Bruce Momjian wrote:
> > On Tue, Jul  4, 2023 at 03:31:05PM +0900, Michael Paquier wrote:
> >> On Thu, May 18, 2023 at 04:49:47PM -0400, Bruce Momjian wrote:
> >> Sawada-san has mentioned on twitter that fdd8937 is not mentioned in
> >> the release notes, and it seems to me that he is right.  This is
> >> described as a bug in the commit log, but it did not get backpatched
> >> because of the lack of complaints.  Also, because we've removed
> >> support for anything older than Windows 10 in PG16, this change very
> >> easy to do.
> > 
> > I did review this and wasn't sure exactly what I would describe.  It is
> > saying huge pages will now work on some versions of Windows 10 but
> > didn't before?
> 
> Windows 10 has always used a forced automated rolling upgrade process,
> so there are not many versions older than 1703, I suppose.  I don't
> know if large pages were working before 1703 where
> FILE_MAP_LARGE_PAGES has been introduced, and I have never been able
> to test that.  Honestly, I don't think that we need to be picky about
> the version mentioned, as per the forced upgrade process done by
> Microsoft.
> 
> So, my preference would be to keep it simple and add an item like "Fix
> huge pages on Windows 10 and newer versions", with as potential
> subnote "The backend sets a flag named FILE_MAP_LARGE_PAGES to allow
> huge pages", though this is not really mandatory to go down to this
> level of internals, either.

That is very helpful.  I added this to the release notes Server
Configuration section:





Allow huge pages to work on newer versions of Windows 10 (Thomas Munro)



This adds the special handling required to enable huge pages on newer
versions of Windows 10.



-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Adding a pg_servername() function

2023-08-09 Thread Christoph Moench-Tegeder
## GF (phab...@gmail.com):

> In the past I needed such a pg_servername() function because in a project I
> was engaged on they needed to distribute "requests" directed to a in-house
> management service running on network servers, and each one had a DB, so we
> went with pglogical to be used as a (transactional) messaging middleware.

That sounds like an "if all you have is a hammer" kind of architecture.
Or "solutioning by fandom". Short of using an actual addressable
message bus, you could set up the node name as a configuration
parameter, or use cluster_name, or if you really must do some
COPY FROM PROGRAM magic (there's your access control) and just store
the value somewhere.
The more examples of use cases I see, the more I think that actually
beneficial use cases are really rare.

And now that I checked it: I do have systems with gethostname()
returning an FQDN, and other systems return the (short) hostname
only. And it gets worse when you're talking "container" and
"automatic image deployment". So I believe it's a good thing when
a database does not expose too much of the OS below it...

Regards,
Christoph

-- 
Spare Space




Re: Show WAL write and fsync stats in pg_stat_io

2023-08-09 Thread Melanie Plageman
On Thu, Aug 03, 2023 at 04:38:41PM +0300, Nazir Bilal Yavuz wrote:
> Current status of the patch is:
> - 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
> - IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
> added to pg_stat_io.
> - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
> - Working on which 'BackendType / IOContext / IOOp' should be banned
> in pg_stat_io.
> - Working on adding 'WAL read' to the xlogreader.c and walsender.c.
> - PendingWalStats.wal_sync and
> PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
> to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

Cool! Thanks for the summary and for continuing to work on this.

> TODOs:
> - Documentation.
> - Thinking about how to interpret the different IOOps within the two
> IOContexts and discussing what would be useful to count.
> - Decide which 'BackendType / IOContext / IOOp' should not be tracked.
> - Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
> be an another patch)

Yes, I would be explicit that you are not including WAL IO done exclusively in
the context of replication.

> - Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
> (This could be an another patch)

Yes, I think it makes sense as another patch.

> 
> On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
>  wrote:
> > I think it would be good to count WAL reads even though they are not
> > currently represented in pg_stat_wal. Here is a thread discussing this
> > [1].
> 
> I used the same implementation in the thread link [1]. I added 'WAL
> read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> xlogreader.c and walsender.c because they cause some failures on:
> '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> these. Also, I added Bharath to CC. I have a question about 'WAL
> read':
> 1. There are two places where 'WAL read' happens.
> a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> For example
> - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> These nbytes and count variables could be different from XLOG_BLCKSZ.
> 
> b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
> pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> 
> So, what should op_bytes be set to for 'WAL read' operations?

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

> > I think we will also want to add an IOContext for WAL initialization.
> > Then we can track how long is spent doing 'WAL init' (including filling
> > the WAL file with zeroes). XLogFileInitInternal() is likely where we
> > would want to add it. And op_bytes for this would likely be
> > wal_segment_size. I thought I heard about someone proposing adding WAL
> > init to pg_stat_wal, but I can't find the thread.
> 
> Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
> have a question there:
> 1. Some of the WAL processes happens at initdb (standalone backend
> IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
> IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
> initdb, AFAIK there is no way to set 'track_wal_io_timing' and
> 'track_io_timing' variables there. So, their timings appear as 0.
> Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> the initdb or are they not that much important?

I don't have an opinion about this. I can see an argument for doing it
either way. We do track other IO during initdb in pg_stat_io.

> > Any that we decide not to count for now should be "banned" in
> > pgstat_tracks_io_op() for clarity. For example, if we create a separate
> > IOContext for WAL file init, I'm not sure what would count as an
> > IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.
> >
> > Also, I think there are some backend types which will not generate WAL
> > and we should determine which those are and skip those rows in
> > pgstat_tracks_io_object().
> 
> I agree, I am working on this. I have a couple of questions:
> 1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?

I don't know the answer to this.

> 2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
> specific IOOp operations?

I don

Re: Use of additional index columns in rows filtering

2023-08-09 Thread Tomas Vondra
On 8/9/23 19:53, Peter Geoghegan wrote:
> On Wed, Aug 9, 2023 at 10:00 AM Tomas Vondra
>  wrote:
>> Anyway, I find this discussion rather abstract and I'll probably forget
>> half the important cases by next week. Maybe it'd be good to build a set
>> of examples demonstrating the interesting cases? We've already used a
>> couple tenk1 queries for that purpose ...
> 
> That seems wise. I'll try to come up with a list of general principles
> with specific and easy to run examples later on today.
> 

Cool. I'll try to build my own set of examples that I find interesting
either because it's what the patch aims to help with, or because I
expect it to be problematic for some reason. And then we can compare.

>> I'm trying to make the patch to not dependent on such change. In a way,
>> once a clause gets recognized as index qual, it becomes irrelevant for
>> my patch. But the patch also doesn't get any simpler, because it still
>> needs to do the same thing for the remaining quals.
> 
> Practically speaking, I don't see any reason why you won't be able to
> sign off on the set of principles that I'll lay out for work in this
> area, while at the same time continuing with this patch more or less
> as originally planned.
> 
> At one point I thought that your patch might be obligated to
> compensate for its tendency to push down OR-heavy clauses as
> expressions, leading to "risky" plans. While I still have a concern
> about that now, I'm not going to try to make it your problem. I'm now
> inclined to think of this as an existing problem, that your patch will
> increase the prevalence of -- but not to the extent that it makes
> sense to hold it up. To some extent it is up to me to put my money
> where my mouth is.
> 
> I'm optimistic about the prospect of significantly ameliorating (if
> not fixing) the "risky OR expression plan" problem in the scope of my
> work on 17. But even if that doesn't quite happen (say we don't end up
> normalizing to CNF in the way that I've suggested for 17), we should
> at least reach agreement on the best way forward. If we could just
> agree that evaluating complicated OR expressions in index filters is
> much worse than finding a way to pass them down as an equivalent index
> qual (an SAOP), then I could live with it for another release or two.
> 

Yup, I agree with that principle. The AM can evaluate the expression in
a smarter way, without the visibility checks.

> As I said, I mostly just care about having the right general
> principles in place at this point.
> 
>> OTOH if there was some facility to decide if a qual is "safe" to be
>> executed on the index tuple, that'd be nice. But as I already said, I
>> see it more as an additional optimization, as it only applies to a
>> subset of cases.
> 
> I'm happy to go with that.
> 


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Sun, Jul 23, 2023 at 02:09:17PM +0300, Pavel Luzanov wrote:
> Please consider to add item to the psql section:
> 
> Add psql \drg command to display role grants and remove the "Member of"
> column from \du & \dg altogether (d65ddaca)

The release notes are only current as of 2023-06-26 and I will consider
this when I updated them next week, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Sun, Jul 23, 2023 at 12:45:17PM +0800, jian he wrote:
> >   https://momjian.us/pgsql_docs/release-16.html
> >
> >  I will adjust it to the feedback I receive;  that URL will quickly show
> >  all updates.
> 
> >  Create a predefined role and grantable privilege with permission to 
> > perform maintenance operations (Nathan Bossart)
> > The predefined role is is called pg_maintain.
> 
> this feature was also reverted.
> https://git.postgresql.org/cgit/postgresql.git/commit/?id=151c22deee66a3390ca9a1c3675e29de54ae73fc

Thenks, fixed based on earlier report by Laurenz Albe.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Fri, Jul 14, 2023 at 09:29:42PM +0200, Erik Rijkers wrote:
> Op 7/4/23 om 23:32 schreef Bruce Momjian:
> > > > https://momjian.us/pgsql_docs/release-16.html
> 
> I noticed these:
> 
> 'new new RULES'  should be
> 'new RULES'
> 
> 'Perform apply of large transactions'  should be
> 'Performs apply of large transactions'
> (I think)
> 
> 'SQL JSON paths'  should be
> 'SQL/JSON paths'

Fixed with the attached patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/release-16.sgml b/doc/src/sgml/release-16.sgml
index d8e1369844..c84c0f0eda 100644
--- a/doc/src/sgml/release-16.sgml
+++ b/doc/src/sgml/release-16.sgml
@@ -1370,7 +1370,7 @@ Allow custom ICU collation rules to be created (Peter Eisentraut)
 
 
 
-This is done using CREATE COLLATION's new new RULES clause, as well as new options for CREATE DATABASE, createdb, and initdb.
+This is done using CREATE COLLATION's new RULES clause, as well as new options for CREATE DATABASE, createdb, and initdb.
 
 
 
@@ -1471,7 +1471,7 @@ Allow parallel application of logical replication (Hou Zhijie, Wang Wei, Amit Ka
 
 
 
-The CREATE SUBSCRIPTION "streaming" option now supports "parallel" to enable parallel application. Perform apply of large transactions by parallel workers.  The number of parallel workers is controlled by
+The CREATE SUBSCRIPTION "streaming" option now supports "parallel" to enable application of large transactions by parallel workers.  The number of parallel workers is controlled by
 the new server variable max_parallel_apply_workers_per_subscription. Wait events LogicalParallelApplyMain, LogicalParallelApplyStateChange, and LogicalApplySendData were also added.  Column leader_pid was
 added to system view pg_stat_subscription to track parallel activity.
 
@@ -1803,7 +1803,7 @@ Author: Peter Eisentraut 
 
 
 
-Add support for enhanced numeric literals in SQL JSON paths (Peter Eisentraut)
+Add support for enhanced numeric literals in SQL/JSON paths (Peter Eisentraut)
 
 
 



Re: Use of additional index columns in rows filtering

2023-08-09 Thread Peter Geoghegan
On Wed, Aug 9, 2023 at 10:00 AM Tomas Vondra
 wrote:
> Anyway, I find this discussion rather abstract and I'll probably forget
> half the important cases by next week. Maybe it'd be good to build a set
> of examples demonstrating the interesting cases? We've already used a
> couple tenk1 queries for that purpose ...

That seems wise. I'll try to come up with a list of general principles
with specific and easy to run examples later on today.

> I'm trying to make the patch to not dependent on such change. In a way,
> once a clause gets recognized as index qual, it becomes irrelevant for
> my patch. But the patch also doesn't get any simpler, because it still
> needs to do the same thing for the remaining quals.

Practically speaking, I don't see any reason why you won't be able to
sign off on the set of principles that I'll lay out for work in this
area, while at the same time continuing with this patch more or less
as originally planned.

At one point I thought that your patch might be obligated to
compensate for its tendency to push down OR-heavy clauses as
expressions, leading to "risky" plans. While I still have a concern
about that now, I'm not going to try to make it your problem. I'm now
inclined to think of this as an existing problem, that your patch will
increase the prevalence of -- but not to the extent that it makes
sense to hold it up. To some extent it is up to me to put my money
where my mouth is.

I'm optimistic about the prospect of significantly ameliorating (if
not fixing) the "risky OR expression plan" problem in the scope of my
work on 17. But even if that doesn't quite happen (say we don't end up
normalizing to CNF in the way that I've suggested for 17), we should
at least reach agreement on the best way forward. If we could just
agree that evaluating complicated OR expressions in index filters is
much worse than finding a way to pass them down as an equivalent index
qual (an SAOP), then I could live with it for another release or two.

As I said, I mostly just care about having the right general
principles in place at this point.

> OTOH if there was some facility to decide if a qual is "safe" to be
> executed on the index tuple, that'd be nice. But as I already said, I
> see it more as an additional optimization, as it only applies to a
> subset of cases.

I'm happy to go with that.

-- 
Peter Geoghegan




Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Fri, Jul 14, 2023 at 08:20:59PM +0200, Laurenz Albe wrote:
> On Thu, 2023-05-18 at 16:49 -0400, Bruce Momjian wrote:
> > I have completed the first draft of the PG 16 release notes.
> 
> The release notes still have:
> 
> - Have initdb use ICU by default if ICU is enabled in the binary (Jeff Davis)
> 
>   Option --locale-provider=libc can be used to disable ICU.
> 
> 
> But this was reverted in 2535c74b1a6190cc42e13f6b6b55d94bff4b7dd6.

FYI, this was corrected in this commit:

commit c729642bd7
Author: Bruce Momjian 
Date:   Fri Jun 30 17:35:47 2023 -0400

doc: PG 16 relnotes, remove "Have initdb use ICU by default"

Item reverted.

Backpatch-through: 16 only

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PG 16 draft release notes ready

2023-08-09 Thread Bruce Momjian
On Fri, Jul 14, 2023 at 08:16:38PM +0200, Laurenz Albe wrote:
> On Thu, 2023-05-18 at 16:49 -0400, Bruce Momjian wrote:
> > I have completed the first draft of the PG 16 release notes.
> 
> The release notes say:
> 
> - Prevent \df+ from showing function source code (Isaac Morland)
> 
>   Function bodies are more easily viewed with \ev and \ef.
> 
> 
> That should be \sf, not \ev or \ef, right?

Agreed, fixed.  I am not sure why I put \ev and \ef there.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-09 Thread Alvaro Herrera
Hello

So pginfra had a little chat about this.  Firstly, there's consensus
that it makes sense for pginfra to help out with some persistent workers
in our existing VM system; however there are some aspects that need
some further discussion, to avoid destabilizing the rest of the
infrastructure.  We're looking into it and we'll let you know.

Hosting a couple of Mac Minis is definitely a possibility, if some
entity like SPI buys them.  Let's take this off-list to arrange the
details.

Regards

-- 
Álvaro Herrera




Re: Using defines for protocol characters

2023-08-09 Thread Alvaro Herrera
On 2023-Aug-09, Nathan Bossart wrote:

> On Wed, Aug 09, 2023 at 10:44:42AM -0600, Dave Cramer wrote:
> > On Wed, 9 Aug 2023 at 10:34, Tom Lane  wrote:
> >> I agree with Peter: let's use the names in the protocol document
> >> with a single prefix.  I've got mixed feelings about whether that prefix
> >> should have an underscore, though.
> > 
> > Well, we're getting closer :)
> 
> I'm +0.5 for the underscore.

We use CamelCase_With_UnderScores in other places (PgStat_Counter,
AtEOXact_PgStat_Database).  It's not pretty, and at times it's not
comfortable to type, but I think it will be less ugly here than in those
other places (particularly because there'll be a single underscore), and
I also agree that readability is better with the underscore than
without.

So, I'm also +0.5 on having them, much as it displeases Robert.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: Use of additional index columns in rows filtering

2023-08-09 Thread Tomas Vondra
On 8/8/23 22:54, Peter Geoghegan wrote:
> On Tue, Aug 8, 2023 at 1:24 PM Tomas Vondra
>  wrote:
>>> Assuming that that happens, then it immediately gives index scans a
>>> huge advantage over bitmap index scans. At that point it seems
>>> important to describe (in high level terms) where it is that the
>>> advantage is innate, and where it's just because we haven't done the
>>> required work for bitmap index scans. I became confused on this point
>>> myself yesterday. Admittedly I should have been able to figure it out
>>> on my own -- but it is confusing.
>>>
>>
>> Yeah, I agree that might help a lot, particularly for tables that have a
>> significant fraction of not-all-visible pages.
> 
> It also has the potential to make the costing a lot easier in certain
> important cases. Accurately deriving just how many heap accesses can
> be avoided via the VM from the statistics that are available to the
> planner is likely always going to be very difficult. Finding a way to
> make that just not matter at all (in these important cases) can also
> make it safe to bias the costing, such that the planner tends to favor
> index scans (and index-only scans) over bitmap index scans that cannot
> possibly eliminate any heap page accesses via an index filter qual.
> 

Yes, if there's a way to safely skip the visibility check for some
conditions, that would probably make the costing simpler.

Anyway, I find this discussion rather abstract and I'll probably forget
half the important cases by next week. Maybe it'd be good to build a set
of examples demonstrating the interesting cases? We've already used a
couple tenk1 queries for that purpose ...

>> Right, and I'm not against improving that, but I see it more like an
>> independent task. I don't think it needs (or should) to be part of this
>> patch - skipping visibility checks would apply to IOS, while this is
>> aimed only at plain index scans.
> 
> I'm certainly not going to insist on it. Worth considering if putting
> it in scope could make certain aspects of this patch (like the
> costing) easier, though.
> 
> I think that it wouldn't be terribly difficult to make simple
> inequalities into true index quals. I think I'd like to have a go at
> it myself. To some degree I'm trying to get a sense of how much that'd
> help you.
> 

I'm trying to make the patch to not dependent on such change. In a way,
once a clause gets recognized as index qual, it becomes irrelevant for
my patch. But the patch also doesn't get any simpler, because it still
needs to do the same thing for the remaining quals.

OTOH if there was some facility to decide if a qual is "safe" to be
executed on the index tuple, that'd be nice. But as I already said, I
see it more as an additional optimization, as it only applies to a
subset of cases.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use of additional index columns in rows filtering

2023-08-09 Thread Peter Geoghegan
On Wed, Aug 9, 2023 at 9:05 AM Tomas Vondra
 wrote:
> But in the example you shared yesterday, the problem is not really about
> visibility checks. In fact, the index scan costing completely ignores
> the VM checks - it didn't matter before, and the patch did not change
> this. It's about the number of rows the index scan is expected to
> produce - and those will always do a random I/O, we can't skip those.

I wasn't really talking about that example here.

As I see it, the problem from my example is that plain index scans had
an "unnatural" advantage over bitmap index scans. There was no actual
reason why the system couldn't just deal with the inequality on the
second column uniformly, so that index scans and bitmap index scans
both filtered out all non-matches inexpensively, without heap access.
Then the costing could have been quite off, and it really wouldn't
have mattered at runtime, because the index scan and bitmap index scan
would do approximately the same thing in any case.

As I've said, it's obviously also true that there are many other cases
where there really will be a "natural" advantage for index scans, that
bitmap index scans just cannot hope to offer. These are the cases
where the mechanism from your patch is best placed to be the thing
that avoids heap accesses, or maybe even avoid all visibility checks
(despite not using true index quals).

> Understood. I think this whole discussion is about figuring out these
> trade offs and also how to divide the various improvements into "minimum
> viable" changes.

That's exactly how I see it myself.

Obviously, there is still plenty of gray area here -- cases where it's
not at all clear whether or not we should rely on the mechanism from
your patch, or whether we should provide some alternative, more
specialized mechanism. For example, I've made a lot out of simple !=
inequalities recently, but it's natural to wonder what that might mean
for NOT IN ( ... ) SAOP inequalities. Am I also going to add
specialized code that passes those down to the index AM? Where do you
draw the line?

I totally accept that there is a significant amount of gray area, and
that that's likely to remain true for the foreseeable future. But I
also believe that there is a small but important number of things that
are either exactly black or exactly white. If we can actually firmly
agree on what these other things are in days or weeks (which seems
quite doable), then we'll have the right framework for figuring
everything else out over time (possibly over multiple releases). We'll
at least have the right shared vocabulary for discussing the problems,
which is a very good start. I want to have a general structure that
has the right general concepts in place from the start -- that's all.

I also suspect that we'll discover that the large amount of gray area
clauses/items are those that tend to be far less important than
"exactly black" and "exactly white" items. So even if we can only
agree that a small handful of things are in either category, that
small handful will likely be very overrepresented in real world
queries. For example, simple inequalities are very common -- it's
surprising that nbtree can't already handle them directly. I should
have thought of this myself, long ago, but it took your patch to force
me to think about it.

The problem with simple inequalities was "hiding in plain sight" for a
very long time. Could there be anything else like that?

-- 
Peter Geoghegan




Re: Using defines for protocol characters

2023-08-09 Thread Nathan Bossart
On Wed, Aug 09, 2023 at 10:44:42AM -0600, Dave Cramer wrote:
> On Wed, 9 Aug 2023 at 10:34, Tom Lane  wrote:
>> I agree with Peter: let's use the names in the protocol document
>> with a single prefix.  I've got mixed feelings about whether that prefix
>> should have an underscore, though.
> 
> Well, we're getting closer :)

I'm +0.5 for the underscore.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Using defines for protocol characters

2023-08-09 Thread Dave Cramer
On Wed, 9 Aug 2023 at 10:34, Tom Lane  wrote:

> Dave Cramer  writes:
> > On Wed, 9 Aug 2023 at 09:19, Peter Eisentraut 
> wrote:
> >> 3. IMO, the names of the protocol messages in protocol.sgml are
> >> canonical.  Your patch appends "Request" and "Response" in cases where
> >> that is not part of the actual name.  Also, some messages are documented
> >> to go both ways, so this separation doesn't make sense strictly
> >> speaking.  Please use the names as in protocol.sgml without augmenting
> >> them.
>
> > I've changed this a number of times. I do not mind changing it again, but
> > can we reach a consensus ?
>
> I agree with Peter: let's use the names in the protocol document
> with a single prefix.  I've got mixed feelings about whether that prefix
> should have an underscore, though.
>

Well, we're getting closer :)

Dave


Re: [RFC] Clang plugin for catching suspicious typedef casting

2023-08-09 Thread Dmitry Dolgov
> On Wed, Aug 09, 2023 at 03:23:32PM +0200, Peter Eisentraut wrote:
> On 03.08.23 18:56, Dmitry Dolgov wrote:
> > I would like to get your opinion, folks. Does it sound interesting
> > enough for the community, is it worth it to pursue the idea?
>
> I think it's interesting, and doesn't look too invasive.
>
> Maybe we can come up with three or four interesting use cases and try to
> implement them.  BlockNumber vs. Buffer type checking is obviously a bit
> marginal to get anyone super-excited, but it's a reasonable demo.
>
> Also, how stable is the plugin API?  Would we need to keep updating these
> plugins for each new clang version?

>From the first glance the API itself seems to be stable -- I didn't find
many breaking changes for plugins in the release notes over the last
five releases. The git history for such definitions as ASTConsumer or
RecursiveASTVisitor also doesn't seem to be very convoluted.

But libreoffice example shows, that some updating would be necessary --
they've got about a dozen of places in the code that depend on the clang
version. From what I see it's usually not directly related to the plugin
API, and looks more like our opaque pointers issue.




Re: Using defines for protocol characters

2023-08-09 Thread Tom Lane
Dave Cramer  writes:
> On Wed, 9 Aug 2023 at 09:19, Peter Eisentraut  wrote:
>> 3. IMO, the names of the protocol messages in protocol.sgml are
>> canonical.  Your patch appends "Request" and "Response" in cases where
>> that is not part of the actual name.  Also, some messages are documented
>> to go both ways, so this separation doesn't make sense strictly
>> speaking.  Please use the names as in protocol.sgml without augmenting
>> them.

> I've changed this a number of times. I do not mind changing it again, but
> can we reach a consensus ?

I agree with Peter: let's use the names in the protocol document
with a single prefix.  I've got mixed feelings about whether that prefix
should have an underscore, though.

regards, tom lane




Re: Adding a pg_servername() function

2023-08-09 Thread Tom Lane
GF  writes:
> On Wed, 9 Aug 2023 at 16:05, Tom Lane  wrote:
>> I actually do object to this, because I think the concept of "server
>> name" is extremely ill-defined

> But the gethostname() function is well defined, both in Linux and in
> Windows.

Sure, its call convention is standardized.  But I see nothing in POSIX
saying whether it returns a FQDN or just some random name.  In any
case, the bigger issue is that I don't really want us to expose a
function defined as "whatever gethostname() says".  I think there will
be portability issues on some platforms, and I am dubious that that
definition is what people would want.

One concrete reason why I am doubtful about this is the case of
multiple PG servers running on the same machine.  gethostname()
will be unable to distinguish them.

regards, tom lane




Re: Using defines for protocol characters

2023-08-09 Thread Dave Cramer
On Wed, 9 Aug 2023 at 09:19, Peter Eisentraut  wrote:

> 1. As was discussed, these definitions should go into
> src/include/libpq/pqcomm.h, not a new file.
>

I'm ambivalent, this is very easy to do.

>
> 2. I would prefer an underscore after PgMsg, like PqMsg_DescribeRequest,
> so it's easier to visually locate the start of the actual message name.
>
> 3. IMO, the names of the protocol messages in protocol.sgml are
> canonical.  Your patch appends "Request" and "Response" in cases where
> that is not part of the actual name.  Also, some messages are documented
> to go both ways, so this separation doesn't make sense strictly
> speaking.  Please use the names as in protocol.sgml without augmenting
> them.
>
>
I've changed this a number of times. I do not mind changing it again, but
can we reach a consensus ?

Dave


Re: Use of additional index columns in rows filtering

2023-08-09 Thread Tomas Vondra



On 8/8/23 23:03, Peter Geoghegan wrote:
> On Tue, Aug 8, 2023 at 1:49 PM Tomas Vondra
>  wrote:
>> So we expect 1250 rows. If that was accurate, the index scan would have
>> to do 1250 heap fetches. It's just luck the index scan doesn't need to
>> do that. I don't this there's a chance to improve this costing - if the
>> inputs are this off, it can't do anything.
> 
> Well, that depends. If we can find a way to make the bitmap index scan
> capable of doing something like the same trick through other means, in
> some other patch, then this particular problem (involving a simple
> inequality) just goes away. There may be other cases that look a
> little similar, with a more complicated expression, where it just
> isn't reasonable to expect a bitmap index scan to compete. Ideally,
> bitmap index scans will only be at a huge disadvantage when it just
> makes sense, due to the particulars of the expression.
> 
> I'm not trying to make this your problem. I'm just trying to establish
> the general nature of the problem.
> 
>> Also, I think this is related to the earlier discussion about maybe
>> costing it according to the worst case - i.e. as if we still needed
>> fetch the same number of heap tuples as before. Which will inevitably
>> lead to similar issues, with worse plans looking cheaper.
> 
> Not in those cases where it just doesn't come up, because we can
> totally avoid visibility checks. As I said, securing that guarantee
> has the potential to make the costing a lot more reliable/easier to
> implement.
> 

But in the example you shared yesterday, the problem is not really about
visibility checks. In fact, the index scan costing completely ignores
the VM checks - it didn't matter before, and the patch did not change
this. It's about the number of rows the index scan is expected to
produce - and those will always do a random I/O, we can't skip those.

>> That is certainly true - I'm trying to keep the scope somewhat close to
>> the original goal. Obviously, there may be additional things the patch
>> really needs to consider, but I'm not sure this is one of those cases
>> (perhaps I just don't understand what the issue is - the example seems
>> like a run-of-the-mill case of poor estimate / costing).
> 
> I'm not trying to impose any particular interpretation here. It's
> early in the cycle, and my questions are mostly exploratory. I'm still
> trying to develop my own understanding of the trade-offs in this area.
> 

Understood. I think this whole discussion is about figuring out these
trade offs and also how to divide the various improvements into "minimum
viable" changes.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Ignore BRIN indexes when checking for HOT udpates

2023-08-09 Thread Tomas Vondra
On 8/9/23 11:11, Alvaro Herrera wrote:
> On 2021-Nov-30, Tomas Vondra wrote:
> 
>> Ignore BRIN indexes when checking for HOT udpates
> 
> I was trying to use RelationGetIndexAttrBitmap for something and
> realized that its header comment does not really explain things very
> well.  That was already the case before this commit, but it (this
> commit) did add new possible values without mentioning them.  I propose
> the attached comment-only patch.
> 

+1

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg15: reltuples stuck at -1 after pg_upgrade and VACUUM

2023-08-09 Thread Peter Geoghegan
On Wed, Aug 9, 2023 at 6:18 AM Tom Lane  wrote:
> Perhaps, though, there's a case for forcing all pages to be visited
> if we start with reltuples == -1?  I'm not sure it matters much
> though given that you also need an ANALYZE run to be really in a
> good place after pg_upgrade.  The ANALYZE should update this.

Right. VACUUM is sometimes much less efficient than just using ANALYZE
to establish an initial reltuples. Other times it is much less
accurate. I can't see any argument for opting to use VACUUM instead of
ANALYZE after an upgrade.

-- 
Peter Geoghegan




Re: Adding a pg_servername() function

2023-08-09 Thread GF
On Wed, 9 Aug 2023 at 16:05, Tom Lane  wrote:

> Peter Eisentraut  writes:
> > On 09.08.23 08:42, Laetitia Avrot wrote:
> >> My question is very simple: Do you oppose having this feature in
> Postgres?
>
> > I think this is pretty harmless(*) and can be useful, so it seems
> > reasonable to pursue.
>
> I actually do object to this, because I think the concept of "server
> name" is extremely ill-defined
>

Tom,
But the gethostname() function is well defined, both in Linux and in
Windows.
Maybe the proposed name pg_servername() is unfortunate in that it may
suggest that DNS or something else is involved, but in Laetitia's patch the
call is to gethostname().
Would it be less problematic if the function were called pg_gethostname()?
@Laetitia: why did you propose that name? maybe to avoid clashes with some
extension out there?

Best,
Giovanni


Re: Fix last unitialized memory warning

2023-08-09 Thread Tristan Partin

On Wed Aug 9, 2023 at 10:02 AM CDT, Peter Eisentraut wrote:

On 09.08.23 10:07, Peter Eisentraut wrote:
> On 08.08.23 17:14, Tristan Partin wrote:
>>> I was able to reproduce the warning now on Fedora.  I agree with the 
>>> patch

>>>
>>> -   PgBenchValue vargs[MAX_FARGS];
>>> +   PgBenchValue vargs[MAX_FARGS] = { 0 };
>>>
>>> I suggest to also do
>>>
>>>   typedef enum
>>>   {
>>> -   PGBT_NO_VALUE,
>>> +   PGBT_NO_VALUE = 0,
>>>
>>> to make clear that the initialization value is meant to be invalid.
>>>
>>> I also got the plpgsql warning that you showed earlier, but I 
>>> couldn't think of a reasonable way to fix that.

>>
>> Applied in v2.
> 
> committed


This patch has apparently upset one buildfarm member with a very old 
compiler: 
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=lapwing&br=HEAD


Any thoughts?


Best I could find is SO question[0] which links out to[1]. Try this 
patch. Otherwise, a memset() would probably do too.


[0]: 
https://stackoverflow.com/questions/13746033/how-to-repair-warning-missing-braces-around-initializer
[1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119

--
Tristan Partin
Neon (https://neon.tech)
From ecb342a88e5fb2a88f0086dcf46fdcadb6d930ba Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Wed, 9 Aug 2023 10:12:47 -0500
Subject: [PATCH v1] Fix erroneous -Werror=missing-braces on old GCC

The buildfarm reports that this is an error on gcc (Debian 4.7.2-5)
4.7.2, 32-bit. The bug seems to be GCC bug 53119, which has obviously
been fixed for years.
---
 src/bin/pgbench/pgbench.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 2ba3e367c4..d2b7fc87e4 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2239,10 +2239,15 @@ evalStandardFunc(CState *st,
 {
 	/* evaluate all function arguments */
 	int			nargs = 0;
-	PgBenchValue vargs[MAX_FARGS] = { 0 };
 	PgBenchExprLink *l = args;
 	bool		has_null = false;
 
+	/*
+	 * This value is double braced to workaround GCC bug 53119, which seems to
+	 * exist at least on gcc (Debian 4.7.2-5) 4.7.2, 32-bit.
+	 */
+	PgBenchValue vargs[MAX_FARGS] = { { 0 } };
+
 	for (nargs = 0; nargs < MAX_FARGS && l != NULL; nargs++, l = l->next)
 	{
 		if (!evaluateExpr(st, l->expr, &vargs[nargs]))
-- 
Tristan Partin
Neon (https://neon.tech)



Re: Report planning memory in EXPLAIN ANALYZE

2023-08-09 Thread David Rowley
On Thu, 10 Aug 2023 at 03:12, Ashutosh Bapat
 wrote:
> Thinking more about it, I think memory used is the only right metrics.
> It's an optimization in MemoryContext implementation that malloc'ed
> memory is not freed when it is returned using free().

I guess it depends on the problem you're trying to solve. I had
thought you were trying to do some work to reduce the memory used by
the planner, so I imagined you wanted this so you could track your
progress and also to help ensure we don't make too many mistakes in
the future which makes memory consumption worse again.  For that, I
imagined you'd want to know how much memory is held to ransom by the
context with malloc(), not palloc(). Is it really useful to reduce the
palloc'd memory by the end of planning if it does not reduce the
malloc'd memory?

Another way we might go about reducing planner memory is to make
changes to the allocators themselves.  For example, aset rounds up to
the next power of 2.  If we decided to do something like add more
freelists to double the number so we could add a mid-way point
freelist between each power of 2, then we might find it reduces the
planner memory by, say 12.5%.  If we just tracked what was consumed by
palloc() that would appear to save us nothing, but it might actually
save us several malloced blocks.

David




Re: Using defines for protocol characters

2023-08-09 Thread Peter Eisentraut
1. As was discussed, these definitions should go into 
src/include/libpq/pqcomm.h, not a new file.


2. I would prefer an underscore after PgMsg, like PqMsg_DescribeRequest, 
so it's easier to visually locate the start of the actual message name.


3. IMO, the names of the protocol messages in protocol.sgml are 
canonical.  Your patch appends "Request" and "Response" in cases where 
that is not part of the actual name.  Also, some messages are documented 
to go both ways, so this separation doesn't make sense strictly 
speaking.  Please use the names as in protocol.sgml without augmenting them.






Re: Report planning memory in EXPLAIN ANALYZE

2023-08-09 Thread Ashutosh Bapat
On Wed, Aug 9, 2023 at 8:09 PM Ashutosh Bapat
 wrote:
>
> Hi David,
> Replying to your comments on this thread.
>
> > On Tue, Aug 8, 2023 at 11:40 AM Ashutosh Bapat 
> >  wrote:
> >>
> >> Hi All,
> >> I have been looking at memory consumed when planning a partitionwise join 
> >> [1], [2] and [3]. I am using the attached patch to measure the memory 
> >> consumption. The patch has been useful to detect an increased memory 
> >> consumption in other patches e.g. [4] and [5]. The patch looks useful by 
> >> itself. So I propose this enhancement in EXPLAIN ANALYZE.
> >>
>
> On Wed, Aug 9, 2023 at 10:12 AM David Rowley  wrote:
> >
> > I see you're recording the difference in the CurrentMemoryContext of
> > palloc'd memory before and after planning.  That won't really alert us
> > to problems if the planner briefly allocates, say 12GBs of memory
> > during, say the join search then quickly pfree's it again.  unless
> > it's an oversized chunk, aset.c won't free() any memory until
> > MemoryContextReset(). Chunks just go onto a freelist for reuse later.
> > So at the end of planning, the context may still have that 12GBs
> > malloc'd, yet your new EXPLAIN ANALYZE property might end up just
> > reporting a tiny fraction of that.
> >
> > I wonder if it would be more useful to just have ExplainOneQuery()
> > create a new memory context and switch to that and just report the
> > context's mem_allocated at the end.
>
> The memory allocated but not used is still available for use in rest
> of the query processing stages. The memory context which is
> CurrentMemoryContext at the time of planning is also
> CurrentMemoryContext at the time of its execution if it's not
> PREPAREd. So it's not exactly "consumed" by memory. But your point is
> valid, that it indicates how much was allocated. Just reporting
> allocated memory wont' be enough since it might have been allocated
> before planning began. How about reporting both used and net allocated
> memory? If we use a separate memory context as you suggest, context's
> mem_allocated would be net allocated.

Thinking more about it, I think memory used is the only right metrics.
It's an optimization in MemoryContext implementation that malloc'ed
memory is not freed when it is returned using pfree().

If we have to report allocated memory, maybe we should report as two
properties Planning Memory used, Planning Memory allocated
respectively. But again the later is an implementation detail which
may not be relevant.

-- 
Best Wishes,
Ashutosh Bapat




Re: [BackendXidGetPid] only access allProcs when xid matches

2023-08-09 Thread Junwang Zhao
On Wed, Aug 9, 2023 at 10:46 PM Ashutosh Bapat
 wrote:
>
> On Wed, Aug 9, 2023 at 9:30 AM Junwang Zhao  wrote:
> >
> > In function `BackendXidGetPid`, when looping every proc's
> > TransactionId, there is no need to access its PGPROC since there
> > is shared memory access: `arrayP->pgprocnos[index]`.
> >
> > Though the compiler can optimize this kind of inefficiency, I
> > believe we should ship with better code.
> >
>
> Looks good to me. However, I would just move the variable declaration
> with their assignments inside the if () rather than combing the
> expressions. It more readable that way.

yeah, make sense, also checked elsewhere using the original style,
attachment file
keep that style, thanks ;)

>
> --
> Best Wishes,
> Ashutosh Bapat



-- 
Regards
Junwang Zhao


v2-0001-BackendXidGetPid-only-access-allProcs-when-xid-ma.patch
Description: Binary data


Re: Fix last unitialized memory warning

2023-08-09 Thread Peter Eisentraut

On 09.08.23 10:07, Peter Eisentraut wrote:

On 08.08.23 17:14, Tristan Partin wrote:
I was able to reproduce the warning now on Fedora.  I agree with the 
patch


-   PgBenchValue vargs[MAX_FARGS];
+   PgBenchValue vargs[MAX_FARGS] = { 0 };

I suggest to also do

  typedef enum
  {
-   PGBT_NO_VALUE,
+   PGBT_NO_VALUE = 0,

to make clear that the initialization value is meant to be invalid.

I also got the plpgsql warning that you showed earlier, but I 
couldn't think of a reasonable way to fix that.


Applied in v2.


committed


This patch has apparently upset one buildfarm member with a very old 
compiler: 
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=lapwing&br=HEAD


Any thoughts?





Re: [BackendXidGetPid] only access allProcs when xid matches

2023-08-09 Thread Ashutosh Bapat
On Wed, Aug 9, 2023 at 9:30 AM Junwang Zhao  wrote:
>
> In function `BackendXidGetPid`, when looping every proc's
> TransactionId, there is no need to access its PGPROC since there
> is shared memory access: `arrayP->pgprocnos[index]`.
>
> Though the compiler can optimize this kind of inefficiency, I
> believe we should ship with better code.
>

Looks good to me. However, I would just move the variable declaration
with their assignments inside the if () rather than combing the
expressions. It more readable that way.

-- 
Best Wishes,
Ashutosh Bapat




Re: Support to define custom wait events for extensions

2023-08-09 Thread Andres Freund
Hi,

On 2023-08-09 20:10:42 +0900, Masahiro Ikeda wrote:
> * Is there any way to not force extensions that don't use shared
>   memory for their use like dblink to acquire AddinShmemInitLock?;

I think the caller shouldn't need to do deal with AddinShmemInitLock at all.

Greetings,

Andres Freund




Re: Report planning memory in EXPLAIN ANALYZE

2023-08-09 Thread Ashutosh Bapat
Hi David,
Replying to your comments on this thread.

> On Tue, Aug 8, 2023 at 11:40 AM Ashutosh Bapat  
> wrote:
>>
>> Hi All,
>> I have been looking at memory consumed when planning a partitionwise join 
>> [1], [2] and [3]. I am using the attached patch to measure the memory 
>> consumption. The patch has been useful to detect an increased memory 
>> consumption in other patches e.g. [4] and [5]. The patch looks useful by 
>> itself. So I propose this enhancement in EXPLAIN ANALYZE.
>>

On Wed, Aug 9, 2023 at 10:12 AM David Rowley  wrote:
>
> I see you're recording the difference in the CurrentMemoryContext of
> palloc'd memory before and after planning.  That won't really alert us
> to problems if the planner briefly allocates, say 12GBs of memory
> during, say the join search then quickly pfree's it again.  unless
> it's an oversized chunk, aset.c won't free() any memory until
> MemoryContextReset(). Chunks just go onto a freelist for reuse later.
> So at the end of planning, the context may still have that 12GBs
> malloc'd, yet your new EXPLAIN ANALYZE property might end up just
> reporting a tiny fraction of that.
>
> I wonder if it would be more useful to just have ExplainOneQuery()
> create a new memory context and switch to that and just report the
> context's mem_allocated at the end.

The memory allocated but not used is still available for use in rest
of the query processing stages. The memory context which is
CurrentMemoryContext at the time of planning is also
CurrentMemoryContext at the time of its execution if it's not
PREPAREd. So it's not exactly "consumed" by memory. But your point is
valid, that it indicates how much was allocated. Just reporting
allocated memory wont' be enough since it might have been allocated
before planning began. How about reporting both used and net allocated
memory? If we use a separate memory context as you suggest, context's
mem_allocated would be net allocated.

>
> It's also slightly annoying that these planner-related summary outputs
> are linked to EXPLAIN ANALYZE. We could be showing them in EXPLAIN
> without ANALYZE.  If we were to change that now, it might be a bit
> annoying for the regression tests as we'd need to go and add SUMMARY
> OFF in a load of places...

We do report planning time when SUMMARY ON without ANALYZE. Am I
missing something?

#select version();
 version
--
 PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)
#explain (summary on) select * from pg_class;
 QUERY PLAN
-
 Seq Scan on pg_class  (cost=0.00..18.13 rows=413 width=273)
 Planning Time: 0.282 ms
(2 rows)
#explain (summary off) select * from pg_class;
 QUERY PLAN
-
 Seq Scan on pg_class  (cost=0.00..18.13 rows=413 width=273)
(1 row)

I need to just make sure that the Planning Memory is reported with SUMMARY ON.


-- 
Best Wishes,
Ashutosh Bapat




Re: Remove distprep

2023-08-09 Thread Andres Freund
Hi,

Thanks for sending the -packagers email Peter!

On 2023-08-09 16:25:28 +0200, Christoph Berg wrote:
> Re: Tom Lane
> > Meh ... the fact that it works fine for you doesn't mean it will work
> > fine elsewhere.  Since we're trying to get out from under maintaining
> > the autoconf build system, I don't think it makes sense to open
> > ourselves up to having to do more work on it.  A policy of benign
> > neglect seems appropriate to me.
> 
> Understood, I was just pointing out there are more types of generated
> files in there.

The situation for configure is somewhat different, due to being maintained in
the repository, rather than just being included in the tarball...

Greetings,

Andres Freund




Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2023-08-09 Thread Ashutosh Bapat
On Wed, Aug 9, 2023 at 10:12 AM David Rowley  wrote:
>
> On Fri, 28 Jul 2023 at 02:06, Ashutosh Bapat
>  wrote:
> > 0001 - to measure memory consumption during planning. This is the same
> > one as attached to [1].
>

I have started a separate thread to discuss this patch. I am taking
this discussion to that thread.

-- 
Best Wishes,
Ashutosh Bapat




Re: Remove distprep

2023-08-09 Thread Christoph Berg
Re: Tom Lane
> Meh ... the fact that it works fine for you doesn't mean it will work
> fine elsewhere.  Since we're trying to get out from under maintaining
> the autoconf build system, I don't think it makes sense to open
> ourselves up to having to do more work on it.  A policy of benign
> neglect seems appropriate to me.

Understood, I was just pointing out there are more types of generated
files in there.

Christoph




Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-08-09 Thread Ashutosh Bapat
On Wed, Aug 9, 2023 at 8:14 AM Richard Guo  wrote:
>
> With this patch, the reparameterize_path_by_child work is postponed
> until createplan time, so in create_nestloop_path() the inner path is
> still parameterized by top parent.  So we have to check against the top
> parent of outer rel.
>

Your changes in create_nestloop_path() only affect the check for
parameterized path not the unparameterized paths. So I think we are
good there. My worry was misplaced.

-- 
Best Wishes,
Ashutosh Bapat




Re: Separate memory contexts for relcache and catcache

2023-08-09 Thread David Rowley
On Thu, 10 Aug 2023 at 01:23, Alvaro Herrera  wrote:
>
> On 2023-Aug-09, Melih Mutlu wrote:
>
> > --Patch
> >  name  | used_bytes | free_bytes | total_bytes
> > ---+++-
> >  RelCacheMemoryContext |4706464 |3682144 | 8388608
> >  CatCacheMemoryContext |3489384 | 770712 | 4260096
> >  index info|2102160 | 113776 | 2215936
> >  CacheMemoryContext|   2336 |   5856 |8192
> >  relation rules|   4416 |   3776 |8192
> > (5 rows)
>
> Hmm, is this saying that there's too much fragmentation in the relcache
> context?

free_bytes is just the space in the blocks that are not being used by
any allocated chunks or chunks on the freelist.

It looks like RelCacheMemoryContext has 10 blocks including the 8kb
initial block:

postgres=# select 8192 + sum(8192*power(2,x)) as total_bytes from
generate_series(0,9) x;
 total_bytes
-
 8388608

The first 2 blocks are 8KB as we only start doubling after we malloc
the first 8kb block after the keeper block.

If there was 1 fewer block then total_bytes would be 4194304, which is
less than the used_bytes for that context, so those 10 block look
needed.

> Maybe it would improve things to make it a SlabContext instead
> of AllocSet.  Or, more precisely, a bunch of SlabContexts, each with the
> appropriate chunkSize for the object being stored.

It would at least save from having to do the power of 2 rounding that
aset does. However, on a quick glance, it seems not all the size
requests in relcache.c are fixed.  I see a datumCopy() in
RelationBuildTupleDesc() for the attmissingval stuff, so we couldn't
SlabAlloc that.

It could be worth looking at the size classes of the fixed-sized
allocations to estimate how much memory we might save by using slab to
avoid the power-2 rounding that aset.c does. However, if there are too
many contexts then we may end up using more memory with all the
mostly-empty contexts for backends that only query a tiny number of
tables.  That might not be good.  Slab also does not do block doubling
like aset does, so it might be hard to choose a good block size.

> (I don't say this
> because I know for a fact that Slab is better for these purposes; it's
> just that I happened to read its comments yesterday and they stated that
> it behaves better in terms of fragmentation.  Maybe Andres or Tomas have
> an opinion on this.)

I'm not sure of the exact comment, but I was in the recently and
there's a chance that I wrote that comment.  Slab priorities putting
new chunks on fuller blocks and may free() blocks once they become
empty of any chunks.  Aset does no free()ing of blocks unless a block
was malloc()ed especially for a chunk above allocChunkLimit.  That
means aset might hold a lot of malloc'ed memory for chunks that just
sit on freelists which might never be used ever again, meanwhile,
other request sizes may have to malloc new blocks.

David




Re: Remove distprep

2023-08-09 Thread Tom Lane
Christoph Berg  writes:
> Most notably, we are also rebuilding "configure" using autoconf 2.71
> without issues. Perhaps we can get rid of the 2.69 hardcoding there?

Meh ... the fact that it works fine for you doesn't mean it will work
fine elsewhere.  Since we're trying to get out from under maintaining
the autoconf build system, I don't think it makes sense to open
ourselves up to having to do more work on it.  A policy of benign
neglect seems appropriate to me.

regards, tom lane




Re: Adding a pg_servername() function

2023-08-09 Thread Tom Lane
Peter Eisentraut  writes:
> On 09.08.23 08:42, Laetitia Avrot wrote:
>> My question is very simple: Do you oppose having this feature in Postgres?

> I think this is pretty harmless(*) and can be useful, so it seems 
> reasonable to pursue.

I actually do object to this, because I think the concept of "server
name" is extremely ill-defined and if we try to support it, we will
forever be chasing requests for alternative behaviors.  Just to start
with, is a prospective user expecting a fully-qualified domain name
or just the base name?  If the machine has several names (perhaps
associated with different IP addresses), what do you do about that?
I wouldn't be too surprised if users would expect to get the name
associated with the IP address that the current connection came
through.  Or at least they might tell you they want that, until
they discover they're getting "localhost.localdomain" on loopback
connections and come right back to bitch about that.

Windows likely adds a whole 'nother set of issues to "what's the
machine name", but I don't know enough about it to say what.

I think the upthread suggestion to use cluster_name is going to
be a superior solution for most people, not least because they
can use it today and it will work the same regardless of platform.

> (*) But we should think about access control for this.  If you're in a 
> DBaaS environment, providers might not like that you can read out their 
> internal host names.

There's that, too.

regards, tom lane




Re: [RFC] Clang plugin for catching suspicious typedef casting

2023-08-09 Thread Peter Eisentraut

On 03.08.23 18:56, Dmitry Dolgov wrote:

I would like to get your opinion, folks. Does it sound interesting
enough for the community, is it worth it to pursue the idea?


I think it's interesting, and doesn't look too invasive.

Maybe we can come up with three or four interesting use cases and try to 
implement them.  BlockNumber vs. Buffer type checking is obviously a bit 
marginal to get anyone super-excited, but it's a reasonable demo.


Also, how stable is the plugin API?  Would we need to keep updating 
these plugins for each new clang version?





Re: Separate memory contexts for relcache and catcache

2023-08-09 Thread Alvaro Herrera
On 2023-Aug-09, Melih Mutlu wrote:

> --Patch
>  name  | used_bytes | free_bytes | total_bytes
> ---+++-
>  RelCacheMemoryContext |4706464 |3682144 | 8388608
>  CatCacheMemoryContext |3489384 | 770712 | 4260096
>  index info|2102160 | 113776 | 2215936
>  CacheMemoryContext|   2336 |   5856 |8192
>  relation rules|   4416 |   3776 |8192
> (5 rows)

Hmm, is this saying that there's too much fragmentation in the relcache
context?  Maybe it would improve things to make it a SlabContext instead
of AllocSet.  Or, more precisely, a bunch of SlabContexts, each with the
appropriate chunkSize for the object being stored.  (I don't say this
because I know for a fact that Slab is better for these purposes; it's
just that I happened to read its comments yesterday and they stated that
it behaves better in terms of fragmentation.  Maybe Andres or Tomas have
an opinion on this.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I love the Postgres community. It's all about doing things _properly_. :-)"
(David Garamond)




Re: pg15: reltuples stuck at -1 after pg_upgrade and VACUUM

2023-08-09 Thread Tom Lane
Peter Geoghegan  writes:
> On Tue, Aug 8, 2023 at 8:43 PM Justin Pryzby  wrote:
>> The problem isn't that reltuples == -1 after the upgrade (which is
>> normal).  The issue is that if VACUUM skips all the pages, it can leave
>> reltuples -1.  My expectation is that after running "vacuum", no tables
>> are left in the "never have been vacuumed" state.

> But -1 isn't the "never have been vacuumed" state, exactly. At best it
> is the "never been vacuumed or analyzed" state.

Yeah.  -1 effectively pleads ignorance about the number of live tuples.
If VACUUM has skipped every page, then it is still ignorant about
the true number of live tuples, so setting the value to something
else would be inappropriate.

Perhaps, though, there's a case for forcing all pages to be visited
if we start with reltuples == -1?  I'm not sure it matters much
though given that you also need an ANALYZE run to be really in a
good place after pg_upgrade.  The ANALYZE should update this.

regards, tom lane




Re: initial pruning in parallel append

2023-08-09 Thread Amit Langote
On Wed, Aug 9, 2023 at 9:48 PM Robert Haas  wrote:
> On Wed, Aug 9, 2023 at 6:22 AM Amit Langote  wrote:
> > > > I'm assuming it's not
> > > > too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
> > > > it should be writing to EState.es_part_prune_results or reading from
> > > > it -- the former if in the leader and the latter in a worker.
> > >
> > > I don't think that's too ugly. I mean you have to have an if statement
> > > someplace.
> >
> > Yes, that makes sense.
> >
> > It's just that I thought maybe I haven't thought hard enough about
> > options before adding a new IsParallelWorker(), because I don't find
> > too many instances of IsParallelWorker() in the generic executor code.
> > I think that's because most parallel worker-specific logic lives in
> > execParallel.c or in Exec*Worker() functions outside that file, so the
> > generic code remains parallel query agnostic as much as possible.
>
> Oh, actually, there is an issue here. IsParallelWorker() is not the
> right test. Imagine that there's a parallel query which launches some
> workers, and one of those calls a user-defined function which again
> uses parallelism, launching more workers. This may not be possible
> today, I don't really remember, but the test should be "am I a
> parallel worker with respect to this plan?" not "am I a parallel
> worker at all?".Not quite sure what the best way to code that is. If
> we could just test whether we have a ParallelWorkerContext, it would
> be easy...

I checked enough to be sure that IsParallelWorker() is reliable at the
time of ExecutorStart() / ExecInitNode() in ParallelQueryMain() in a
worker.   However, ParallelWorkerContext is not available at that
point.  Here's the relevant part of ParallelQueryMain():

/* Start up the executor */
queryDesc->plannedstmt->jitFlags = fpes->jit_flags;
ExecutorStart(queryDesc, fpes->eflags);

/* Special executor initialization steps for parallel workers */
queryDesc->planstate->state->es_query_dsa = area;
if (DsaPointerIsValid(fpes->param_exec))
{
char   *paramexec_space;

paramexec_space = dsa_get_address(area, fpes->param_exec);
RestoreParamExecParams(paramexec_space, queryDesc->estate);
}
pwcxt.toc = toc;
pwcxt.seg = seg;
ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);

BTW, we do also use IsParallelWorker() in ExecGetRangeTableRelation()
which also probably only runs during ExecInitNode(), same as
ExecInitPartitionPruning() that this patch adds it to.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: initial pruning in parallel append

2023-08-09 Thread Robert Haas
On Wed, Aug 9, 2023 at 6:22 AM Amit Langote  wrote:
> > > I'm assuming it's not
> > > too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
> > > it should be writing to EState.es_part_prune_results or reading from
> > > it -- the former if in the leader and the latter in a worker.
> >
> > I don't think that's too ugly. I mean you have to have an if statement
> > someplace.
>
> Yes, that makes sense.
>
> It's just that I thought maybe I haven't thought hard enough about
> options before adding a new IsParallelWorker(), because I don't find
> too many instances of IsParallelWorker() in the generic executor code.
> I think that's because most parallel worker-specific logic lives in
> execParallel.c or in Exec*Worker() functions outside that file, so the
> generic code remains parallel query agnostic as much as possible.

Oh, actually, there is an issue here. IsParallelWorker() is not the
right test. Imagine that there's a parallel query which launches some
workers, and one of those calls a user-defined function which again
uses parallelism, launching more workers. This may not be possible
today, I don't really remember, but the test should be "am I a
parallel worker with respect to this plan?" not "am I a parallel
worker at all?".Not quite sure what the best way to code that is. If
we could just test whether we have a ParallelWorkerContext, it would
be easy...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Separate memory contexts for relcache and catcache

2023-08-09 Thread Andy Fan
>
> Most catcache and relcache entries (other than index info etc.) currently
> go straight into CacheMemoryContext. And I believe these two caches can be
> the ones with the largest contribution to the memory usage of
> CacheMemoryContext most of the time. For example, in cases where we have
> lots of database objects accessed in a long-lived connection,
> CacheMemoryContext tends to increase significantly.
>
> While I've been working on another patch for pg_backend_memory_contexts
> view, we thought that it would also be better to see the memory usages of
> different kinds of caches broken down into their own contexts. The attached
> patch implements this and aims to easily keep track of the memory used by
> relcache and catcache
>
>
+ 1 for the idea, this would be pretty useful as a proof of which
context is consuming most of the memory and it doesn't cost
much.  It would be handy than estimating that by something
like select count(*) from pg_class.

I think, for example,  if we find relcache using too much memory,
it is a signal that the user may use too many partitioned tables.


-- 
Best Regards
Andy Fan


Separate memory contexts for relcache and catcache

2023-08-09 Thread Melih Mutlu
Hi hackers,

Most catcache and relcache entries (other than index info etc.) currently
go straight into CacheMemoryContext. And I believe these two caches can be
the ones with the largest contribution to the memory usage of
CacheMemoryContext most of the time. For example, in cases where we have
lots of database objects accessed in a long-lived connection,
CacheMemoryContext tends to increase significantly.

While I've been working on another patch for pg_backend_memory_contexts
view, we thought that it would also be better to see the memory usages of
different kinds of caches broken down into their own contexts. The attached
patch implements this and aims to easily keep track of the memory used by
relcache and catcache

To quickly show how pg_backend_memory_contexts would look like, I did the
following:

-Create some tables:
SELECT 'BEGIN;' UNION ALL SELECT format('CREATE TABLE %1$s(id serial
primary key, data text not null unique)', 'test_'||g.i) FROM
generate_series(0, 1000) g(i) UNION ALL SELECT 'COMMIT;';\gexec

-Open a new connection and query pg_backend_memory_contexts [1]:
This is what you'll see before and after the patch.
-- HEAD:
name| used_bytes | free_bytes | total_bytes
+++-
 CacheMemoryContext | 467656 |  56632 |  524288
 index info | 111760 |  46960 |  158720
 relation rules |   4416 |   3776 |8192
(3 rows)

-- Patch:
 name  | used_bytes | free_bytes | total_bytes
---+++-
 CatCacheMemoryContext | 217696 |  8 |  262144
 RelCacheMemoryContext | 248264 |  13880 |  262144
 index info| 111760 |  46960 |  158720
 CacheMemoryContext|   2336 |   5856 |8192
 relation rules|   4416 |   3776 |8192
(5 rows)

- Run select on all tables
SELECT format('SELECT count(*) FROM %1$s', 'test_'||g.i) FROM
generate_series(0, 1000) g(i);\gexec

- Then check pg_backend_memory_contexts [1] again:
--HEAD
name| used_bytes | free_bytes | total_bytes
+++-
 CacheMemoryContext |8197344 | 257056 | 8454400
 index info |2102160 | 113776 | 2215936
 relation rules |   4416 |   3776 |8192
(3 rows)

--Patch
 name  | used_bytes | free_bytes | total_bytes
---+++-
 RelCacheMemoryContext |4706464 |3682144 | 8388608
 CatCacheMemoryContext |3489384 | 770712 | 4260096
 index info|2102160 | 113776 | 2215936
 CacheMemoryContext|   2336 |   5856 |8192
 relation rules|   4416 |   3776 |8192
(5 rows)

You can see that CacheMemoryContext does not use much memory without
catcache and relcache (at least in cases similar to above), and it's easy
to bloat catcache and relcache. That's why I think it would be useful to
see their usage separately.

Any feedback would be appreciated.

[1]
SELECT

name,
sum(used_bytes) AS used_bytes,
sum(free_bytes) AS free_bytes,
sum(total_bytes) AS total_bytes

FROM pg_backend_memory_contexts
WHERE name LIKE '%CacheMemoryContext%' OR parent LIKE '%CacheMemoryContext%'
GROUP BY name
ORDER BY total_bytes DESC;


Thanks,
-- 
Melih Mutlu
Microsoft


0001-Separate-memory-contexts-for-relcache-and-catcache.patch
Description: Binary data


Re: [PoC] Reducing planning time when tables have many partitions

2023-08-09 Thread David Rowley
On Wed, 9 Aug 2023 at 20:15, Yuya Watari  wrote:
> I agree with your opinion that my patch lacks some explanations, so I
> will consider adding more comments. However, I received the following
> message from David in March.
>
> On Thu, Mar 9, 2023 at 6:23 AM David Rowley  wrote:
> > For the main patch, I've been starting to wonder if it should work
> > completely differently.  Instead of adding members for partitioned and
> > inheritance children, we could just translate the Vars from child to
> > top-level parent and find the member that way. I wondered if this
> > method might be even faster as it would forego
> > add_child_rel_equivalences(). I think we'd still need em_is_child for
> > UNION ALL children.  So far, I've not looked into this in detail. I
> > was hoping to find an idea that would allow some means to have the
> > planner realise that a LIST partition which allows a single Datum
> > could skip pushing base quals which are constantly true. i.e:
> >
> > create table lp (a int) partition by list(a);
> > create table lp1 partition of lp for values in(1);
> > explain select * from lp where a = 1;
> >
> >  Seq Scan on lp1 lp  (cost=0.00..41.88 rows=13 width=4)
> >Filter: (a = 1)
>
> I am concerned that fixing the current patch will conflict with
> David's idea. Of course, I am now trying to experiment with the above
> idea, but I should avoid the conflict if he is working on this. David,
> what do you think about this? Is it OK to post a new patch to address
> the review comments? I am looking forward to your reply.

So, I have three concerns with this patch.

1) I really dislike the way eclass_member_iterator_next() has to check
bms_overlap() to filter out unwanted EMs.  This is required because of
how add_child_rel_equivalences() does not pass the "relids" parameter
in add_eq_member() as equivalent to pull_varnos(expr).  See this code
in master:

/*
* Transform em_relids to match.  Note we do *not* do
* pull_varnos(child_expr) here, as for example the
* transformation might have substituted a constant, but we
* don't want the child member to be marked as constant.
*/
new_relids = bms_difference(cur_em->em_relids,
top_parent_relids);
new_relids = bms_add_members(new_relids, child_relids);


I understand this is done to support Consts in UNION ALL parents, e.g
the following query prunes the n=2 UNION ALL branch

postgres=# explain select * from (select 1 AS n,* from pg_Class c1
union all select 2 AS n,* from pg_Class c2) where n=1;
   QUERY PLAN

 Seq Scan on pg_class c1  (cost=0.00..18.13 rows=413 width=277)
(1 row)

... but the following (existing) comment is just a lie:

Relids em_relids; /* all relids appearing in em_expr */

This means that there's some weirdness on which RelOptInfos we set
eclass_member_indexes.  Do we just set the EM in the RelOptInfos
mentioned in the em_expr, or should it be the ones in em_relids?

You can see the following code I wrote in the 0001 patch which tries
to work around this problem:

+ /*
+ * We must determine the exact set of relids in the expr for child
+ * EquivalenceMembers as what is given to us in 'relids' may differ from
+ * the relids mentioned in the expression.  See add_child_rel_equivalences
+ */
+ if (parent != NULL)
+ expr_relids = pull_varnos(root, (Node *) expr);
+ else
+ {
+ expr_relids = relids;
+ /* We expect the relids to match for non-child members */
+ Assert(bms_equal(pull_varnos(root, (Node *) expr), relids));
+ }

So, you can see we go with the relids from the em_expr rather than
what's mentioned in em_relids.  I believe this means we need the
following line:

+ /*
+ * Don't return members which have no common rels with with_relids
+ */
+ if (!bms_overlap(em->em_relids, iter->with_relids))
+ continue;

I don't quite recall if the em_expr can mention relids that are not in
em_relids or not or if em_expr's relids always is a subset of
em_relids.

I'm just concerned this adds complexity and the risk of mixing up the
meaning (even more than it is already in master). I'm not sure I'm
confident that all this is correct, and I wrote the 0001 patch.

Maybe this can be fixed by changing master so that em_relids always
matches pull_varnos(em_expr)? I'm unsure if there are any other
complexities other than having to ensure we don't set em_is_const for
child members.

2) The 2nd reason is what I hinted at that you quoted in the email I
sent you in March.  I think if it wasn't for UNION ALL and perhaps
table inheritance and we only needed child EMs for partitions of
partitioned tables, then I think we might be able to get away with
just translating Exprs child -> parent before looking up the EM and
likewise when asked to get join quals for child rels, we'd translate
the child relids to their top level parents, find the quals then
translate those back to child form again. EquivalenceClasses would
then only contain a few members and there likely wouldn't be a g

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-09 Thread Christoph Berg
Re: To Thomas Munro
> 603 iterations later it hit again, but didn't log anything. (I believe
> I did run "make" in the right directory.)

Since that didn't seem right I'm running the tests again. There are
XXX lines in the output, but it hasn't hit yet.

Christoph




Re: POC, WIP: OR-clause support for indexes

2023-08-09 Thread Alena Rybakina
Hi! Thank you for your research, I'm sure it will help me to fix the 
problem of calculating selectivity faster)
I'm sorry I didn't answer right away, to be honest, I had a full diary 
of urgent matters at work. For this reason, I didn't have enough time to 
work on this patch properly.



The optimizer will itself do a limited form of "normalizing to CNF".
Are you familiar with extract_restriction_or_clauses(), from
orclauses.c? Comments above the function have an example of how this
can work:

  * Although a join clause must reference multiple relations overall,
  * an OR of ANDs clause might contain sub-clauses that reference just one
  * relation and can be used to build a restriction clause for that rel.
  * For example consider
  *  WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45));
  * We can transform this into
  *  WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45))
  *  AND (a.x = 42 OR a.x = 44)
  *  AND (b.y = 43 OR b.z = 45);
  * which allows the latter clauses to be applied during the scans of a and b,
  * perhaps as index qualifications, and in any case reducing the number of
  * rows arriving at the join.  In essence this is a partial transformation to
  * CNF (AND of ORs format).  It is not complete, however, because we do not
  * unravel the original OR --- doing so would usually bloat the qualification
  * expression to little gain.
This is an interesting feature. I didn't notice this function before, I 
studied many times consider_new_or_cause, which were called there. As 
far as I know, there is a selectivity calculation going on there, but as 
far as I remember, I called it earlier after my conversion, and 
unfortunately it didn't solve my problem with calculating selectivity. 
I'll reconsider it again, maybe I can find something I missed.

Of course this immediately makes me wonder: shouldn't your patch be
able to perform an additional transformation here? You know, by
transforming "a.x = 42 OR a.x = 44" into "a IN (42, 44)"? Although I
haven't checked for myself, I assume that this doesn't happen right
now, since your patch currently performs all of its transformations
during parsing.

I also noticed that the same comment block goes on to say something
about "clauselist_selectivity's inability to recognize redundant
conditions". Perhaps that is relevant to the problems you were having
with selectivity estimation, back when the code was in
preprocess_qual_conditions() instead? I have no reason to believe that
there should be any redundancy left behind by your transformation, so
this is just one possibility to consider.
Separately, the commit message of commit 25a9e54d2d says something
about how the planner builds RestrictInfos, which seems
possibly-relevant. That commit enhanced extended statistics for OR
clauses, so the relevant paragraph describes a limitation of extended
statistics with OR clauses specifically. I'm just guessing, but it
still seems like it might be relevant to the problem you ran into with
selectivity estimation. Another possibility to consider.


I understood what is said about AND clauses in this comment. It seems to 
me that AND clauses saved like (BoolExpr *) expr->args->(RestrictInfo *) 
clauseA->(RestrictInfo *)clauseB lists and OR clauses saved like 
(BoolExpr *) expr -> orclause->(RestrictInfo *)clause A->(RestrictInfo 
*)clause B.


As I understand it, selectivity is calculated for each expression. But 
I'll exploring it deeper, because I think this place may contain the 
answer to the question, what's wrong with selectivity calculation in my 
patch.



BTW, I sometimes use RR to help improve my understanding of the planner:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework
The planner has particularly complicated control flow, which has
unique challenges -- just knowing where to begin can be difficult
(unlike most other areas). I find that setting watchpoints to see when
and where the planner modifies state using RR is far more useful than
it would be with regular GDB. Once I record a query, I find that I can
"map out" what happens in the planner relatively easily.
Thank you for sharing this source! I didn't know about this before, and 
it will definitely make my life easier to understand the optimizer.


I understand what you mean, and I researched the optimizer in a similar 
way through gdb and looked at the comments and code in postgresql. This 
is a complicated way and I didn't always understand correctly what this 
variable was doing in this place, and this created some difficulties for me.


So, thank you for the link!


Many interesting cases won't get SAOP transformation from the patch,
simply because of the or_transform_limit GUC's default of 500. I don't
think that that design makes too much sense. It made more sense back
when the focus was on expression evaluation overhead. But that's only
one of th

Re: Support to define custom wait events for extensions

2023-08-09 Thread Masahiro Ikeda

Hi,

Thanks for your comments to v1 patch.

I made v2 patch. Main changes are
* change to NAMEDATALEN
* change to use dynahash from dshash
* remove worker_spi_init()
* create second hash table to find a event id from a name to
  identify uniquness. It enable extensions which don't use share
  memory for their use to define custom wait events because
  WaitEventExtensionNew() will not allocate duplicate wait events.
* create PoC patch to show that extensions, which don't use shared
  memory for their use, can define custom wait events.
 (v2-0002-poc-custom-wait-event-for-dblink.patch)

I'm worrying about
* Is 512(wee_hash_max_size) the maximum number of the custom wait
  events sufficient?
* Is there any way to not force extensions that don't use shared
  memory for their use like dblink to acquire AddinShmemInitLock?;

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom ec7d2b0ae47be9bedabb4a4127c914bfb8c361b5 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Wed, 9 Aug 2023 19:19:58 +0900
Subject: [PATCH 1/2] Change to manage custom wait events in shared hash

Currently, names of the custom wait event must be registered
per backends. This patch relaxes the constraints by store the
wait events and their names in hash tables in shared memory.

So, all backends can look up the wait event names on
pg_stat_activity view without additional processing by extensions.

In addition, extensions which do not use shared memory for their
use can define custom wait events because WaitEventExtensionNew()
will never allocate new one if the wait event associated to the
name is already allocated. It use hash table to identify uniquness.
---
 doc/src/sgml/monitoring.sgml  |   7 +-
 doc/src/sgml/xfunc.sgml   |  10 +-
 src/backend/storage/lmgr/lwlocknames.txt  |   1 +
 src/backend/utils/activity/wait_event.c   | 198 +++---
 src/include/utils/wait_event.h|  17 +-
 .../modules/worker_spi/t/001_worker_spi.pl|  18 +-
 .../modules/worker_spi/worker_spi--1.0.sql|   5 -
 src/test/modules/worker_spi/worker_spi.c  |  22 +-
 8 files changed, 137 insertions(+), 141 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f4fc5d814f..19181832d7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1121,10 +1121,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  LWLock types
  to the list shown in  and
  . In some cases, the name
- assigned by an extension will not be available in all server processes;
- so an Extension or LWLock wait
- event might be reported as just
- extension rather than the
+ of LWLock assigned by an extension will not be
+ available in all server processes; so an wait event might be reported
+ as just extension rather than the
  extension-assigned name.
 

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index d6345a775b..7fec034db4 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3470,17 +3470,13 @@ void RequestAddinShmemSpace(int size)
 
 
 
- shmem_startup_hook can allocate in shared memory
+ shmem_startup_hook can allocate in dynamic shared memory
  custom wait events by calling while holding the LWLock
  AddinShmemInitLock to avoid any race conditions:
 
-uint32 WaitEventExtensionNew(void)
-
- Next, each process needs to associate the wait event allocated previously
- to a user-facing custom string, which is something done by calling:
-
-void WaitEventExtensionRegisterName(uint32 wait_event_info, const char *wait_event_name)
+uint32 WaitEventExtensionNew(const char *wait_event_name)
 
+ The wait event is associated to a user-facing custom string.
  An example can be found in src/test/modules/worker_spi
  in the PostgreSQL source tree.
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index b34b6afecd..7af3e0ba1a 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock	44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock46
 NotifyQueueTailLock	47
+WaitEventExtensionLock  48
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index b3596ece80..15b2dd06cc 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -45,6 +45,42 @@ uint32	   *my_wait_event_info = &local_my_wait_event_info;
 #define WAIT_EVENT_CLASS_MASK	0xFF00
 #define WAIT_EVENT_ID_MASK		0x
 
+/*
+ * Hash tables for storing custom wait event ids and their names in
+ * shared memory.
+ *
+ * WaitEventExtensionNameHash is used to find the name from a event id.
+ * It enables all backends look up them without additional processing
+ * per backend like L

Re: cataloguing NOT NULL constraints

2023-08-09 Thread Alvaro Herrera
On 2023-Aug-09, Peter Eisentraut wrote:

> I wonder whether the root of these problems is that we mix together primary
> key constraints and not-null constraints.  I understand that right now, with
> the proposed patch, when a table inherits from a parent table with a primary
> key constraint, we generate not-null constraints on the child, in order to
> enforce the not-nullness.  What if we did something like this instead: In
> the child table, we don't generate a not-null constraint, but instead a
> primary key constraint entry.  But we mark the primary key constraint
> somehow to say, this is just for the purpose of inheritance, don't enforce
> uniqueness, but enforce not-nullness.  Would that work?

Hmm.  One table can have many parents, and many of them can have primary
keys.  If we tried to model it the way you suggest, the child table
would need to have several primary keys.  I don't think this would work.

But I think I just need to stare at the dependency graph a little while
longer.  Maybe I just need to add some extra edges to make it work
correctly.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-09 Thread Christoph Berg
Re: Thomas Munro
> On Wed, Aug 9, 2023 at 2:01 AM Christoph Berg  wrote:
> > Putting that patch on top of v8 made it pass 294 times before exiting
> > like this:
> >
> > [08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting 
> > pin established
> > Waiting for replication conn standby's replay_lsn to pass 0/343 on 
> > primary
> > done
> > timed out waiting for match: (?^:User was holding shared buffer pin for too 
> > long) at t/031_recovery_conflict.pl line 318.
> 
> Can you reproduce that with logging like this added on top?

603 iterations later it hit again, but didn't log anything. (I believe
I did run "make" in the right directory.)

[22:20:24.714](3.145s) # issuing query via background psql:
# BEGIN;
# DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM 
test_recovery_conflict_table1;
# FETCH FORWARD FROM test_recovery_conflict_cursor;
[22:20:24.745](0.031s) ok 1 - buffer pin conflict: cursor with conflicting pin 
established
Waiting for replication conn standby's replay_lsn to pass 0/343 on primary
done
timed out waiting for match: (?^:User was holding shared buffer pin for too 
long) at t/031_recovery_conflict.pl line 318.

Perhaps this can simply be attributed to the machine being too busy.

Christoph
2023-08-08 22:20:22.732 UTC [778626] LOG:  starting PostgreSQL 17devel (Debian 
17~~devel-1) on s390x-ibm-linux-gnu, compiled by gcc (Debian 13.2.0-1) 13.2.0, 
64-bit
2023-08-08 22:20:22.732 UTC [778626] LOG:  listening on Unix socket 
"/tmp/S8kif7vda0/.s.PGSQL.59145"
2023-08-08 22:20:22.733 UTC [778630] LOG:  database system was shut down at 
2023-08-08 22:20:22 UTC
2023-08-08 22:20:22.755 UTC [778626] LOG:  database system is ready to accept 
connections
2023-08-08 22:20:22.832 UTC [778639] 031_recovery_conflict.pl LOG:  statement: 
CREATE TABLESPACE test_recovery_conflict_tblspc LOCATION ''
2023-08-08 22:20:22.964 UTC [778644] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-08 22:20:22.964 UTC [778644] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-08 22:20:22.969 UTC [778644] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW wal_segment_size
2023-08-08 22:20:22.969 UTC [778644] 031_recovery_conflict.pl STATEMENT:  SHOW 
wal_segment_size
2023-08-08 22:20:22.975 UTC [778644] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-08 22:20:22.975 UTC [778644] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-08 22:20:22.991 UTC [778644] 031_recovery_conflict.pl LOG:  received 
replication command: BASE_BACKUP ( LABEL 'pg_basebackup base backup',  
PROGRESS,  CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-08 22:20:22.991 UTC [778644] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-08 22:20:22.998 UTC [778628] LOG:  checkpoint starting: immediate force 
wait
2023-08-08 22:20:23.006 UTC [778628] LOG:  checkpoint complete: wrote 7 buffers 
(5.5%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.001 s, sync=0.001 
s, total=0.008 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=11350 kB, estimate=11350 kB; lsn=0/260, redo lsn=0/228
2023-08-08 22:20:23.027 UTC [778654] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-08 22:20:23.027 UTC [778654] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-08 22:20:23.035 UTC [778654] 031_recovery_conflict.pl LOG:  received 
replication command: CREATE_REPLICATION_SLOT "pg_basebackup_778654" TEMPORARY 
PHYSICAL ( RESERVE_WAL)
2023-08-08 22:20:23.035 UTC [778654] 031_recovery_conflict.pl STATEMENT:  
CREATE_REPLICATION_SLOT "pg_basebackup_778654" TEMPORARY PHYSICAL ( RESERVE_WAL)
2023-08-08 22:20:23.073 UTC [778654] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-08 22:20:23.073 UTC [778654] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-08 22:20:23.140 UTC [778654] 031_recovery_conflict.pl LOG:  received 
replication command: START_REPLICATION SLOT "pg_basebackup_778654" 0/200 
TIMELINE 1
2023-08-08 22:20:23.140 UTC [778654] 031_recovery_conflict.pl STATEMENT:  
START_REPLICATION SLOT "pg_basebackup_778654" 0/200 TIMELINE 1
2023-08-08 22:20:23.679 UTC [778644] 031_recovery_conflict.pl LOG:  temporary 
file: path "base/pgsql_tmp/pgsql_tmp778644.0", size 137324
2023-08-08 22:20:23.679 UTC [778644] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-08 22:20:24.216 UTC [778709] standby LOG:  received replication 
command: IDENTIFY_SYSTEM
2023-08-08 22:20:24.216 UTC [778709] standby STATEMENT:  IDENTIFY_SYSTEM
2023-08-08 22:20:24.216 UTC [778709] standby LOG:  received replication 

Re: Adding a pg_servername() function

2023-08-09 Thread GF
Hi everybody,

On 09.08.23 08:42, Laetitia Avrot wrote:
> I agree that the feature I'm suggesting could be done with a few tricks.
> I meant to simplify the life of the user by providing a simple new
> feature. (Also, I might have trust issues with DNS due to several past
> production disasters.)

Just my contribution on why this function could be useful, since we had a
use case that fits exactly this.
In the past I needed such a pg_servername() function because in a project I
was engaged on they needed to distribute "requests" directed to a in-house
management service running on network servers, and each one had a DB, so we
went with pglogical to be used as a (transactional) messaging middleware.
The requests were targeted to specific hostnames, thus on the replication
downstream we wanted to filter with a before-insert trigger on the hostname.
At that point, we had to do some contortions to get the hostname the DB was
running on, sometimes really nasty like: on linux servers we could go with
a python function; but on windows ones (where there was no python
available) we had to resort to a function that read the OS "hostname"
command into a temporary table via the pg "copy", which is both tricky and
inefficient.

So, from me +1 to Laetitia's proposal.

On Wed, 9 Aug 2023 at 10:26, Peter Eisentraut  wrote:

> (*) But we should think about access control for this.


And +1 also to Peter's note on enforcing the access control. BTW that's
exactly what we needed with plpythonu, and also with "copy from program".

Best,
Giovanni


Re: [PoC] Reducing planning time when tables have many partitions

2023-08-09 Thread David Rowley
On Wed, 9 Aug 2023 at 22:28, David Rowley  wrote:
> i.e:
>
> + Bitmapset *matching_ems = NULL;
> + memcpy(&idx_iter, iter, sizeof(EquivalenceMemberIterator));
> + memcpy(&noidx_iter, iter, sizeof(EquivalenceMemberIterator));
> +
> + idx_iter.use_index = true;
> + noidx_iter.use_index = false;
> +
> + while ((em = eclass_member_iterator_strict_next(&noidx_iter)) != NULL)
> + matching_ems = bms_add_member(matching_ems, em->em_index);
> +
> + Assert(bms_equal(matching_ems, iter->matching_ems));

Slight correction, you could just get rid of idx_iter completely. I
only added that copy since the Assert code needed to iterate and I
didn't want to change the position of the iterator that's actually
being used.  Since the updated code wouldn't be interesting over
"iter", you could just use "iter" directly like I have in the
Assert(bms_equals... code above.

David




Re: [PoC] Reducing planning time when tables have many partitions

2023-08-09 Thread David Rowley
On Wed, 5 Jul 2023 at 21:58, Yuya Watari  wrote:
>
> Hello,
>
> On Fri, Mar 10, 2023 at 5:38 PM Yuya Watari  wrote:
> > Thank you for pointing it out. I have attached the rebased version to
> > this email.
>
> Recent commits, such as a8c09daa8b [1], have caused conflicts and
> compilation errors in these patches. I have attached the fixed version
> to this email.
>
> The v19-0004 adds an 'em_index' field representing the index within
> root->eq_members of the EquivalenceMember. This field is needed to
> delete EquivalenceMembers when iterating them using the ec_members
> list instead of the ec_member_indexes.

If 0004 is adding an em_index to mark the index into
PlannerInfo->eq_members, can't you use that in
setup_eclass_member[_strict]_iterator to loop to verify that the two
methods yield the same result?

i.e:

+ Bitmapset *matching_ems = NULL;
+ memcpy(&idx_iter, iter, sizeof(EquivalenceMemberIterator));
+ memcpy(&noidx_iter, iter, sizeof(EquivalenceMemberIterator));
+
+ idx_iter.use_index = true;
+ noidx_iter.use_index = false;
+
+ while ((em = eclass_member_iterator_strict_next(&noidx_iter)) != NULL)
+ matching_ems = bms_add_member(matching_ems, em->em_index);
+
+ Assert(bms_equal(matching_ems, iter->matching_ems));

That should void the complaint that the Assert checking is too slow.
You can also delete the list_ptr_cmp function too (also noticed a
complaint about that).

For the 0003 patch.  Can you explain why you think these fields should
be in RangeTblEntry rather than RelOptInfo? I can only guess you might
have done this for memory usage so that we don't have to carry those
fields for join rels?  I think RelOptInfo is the correct place to
store fields that are only used in the planner.  If you put them in
RangeTblEntry they'll end up in pg_rewrite and be stored for all
views.  Seems very space inefficient and scary as it limits the scope
for fixing bugs in back branches due to RangeTblEntries being
serialized into the catalogues in various places.

David




Re: initial pruning in parallel append

2023-08-09 Thread Amit Langote
On Tue, Aug 8, 2023 at 11:16 PM Robert Haas  wrote:
> On Tue, Aug 8, 2023 at 2:58 AM Amit Langote  wrote:
> > Or we could consider something like the patch I mentioned in my 1st
> > email.  The idea there was to pass the pruning result via a separate
> > channel, not the DSM chunk linked into the PlanState tree.  To wit, on
> > the leader side, ExecInitParallelPlan() puts the serialized
> > List-of-Bitmapset into the shm_toc with a dedicated PARALLEL_KEY,
> > alongside PlannedStmt, ParamListInfo, etc.  The List-of-Bitmpaset is
> > initialized during the leader's ExecInitNode().  On the worker side,
> > ExecParallelGetQueryDesc() reads the List-of-Bitmapset string and puts
> > the resulting node into the QueryDesc, that ParallelQueryMain() then
> > uses to do ExecutorStart() which copies the pointer to
> > EState.es_part_prune_results.  ExecInitAppend() consults
> > EState.es_part_prune_results and uses the Bitmapset from there, if
> > present, instead of performing initial pruning.
>
> This also seems reasonable.
>
> > I'm assuming it's not
> > too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
> > it should be writing to EState.es_part_prune_results or reading from
> > it -- the former if in the leader and the latter in a worker.
>
> I don't think that's too ugly. I mean you have to have an if statement
> someplace.

Yes, that makes sense.

It's just that I thought maybe I haven't thought hard enough about
options before adding a new IsParallelWorker(), because I don't find
too many instances of IsParallelWorker() in the generic executor code.
I think that's because most parallel worker-specific logic lives in
execParallel.c or in Exec*Worker() functions outside that file, so the
generic code remains parallel query agnostic as much as possible.

> > If we
> > are to go with this approach we will need to un-revert ec386948948c,
> > which moved PartitionPruneInfo nodes out of Append/MergeAppend nodes
> > to a List in PlannedStmt (copied into EState.es_part_prune_infos),
> > such that es_part_prune_results mirrors es_part_prune_infos.
>
> The comment for the revert (which was
> 5472743d9e8583638a897b47558066167cc14583) points to
> https://www.postgresql.org/message-id/4191508.1674157...@sss.pgh.pa.us
> as the reason, but it's not very clear to me why that email led to
> this being reverted. In any event, I agree that if we go with your
> idea to pass this via a separate PARALLEL_KEY, unreverting that patch
> seems to make sense, because otherwise I think we don't have a fast
> way to find the nodes that contain the state that we care about.

OK, I've attached the unreverted patch that adds
EState.es_part_prune_infos as 0001.

0002 adds EState.es_part_prune_results.   Parallel query leader stores
the bitmapset of initially valid subplans by performing initial
pruning steps contained in a given PartitionPruneInfo into that list
at the same index as the PartitionPruneInfo's index in
es_part_prune_infos.  ExecInitParallelPlan() serializes
es_part_prune_results and stores it in the DSM.  A worker initializes
es_part_prune_results in its own EState by reading the leader's value
from the DSM and for each PartitionPruneInfo in its own copy of
EState.es_part_prune_infos, gets the set of initially valid subplans
by referring to es_part_prune_results in lieu of performing initial
pruning again.

Should workers, as Tom says, instead do the pruning and cross-check
the result to give an error if it doesn't match the leader's?  The
error message can't specifically point out which, though a user would
at least know that they have functions in their database with wrong
volatility markings.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v1-0001-Move-PartitioPruneInfo-out-of-plan-nodes-into-Pla.patch
Description: Binary data


v1-0002-Share-initial-pruning-result-between-parallel-que.patch
Description: Binary data


Re: logical decoding issue with concurrent ALTER TYPE

2023-08-09 Thread Amit Kapila
On Sun, Aug 6, 2023 at 11:06 AM Masahiko Sawada  wrote:
>
> A colleague Drew Callahan (in CC) has discovered that the logical
> decoding doesn't handle syscache invalidation messages properly that
> are generated by other transactions. Here is example (I've attached a
> patch for isolation test),
>
> -- Setup
> CREATE TYPE comp AS (f1 int, f2 text);
> CREATE TABLE tbl3(val1 int, val2 comp);
> SELECT pg_create_logical_replication_slot('s', 'test_decoding');
>
> -- Session 1
> BEGIN;
> INSERT INTO tbl3 (val1, val2) VALUES (1, (1, '2'));
>
> -- Session 2
> ALTER TYPE comp ADD ATTRIBUTE f3 int;
>
> INSERT INTO tbl3 (val1, val2) VALUES (1, (1, '2', 3));
> COMMIT;
>
> pg_logical_slot_get_changes() returns:
>
> BEGIN
> table public.tbl3: INSERT: val1[integer]:1 val2[comp]:'(1,2)'
> table public.tbl3: INSERT: val1[integer]:1 val2[comp]:'(1,2)'
> COMMIT
>
> However, the logical decoding should reflect the result of ALTER TYPE
> and the val2 of the second INSERT output should be '(1,2,3)'.
>
> The root cause of this behavior is that while ALTER TYPE can be run
> concurrently to INSERT, the logical decoding doesn't handle cache
> invalidation properly, and it got a cache hit of stale data (of
> typecache in this case). Unlike snapshots that are stored in the
> transaction’s reorder buffer changes, the invalidation messages of
> other transactions are not distributed. As a result, the snapshot
> becomes moot when we get a cache hit of stale data due to not
> processing the invalidation messages again. This is not an issue for
> ALTER TABLE and the like due to 2 phase locking and taking an
> AccessExclusive lock. The issue with DMLs and ALTER TYPE has been
> discovered but there might be other similar cases.
>
> As far as I tested, this issue has existed since v9.4, where the
> logical decoding was introduced, so it seems to be a long-standing
> issue.
>
> The simplest fix would be to invalidate all caches when setting a new
> historical snapshot (i.e. applying CHANGE_INTERNAL_SNAPSHOT) but we
> end up invalidating other caches unnecessarily too.
>

I think this could be quite onerous for workloads having a mix of DDL/DMLs.

> A better fix would be that when decoding the commit of a catalog
> changing transaction, we distribute invalidation messages to other
> concurrent transactions, like we do for snapshots. But we might not
> need to distribute all types of invalidation messages, though.
>

I would prefer the solution on these lines. It would be good if we
distribute only the required type of invalidations.

-- 
With Regards,
Amit Kapila.




Re: Remove distprep

2023-08-09 Thread Christoph Berg
Re: Tom Lane
> Have we yet run this concept past the packagers list?  I'm still
> wondering if they will raise any objection to getting rid of all
> the prebuilt files.

No problem for Debian, we are building snapshot releases from plain
git already without issues. In fact, there are already some explicit
rm/clean rules in the packages to force rebuilding some (most?) of the
pre-built files.

Most notably, we are also rebuilding "configure" using autoconf 2.71
without issues. Perhaps we can get rid of the 2.69 hardcoding there?

Thanks for the heads-up,
Christoph




Re: pgsql: Ignore BRIN indexes when checking for HOT udpates

2023-08-09 Thread Alvaro Herrera
On 2021-Nov-30, Tomas Vondra wrote:

> Ignore BRIN indexes when checking for HOT udpates

I was trying to use RelationGetIndexAttrBitmap for something and
realized that its header comment does not really explain things very
well.  That was already the case before this commit, but it (this
commit) did add new possible values without mentioning them.  I propose
the attached comment-only patch.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".
>From db4c89d15d121bf0e15c4160cb2f0770bc98195a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 12 Jul 2023 14:19:23 +0200
Subject: [PATCH] Document RelationGetIndexAttrBitmap better

Commit 19d8e2308bc5 changed the list of set-of-columns that can be
returned by RelationGetIndexAttrBitmap, but didn't update its
"documentation".  That was pretty hard to read already, so rewrite to
make it more comprehensible, adding the missing values while at it.

Backpatch to 16, like that commit.
---
 src/backend/utils/cache/relcache.c | 12 +---
 src/include/utils/relcache.h   |  3 +++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8e28335915..8e08ca1c68 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5150,9 +5150,15 @@ RelationGetIndexPredicate(Relation relation)
  * simple index keys, but attributes used in expressions and partial-index
  * predicates.)
  *
- * Depending on attrKind, a bitmap covering the attnums for all index columns,
- * for all potential foreign key columns, or for all columns in the configured
- * replica identity index is returned.
+ * Depending on attrKind, a bitmap covering attnums for certain columns is
+ * returned:
+ *	INDEX_ATTR_BITMAP_KEY			Columns in non-partial unique indexes not
+ *	in expressions (i.e., usable for FKs)
+ *	INDEX_ATTR_BITMAP_PRIMARY_KEY	Columns in the table's primary key
+ *	INDEX_ATTR_BITMAP_IDENTITY_KEY	Columns in the table's replica identity
+ *	index (empty if FULL)
+ *	INDEX_ATTR_BITMAP_HOT_BLOCKING	Columns that block updates from being HOT
+ *	INDEX_ATTR_BITMAP_SUMMARIZED	Columns included in summarizing indexes
  *
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index beeb28b83c..38524641f4 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -54,6 +54,9 @@ extern List *RelationGetIndexPredicate(Relation relation);
 extern Datum *RelationGetIndexRawAttOptions(Relation indexrel);
 extern bytea **RelationGetIndexAttOptions(Relation relation, bool copy);
 
+/*
+ * Which set of columns to return by RelationGetIndexAttrBitmap.
+ */
 typedef enum IndexAttrBitmapKind
 {
 	INDEX_ATTR_BITMAP_KEY,
-- 
2.39.2



Re: pgbnech: allow to cancel queries during benchmark

2023-08-09 Thread Fabien COELHO



Hello Yugo-san,

Some feedback about v2.

There is some dead code (&& false) which should be removed.


Maybe it should check that cancel is not NULL before calling PQcancel?


I think this is already checked as below, but am I missing something?

+   if (all_state[i].cancel != NULL)
+   (void) PQcancel(all_state[i].cancel, errbuf, sizeof(errbuf));


After looking at the code, I'm very unclear whether they may be some
underlying race conditions, or not, depending on when the cancel is
triggered. I think that some race conditions are still likely given the
current thread 0 implementation, and dealing with them with a barrier or
whatever is not desirable at all.

In order to work around this issue, ISTM that we should go away from the
simple and straightforward thread 0 approach, and the only clean way is
that the cancelation should be managed by each thread for its own client.

I'd suggest to have the advanceState to call PQcancel when CancelRequested
is set and switch to CSTATE_ABORTED to end properly. This means that there
would be no need for the global client states pointer, so the patch should
be smaller and simpler. Possibly there would be some shortcuts added here
and there to avoid lingering after the control-C, in threadRun.


I am not sure this approach is simpler than mine.


My argument is more about latent race conditions and inter-thread 
interference than code simplicity.



In multi-threads, only one thread can catches the signal and other threads
continue to run.


Yep. This why I see plenty uncontrolled race conditions if thread 0 
cancels clients which are managed in parallel by other threads and may be 
active. I'm not really motivated to delve into libpq internals to check 
whether there are possibly bad issues or not, but if two threads write 
message at the same time in the same socket, I assume that this can be 
bad if you are unlucky.


ISTM that the rational convention should be that each thread cancels its 
own clients, which ensures that there is no bad interaction between 
threads.



Therefore, if Ctrl+C is pressed while threads are waiting
responses from the backend in wait_on_socket_set, only one thread can be
interrupted and return, but other threads will continue to wait and cannot
check CancelRequested.



So, for implementing your suggestion, we need any hack
to make all threads return from wait_on_socket_set when the event occurs, but
I don't have idea to do this in simpler way.



In my patch, all threads can return from wait_on_socket_set at Ctrl+C
because when thread #0 cancels all connections, the following error is
sent to all sessions:

 ERROR:  canceling statement due to user request

and all threads will receive the response from the backend.


Hmmm.

I understand that the underlying issue you are raising is that other 
threads may be stuck while waiting on socket events and that with your 
approach they will be cleared somehow by socket 0.


I'll say that (1) this point does not address potential race condition 
issues with several thread interacting directly on the same client ;
(2) thread 0 may also be stuck waiting for events so the cancel is only 
taken into account when it is woken up.


If we accept that each thread cancels its clients when it is woken up, 
which may imply some (usually small) delay, then it is not so different 
from the current version because the code must wait for 0 to wake up 
anyway, and it solves (1). The current version does not address potential 
thread interactions.


--
Fabien.




Re: Avoid stack frame setup in performance critical routines using tail calls

2023-08-09 Thread David Rowley
On Fri, 21 Jul 2023 at 14:03, David Rowley  wrote:
> I'll reply back with a more detailed review next week.

Here's a review of v2-0001:

1.

/*
* XXX: Should this also be moved into alloc()? We could possibly avoid
* zeroing in some cases (e.g. if we used mmap() ourselves.
*/
MemSetAligned(ret, 0, size);

Maybe this should be moved to the alloc function.  It would allow us
to get rid of this:

#define palloc0fast(sz) \
( MemSetTest(0, sz) ? \
MemoryContextAllocZeroAligned(CurrentMemoryContext, sz) : \
MemoryContextAllocZero(CurrentMemoryContext, sz) )

If we do the zeroing inside the alloc function then it can always use
the MemoryContextAllocZeroAligned version providing we zero before
setting the sentinel byte.

It would allow the tail call in the palloc0() case, but the drawback
would be having to check for the MCXT_ALLOC_ZERO flag in the alloc
function. I wonder if that branch would be predictable in most cases,
e.g. the parser will be making lots of nodes and want to zero all
allocations, but the executor won't be doing much of that. There will
be a mix of zeroing and not zeroing in the planner, mostly not, I
think.

2. Why do you need to add the NULL check here?

 #ifdef USE_VALGRIND
- if (method != MCTX_ALIGNED_REDIRECT_ID)
+ if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID)
  VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
 #endif

I know it's just valgrind code and performance does not matter, but
the realloc flags are being passed as 0, so allocation failures won't
return.

3.

/*
* XXX: Probably no need to check for huge allocations, we only support
* one size? Which could theoretically be huge, but that'd not make
* sense...
*/

They can't be huge per Assert(fullChunkSize <= MEMORYCHUNK_MAX_VALUE)
in SlabContextCreate().

4. It would be good to see some API documentation in the
MemoryContextMethods struct.  This adds a lot of responsibility onto
the context implementation without any extra documentation to explain
what, for example, palloc is responsible for and what the alloc
function needs to do itself.

David




Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Michael Paquier
On Wed, Aug 09, 2023 at 05:00:49PM +0900, Kyotaro Horiguchi wrote:
> Looks fine.

Okay, I've updated the patch in consequence.  I'll look at 0001 again
at the beginning of next week.

> While it's a kind of bug in total, we encountered a case where an
> excessively large xl_tot_len actually came from a corrupted
> record. [1]

Right, I remember this one.  I think that Thomas was pretty much right
that this could be caused because of a lack of zeroing in the WAL
pages.

> I'm glad to see this infrastructure comes in, and I'm on board with
> retrying due to an OOM. However, I think we really need official steps
> to wrap up recovery when there is a truly broken, oversized
> xl_tot_len.

There are a few options on the table, only doable once the WAL reader
provider the error state to the startup process:
1) Retry a few times and FATAL.
2) Just FATAL immediately and don't wait.
3) Retry and hope for the best that the host calms down.
I have not seeing this issue being much of an issue in the field, so
perhaps option 2 with the structure of 0002 and a FATAL when we catch
XLOG_READER_OOM in the switch would be enough.  At least that's enough
for the cases we've seen.  I'll think a bit more about it, as well.

Yeah, agreed.  That's orthogonal to the issue reported by Ethan,
unfortunately, where he was able to trigger the issue of this thread
by manipulating the sizing of a host after producing a record larger
than what the host could afford after the resizing :/
--
Michael
From 1274883c4d06dec8876ec60ad983749b7fae3946 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 9 Aug 2023 17:39:52 +0900
Subject: [PATCH v3 1/3] Add infrastructure to report error codes in WAL reader

This commits moves the error state coming from WAL readers into a new
structure, that includes the existing pointer to the error message
buffer, but it also gains an error code that fed back to the callers of
the following routines:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

This will help in improving the decisions to take during recovery
depending on the failure more reported.
---
 src/include/access/xlogprefetcher.h   |   2 +-
 src/include/access/xlogreader.h   |  33 +++-
 src/backend/access/transam/twophase.c |   8 +-
 src/backend/access/transam/xlog.c |   6 +-
 src/backend/access/transam/xlogprefetcher.c   |   4 +-
 src/backend/access/transam/xlogreader.c   | 167 --
 src/backend/access/transam/xlogrecovery.c |  14 +-
 src/backend/access/transam/xlogutils.c|   2 +-
 src/backend/replication/logical/logical.c |   9 +-
 .../replication/logical/logicalfuncs.c|   9 +-
 src/backend/replication/slotfuncs.c   |   8 +-
 src/backend/replication/walsender.c   |   8 +-
 src/bin/pg_rewind/parsexlog.c |  24 +--
 src/bin/pg_waldump/pg_waldump.c   |  10 +-
 contrib/pg_walinspect/pg_walinspect.c |  11 +-
 src/tools/pgindent/typedefs.list  |   2 +
 16 files changed, 200 insertions(+), 117 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..5563ad1a67 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,7 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 	XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-			char **errmsg);
+			XLogReaderError *errordata);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..06664dc6fb 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,24 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderError.errorcode */
+typedef enum XLogReaderErrorCode
+{
+	XLOG_READER_NO_ERROR = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderErrorCode;
+
+/* Error status generated by a WAL reader on failure */
+typedef struct XLogReaderError
+{
+	/* Buffer to hold error message */
+	char	   *message;
+	/* Error code when filling *message */
+	XLogReaderErrorCode code;
+} XLogReaderError;
+
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 			   XLogRecPtr targetPagePtr,
@@ -307,9 +325,9 @@ struct XLogReaderState
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
-	/* Buffer to hold error message */
-	char	   *errormsg_buf;
-	bool		errormsg_deferred;
+	/* Error state data */
+	XLogReaderError errordata;
+	bool		errordata_deferred;
 
 	/*
 	 * Flag to indicate to XLogPageReadCB that it should not block waiting for
@@ -324,7 +342,8 @@ struct XLogReaderState
 static inline bool
 XLogReaderHasQueuedRe

Re: Row pattern recognition

2023-08-09 Thread Tatsuo Ishii
Attached is the v4 patch. Differences from previous patch include:

> - PERMUTE is still misspelled as PREMUTE

Fixed.

> - PATTERN variables do not have to exist in the DEFINE clause.  They are
> - considered TRUE if not present.

Fixed. Moreover new regression test case is added.

- It was possible that tle nodes in DEFINE clause do not appear in the
  plan's target list. This makes impossible to evaluate expressions in
  the DEFINE because it does not appear in the outer plan's target
  list. To fix this, call findTargetlistEntrySQL99 (with resjunk is
  true) so that the missing TargetEntry is added to the outer plan
  later on.

- I eliminated some hacks in handling the Var node in DEFINE
  clause. Previously I replaced varattno of Var node in a plan tree by
  hand so that it refers to correct varattno in the outer plan
  node. In this patch I modified set_upper_references so that it calls
  fix_upper_expr for those Var nodes in the DEFINE clause. See v4-0003
  patch for more details.

- I found a bug with pattern matching code. It creates a string for
  subsequent regular expression matching. It uses the initial letter
  of each define variable name. For example, if the varname is "foo",
  then "f" is used. Obviously this makes trouble if we have two or
  more variables starting with same "f" (e.g. "food"). To fix this, I
  assign [a-z] to each variable instead of its initial letter. However
  this way limits us not to have more than 26 variables. I hope 26 is
  enough for most use cases.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
>From 5c6430e171ab9f9a75df92fac25949cb96fe1da2 Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii 
Date: Wed, 9 Aug 2023 16:56:17 +0900
Subject: [PATCH v4 1/7] Row pattern recognition patch for raw parser.

---
 src/backend/parser/gram.y   | 216 +---
 src/include/nodes/parsenodes.h  |  56 +
 src/include/parser/kwlist.h |   8 ++
 src/include/parser/parse_node.h |   1 +
 4 files changed, 267 insertions(+), 14 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 15ece871a0..62c1919538 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	DefElem	   *defelt;
 	SortBy	   *sortby;
 	WindowDef  *windef;
+	RPCommonSyntax	*rpcom;
+	RPSubsetItem	*rpsubset;
 	JoinExpr   *jexpr;
 	IndexElem  *ielem;
 	StatsElem  *selem;
@@ -453,8 +455,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 TriggerTransitions TriggerReferencing
 vacuum_relation_list opt_vacuum_relation_list
 drop_option_list pub_obj_list
-
-%type 	opt_routine_body
+row_pattern_measure_list row_pattern_definition_list
+opt_row_pattern_subset_clause
+row_pattern_subset_list row_pattern_subset_rhs
+row_pattern
+%type 	 row_pattern_subset_item
+%type 	opt_routine_body row_pattern_term
 %type  group_clause
 %type 	group_by_list
 %type 	group_by_item empty_grouping_set rollup_clause cube_clause
@@ -551,6 +557,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	relation_expr_opt_alias
 %type 	tablesample_clause opt_repeatable_clause
 %type 	target_el set_target insert_column_item
+row_pattern_measure_item row_pattern_definition
 
 %type 		generic_option_name
 %type 	generic_option_arg
@@ -633,6 +640,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	window_clause window_definition_list opt_partition_clause
 %type 	window_definition over_clause window_specification
 opt_frame_clause frame_extent frame_bound
+%type 	opt_row_pattern_common_syntax opt_row_pattern_skip_to
+%type 	opt_row_pattern_initial_or_seek
+%type 	opt_row_pattern_measures
 %type 	opt_window_exclusion_clause
 %type 		opt_existing_window_name
 %type  opt_if_not_exists
@@ -659,7 +669,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 json_object_constructor_null_clause_opt
 json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -702,7 +711,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE
 
 	DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS
-	DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC
+	DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC
 	DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P
 	DOUBLE_P DROP
 
@@ -718,7 +727,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	HANDLER HAVING HEADER_P HOLD HOUR_P
 
 	IDENTITY_P IF_P ILIKE 

Re: Adding a pg_servername() function

2023-08-09 Thread Peter Eisentraut

On 09.08.23 08:42, Laetitia Avrot wrote:
I agree that the feature I'm suggesting could be done with a few tricks. 
I meant to simplify the life of the user by providing a simple new 
feature. (Also, I might have trust issues with DNS due to several past 
production disasters.)


My question is very simple: Do you oppose having this feature in Postgres?


I think this is pretty harmless(*) and can be useful, so it seems 
reasonable to pursue.


(*) But we should think about access control for this.  If you're in a 
DBaaS environment, providers might not like that you can read out their 
internal host names.  I'm not sure if there is an existing permission 
role that this could be attached to or if we need a new one.







Re: [PoC] Reducing planning time when tables have many partitions

2023-08-09 Thread Yuya Watari
Hello Andrey, Ashutosh, and David,

Thank you for your reply and for reviewing the patch.

On Mon, Aug 7, 2023 at 5:51 PM Andrey Lepikhov
 wrote:
> One more thing: I think, you should add comments to
> add_child_rel_equivalences() and add_child_join_rel_equivalences()
> on replacing of:
>
> if (bms_is_subset(cur_em->em_relids, top_parent_relids) &&
> !bms_is_empty(cur_em->em_relids))
> and
> if (bms_overlap(cur_em->em_relids, top_parent_relids))
>
> with different logic. What was changed? It will be better to help future
> developers realize this part of the code more easily by adding some
> comments.

The following change in add_child_join_rel_equivalences():

-/* Does this member reference child's topmost parent rel? */
-if (bms_overlap(cur_em->em_relids, top_parent_relids))

is correct because EquivalenceMemberIterator guarantees that these two
Relids always overlap for the iterated results. The following code
does this iteration. As seen from the below code, the iteration
eliminates not overlapping Relids, so we do not need to check
bms_overlap() for the iterated results.

=
/*
 * eclass_member_iterator_next
 * Fetch the next EquivalenceMember from an EquivalenceMemberIterator
 * which was set up by setup_eclass_member_iterator().  Returns NULL when
 * there are no more matching EquivalenceMembers.
 */
EquivalenceMember *
eclass_member_iterator_next(EquivalenceMemberIterator *iter)
{
...
ListCell   *lc;

for_each_from(lc, iter->eclass->ec_members, iter->current_index + 1)
{
EquivalenceMember *em = lfirst_node(EquivalenceMember, lc);
...
/*
 * Don't return members which have no common rels with with_relids
 */
if (!bms_overlap(em->em_relids, iter->with_relids))
continue;

return em;
}
return NULL;
...
}
=

I agree with your opinion that my patch lacks some explanations, so I
will consider adding more comments. However, I received the following
message from David in March.

On Thu, Mar 9, 2023 at 6:23 AM David Rowley  wrote:
> For the main patch, I've been starting to wonder if it should work
> completely differently.  Instead of adding members for partitioned and
> inheritance children, we could just translate the Vars from child to
> top-level parent and find the member that way. I wondered if this
> method might be even faster as it would forego
> add_child_rel_equivalences(). I think we'd still need em_is_child for
> UNION ALL children.  So far, I've not looked into this in detail. I
> was hoping to find an idea that would allow some means to have the
> planner realise that a LIST partition which allows a single Datum
> could skip pushing base quals which are constantly true. i.e:
>
> create table lp (a int) partition by list(a);
> create table lp1 partition of lp for values in(1);
> explain select * from lp where a = 1;
>
>  Seq Scan on lp1 lp  (cost=0.00..41.88 rows=13 width=4)
>Filter: (a = 1)

I am concerned that fixing the current patch will conflict with
David's idea. Of course, I am now trying to experiment with the above
idea, but I should avoid the conflict if he is working on this. David,
what do you think about this? Is it OK to post a new patch to address
the review comments? I am looking forward to your reply.

-- 
Best regards,
Yuya Watari




Re: Fix last unitialized memory warning

2023-08-09 Thread Peter Eisentraut

On 08.08.23 17:14, Tristan Partin wrote:
I was able to reproduce the warning now on Fedora.  I agree with the 
patch


-   PgBenchValue vargs[MAX_FARGS];
+   PgBenchValue vargs[MAX_FARGS] = { 0 };

I suggest to also do

  typedef enum
  {
-   PGBT_NO_VALUE,
+   PGBT_NO_VALUE = 0,

to make clear that the initialization value is meant to be invalid.

I also got the plpgsql warning that you showed earlier, but I couldn't 
think of a reasonable way to fix that.


Applied in v2.


committed





Re: WIP: new system catalog pg_wait_event

2023-08-09 Thread Michael Paquier
On Tue, Aug 08, 2023 at 10:16:37AM +0200, Drouvot, Bertrand wrote:
> Please find attached v3 adding the wait event types.

+-- There will surely be at least 9 wait event types, 240 wait events and at
+-- least 27 related to WAL
+select count(distinct(wait_event_type)) > 8 as ok_type,
+   count(*) > 239 as ok,
+   count(*) FILTER (WHERE description like '%WAL%') > 26 AS ok_wal_desc
+from pg_wait_event;
The point is to check the execution of this function, so this could be
simpler, like that or a GROUP BY clause with the event type:
SELECT count(*) > 0 FROM pg_wait_event;
SELECT wait_event_type, count(*) > 0 AS has_data FROM pg_wait_event
  GROUP BY wait_event_type ORDER BY wait_event_type;

+   printf $ic "\tmemset(values, 0, sizeof(values));\n";
+   printf $ic "\tmemset(nulls, 0, sizeof(nulls));\n\n";
+   printf $ic "\tvalues[0] = CStringGetTextDatum(\"%s\");\n", $last;
+   printf $ic "\tvalues[1] = CStringGetTextDatum(\"%s\");\n", 
$wev->[1];
+   printf $ic "\tvalues[2] = CStringGetTextDatum(\"%s\");\n\n", 
$new_desc;

That's overcomplicated for some code generated.  Wouldn't it be
simpler to generate a list of elements, with the code inserting the
tuples materialized looping over it?

+   my $new_desc = substr $wev->[2], 1, -2;
+   $new_desc =~ s/'/\\'/g;
+   $new_desc =~ s/<.*>(.*?)<.*>/$1/g;
+   $new_desc =~ s//$1/g;
+   $new_desc =~ s/; see.*$//;
Better to document what this does, the contents produced look good.

+   rename($ictmp, "$output_path/pg_wait_event_insert.c")
+ || die "rename: $ictmp to $output_path/pg_wait_event_insert.c: $!";

 # seems nicer to not add that as an include path for the whole backend.
 waitevent_sources = files(
   'wait_event.c',
+  'pg_wait_event.c',
 )

This could use a name referring to SQL functions, say
wait_event_funcs.c, with a wait_event_data.c or a
wait_event_funcs_data.c?

+   # Don't generate .c (except pg_wait_event_insert.c) and .h files for
+   # Extension, LWLock and Lock, these are handled independently.
+   my $is_exception = $waitclass eq 'WaitEventExtension' ||
+  $waitclass eq 'WaitEventLWLock' ||
+  $waitclass eq 'WaitEventLock';
Perhaps it would be cleaner to use a separate loop?
--
Michael


signature.asc
Description: PGP signature


Re: cataloguing NOT NULL constraints

2023-08-09 Thread Peter Eisentraut

On 05.08.23 21:50, Dean Rasheed wrote:

Anyway, I was at the same time fixing the other problem you reported
with inheritance (namely, adding a PK ends up with the child column
being marked NOT NULL but no corresponding constraint).

At some point I wondered if the easy way out wouldn't be to give up on
the idea that creating a PK causes the child columns to be marked
not-nullable.  However, IIRC I decided against that because it breaks
restoring of old dumps, so it wouldn't be acceptable.

To make matters worse: pg_dump creates the PK as

   ALTER TABLE ONLY parent ADD PRIMARY KEY ( ... )

note the ONLY there.  It seems I'm forced to cause the PK to affect
children even though ONLY is given.  This is undesirable but I don't see
a way out of that.

It is all a bit of a rat's nest.



I wonder if that could be made to work in the same way as inherited
CHECK constraints -- dump the child's inherited NOT NULL constraints,
and then manually update conislocal in pg_constraint.


I wonder whether the root of these problems is that we mix together 
primary key constraints and not-null constraints.  I understand that 
right now, with the proposed patch, when a table inherits from a parent 
table with a primary key constraint, we generate not-null constraints on 
the child, in order to enforce the not-nullness.  What if we did 
something like this instead: In the child table, we don't generate a 
not-null constraint, but instead a primary key constraint entry.  But we 
mark the primary key constraint somehow to say, this is just for the 
purpose of inheritance, don't enforce uniqueness, but enforce 
not-nullness.  Would that work?






Re: Extract numeric [field] in JSONB more effectively

2023-08-09 Thread jian he
On Wed, Aug 9, 2023 at 4:30 AM Chapman Flack  wrote:
>
> Hi,
>
> Looking at the most recent patch, so far I have a minor
> spelling point, and a question (which I have not personally
> explored).
>
> The minor spelling point, the word 'field' has been spelled
> 'filed' throughout this comment (just as in the email subject):
>
> +   /*
> +* Simplify cast(jsonb_object_filed(jsonb, filedName) as
type)
> +* to jsonb_object_field_type(jsonb, filedName,
targetTypeOid);
> +*/
>
> The question: the simplification is currently being applied
> when the underlying operation uses F_JSONB_OBJECT_FIELD.
> Are there opportunities for a similar benefit if applied
> over F_JSONB_ARRAY_ELEMENT and/or F_JSONB_EXTRACT_PATH?
>
> Regards,
> -Chap


Based on most recent patch
in jsonb_object_field_type function, I made some changes, need to
include  . just created a C function, but didn't rebuild. then
compare it with the "numeric"(jsonb) function. overall it's fast.

some changes I made in jsonb_object_field_type.

uint32 i;
char *endptr;
if (JB_ROOT_IS_OBJECT(jb))
v = getKeyJsonValueFromContainer(&jb->root,
VARDATA_ANY(key),
VARSIZE_ANY_EXHDR(key),
&vbuf);
else if (JB_ROOT_IS_ARRAY(jb) && !JB_ROOT_IS_SCALAR(jb)) /* scalar element
is  pseudo-array */
{
errno = 0;
char *src = text_to_cstring(key);
i = (uint32) strtoul(src, &endptr, 10);

if (endptr == src || *endptr != '\0' || errno != 0)
{
elog(ERROR,"invalid input syntax when convert to integer:");
}
// i boundary index checked inside.
v = getIthJsonbValueFromContainer(&jb->root,i);
}
else if (JB_ROOT_IS_SCALAR(jb))
{
if (!JsonbExtractScalar(&jb->root, &vbuf) || vbuf.type != jbvNumeric)
cannotCastJsonbValue(vbuf.type, "numeric");
v = &vbuf;
}
else
PG_RETURN_NULL();
---
The following query will return zero rows. but jsonb_object_field_type will
be faster.
select jsonb_object_field_type('[1.1,2.2]'::jsonb,'1', 1700),
jsonb_object_field_type('{"1":10.2}'::jsonb,'1', 1700),
jsonb_object_field_type('10.2'::jsonb,'1', 1700)
except
select "numeric"(('[1.1,2.2]'::jsonb)[1]),
"numeric"('{"1":10.2}'::jsonb->'1'),
"numeric"('10.2'::jsonb);

how to glue it as a support function, or make it more generic needs extra
thinking.


  1   2   >