date:20230621

Re: Make pgbench exit on SIGINT more reliably

2023-06-21 Thread Yugo NAGATA

On Mon, 19 Jun 2023 16:49:05 -0700
"Tristan Partin"  wrote:

> On Mon Jun 19, 2023 at 6:39 AM PDT, Yugo NAGATA wrote:
> > On Wed, 24 May 2023 08:58:46 -0500
> > "Tristan Partin"  wrote:
> >
> > > On Tue May 23, 2023 at 7:31 PM CDT, Michael Paquier wrote:
> > > > On Mon, May 22, 2023 at 10:02:02AM -0500, Tristan Partin wrote:
> > > > > The way that pgbench handled SIGINT changed in
> > > > > 1d468b9ad81b9139b4a0b16b416c3597925af4b0. Unfortunately this had a
> > > > > couple of unintended consequences, at least from what I can tell[1].
> > > > > 
> > > > > - CTRL-C no longer stops the program unless the right point in pgbench
> > > > >   execution is hit 
> > > > > - pgbench no longer exits with a non-zero exit code
> > > > > 
> > > > > An easy reproduction of these problems is to run with a large scale
> > > > > factor like: pgbench -i -s 50. Then try to CTRL-C the program.
> > > >
> > > > This comes from the code path where the data is generated client-side,
> > > > and where the current CancelRequested may not be that responsive,
> > > > isn't it?
> > > 
> > > It comes from the fact that CancelRequested is only checked within the
> > > generation of the pgbench_accounts table, but with a large enough scale
> > > factor or enough latency, say cross-continent communication, generating
> > > the other tables can be time consuming too. But, yes you are more likely
> > > to run into this issue when generating client-side data.
> >
> > If I understand correctly, your patch allows to exit pgbench immediately
> > during a client-side data generation even while the tables other than
> > pgbench_accounts are processed. It can be useful when the scale factor
> > is large. However, the same purpose seems to be achieved by you other patch 
> > [1].
> > Is the latter your latest proposal that replaces the patch in this 
> > thread?ad?
> >
> > [1] 
> > https://www.postgresql.org/message-id/flat/CSTU5P82ONZ1.19XFUGHMXHBRY%40c3po
> 
> The other patch does not replace this one. They are entirely separate.

After applying the other patch, CancelRequested would be checked enough 
frequently
even during initialization of pgbench_branches and pgbench_tellers, and it will
allow the initialization step to be cancelled immediately even if the scale 
factor
is high. So, is it unnecessary to modify setup_cancel_handler anyway?

I think it would be still nice to introduce a new exit code for query cancel 
separately.

> 
> > Also, your proposal includes to change the exit code when pgbench is 
> > cancelled by
> > SIGINT. I think it is nice because this will make it easy to understand and 
> > clarify 
> > the result of the initialization. 
> >
> > Currently, pgbench returns 0 when the initialization is cancelled while 
> > populating
> > pgbench_branches or pgbench_tellers, but the resultant pgbench_accounts has 
> > only
> > one row, which is an inconsistent result. Returning the specific value for 
> > SIGINT
> > cancel can let user know it explicitly. In addition, it seems better if an 
> > error
> > message to inform would be output. 
> >
> > For the above purpose, the patch moved exit codes of psql to fe_utils to be 
> > shared.
> > However,  I am not sure this is good way. Each frontend-tool may have it 
> > own exit code,
> > for example. "2" means "bad connection" in psql [2], but "errors during the 
> > run" in
> > pgbench [3]. So, I think it is better to define them separately.
> >
> > [2] https://www.postgresql.org/docs/current/app-psql.html#id-1.9.4.20.7
> > [3] https://www.postgresql.org/docs/current/pgbench.html#id=id-1.9.4.11.7
> 
> I can change it. I just figured that sharing exit code definitions
> would've been preferrable. I will post a v3 some time soon which removes
> that patch.

Great!

> 
> Thanks for your review :).
> 
> -- 
> Tristan Partin
> Neon (https://neon.tech)


-- 
Yugo NAGATA

Re: Do we want a hashset type?

2023-06-21 Thread Joel Jacobson

On Tue, Jun 20, 2023, at 18:25, Tomas Vondra wrote:
> On 6/20/23 16:56, Joel Jacobson wrote:
>> The reference to consistency with what we do elsewhere might not be entirely
>> applicable in this context, since the set feature we're designing is a new 
>> beast
>> in the SQL landscape.
>
> I don't see how it's new, considering relational algebra is pretty much
> based on (multi)sets, and the three-valued logic with NULL values is
> pretty well established part of that.

What I meant was that the SET-feature is new; since it doesn't exist in 
PostgreSQL nor SQL.

>> I think adhering to the theoretical purity of sets by excluding NULLs aligns 
>> us
>> with set theory, simplifies our code, and parallels set implementations in 
>> other
>> languages.
>
> I don't see how that would be more theoretically pure, really. The
> three-valued logic is a well established part of relational algebra, so
> not respecting that is more a violation of the purity.

Hmm, I think it's pure in different ways;
Set Theory is well established and is based on two-values logic,
but at the same time SQL's three-valued logic is also well established.

>> I think we have an opportunity here to innovate and potentially influence a
>> future set concept in the SQL standard.
>
> I doubt this going to influence what the SQL standard says, especially
> because it already defined the behavior for MULTISETS (of which the sets
> are a special case, pretty much). So this has 0% chance of success.

OK. 0% is 1% too low for me to work with. :)

>> However, I see how one could argue against this reasoning, on the basis that
>> PostgreSQL users might be more familiar with and expect NULLs can exist
>> everywhere in all data structures.
>
> Right, it's what we already do for similar cases, and if you have NULLS
> in the data, you better be aware of the behavior. Granted, some people
> are surprised by three-valued logic, but using a different behavior for
> some new features would just increase the confusion.

Good point.

>> I've been trying hard, but I can't find compelling use-cases where a NULL 
>> element
>> in a set would offer a more natural SQL query than handling NULLs within SQL 
>> and
>> keeping the set NULL-free.
>
> IMO if you have NULL values in the data, you better be aware of it and
> handle the case accordingly (e.g. by filtering them out when building
> the set). If you don't have NULLs in the data, there's no issue.

As long as the data model and queries would ensure there can never be
any NULLs, fine, then there's is no issue.

> And in the graph case, I don't see why you'd have any NULLs, considering
> we're dealing with adjacent nodes, and if there's adjacent node, it's ID
> is not NULL.

Me neither, can't see the need for any NULLs there.

>> Does anyone else have a strong realistic example where including NULLs in the
>> set would simplify the SQL query?
>
> I'm sure there are cases where you have NULLs in the dat aand need to
> filter them out, but that's just natural consequence of having NULLs. If
> you have them you better know what NULLs do ...

What I tried to find was an example for was when you wouldn't want to
filter out the NULLs, when you would want to include the NULL
in the set.

If we could just find one should realistic use-case, that would be very
helpful, since it would then kill my argument completely that we couldn't
do without storing a NULL in the set.

> It's too early to make any strong statements, but it's going to be hard
> to convince me we should handle NULLs differently from what we already
> do elsewhere.

I think it's a trade-off, and I don't have any strong preference for the 
simplicity
of a classical two-valued set-theoretic system vs a three-valued
multiset-based one. I was 51/49 but given your feedback I'm now 49/51.

I think the next step is to think about how the hashset type should work
with three-valued logic, and then implement it to get a feeling for it.

For instance, how should hashset_count() work?

Given the query,

SELECT hashset_count('{1,2,3,null}'::int4hashset);

Should we,

a) threat NULL as a distinct value and return 4?

b) ignore NULL and return 3?

c) return NULL? (since the presence of NULL can be thought to render the entire 
count indeterminate)

I think my personal preference is (b) since it is then consistent with how 
COUNT() works.

/Joel

RE: Partial aggregates pushdown

2023-06-21 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Mr.Momjian, Mr.Pyhalov, hackers.

> From: Bruce Momjian 
> Sent: Thursday, June 22, 2023 12:44 AM
> On Tue, Jun 20, 2023 at 09:59:11AM +0300, Alexander Pyhalov wrote:
> > > Therefore, it seems like it would be near-zero cost to just call
> > > conn =
> > > GetConnection() and then PQserverVersion(conn), and ReleaseConnection().
> > > You can then use the return value of PQserverVersion() to determine
> > > if you can push down partial aggregates.
> >
> > Hi.
> > Currently we don't get remote connection while planning if
> > use_remote_estimate is not set.
> > Such change would require to get remote connection in planner, not in
> > executor.
> > This can lead to change of behavior (like errors in explain when user
> > mapping is wrong - e.g. bad password is specified).
> > Also this potentially can lead to establishing connections even when
> > plan node is not actually used (like extreme example - select
> > sum(score) from t limit 0).
> > I'm not saying we shouldn't do it - just hint at possible consequences.
> 
> Agreed.  I noticed it was doing FDW connections during optimization, but 
> didn't see the postgres_fdw option that would
> turn it off.
> Interestingly, it is disabled by default.
> 
> After considering the options, I think we should have a postgres_fdw option 
> called "planner_version_check", and default
> that false.  When false, a remote server version check will not be performed 
> during planning and partial aggregates will be
> always be considered.  When true, a version check will be performed during 
> planning and partial aggregate pushdown
> disabled for pre-PG 17 foreign servers during the query.
> 
> If we want to be more specific, we can call it 
> "check_partial_aggregate_support".
Thank you both for your advice.
We will address the compatibility issues as follows.

Approach1-3:
I will add a postgres_fdw option "check_partial_aggregate_support".
This option is false, default.
Only if this option is true, postgres_fdw connect to the remote server and get 
the version of the remote server.
And if the version of the remote server is less than PG17, then partial 
aggregate push down to the remote server is disable.

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R Center Mitsubishi Electric Corporation

> -Original Message-
> From: Bruce Momjian 
> Sent: Thursday, June 22, 2023 12:44 AM
> To: Alexander Pyhalov 
> Cc: Fujii Yuki/藤井 雄規(MELCO/情報総研 ＤＭ最適Ｇ) 
> ;
> PostgreSQL-development ; Andres Freund 
> ; Tom Lane
> ; Tomas Vondra ; Julien 
> Rouhaud ;
> Daniel Gustafsson ; Ilya Gladyshev 
> 
> Subject: Re: Partial aggregates pushdown
> 
> On Tue, Jun 20, 2023 at 09:59:11AM +0300, Alexander Pyhalov wrote:
> > > Therefore, it seems like it would be near-zero cost to just call
> > > conn =
> > > GetConnection() and then PQserverVersion(conn), and ReleaseConnection().
> > > You can then use the return value of PQserverVersion() to determine
> > > if you can push down partial aggregates.
> >
> > Hi.
> > Currently we don't get remote connection while planning if
> > use_remote_estimate is not set.
> > Such change would require to get remote connection in planner, not in
> > executor.
> > This can lead to change of behavior (like errors in explain when user
> > mapping is wrong - e.g. bad password is specified).
> > Also this potentially can lead to establishing connections even when
> > plan node is not actually used (like extreme example - select
> > sum(score) from t limit 0).
> > I'm not saying we shouldn't do it - just hint at possible consequences.
> 
> Agreed.  I noticed it was doing FDW connections during optimization, but 
> didn't see the postgres_fdw option that would
> turn it off.
> Interestingly, it is disabled by default.
> 
> After considering the options, I think we should have a postgres_fdw option 
> called "planner_version_check", and default
> that false.  When false, a remote server version check will not be performed 
> during planning and partial aggregates will be
> always be considered.  When true, a version check will be performed during 
> planning and partial aggregate pushdown
> disabled for pre-PG 17 foreign servers during the query.
> 
> If we want to be more specific, we can call it 
> "check_partial_aggregate_support".
> 
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
> 
>   Only you can decide what is important to you.

Re: vac_truncate_clog()'s bogus check leads to bogusness

2023-06-21 Thread Noah Misch

On Wed, Jun 21, 2023 at 03:12:08PM -0700, Andres Freund wrote:
> When vac_truncate_clog() returns early
...
> we haven't released the lwlock that we acquired earlier

> Until there's some cause for the session to call LWLockReleaseAll(), the lock
> is held. Until then neither the process holding the lock, nor any other
> process, can finish vacuuming.  We don't even have an assert against a
> self-deadlock with an already held lock, oddly enough.

I agree with this finding.  Would you like to add the lwlock releases, or
would you like me to?

The bug has been in all released versions for 2.5 years, yet it escaped
notice.  That tells us something.  Bogus values have gotten rare?  The
affected session tends to get lucky and call LWLockReleaseAll() soon?

On Wed, Jun 21, 2023 at 05:46:37PM -0700, Andres Freund wrote:
> On 2023-06-21 15:12:08 -0700, Andres Freund wrote:
> > Separately, I think it's quite bad that we *silently* return from
> > vac_truncate_clog() when finding a bogus xid. That's a quite severe 
> > condition,
> > we should at least tell the user about it.
> 
> A related issue is that as far as I can tell the determination of what is
> bogus is bogus.
> 
> The relevant cutoffs are determined vac_update_datfrozenxid() using:
> 
>   /*
>* Identify the latest relfrozenxid and relminmxid values that we could
>* validly see during the scan.  These are conservative values, but it's
>* not really worth trying to be more exact.
>*/
>   lastSaneFrozenXid = ReadNextTransactionId();
>   lastSaneMinMulti = ReadNextMultiXactId();
> 
> but doing checks based on thos is bogus, because:
> 
> a) a concurrent create table / truncate / vacuum can update
>pg_class.relfrozenxid of some relation in the current database to a newer
>value, after lastSaneFrozenXid already has been determined. If that
>happens, we skip updating pg_database.datfrozenxid.
> 
> b) A concurrent vacuum in another database, ending up in vac_truncate_clog(),
>can compute a newer datfrozenxid. In that case the vac_truncate_clog() with
>the outdated lastSaneFrozenXid will not truncate the clog (and also forget
>to release WrapLimitsVacuumLock currently, as reported upthread) and not
>call SetTransactionIdLimit(). The latter is particularly bad, because that
>means we might not come out of "database is not accepting commands" land.

> I guess we could just recompute the boundaries before actually believing the
> catalog values are bogus?

That's how I'd do it.

Re: Making empty Bitmapsets always be NULL

2023-06-21 Thread David Rowley

On Thu, 22 Jun 2023 at 00:16, Ranier Vilela  wrote:
> 2. Only compute BITNUM when necessary.

I doubt this will help.  The % 64 done by BITNUM will be transformed
to an AND operation by the compiler which is likely going to be single
instruction latency on most CPUs which probably amounts to it being
"free".  There's maybe a bit of reading for you in [1] and [2] if
you're wondering how any operation could be free.

(The compiler is able to transform the % into what is effectively &
because 64 is a power of 2.  uintvar % 64 is the same as uintvar & 63.
Play around with [3] to see what I mean)

> 3. Avoid enlargement when nwords is equal wordnum.
> Can save cycles when in corner cases?

No, you're just introducing a bug here.  Arrays in C are zero-based,
so "wordnum >=  a->nwords" is exactly the correct way to check if
wordnum falls outside the bounds of the existing allocated memory. By
changing that to "wordnum > a->nwords" we'll fail to enlarge the words
array when it needs to be enlarged by 1 element.

It looks like you've introduced a bunch of random white space and
changed around a load of other random things in the patch too. I'm not
sure why you think that's a good idea.

FWIW, we normally only write "if (somevar)" as a shortcut when somevar
is boolean and we want to know that it's true.   The word value is not
a boolean type, so although "if (someint)" and "if (someint != 0)"
will compile to the same machine code, we don't normally write our C
code that way in PostgreSQL.  We also tend to write "if (someptr !=
NULL)" rather than "if (someptr)". The compiler will produce the same
code for each, but we write the former to assist people reading the
code so they know we're checking for NULL rather than checking if some
boolean variable is true.

Overall, I'm not really interested in sneaking any additional changes
that are unrelated to adjusting Bitmapsets so that don't carry
trailing zero words. If have other optimisations you think are
worthwhile, please include them in another thread along with
benchmarks to show the performance increase.  For learning, I'd
encourage you to do some micro benchmarks outside of PostgreSQL and
mock up some Bitmapset code in a single .c file and try out with any
without your changes after calling the function in a tight loop to see
if you can measure any performance gains. Just remember you'll never
see any gains in performance when your change compiles into the exact
same code as without your change.  Between [1] and [2], you still
might not see performance changes even when the compiled code is
changed (I'm thinking of your #2 change here).

David

[1] https://en.wikipedia.org/wiki/Speculative_execution
[2] https://en.wikipedia.org/wiki/Out-of-order_execution
[3] https://godbolt.org/z/9vbbnMKEE

Re: Assert while autovacuum was executing

2023-06-21 Thread Amit Kapila

On Wed, Jun 21, 2023 at 11:53 AM Peter Geoghegan  wrote:
>
> On Tue, Jun 20, 2023 at 10:27 PM Andres Freund  wrote:
> > As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that 
> > likely
> > also means 3ba59ccc89 is not right.
>
> Quite possibly. But I maintain that ginInsertCleanup() is probably
> also bogus in a way that's directly relevant.
>
> Did you know that ginInsertCleanup() is the only code that uses
> heavyweight page locks these days? Though only on the index metapage!
>
> Isn't this the kind of thing that VACUUM's relation level lock is
> supposed to take care of?
>

Yeah, I also can't see why that shouldn't be sufficient for VACUUM.
Assuming your observation is correct, what do you suggest doing in
this regard?

-- 
With Regards,
Amit Kapila.

Re: Support logical replication of DDLs

2023-06-21 Thread shveta malik

On Mon, Jun 12, 2023 at 7:17 AM Wei Wang (Fujitsu)
 wrote:
>
> On Thur, Jun 8, 2023 20:02 PM shveta malik  wrote:
> > Thank You Vignesh for handling (a), Ajin for handling (b), Shi-san and
> > Hou-san for contributing in (c).
> >
> > The new changes are in patch 0001, 0002, 0005 and 0008.
>
> Thanks for updating the patch set.
>
> Here are some comments:
> ===
> For 0002 patch.
> 1. The typo atop the function EventTriggerTableInitWrite.
> ```
> +/*
> + * Fire table_init_rewrite triggers.
> + */
> +void
> +EventTriggerTableInitWrite(Node *real_create, ObjectAddress address)
> ```
> s/table_init_rewrite/table_init_write
>
> ~~~
>
> 2. The new process for "SCT_CreateTableAs" in the function 
> pg_event_trigger_ddl_commands.
> With the event trigger created in
> test_ddl_deparse_regress/sql/test_ddl_deparse.sql, when creating the table 
> that
> already exists with `CreateTableAs` command, an error is raised like below:
> ```
> postgres=# CREATE TABLE IF NOT EXISTS as_select1 AS SELECT * FROM pg_class 
> WHERE relkind = 'r';
> postgres=# CREATE TABLE IF NOT EXISTS as_select1 AS SELECT * FROM pg_class 
> WHERE relkind = 'r';
> NOTICE:  relation "as_select1" already exists, skipping
> ERROR:  unrecognized object class: 0
> CONTEXT:  PL/pgSQL function test_ddl_deparse() line 6 at FOR over SELECT rows
> ```
> It seems that we could check cmd->d.ctas.real_create in the function
> pg_event_trigger_ddl_commands and return NULL in this case.
>
> ===
> For 0004 patch.
> 3. The command tags that are not supported for deparsing in the tests.
> ```
> FOR r IN SELECT * FROM pg_event_trigger_ddl_commands()
> -- Some TABLE commands generate sequence-related commands, 
> also deparse them.
> WHERE command_tag in ('ALTER FOREIGN TABLE', 'ALTER TABLE',
>   'CREATE FOREIGN 
> TABLE', 'CREATE TABLE',
>   'CREATE TABLE AS', 
> 'DROP FOREIGN TABLE',
>   'DROP TABLE', 
> 'ALTER SEQUENCE',
>   'CREATE SEQUENCE', 
> 'DROP SEQUENCE')
> ```
> Since foreign table is not supported yet in the current patch set, it seems 
> that
> we need to remove "FOREIGN TABLE" related command tag. If so, I think the
> following three files need to be modified:
> - test_ddl_deparse_regress/sql/test_ddl_deparse.sql
> - test_ddl_deparse_regress/t/001_compare_dumped_results.pl
> - test_ddl_deparse_regress/t/002_regress_tests.pl
>
> ~~~
>
> 4. The different test items between meson and Makefile.
> It seems that we should keep the same SQL files and the same order of SQL 
> files
> in test_ddl_deparse_regress/meson.build and test_ddl_deparse_regress/Makefile.
>
> ===
> For 0004 && 0008 patches.
> 5. The test cases in the test file 
> test_ddl_deparse_regress/t/001_compare_dumped_results.pl.
> ```
> # load test cases from the regression tests
> -my @regress_tests = split /\s+/, $ENV{REGRESS};
> +#my @regress_tests = split /\s+/, $ENV{REGRESS};
> +my @regress_tests = ("create_type", "create_schema", "create_rule", 
> "create_index");
> ```
> I think @regress_tests should include all SQL files, instead of just four. 
> BTW,
> the old way (using `split /\s+/, $ENV{REGRESS}`) doesn't work in meson.
>

Wang-san, Thank You for your feedback. In the latest version, we have
pulled out CTAS and the test_ddl_deparse_regress module. I will
revisit your comments once we plan to put these modules back.

thanks
Shveta

Re: Preventing non-superusers from altering session authorization

2023-06-21 Thread Nathan Bossart

On Wed, Jun 21, 2023 at 04:28:43PM -0400, Joseph Koshakow wrote:
> + roleTup = SearchSysCache1(AUTHOID, 
> ObjectIdGetDatum(AuthenticatedUserId));
> + if (!HeapTupleIsValid(roleTup))
> + ereport(FATAL,
> + 
> (errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
> + errmsg("role with OID %u does 
> not exist", AuthenticatedUserId)));
> + rform = (Form_pg_authid) GETSTRUCT(roleTup);

I think "superuser_arg(AuthenticatedUserId)" would work here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Assert while autovacuum was executing

2023-06-21 Thread Amit Kapila

On Wed, Jun 21, 2023 at 10:57 AM Andres Freund  wrote:
>
> As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> also means 3ba59ccc89 is not right.
>

Indeed. I was thinking of a fix but couldn't find one yet. One idea I
am considering is to allow catalog table locks after page lock but I
think apart from hacky that also won't work because we still need to
remove the check added for page locks in the deadlock code path in
commit 3ba59ccc89 and may need to do something for group locking. Feel
free to share any ideas if you have, I can try to evaluate those in
detail. I think in the worst case we need to remove the changes added
by 72e78d831a and 3ba59ccc89 which won't impact any existing feature
but will add a hurdle in parallelizing other write operations or even
improving the parallelism in vacuum (like allowing multiple workers
for an index).

-- 
With Regards,
Amit Kapila.

Re: Support to define custom wait events for extensions

2023-06-21 Thread Masahiro Ikeda


On 2023-06-20 18:26, Masahiro Ikeda wrote:

The followings are TODO items.
* to check that meson.build works since I tested with old command 
`make` now


I test with meson and I updated the patches to work with it.
My test procedure is the following.

```
export builddir=/mnt/tmp/build
export prefix=/mnt/tmp/master

# setup
meson setup $builddir --prefix=$prefix -Ddebug=true -Dcassert=true 
-Dtap_tests=enabled


# build and install with src/test/modules
ninja -C $builddir install install-test-files

# test
meson test -v -C $builddir
meson test -v -C $builddir --suite test_custom_wait_events  # run the 
test only

```

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 901b75d31070ef0029557db6981c98e06f5c16c3 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Tue, 20 Jun 2023 17:59:34 +0900
Subject: [PATCH 2/2] Add test codes for custom wait events

---
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../test_custom_wait_events/.gitignore|   2 +
 .../modules/test_custom_wait_events/Makefile  |  23 +++
 .../test_custom_wait_events/meson.build   |  33 
 .../test_custom_wait_events/t/001_basic.pl|  34 
 .../test_custom_wait_events--1.0.sql  |  14 ++
 .../test_custom_wait_events.c | 182 ++
 .../test_custom_wait_events.control   |   3 +
 9 files changed, 293 insertions(+)
 create mode 100644 src/test/modules/test_custom_wait_events/.gitignore
 create mode 100644 src/test/modules/test_custom_wait_events/Makefile
 create mode 100644 src/test/modules/test_custom_wait_events/meson.build
 create mode 100644 src/test/modules/test_custom_wait_events/t/001_basic.pl
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events--1.0.sql
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events.c
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 6331c976dc..84312b7e57 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -17,6 +17,7 @@ SUBDIRS = \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
+		  test_custom_wait_events \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_ginpostinglist \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 17d369e378..83a1d8fd19 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -14,6 +14,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
+subdir('test_custom_wait_events')
 subdir('test_ddl_deparse')
 subdir('test_extensions')
 subdir('test_ginpostinglist')
diff --git a/src/test/modules/test_custom_wait_events/.gitignore b/src/test/modules/test_custom_wait_events/.gitignore
new file mode 100644
index 00..716e17f5a2
--- /dev/null
+++ b/src/test/modules/test_custom_wait_events/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_custom_wait_events/Makefile b/src/test/modules/test_custom_wait_events/Makefile
new file mode 100644
index 00..dda365d523
--- /dev/null
+++ b/src/test/modules/test_custom_wait_events/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_custom_wait_events/Makefile
+
+MODULE_big = test_custom_wait_events
+OBJS = \
+	$(WIN32RES) \
+	test_custom_wait_events.o
+PGFILEDESC = "test_custom_wait_events - test custom wait events"
+
+EXTENSION = test_custom_wait_events
+DATA = test_custom_wait_events--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_custom_wait_events
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_custom_wait_events/meson.build b/src/test/modules/test_custom_wait_events/meson.build
new file mode 100644
index 00..8ad073e577
--- /dev/null
+++ b/src/test/modules/test_custom_wait_events/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) PostgreSQL Global Development Group
+
+test_custom_wait_events = files(
+  'test_custom_wait_events.c',
+)
+
+if host_system == 'windows'
+  test_custom_wait_events += rc_lib_gen.process(win32ver_rc, extra_args: [
+'--NAME', 'test_custom_wait_events',
+'--FILEDESC', 'test_custom_wait_events - test custom wait events',])
+endif
+
+test_custom_wait_events = shared_module('test_custom_wait_events',
+  test_custom_wait_events,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_custom_wait_events
+
+test_install_data += files(
+  'test_custom_wait_events.control',
+  'test_custom_wait_events--1.0.sql',
+)
+
+tests += {
+  'name': 'test_custom_wait_events',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-06-21 Thread Nathan Bossart

On Thu, Jun 22, 2023 at 10:46:41AM +0900, Michael Paquier wrote:
> On Wed, Jun 21, 2023 at 10:16:24AM -0700, Nathan Bossart wrote:
>>> I think that there is a testing gap with the coverage of CLUSTER.
>>> "Ownership of partitions is checked" is a test that looks for the case
>>> where regress_ptnowner owns the partitioned table and one of its
>>> partitions, checking that the leaf not owned is skipped, but we don't
>>> have a test where we attempt a CLUSTER on the partitioned table with
>>> regress_ptnowner *not* owning the partitioned table, only one or more
>>> of its partitions owned by regress_ptnowner.  In this case, the
>>> command would fail.
>>  
>> We could add something for this, but it'd really just exercise the checks
>> in RangeVarCallbackMaintainsTable(), which already has a decent amount of
>> coverage.
> 
> It seems to me that this has some value for the CLUSTER path, so I
> would add a small thing for it.

Done.

> -   /*
> -* We already checked that the user has privileges to CLUSTER the
> -* partitioned table when we locked it earlier, so there's no need to
> -* check the privileges again here.
> -*/
> +   if (!cluster_is_permitted_for_relation(relid, GetUserId()))
> +   continue;
> I would add a comment here that this ACL recheck for the leaves is an
> important thing to keep around as it impacts the case where the leaves
> have a different owner than the parent, and the owner of the parent
> clusters it.  The only place in the tests where this has an influence
> is the isolation test cluster-conflict-partition.

Done.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 941371500f66690bbf200dfb555cc17d947cf0d1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 19 Jun 2023 13:57:57 -0700
Subject: [PATCH v9 1/2] partial revert of ff9618e82a

---
 doc/src/sgml/ref/analyze.sgml |  5 +--
 doc/src/sgml/ref/cluster.sgml |  5 +--
 doc/src/sgml/ref/lock.sgml|  5 +--
 doc/src/sgml/ref/reindex.sgml |  6 +---
 doc/src/sgml/ref/vacuum.sgml  |  5 +--
 src/backend/commands/cluster.c| 12 ---
 src/backend/commands/indexcmds.c  | 27 +++
 src/backend/commands/lockcmds.c   |  8 -
 src/backend/commands/tablecmds.c  | 34 +++
 src/backend/commands/vacuum.c | 10 ++
 src/include/commands/tablecmds.h  |  1 -
 .../expected/cluster-conflict-partition.out   | 14 
 .../specs/cluster-conflict-partition.spec |  7 ++--
 src/test/regress/expected/cluster.out |  7 +++-
 src/test/regress/expected/create_index.out|  4 +--
 src/test/regress/expected/privileges.out  |  8 ++---
 src/test/regress/expected/vacuum.out  | 18 ++
 src/test/regress/sql/cluster.sql  |  3 ++
 18 files changed, 74 insertions(+), 105 deletions(-)

diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index 20c6f9939f..30a893230e 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -190,10 +190,7 @@ ANALYZE [ VERBOSE ] [ table_and_columnsANALYZE can only be performed by superusers and roles
-   with privileges of pg_maintain.)  If a role has
-   permission to ANALYZE a partitioned table, it is also
-   permitted to ANALYZE each of its partitions, regardless
-   of whether the role has the aforementioned privileges on the partition.
+   with privileges of pg_maintain.)
ANALYZE will skip over any tables that the calling user
does not have permission to analyze.
   
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 29f0f1fd90..f0dd7faed5 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -137,10 +137,7 @@ CLUSTER [VERBOSE]
 on the table or be the table's owner, a superuser, or a role with
 privileges of the
 pg_maintain
-role.  If a role has permission to CLUSTER a partitioned
-table, it is also permitted to CLUSTER each of its
-partitions, regardless of whether the role has the aforementioned
-privileges on the partition.  CLUSTER will skip over any
+role.  CLUSTER will skip over any
 tables that the calling user does not have permission to cluster.

 
diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index 5b3b2b793a..8524182211 100644
--- a/doc/src/sgml/ref/lock.sgml
+++ b/doc/src/sgml/ref/lock.sgml
@@ -177,10 +177,7 @@ LOCK [ TABLE ] [ ONLY ] name [ * ]
 MODE (or a less-conflicting mode as described in ) is permitted. If a user has
 SELECT privileges on the table, ACCESS SHARE
-MODE is permitted.  If a role has permission to lock a
-partitioned table, it is also permitted to lock each of its partitions,
-regardless of whether the role has the aforementioned privileges on the
-partition.
+MODE is permitted.

 

diff

Re: DROP DATABASE is interruptible

2023-06-21 Thread Andres Freund

Hi,

On 2023-05-09 15:41:36 +1200, Thomas Munro wrote:
> +# FIXME: It'd be good to test the actual interruption path. But it's not
> +# immediately obvious how.
> 
> I wonder if there is some way to incorporate something based on
> SIGSTOP signals into the test, but I don't know how to do it on
> Windows and maybe that's a bit weird anyway.  For a non-OS-specific
> way to do it, I was wondering about having a test module function that
> has a wait loop that accepts ^C but deliberately ignores
> ProcSignalBarrier, and leaving that running in a background psql for a
> similar effect?

I found a way to test it reliably, albeit partially. However, I'm not sure
where to do so / if it's worth doing so.

The problem occurs once remove_dbtablespaces() starts working. The fix does a
heap_inplace_update() before that. So to reproduce the problem one session can
lock pg_tablespace, another can drop a database. Then the second session can
be cancelled by the first.

Waiting for locks to be acquired etc is somewhat cumbersome in a tap
test. It'd be easier in an isolation test. But I don't think we want to do
this as part of the normal isolation schedule?

So just open coding it in a tap test seems to be the best way?

Is it worth doing?

Greetings,

Andres Freund

Re: bgwriter doesn't flush WAL stats

2023-06-21 Thread Kyotaro Horiguchi

At Wed, 21 Jun 2023 18:52:26 +0300, Nazir Bilal Yavuz  
wrote in 
> I attached a WIP patch for showing WAL stats in pg_stat_io.

Yeah, your diagnosis appears accurate. I managed to trigger an
assertion failure quite easily when I added
"Assert(!pgstat_have_pending_wal()) just after the call to
pgstat_report_bgwriter(). Good find!

I slightly inclined to place the added call after smgrcloseall() but
it doesn't seem to cause any io-stats updates so the proposed first
patch as-is looks good to me.


Regarding the second patch, it introduces WAL IO time as a
IOCONTEXT_NORMAL/IOOBJECT_WAL, but it doesn't seem to follow the
convention or design of the pgstat_io component, which primarily
focuses on shared buffer IOs.

There was a brief mention about WAL IO during the development of
pgstat_io [1].

>> It'd be different if we tracked WAL fsyncs more granularly - which would be
>> quite interesting - but that's something for another day^Wpatch.
>>
>>
> I do have a question about this.
> So, if we were to start tracking WAL IO would it fit within this
> paradigm to have a new IOPATH_WAL for WAL or would it add a separate
> dimension?


[1] 
https://www.postgresql.org/message-id/CAAKRu_bM55pj3pPRW0nd_-paWHLRkOU69r816AeztBBa-N1HLA%40mail.gmail.com


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-06-21 Thread Michael Paquier


On Wed, Jun 21, 2023 at 10:16:24AM -0700, Nathan Bossart wrote:
>> I think that there is a testing gap with the coverage of CLUSTER.
>> "Ownership of partitions is checked" is a test that looks for the case
>> where regress_ptnowner owns the partitioned table and one of its
>> partitions, checking that the leaf not owned is skipped, but we don't
>> have a test where we attempt a CLUSTER on the partitioned table with
>> regress_ptnowner *not* owning the partitioned table, only one or more
>> of its partitions owned by regress_ptnowner.  In this case, the
>> command would fail.
>  
> We could add something for this, but it'd really just exercise the checks
> in RangeVarCallbackMaintainsTable(), which already has a decent amount of
> coverage.

It seems to me that this has some value for the CLUSTER path, so I
would add a small thing for it.

> On Tue, Jun 20, 2023 at 09:15:18PM -0700, Nathan Bossart wrote:
>> Perhaps we should add something like
>> 
>>  Note that while REINDEX on a partitioned index or table requires
>>  MAINTAIN on the partitioned table, such commands skip the privilege
>>  checks when processing the individual partitions.
>> 
>> Thoughts?  I'm trying to keep the privilege documentation for maintenance
>> commands as simple as possible, so I'm hoping to avoid adding too much text
>> dedicated to these special cases.
> 
> Here is a new patch set that includes this new sentence.

-   aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_INDEX,
-  relation->relname);
Interesting that the previous code assumed ACLCHECK_NOT_OWNER all the
time in the reindex RangeVar callback.

-   /*
-* We already checked that the user has privileges to CLUSTER the
-* partitioned table when we locked it earlier, so there's no need to
-* check the privileges again here.
-*/
+   if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+   continue;
I would add a comment here that this ACL recheck for the leaves is an
important thing to keep around as it impacts the case where the leaves
have a different owner than the parent, and the owner of the parent
clusters it.  The only place in the tests where this has an influence
is the isolation test cluster-conflict-partition.

The documentation changes seem in line with the code changes,
particularly for VACUUM and REINDEX where we have some special
handling for shared catalogs with ownership.
--
Michael


signature.asc
Description: PGP signature

[Patch] Use other indexes on the subscriber when REPLICA IDENTITY is FULL

2023-06-21 Thread Hayato Kuroda (Fujitsu)

Dear hackers,
(CC: Önder because he owned the related thread)

This is a follow-up thread of [1]. The commit allowed subscribers to use indexes
other than PK and REPLICA IDENTITY when REPLICA IDENTITY is FULL on publisher,
but the index must be a B-tree. In this proposal, I aim to extend this 
functionality to allow
for hash indexes and potentially other types.
I would like to share an initial patch to activate discussions.

# Current problem

The current limitation comes from the function build_replindex_scan_key(), 
specifically these lines:


```
/*
 * Load the operator info.  We need this to get the equality 
operator
 * function for the scan key.
 */
optype = get_opclass_input_type(opclass->values[index_attoff]);
opfamily = get_opclass_family(opclass->values[index_attoff]);
operator = get_opfamily_member(opfamily, optype,
   
optype,
   
BTEqualStrategyNumber);
```

These lines retrieve an operator for equality comparisons. The "strategy number"
[2] identifies this operator. For B-tree indexes, an equal-comparison operator
is always associated with a fixed number (BTEqualStrategyNumber, 3). However,
this approach fails for other types of indexes because the strategy number is
determined by the operator class, not fixed.

# Proposed solution

I propose a patch that chooses the correct strategy number based on the index
access method. We would extract the access method from the pg_opclass system
catalog, similar to the approach used for data types and operator families.
Also, this patch change the condition for using the index on the subscriber
(see IsIndexUsableForReplicaIdentityFull()).

However, this solution does not yet extend to GiST, SP-GiST, GIN, BRIN due to
implementation constraints.

## Current difficulties

The challenge with supporting other indexes is that they lack a fixed set of 
strategies,
making it difficult to choose the correct strategy number based on the index
access method. Even within the same index type, different operator classes can
use different strategy numbers for the same operation.
E.g. [2] shows that number 6 can be used for the purpose, but other operator 
classes
added by btree_gist [3] seem to use number 3 for the euqlaity comparison.


I've looked into using ExecIndexBuildScanKeys(), which is used for normal index 
scans.
However, in this case, the operator and its family seem to be determined by the 
planner.
Based on that, the associated strategy number is extracted. This is the opposite
of what I am trying to achieve, so it doesn't seem helpful in this context.



The current patch only includes tests for hash indexes. These are separated into
the file 032_subscribe_use_index.pl for convenience, but will be integrated in a
later version.


How do you think? I want to know your opinion.

[1]: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=89e46da5e511a6970e26a020f265c9fb4b72b1d2
[2]: https://www.postgresql.org/docs/devel/xindex.html#XINDEX-STRATEGIES
[3]: https://www.postgresql.org/docs/devel/btree-gist.html


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Allow-to-use-Hash-index-on-subscriber.patch
Description: 0001-Allow-to-use-Hash-index-on-subscriber.patch

Re: extended statistics n-distinct on multiple columns not used when join two tables

2023-06-21 Thread David Rowley

On Tue, 13 Jun 2023 at 23:29, Pavel Stehule  wrote:
>> I think it's probably worth adjusting the docs to mention this. It
>> seems like it might be something that could surprise someone.
>>
>> Something like the attached, maybe?
>
> +1

Ok, I pushed that patch.  Thanks.

David

Re: Adding SHOW CREATE TABLE

2023-06-21 Thread Kirk Wolak

On Mon, Jun 5, 2023 at 7:43 AM Jelte Fennema  wrote:

> On Thu, 1 Jun 2023 at 18:57, Kirk Wolak  wrote:
> > Can this get turned into a Patch?  Were you offering this code up for
> others (me?) to pull, and work into a patch?
> > [If I do the patch, I am not sure it gives you the value of reducing
> what CITUS has to maintain.  But it dawns on
> > me that you might be pushing a much bigger patch...  But I would take
> that, as I think there is other value in there]
>
> Attached is a patch which adds the relevant functions from Citus to
> Postgres (it compiles without warnings on my machine). I checked with
> a few others on the Citus team at Microsoft and everyone thought that
> upstreaming this was a good idea, because it's quite a bit of work to
> update with every postgres release.
>
> To set expectations though, I don't really have time to work on this
> patch. So if you can take it over from here that would be great.
>
> The patch only contains the C functions which generate SQL based on
> some oids. The wrappers such as the master_get_table_ddl_events
> function were too hard for me to pull out of Citus code, because they
> integrated a lot with other pieces. But the bulk of the logic is in
> the functions in this patch. Essentially all that
> master_get_table_ddl_events does is call the functions in this patch
> in the right order.
>
> > The ONLY thing I did not see was "CREATE TEMPORARY " syntax?  If you did
> this on a  TEMP table,
> > does it generate normal table syntax or TEMPORARY TABLE syntax???
>
> Yeah, the Citus code only handles things that Citus supports in
> distributed tables. Which is quite a lot, but indeed not everything
> yet. Temporary and inherited tables are not supported in this code
> afaik. Possibly more. See the commented out
> EnsureRelationKindSupported for what should be supported (normal
> tables and partitioned tables afaik).
>

Jelte,
  Thank you for this.
  Let me see what I can do with this, it seems like a solid starting point.
  At this point, based on previous feedback, finding a way to make
get_tabledef() etc. to work as functions is my goal.
I will see how inherited tables and temporary tables will be dealt with.

  Hopefully, this transfer works to please anyone concerned with
integrating this code into our project from the Citus code.

Kirk...

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-06-21 Thread Michael Paquier

On Wed, Jun 21, 2023 at 09:26:09AM -0700, Jeff Davis wrote:
> What I meant is that if you do:
> 
>   CREATE TABLE p(i INT, j INT) PARTITION BY RANGE (i);
>   CREATE TABLE p0 PARTITION OF p FOR VALUES FROM (00) TO (10);
>   CREATE TABLE p1 PARTITION OF p FOR VALUES FROM (10) TO (20);
>   CREATE INDEX p_idx ON p (i);
>   CREATE INDEX special_idx ON p0 (j);
>   GRANT MAINTAIN ON p TO foo;
>   \c - foo
>   REINDEX TABLE p;
> 
> That would reindex p0_i_idx and p1_i_idx, but skip special_idx. That
> might be too confusing, but feels a bit more consistent permissions-
> wise.

FWIW, the current behavior to reindex special_idx in this case feels
more natural to me, as the user requests a REINDEX at table-level, not
at index-level.  This would mean to me that all the indexes of all the
partitions should be rebuilt on, not just the partitioned indexes that
are defined in the partitioned table requested for rebuild.
--
Michael


signature.asc
Description: PGP signature

Re: vac_truncate_clog()'s bogus check leads to bogusness

2023-06-21 Thread Andres Freund

Hi,

On 2023-06-21 15:12:08 -0700, Andres Freund wrote:
> When vac_truncate_clog() returns early, due to one of these paths:
>
> [...]
>
> Separately, I think it's quite bad that we *silently* return from
> vac_truncate_clog() when finding a bogus xid. That's a quite severe condition,
> we should at least tell the user about it.

A related issue is that as far as I can tell the determination of what is
bogus is bogus.

The relevant cutoffs are determined vac_update_datfrozenxid() using:

/*
 * Identify the latest relfrozenxid and relminmxid values that we could
 * validly see during the scan.  These are conservative values, but it's
 * not really worth trying to be more exact.
 */
lastSaneFrozenXid = ReadNextTransactionId();
lastSaneMinMulti = ReadNextMultiXactId();

but doing checks based on thos is bogus, because:

a) a concurrent create table / truncate / vacuum can update
   pg_class.relfrozenxid of some relation in the current database to a newer
   value, after lastSaneFrozenXid already has been determined. If that
   happens, we skip updating pg_database.datfrozenxid.

b) A concurrent vacuum in another database, ending up in vac_truncate_clog(),
   can compute a newer datfrozenxid. In that case the vac_truncate_clog() with
   the outdated lastSaneFrozenXid will not truncate the clog (and also forget
   to release WrapLimitsVacuumLock currently, as reported upthread) and not
   call SetTransactionIdLimit(). The latter is particularly bad, because that
   means we might not come out of "database is not accepting commands" land.

I think in both cases a later call might fix the issue, but that could be some
way out, if autovacuum doesn't see further writes being necessary, and no
further write activity happens, because of ""database is not accepting
commands".

It's not entirely obvious to me how to best fix these. For a second I thought
we just need to acquire a snapshot before determining the sane values, but
that doesn't work, since we update the relevant fields with
heap_inplace_update().

I guess we could just recompute the boundaries before actually believing the
catalog values are bogus?

I think we also add warnings to these paths, so we actually have a chance to
find problems in the field.

Greetings,

Andres Freund

Re: Adding further hardening to nbtree page deletion

2023-06-21 Thread Peter Geoghegan

On Tue, Jun 20, 2023 at 11:13 PM Peter Geoghegan  wrote:
> FWIW, I'm almost certain that I'll completely run out of ERRORs to
> demote to LOGs before too long. In fact, this might very well be the
> last ERROR that I ever have to demote to a LOG to harden nbtree
> VACUUM.

Pushed this just now, backpatching all the way.

-- 
Peter Geoghegan

Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~

2023-06-21 Thread Michael Paquier

On Wed, Jun 21, 2023 at 10:11:33AM +0200, Peter Eisentraut wrote:
> Backpatching the OPENSSL_API_COMPAT change would set the minimum OpenSSL
> version to 1.0.1, which is newer than what was so far required in those
> branches.  That is the reason we didn't do this.

Looking at the relevant thread from 2020, this was still at the point
where we did not consider supporting 3.0 for all the stable branches
because 3.0 was in alpha:
https://www.postgresql.org/message-id/3d4afcfc-0930-1389-b9f7-59bdf11fb...@2ndquadrant.com

However, recent fixes like cab553a have made that possible, and we do
build with OpenSSL 3.0 across the whole set of stable branches.
Regarding the versions of OpenSSL supported:
- REL_13_STABLE requires 1.0.1 since 7b283d0e1.
- REL_12_STABLE and REL_11_STABLE require 0.9.8.

For 0.9.8, OPENSSL_API_COMPAT needs to be set at 0x00908000L (see
upstream's CHANGES.md).  So I don't see a reason not to do as
suggested by Andres?

I have tested the attached patches across 11~13 with various versions
of OpenSSL (OPENSSL_API_COMPAT exists since 1.1.0), and this is
working here.  Note that I don't have a MSVC environment at hand to
test this change on Windows, still `perl -cw Solution.pm` is OK with
it.

What do you think about the attached patch set (one for each branch)?
--
Michael
From 469d8c07cfcf22245b59cfe4573a70ea1720b8c1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 19 Jul 2020 12:14:42 +0200
Subject: [PATCH v2] Define OPENSSL_API_COMPAT

This avoids deprecation warnings from newer OpenSSL versions (3.0.0 in
particular).

Discussion: https://www.postgresql.org/message-id/flat/FEF81714-D479-4512-839B-C769D2605F8A%40yesql.se
---
 src/include/pg_config.h.in|  4 
 src/include/pg_config.h.win32 |  4 
 configure |  6 +-
 configure.in  |  3 +++
 src/tools/msvc/Solution.pm| 10 +-
 5 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 912132dbc5..157b504ea6 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -778,6 +778,10 @@
 /* Define bytes to use libc memset(). */
 #undef MEMSET_LOOP_LIMIT
 
+/* Define to the OpenSSL API version in use. This avoids deprecation warnings
+   from newer OpenSSL versions. */
+#undef OPENSSL_API_COMPAT
+
 /* Define to the address where bug reports for this package should be sent. */
 #undef PACKAGE_BUGREPORT
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 9510b98216..7fa151f41b 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -628,6 +628,10 @@
 /* Define bytes to use libc memset(). */
 #define MEMSET_LOOP_LIMIT 1024
 
+/* Define to the OpenSSL API version in use. This avoids deprecation warnings
+   from newer OpenSSL versions. */
+#define OPENSSL_API_COMPAT 0x00908000L
+
 /* Define to the address where bug reports for this package should be sent. */
 #define PACKAGE_BUGREPORT "pgsql-b...@postgresql.org"
 
diff --git a/configure b/configure
index 1577cf7ad3..dae02c8687 100755
--- a/configure
+++ b/configure
@@ -12063,7 +12063,11 @@ fi
 fi
 
 if test "$with_openssl" = yes ; then
-if test "$PORTNAME" != "win32"; then
+# Minimum required OpenSSL version is 0.9.8
+
+$as_echo "#define OPENSSL_API_COMPAT 0x00908000L" >>confdefs.h
+
+  if test "$PORTNAME" != "win32"; then
  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for CRYPTO_new_ex_data in -lcrypto" >&5
 $as_echo_n "checking for CRYPTO_new_ex_data in -lcrypto... " >&6; }
 if ${ac_cv_lib_crypto_CRYPTO_new_ex_data+:} false; then :
diff --git a/configure.in b/configure.in
index 0b44e2119f..29de083fe8 100644
--- a/configure.in
+++ b/configure.in
@@ -1269,6 +1269,9 @@ fi
 
 if test "$with_openssl" = yes ; then
   dnl Order matters!
+  # Minimum required OpenSSL version is 0.9.8
+  AC_DEFINE(OPENSSL_API_COMPAT, [0x00908000L],
+[Define to the OpenSSL API version in use. This avoids deprecation warnings from newer OpenSSL versions.])
   if test "$PORTNAME" != "win32"; then
  AC_CHECK_LIB(crypto, CRYPTO_new_ex_data, [], [AC_MSG_ERROR([library 'crypto' is required for OpenSSL])])
  AC_CHECK_LIB(ssl,SSL_new, [], [AC_MSG_ERROR([library 'ssl' is required for OpenSSL])])
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 984d63f5d7..c823655ed9 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -151,6 +151,8 @@ sub GenerateFiles
 {
 	my $self = shift;
 	my $bits = $self->{platform} eq 'Win32' ? 32 : 64;
+	my $openssl_api_compat;
+	my $ac_define_openssl_api_compat_found = 0;
 
 	# Parse configure.in to get version numbers
 	open(my $c, '<', "configure.in")
@@ -167,10 +169,15 @@ sub GenerateFiles
 			$self->{numver} = sprintf("%d%04d", $1, $2 ? $2 : 0);
 			$self->{majorver} = sprintf("%d", $1);
 		}
+		elsif (/\bAC_DEFINE\(OPENSSL_API_COMPAT, \[([0-9xL]+)\]/)
+		{
+			$ac_define_openssl_api_compat_found = 1;
+

Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Michael Paquier

On Wed, Jun 21, 2023 at 12:55:15PM -0700, Nathan Bossart wrote:
> LGTM.  I think this can wait for v17 since the current behavior has been
> around since 2001 and AFAIK this is the first report.  While it's arguably
> a bug fix, the patch also breaks some cases that work today.

Agreed that anything discussed on this thread does not warrant a
backpatch.
--
Michael


signature.asc
Description: PGP signature

Re: Why does pg_bsd_indent need to be installed?

2023-06-21 Thread Bruce Momjian

On Tue, Jun 20, 2023 at 06:54:56PM +0200, Álvaro Herrera wrote:
> On 2023-May-31, Bruce Momjian wrote:
> 
> > I guess we could try looking for pg_bsd_indent-$MAJOR_VERSION first,
> > then pg_bsd_indent.
> 
> Do you mean with $MAJOR_VERSION being Postgres' version?  That means we
> need to install a new symlink every year, even when pg_bsd_indent
> doesn't change.  I'd rather have it call pg_bsd_indent-$INDENT_VERSION
> first, and pg_bsd_indent if that doesn't exist.  I already have it that
> way.

Yes, your idea makes more sense.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: [PATCH] doc: add missing mention of MERGE in MVCC

2023-06-21 Thread Will Mortensen

I saw, thanks again!

On Wed, Jun 21, 2023 at 4:08 PM Bruce Momjian  wrote:
>
> On Mon, Jun 19, 2023 at 11:32:46PM -0700, Will Mortensen wrote:
> > MERGE is now a data-modification command too.
>
> Yes, this has been applied too.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   Only you can decide what is important to you.

Re: [PATCH] doc: add missing mention of MERGE in MVCC

2023-06-21 Thread Bruce Momjian

On Mon, Jun 19, 2023 at 11:32:46PM -0700, Will Mortensen wrote:
> MERGE is now a data-modification command too.

Yes, this has been applied too.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

vac_truncate_clog()'s bogus check leads to bogusness

2023-06-21 Thread Andres Freund

Hi,

When vac_truncate_clog() returns early, due to one of these paths:

/*
 * Do not truncate CLOG if we seem to have suffered wraparound already;
 * the computed minimum XID might be bogus.  This case should now be
 * impossible due to the defenses in GetNewTransactionId, but we keep 
the
 * test anyway.
 */
if (frozenAlreadyWrapped)
{
ereport(WARNING,
(errmsg("some databases have not been vacuumed 
in over 2 billion transactions"),
 errdetail("You might have already suffered 
transaction-wraparound data loss.")));
return;
}

/* chicken out if data is bogus in any other way */
if (bogus)
return;

we haven't released the lwlock that we acquired earlier:

/* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
LWLockAcquire(WrapLimitsVacuumLock, LW_EXCLUSIVE);

as this isn't a path raising an error, the lock isn't released during abort.
Until there's some cause for the session to call LWLockReleaseAll(), the lock
is held. Until then neither the process holding the lock, nor any other
process, can finish vacuuming.  We don't even have an assert against a
self-deadlock with an already held lock, oddly enough.


This is somewhat nasty - there's no real way to get out of this without an
immediate restart, and it's hard to pinpoint the problem as well :(.


Ok, the subject line is not the most precise, but it was just too good an
opportunity.


To reproduce (only on a throwaway system please!):

CREATE DATABASE invalid;
UPDATE pg_database SET datfrozenxid = '10002' WHERE datname = 'invalid';
DROP TABLE IF EXISTS foo_tbl; CREATE TABLE foo_tbl(); DROP TABLE foo_tbl; 
VACUUM FREEZE;
DROP TABLE IF EXISTS foo_tbl; CREATE TABLE foo_tbl(); DROP TABLE foo_tbl; 
VACUUM FREEZE;



Found this while writing a test for the fix for partial dropping of
databases [1].


Separately, I think it's quite bad that we *silently* return from
vac_truncate_clog() when finding a bogus xid. That's a quite severe condition,
we should at least tell the user about it.


Greetings,

Andres Freund

[1] https://postgr.es/m/20230621190204.nsaelabojxppiuix%40awork3.anarazel.de

Re: Preventing non-superusers from altering session authorization

2023-06-21 Thread Nathan Bossart

On Wed, Jun 21, 2023 at 04:28:43PM -0400, Joseph Koshakow wrote:
> Currently, a user is allowed to execute SET SESSION AUTHORIZATION [1]
> if the role they connected to PostgreSQL with was a superuser at the
> time of connection. Even if the role is later altered to no longer be a
> superuser, the session can still execute SET SESSION AUTHORIZATION, as
> long as the session isn't disconnected. As a consequence, if that role
> is altered to no longer be a superuser, then the user can use SET
> SESSION AUTHORIZATION to switch to another role that is a superuser and
> regain superuser privileges. They can even re-grant themselves the
> superuser attribute.

I suspect most users aren't changing the superuser attribute on roles very
often, so it's unlikely to be a problem.  But it might still be worth
revisiting.

> It is possible that the user had already run SET SESSION AUTHORIZATION
> to set their session to a superuser before their connecting role lost
> the superuser attribute. In this case there's not much we can do.

Right.

> Also, from looking at the code and documentation, it looks like SET
> SESSION AUTHORIZATION works this way intentionally. However, I'm unable
> to figure out why we'd want it to work this way.

I found a brief mention in the archives about this implementation decision
[0], but I don't think it explains the reasoning.

> I've attached a patch that would fix this issue by checking the catalog
> to see if the connecting role is currently a superuser every time SET
> SESSION AUTHORIZATION is run. However, according to the comment I
> deleted there's something invalid about reading the catalog from that
> function, though I wasn't able to understand it fully.

This comment was added in e5d6b91.  I see that RESET SESSION AUTHORIZATION
with a concurrently dropped role will FATAL with your patch but succeed
without it, which could be part of the reason.

> One downside is that if a user switches their session authorization to
> some role, then loses the superuser attribute on their connecting role,
> they may be stuck in a that role with no way to reset their session
> authorization without disconnecting and reconnecting.

It looks like SetSessionAuthorization() skips the privilege checks if the
target role is the authenticated role, so I don't think they'll get stuck.

[0] 
https://postgr.es/m/Pine.LNX.4.30.0104182119290.762-10%40peter.localdomain

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Preventing non-superusers from altering session authorization

2023-06-21 Thread Joseph Koshakow

Hi all,

I briefly mentioned this issue in another mailing thread [0].

Currently, a user is allowed to execute SET SESSION AUTHORIZATION [1]
if the role they connected to PostgreSQL with was a superuser at the
time of connection. Even if the role is later altered to no longer be a
superuser, the session can still execute SET SESSION AUTHORIZATION, as
long as the session isn't disconnected. As a consequence, if that role
is altered to no longer be a superuser, then the user can use SET
SESSION AUTHORIZATION to switch to another role that is a superuser and
regain superuser privileges. They can even re-grant themselves the
superuser attribute.

It is possible that the user had already run SET SESSION AUTHORIZATION
to set their session to a superuser before their connecting role lost
the superuser attribute. In this case there's not much we can do.

Also, from looking at the code and documentation, it looks like SET
SESSION AUTHORIZATION works this way intentionally. However, I'm unable
to figure out why we'd want it to work this way.

I've attached a patch that would fix this issue by checking the catalog
to see if the connecting role is currently a superuser every time SET
SESSION AUTHORIZATION is run. However, according to the comment I
deleted there's something invalid about reading the catalog from that
function, though I wasn't able to understand it fully.

One downside is that if a user switches their session authorization to
some role, then loses the superuser attribute on their connecting role,
they may be stuck in a that role with no way to reset their session
authorization without disconnecting and reconnecting.

Thanks,
Joe Koshakow

[0]
https://www.postgresql.org/message-id/CAAvxfHco7iGw4NarymhfLWN6PjzYRrbYFt2BnSFeSD5sFzqEJQ%40mail.gmail.com
[1] https://www.postgresql.org/docs/15/sql-set-session-authorization.html
From b5f7d42ea325b2be46b7c93e5404792046f1befc Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Thu, 15 Jun 2023 14:53:11 -0400
Subject: [PATCH] Prevent non-superusers from altering session auth

Previously, if a user connected with as a role that had the superuser
attribute, then they could always execute a SET SESSION AUTHORIZATION
statement for the duration of their session. Even if the role was
altered to set superuser to false, the user was still allowed to
execute SET SESSION AUTHORIZATION. This allowed them to set their
session role to some other superuser and effectively regain the
superuser privileges. They could even reset their own superuser
attribute to true.

This commit alters the privilege checks for SET SESSION AUTHORIZATION
such that a user can only execute it if the role they connected with is
currently a superuser. This prevents users from regaining superuser
privileges after it has been revoked.
---
 doc/src/sgml/ref/set_session_auth.sgml |  2 +-
 src/backend/utils/init/miscinit.c  | 33 +++---
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/set_session_auth.sgml b/doc/src/sgml/ref/set_session_auth.sgml
index f8fcafc194..94adab2468 100644
--- a/doc/src/sgml/ref/set_session_auth.sgml
+++ b/doc/src/sgml/ref/set_session_auth.sgml
@@ -51,7 +51,7 @@ RESET SESSION AUTHORIZATION
 
   
The session user identifier can be changed only if the initial session
-   user (the authenticated user) had the
+   user (the authenticated user) has the
superuser privilege.  Otherwise, the command is accepted only if it
specifies the authenticated user name.
   
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a604432126..459af11691 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -467,7 +467,7 @@ ChangeToDataDir(void)
  * AuthenticatedUserId is determined at connection start and never changes.
  *
  * SessionUserId is initially the same as AuthenticatedUserId, but can be
- * changed by SET SESSION AUTHORIZATION (if AuthenticatedUserIsSuperuser).
+ * changed by SET SESSION AUTHORIZATION.
  * This is the ID reported by the SESSION_USER SQL function.
  *
  * OuterUserId is the current user ID in effect at the "outer level" (outside
@@ -492,8 +492,6 @@ static Oid	OuterUserId = InvalidOid;
 static Oid	CurrentUserId = InvalidOid;
 static const char *SystemUser = NULL;
 
-/* We also have to remember the superuser state of some of these levels */
-static bool AuthenticatedUserIsSuperuser = false;
 static bool SessionUserIsSuperuser = false;
 
 static int	SecurityRestrictionContext = 0;
@@ -731,6 +729,7 @@ InitializeSessionUserId(const char *rolename, Oid roleid)
 	HeapTuple	roleTup;
 	Form_pg_authid rform;
 	char	   *rname;
+	bool	   is_superuser;
 
 	/*
 	 * Don't do scans if we're bootstrapping, none of the system catalogs
@@ -770,10 +769,10 @@ InitializeSessionUserId(const char *rolename, Oid roleid)
 	rname = NameStr(rform->rolname);
 
 	AuthenticatedUserId = roleid;
-	AuthenticatedUserIsSuperuser = rform->rolsuper;
+	is_superuser =

Re: Can JoinFilter condition be pushed down into IndexScan?

2023-06-21 Thread Tomas Vondra

On 6/21/23 20:37, Bəxtiyar Neyman wrote:
> Thanks Tomas for the lengthy write-up!
> 
> Pardon the noise in the queries (LATERAL, AND true etc): they were
> autogenerated by the library we wrote.
> 

I know, but it makes them harder to read for people. If you want people
to respond it's generally a good idea to make it easy to understand the
question. Don't make them waste their time - they'll just skip the
message entirely.

>> Because those queries are not doing the same thing. In the first query
>> you sort by t3_0 columns, while the "id = 4732455" condition is on the
>> other table. And so it can't use the index scan for sorting.
>>
>> While in the second query it can do that, and it doesn't need to do the
>> explicit sort (which needs to fetch all the rows etc.).
> 
> Let me try to explain what both of my queries do:
> 1) Get the rank of the user using its id (id = 4732455 in this example,
> but it could have been one that exists, e.g. id = 500). This is LATERAL
> t3_1 in the first query and subquery in the WHERE clause of the second
> query.
> 2) Using that rank, get the next 10 users by rank. This is t3_0.
> 
> Thus I can't just change the first query to "ORDER BY t3_1."rank" DESC,
> t3_1."id" DESC" as you suggest, because then the order of returned rows
> will not be guaranteed. In fact, such a clause will have no effect
> because there is going to be at most one row supplied by t3_1 anyway.
> 

Ah, OK. I got this wrong.

> My question thus still stands. The planner knows that t3_1 has at most
> one row, and it knows that t3_0 can produce up to 5000 rows. Yet, it
> doesn't figure out that it could have lowered the Join Filter condition
> from the first plan as an Index Cond of the Index Scan of t3_1. Is there
> a fundamental reason for this, or is this something worth improving in
> the planner?
> 

As I tried to explain before, I don't think the problem is in the
planner not being able to do this transformation, but more likely in not
being able to cost it correctly.

Consider this (with 1M rows in the user_ranks table):

1) subquery case
=

 Limit  (cost=8.87..9.15 rows=10 width=8) (actual time=0.032..0.037
rows=10 loops=1)
   Output: t3_0.id, t3_0.rank
   InitPlan 1 (returns $0,$1)
 ->  Index Scan using user_ranks_pkey on public.user_ranks t4_0
(cost=0.42..8.44 rows=1 width=8) (actual time=0.017..0.019 rows=1 loops=1)
   Output: t4_0.rank, t4_0.id
   Index Cond: (t4_0.id = 33)
   ->  Index Only Scan Backward using "by (rank, id)" on
public.user_ranks t3_0  (cost=0.42..9493.75 rows=33 width=8) (actual
time=0.031..0.033 rows=10 loops=1)
 Output: t3_0.id, t3_0.rank
 Index Cond: (ROW(t3_0.rank, t3_0.id) <= ROW($0, $1))
 Heap Fetches: 0
 Planning Time: 0.072 ms
 Execution Time: 0.055 ms
(12 rows)


2) join
===

 Limit  (cost=0.85..2.15 rows=10 width=8) (actual time=464.662..464.672
rows=10 loops=1)
   Output: t3_0.id, t3_0.rank
   ->  Nested Loop  (cost=0.85..43488.87 rows=33 width=8) (actual
time=464.660..464.667 rows=10 loops=1)
 Output: t3_0.id, t3_0.rank
 Inner Unique: true
 Join Filter: (ROW(t3_0.rank, t3_0.id) <= ROW(t4_0.rank, t4_0.id))
 Rows Removed by Join Filter: 67
 ->  Index Only Scan Backward using "by (rank, id)" on
public.user_ranks t3_0  (cost=0.42..25980.42 rows=100 width=8)
(actual time=0.015..93.703 rows=77 loops=1)
   Output: t3_0.rank, t3_0.id
   Heap Fetches: 0
 ->  Materialize  (cost=0.42..8.45 rows=1 width=8) (actual
time=0.000..0.000 rows=1 loops=77)
   Output: t4_0.rank, t4_0.id
   ->  Index Scan using user_ranks_pkey on public.user_ranks
t4_0  (cost=0.42..8.44 rows=1 width=8) (actual time=0.010..0.011 rows=1
loops=1)
 Output: t4_0.rank, t4_0.id
 Index Cond: (t4_0.id = 33)
 Planning Time: 0.092 ms
 Execution Time: 464.696 ms
(17 rows)


3) join (with LEFT JOIN)


 Limit  (cost=20038.73..20038.76 rows=10 width=8) (actual
time=180.714..180.720 rows=10 loops=1)
   Output: t3_0.id, t3_0.rank
   ->  Sort  (cost=20038.73..20872.06 rows=33 width=8) (actual
time=180.712..180.715 rows=10 loops=1)
 Output: t3_0.id, t3_0.rank
 Sort Key: t3_0.rank DESC, t3_0.id DESC
 Sort Method: top-N heapsort  Memory: 26kB
 ->  Nested Loop Left Join  (cost=0.85..12835.52 rows=33
width=8) (actual time=0.033..122.000 rows=33 loops=1)
   Output: t3_0.id, t3_0.rank
   ->  Index Scan using user_ranks_pkey on public.user_ranks
t4_0  (cost=0.42..8.44 rows=1 width=8) (actual time=0.018..0.020 rows=1
loops=1)
 Output: t4_0.id, t4_0.rank
 Index Cond: (t4_0.id = 33)
   ->  Index Only Scan using "by (rank, id)" on
public.user_ranks t3_0  (cost=0.42..9493.75 rows=33 width=8) (actual
time=0.013..49.759 rows=33 loops=1)

Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Nathan Bossart

On Wed, Jun 21, 2023 at 09:02:49PM +0200, Drouvot, Bertrand wrote:
> Please find attached a patch doing so (which is basically a revert of 
> d18c1d1f51).

LGTM.  I think this can wait for v17 since the current behavior has been
around since 2001 and AFAIK this is the first report.  While it's arguably
a bug fix, the patch also breaks some cases that work today.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use of additional index columns in rows filtering

2023-06-21 Thread Tomas Vondra




On 6/21/23 18:17, James Coleman wrote:
> On Wed, Jun 21, 2023 at 11:28 AM Tomas Vondra
>  wrote:
>>
>>
>>
>> On 6/21/23 14:45, James Coleman wrote:
>>> Hello,
>>>
>>> I've cc'd Jeff Davis on this due to a conversation we had at PGCon
>>> about applying filters on index tuples during index scans.
>>>
>>> I've also cc'd Andres Freund because I think this relates to his
>>> musing in [1] that:
 One thing I have been wondering around this is whether we should not have
 split the code for IOS and plain indexscans...
>>>
>>> I think I remember Peter Geoghegan also wondering (I can't remember if
>>> this was in conversation at PGCon about index skip scans or in a
>>> hackers thread) about how we compose these various index scan
>>> optimizations.
>>>
>>> To be certain this is probably a thing to tackle as a follow-on to
>>> this patch, but it does seem to me that what we are implicitly
>>> realizing is that (unlike with bitmap scans, I think) it doesn't
>>> really make a lot of conceptual sense to have index only scans be a
>>> separate node from index scans. Instead it's likely better to consider
>>> it an optimization to index scans that can dynamically kick in when
>>> it's able to be of use. That would allow it to compose with e.g.
>>> prefetching in the aforelinked thread. At the very least we would need
>>> pragmatic (e.g., cost of dynamically applying optimizations) rather
>>> than conceptual reasons to argue they should continue to be separate.
>>>
>>
>> I agree it seems a bit weird to have IOS as a separate node. In a way, I
>> think there are two dimensions for "index-only" scans - which pages can
>> be scanned like that, and which clauses can be evaluated with only the
>> index tuple. The current approach focuses on page visibility, but
>> ignores the other aspect entirely. Or more precisely, it disables IOS
>> entirely as soon as there's a single condition requiring heap tuple.
>>
>> I agree it's probably better to see this as a single node with various
>> optimizations that can be applied when possible / efficient (based on
>> planner and/or dynamically).
>>
>> I'm not sure I see a direct link to the prefetching patch, but it's true
>> that needs to deal with tids (instead of slots), just like IOS. So if
>> the node worked with tids, maybe the prefetching could be done at that
>> level (which I now realize may be what Andres meant by doing prefetching
>> in the executor).
> 
> The link to prefetching is that IOS (as a separate node) won't benefit
> from prefetching (I think) with your current prefetching patch (in the
> case where the VM doesn't allow us to just use the index tuple),
> whereas if the nodes were combined that would more naturally be
> composable.
> 

Yeah, mostly. Although just unifying "regular" indexscans and IOS would
not allow prefetching for IOS.

The reason why the current design does not allow doing prefetching for
IOS is that the prefetching happens deep in indexam.c, and down there we
don't know which TIDs are not from all-visible pages and would need
prefetching. Which is another good reason to do the prefetching at the
executor level, I believe.

>>> Apologies for that lengthy preamble; on to the patch under discussion:
>>>
>>> On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
>>>  wrote:

 Hi,

 I took a stab at this and implemented the trick with the VM - during
 index scan, we also extract the filters that only need the indexed
 attributes (just like in IOS). And then, during the execution we:

   1) scan the index using the scan keys (as before)

   2) if the heap page is all-visible, we check the new filters that can
  be evaluated on the index tuple

   3) fetch the heap tuple and evaluate the filters
>>>
>>> Thanks for working on this; I'm excited about this class of work
>>> (along with index prefetching and other ideas I think there's a lot of
>>> potential for improving index scans).
>>>
 This is pretty much exactly the same thing we do for IOS, so I don't see
 why this would be incorrect while IOS is correct.

 This also adds "Index Filter" to explain output, to show which filters
 are executed on the index tuple (at the moment the filters are a subset
 of "Filter"), so if the index tuple matches we'll execute them again on
 the heap tuple. I guess that could be fixed by having two "filter"
 lists, depending on whether we were able to evaluate the index filters.
>>>
>>> Given that we show index filters and heap filters separately it seems
>>> like we might want to maintain separate instrumentation counts of how
>>> many tuple were filtered by each set of filters.
>>>
>>
>> Yeah, separate instrumentation counters would be useful. What I was
>> talking about was more about the conditions itself, because right now we
>> re-evaluate the index-only clauses on the heap tuple.
>>
>> Imagine an index on t(a) and a query that has WHERE (a = 1) AND (b = 2).
>> the patch splits

Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Drouvot, Bertrand




On 6/21/23 4:22 PM, Drouvot, Bertrand wrote:

Hi,

On 6/21/23 3:43 PM, Tom Lane wrote:

Kyotaro Horiguchi  writes:

At Wed, 21 Jun 2023 09:43:50 +0200, "Drouvot, Bertrand" 
 wrote in

Trying to connect with the 64 bytes name:
$ psql -d 
psql: error: connection to server on socket "/tmp/.s.PGSQL.55448"
failed: FATAL: database "äää" does not
exist



IMHO, I'm not sure we should allow connections without the exact name
being provided. In that sense, I think we might want to consider
outright rejecting the estblishment of a connection when the given
database name doesn't fit the startup packet, since the database with
the exact given name cannot be found.


I think I agree.  I don't like the proposed patch at all, because it's
making completely unsupportable assumptions about what encoding the
names are given in.  Simply failing to match when a name is overlength
sounds safer.



Yeah, that's another and "cleaner" option.

I'll propose a patch to make it failing even for the non multibyte case then (
so that multibyte and non multibyte behaves the same aka failing in case of 
overlength
name is detected).


Please find attached a patch doing so (which is basically a revert of 
d18c1d1f51).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 1a40e13385752ef05b9602d1040e73dbb14d0c7e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Wed, 21 Jun 2023 18:28:13 +
Subject: [PATCH v1] Reject incoming username and database name in case of
 overlength

---
 src/backend/postmaster/postmaster.c | 9 -
 1 file changed, 9 deletions(-)
 100.0% src/backend/postmaster/

diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index 4c49393fc5..0b1de9efb2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2290,15 +2290,6 @@ retry1:
}
}
 
-   /*
-* Truncate given database and user names to length of a Postgres name.
-* This avoids lookup failures when overlength names are given.
-*/
-   if (strlen(port->database_name) >= NAMEDATALEN)
-   port->database_name[NAMEDATALEN - 1] = '\0';
-   if (strlen(port->user_name) >= NAMEDATALEN)
-   port->user_name[NAMEDATALEN - 1] = '\0';
-
if (am_walsender)
MyBackendType = B_WAL_SENDER;
else
-- 
2.34.1

Re: DROP DATABASE is interruptible

2023-06-21 Thread Andres Freund

Hi,

I'm hacking on this bugfix again, thanks to Evgeny's reminder on the other
thread [1].


I've been adding checks for partiall-dropped databases to the following places
so far:
- vac_truncate_clog(), as autovacuum can't process it anymore. Otherwise a
  partially dropped database could easily lead to shutdown-due-to-wraparound.
- get_database_list() - so autovacuum workers don't error out when connecting
- template database used by CREATE DATABASE
- pg_dumpall, so we don't try to connect to the database
- vacuumdb, clusterdb, reindexdb, same

It's somewhat annoying that there is no shared place for the relevant query
for the client-side cases.


I haven't yet added checks to pg_upgrade, even though that's clearly
needed. I'm waffling a bit between erroring out and just ignoring the
database? pg_upgrade already fails when datallowconn is set "wrongly", see
check_proper_datallowconn().  Any opinions?


I'm not sure what should be done for psql. It's probably not a good idea to
change tab completion, that'd just make it appear the database is gone. But \l
could probably show dropped databases more prominently?


We don't really have a good place to for database specific
code. dbcommands.[ch] are for commands (duh), but already contain a bunch of
functions that don't really belong there.  Seems we should add a
catalog/pg_database.c or catalog/database.c (tbh, I don't really know which we
use for what). But that's probably for HEAD only.


dbcommands.c's get_db_info() seems to have gone completely off the deep
end. It returns information in 14 separate out parameters, and the variables
for that need to all exhaustively be declared.  And of course that differs
heavily between releases, making it a pain to backpatch any change.  ISTM we
should just define a struct for the parameters - alternatively we could just
return a copy of the pg_database tuple, but it looks like the variable-width
attributes would make that *just* about a loss.

I guess that's once more something better dealt with on HEAD, but damn, I'm
not relishing having to deal with backpatching anything touching it - I think
it might be reasonable to just open-code fetching datconnlimit :/.


This patch is starting to be a bit big, particularly once adding tests for all
the checks mentioned above - but I haven't heard of or thought of a better
proposal :(.

Greetings,

Andres Freund

[1] 
https://postgr.es/m/01020188d31d0a86-16af92c0-4466-4cb6-a2e8-0e5898aab800-00%40eu-west-1.amazonses.com

Re: Memory leak in incremental sort re-scan

2023-06-21 Thread James Coleman

On Thu, Jun 15, 2023 at 6:35 PM Tomas Vondra
 wrote:
>
>
>
> On 6/15/23 22:36, Tom Lane wrote:
> > Tomas Vondra  writes:
> >> On 6/15/23 22:11, Tom Lane wrote:
> >>> I see zero leakage in that example after applying the attached quick
> >>> hack.  (It might be better to make the check in the caller, or to just
> >>> move the call to ExecInitIncrementalSort.)
> >
> >> Thanks for looking. Are you planning to work on this and push the fix,
> >> or do you want me to finish this up?
> >
> > I'm happy to let you take it -- got lots of other stuff on my plate.
> >
>
> OK, will do.

I think the attached is enough to fix it -- rather than nulling out
the sort states in rescan, we can reset them (as the comment says),
but not set them to null (we also have the same mistake with
presorted_keys). That avoids unnecessary recreation of the sort
states, but it also fixes the problem Tom noted as well: the call to
preparePresortedCols() is already guarded by a test on fullsort_state
being NULL, so with this change we also won't unnecessarily redo that
work.

Regards,
James Coleman

v2-0001-Fix-memory-leak-in-incremental-sort-rescan.patch
Description: Binary data

Re: Support TZ format code in to_timestamp()

2023-06-21 Thread David Steele


On 6/21/23 20:07, Bruce Momjian wrote:

On Tue, Jun 13, 2023 at 12:20:42PM -0400, Tom Lane wrote:

It's annoyed me for some time that to_timestamp() doesn't implement
the TZ format code that to_char() has.  I finally got motivated to
do something about that after the complaint at [1] that jsonpath's
datetime() method can't read typical JSON.stringify() output like
"2023-05-22T03:09:37.825Z".  We do already understand "Z" as a
time zone abbreviation for UTC; we just need to get formatting.c
to support this.


I have to admit I tend to prefer actual time zone names like
"America/New_York" over acronyms or offset values.  However, I can see
the dump/restore problem with such names.


I think the abbreviations are worse than useless -- dangerously 
misleading even. I was converting a timestamp I had pulled from the 
internet the other day in IST (India Standard Time) using Postres to 
test some new code I was working on. I got a rather surprising result so 
changed it to Asia/Kolkata and got what I expected.


Turns out IST is *also* Israel Standard Time and Irish Standard Time. I 
think Postres gave me the conversion in Irish time. At any rate, it was 
not offset by 30 minutes which was the dead giveaway.


Offsets are fine when you just need an absolute date to feed into 
something like recovery and it doesn't much matter what timezone you 
were in.


Z and UTC also seem fine since they are unambiguous as far as I know.

Regards,
-David

Re: Support TZ format code in to_timestamp()

2023-06-21 Thread Bruce Momjian

On Tue, Jun 13, 2023 at 12:20:42PM -0400, Tom Lane wrote:
> It's annoyed me for some time that to_timestamp() doesn't implement
> the TZ format code that to_char() has.  I finally got motivated to
> do something about that after the complaint at [1] that jsonpath's
> datetime() method can't read typical JSON.stringify() output like
> "2023-05-22T03:09:37.825Z".  We do already understand "Z" as a
> time zone abbreviation for UTC; we just need to get formatting.c
> to support this.

I have to admit I tend to prefer actual time zone names like
"America/New_York" over acronyms or offset values.  However, I can see
the dump/restore problem with such names.

Parenthetically, I often use airport codes that map to time zones in my
own calendar.  I would argue that on a global scale airport codes are
actually more useful than abbreviations like EST, assuming you don't
need to designate whether daylight saving time was active, e.g. EST vs.
EDT.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: add non-option reordering to in-tree getopt_long

2023-06-21 Thread Nathan Bossart

On Tue, Jun 20, 2023 at 02:12:44PM +0900, Kyotaro Horiguchi wrote:
> The argv elements get shuffled around many times with the
> patch. However, I couldn't find a way to decrease the count without
> resorting to a forward scan.  So I've concluded the current approach
> is them most effeicient, considering the complexity.

Yeah, I'm not sure it's worth doing anything more sophisticated.

> I tried some patterns with the patch and it generates the same results
> with the glibc version.
> 
> The TAP test looks fine and it catches the change.
> 
> Everything looks fine to me.

Thanks for reviewing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: collation-related loose ends before beta2

2023-06-21 Thread Jonathan S. Katz


On 6/20/23 5:02 AM, Jeff Davis wrote:


Status on collation loose ends:

1. There's an open item "Switch to ICU for 17". It's a little bit
confusing exactly what that means, and the CF entry refers to two
items, one of which is the build-time default to --with-icu. As far as
I know, building with ICU by default is a settled issue with no
objections. The second issue is the initdb default, which is covered by
the other open item. So I will just close that open item unless someone
thinks I'm missing something.


[RMT Hat]

No objections. The RMT had interpreted this as "Punt on making ICU the 
building default to v17" but it seems the consensus is to continue to 
leave it in as the default for v16.



2. Open item about the unfriendly rules for choosing an ICU locale at
initdb time. Tom, Robert, and Daniel Verite have expressed concerns
(and at least one objection) to initdb defaulting to  icu for --locale-
provider. Some of the problems have been addressed, but the issue about
C and C.UTF-8 locales is not settled. Even if it were settled I'm not
sure we'd have a clear consensus on all the details. I don't think this
should proceed to beta2 in this state, so I intend to revert back to
libc as the default for initdb. [ I believe we do have a general
consensus that ICU is better, but we can signal it other ways: through
documentation, packaging, etc. ]


[Personal hat]

(Building...)

I do think this raises a good point: it's really the packaging that will 
guide what users are using for v16. I don't know if we want to 
discuss/poll the packagers to see what they are thinking about this?



3. The ICU conversion from "C" to "en-US-u-va-posix": cut out this code
(it was a small part of a larger change). It's only purpose was
consistency between ICU versions, and nobody liked it. It's only here
right now to avoid test failures due to an order-of-commits issue; but
if the initdb default goes back to libc it won't matter and I can
remove it.

4. icu_validation_level WARNING or ERROR: right now an invalid ICU
locale raises a WARNING, but Peter Eisentraut would prefer an ERROR.
I'm still inclined to leave it as a WARNING for one release and
increase it to ERROR later. But if the default collation provider goes
back to libc, the risk of ICU validation errors goes way down, so I
don't object if Peter would like to change it back to an ERROR.


[Personal hat]

I'd be inclined for "WARNING" until getting a sense of what packagers 
who do an initdb as part of the installation process decide what 
collation provider they're going to use.


Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-06-21 Thread Nathan Bossart

On Tue, Jun 20, 2023 at 09:15:18PM -0700, Nathan Bossart wrote:
> Perhaps we should add something like
> 
>   Note that while REINDEX on a partitioned index or table requires
>   MAINTAIN on the partitioned table, such commands skip the privilege
>   checks when processing the individual partitions.
> 
> Thoughts?  I'm trying to keep the privilege documentation for maintenance
> commands as simple as possible, so I'm hoping to avoid adding too much text
> dedicated to these special cases.

Here is a new patch set that includes this new sentence.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6a18f0ab9706eac0295c2e1e53dc2433df10e66a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 19 Jun 2023 13:57:57 -0700
Subject: [PATCH v8 1/2] partial revert of ff9618e82a

---
 doc/src/sgml/ref/analyze.sgml |  5 +--
 doc/src/sgml/ref/cluster.sgml |  5 +--
 doc/src/sgml/ref/lock.sgml|  5 +--
 doc/src/sgml/ref/reindex.sgml |  6 +---
 doc/src/sgml/ref/vacuum.sgml  |  5 +--
 src/backend/commands/cluster.c| 10 ++
 src/backend/commands/indexcmds.c  | 27 +++
 src/backend/commands/lockcmds.c   |  8 -
 src/backend/commands/tablecmds.c  | 34 +++
 src/backend/commands/vacuum.c |  8 +
 src/include/commands/tablecmds.h  |  1 -
 .../expected/cluster-conflict-partition.out   | 14 
 .../specs/cluster-conflict-partition.spec |  7 ++--
 src/test/regress/expected/cluster.out |  3 +-
 src/test/regress/expected/create_index.out|  4 +--
 src/test/regress/expected/privileges.out  |  8 ++---
 src/test/regress/expected/vacuum.out  | 18 ++
 17 files changed, 62 insertions(+), 106 deletions(-)

diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index 20c6f9939f..30a893230e 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -190,10 +190,7 @@ ANALYZE [ VERBOSE ] [ table_and_columnsANALYZE can only be performed by superusers and roles
-   with privileges of pg_maintain.)  If a role has
-   permission to ANALYZE a partitioned table, it is also
-   permitted to ANALYZE each of its partitions, regardless
-   of whether the role has the aforementioned privileges on the partition.
+   with privileges of pg_maintain.)
ANALYZE will skip over any tables that the calling user
does not have permission to analyze.
   
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 29f0f1fd90..f0dd7faed5 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -137,10 +137,7 @@ CLUSTER [VERBOSE]
 on the table or be the table's owner, a superuser, or a role with
 privileges of the
 pg_maintain
-role.  If a role has permission to CLUSTER a partitioned
-table, it is also permitted to CLUSTER each of its
-partitions, regardless of whether the role has the aforementioned
-privileges on the partition.  CLUSTER will skip over any
+role.  CLUSTER will skip over any
 tables that the calling user does not have permission to cluster.

 
diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index 5b3b2b793a..8524182211 100644
--- a/doc/src/sgml/ref/lock.sgml
+++ b/doc/src/sgml/ref/lock.sgml
@@ -177,10 +177,7 @@ LOCK [ TABLE ] [ ONLY ] name [ * ]
 MODE (or a less-conflicting mode as described in ) is permitted. If a user has
 SELECT privileges on the table, ACCESS SHARE
-MODE is permitted.  If a role has permission to lock a
-partitioned table, it is also permitted to lock each of its partitions,
-regardless of whether the role has the aforementioned privileges on the
-partition.
+MODE is permitted.

 

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 71455dfdc7..23f8c7630b 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -306,11 +306,7 @@ REINDEX [ ( option [, ...] ) ] { DA
indexes on shared catalogs will be skipped unless the user owns the
catalog (which typically won't be the case), has privileges of the
pg_maintain role, or has the MAINTAIN
-   privilege on the catalog.  If a role has permission to
-   REINDEX a partitioned table, it is also permitted to
-   REINDEX each of its partitions, regardless of whether the
-   role has the aforementioned privileges on the partition.  Of course,
-   superusers can always reindex anything.
+   privilege on the catalog.  Of course, superusers can always reindex anything.
   
 
   
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 57bc4c23ec..445325e14c 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -452,10 +452,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ relname)));
 
 	/* Check permissions */
-	if

Re: RFC: logical publication via inheritance root?

2023-06-21 Thread Jacob Champion

On Wed, Jun 21, 2023 at 3:28 AM Amit Kapila  wrote:
> On Tue, Jun 20, 2023 at 10:39 PM Jacob Champion  
> wrote:
> > Making it a subscriber-side feature
> > requires tight coupling between the two peers, though. (For the
> > timescaledb case, how does the subscriber know which new partitions
> > belong to which root?
>
> Yeah, the subscriber can't figure that out automatically. Users need
> to provide the mapping manually.

Right. For that reason, I think subscriber-side mappings probably
won't help this particular use case. This patchset is pushing more in
the direction of publisher-side mappings.

Thanks,
--Jacob

Re: EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Jonathan S. Katz


On 6/21/23 12:14 PM, Jeff Davis wrote:

On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time
,
and that it can be implemented with rules in v16.


Interesting, thank you!


+1 -- this is very helpful framing the problem, thank you!


This can be useful for people who migrate from mainframes to Postgres
and need their migration tests to produce the same sorted results as
the
original system.
Since rules can be defined at the database level with the icu_rules
option,
they don't even need to tweak their queries to add COLLATE clauses,
which surely is appreciable in that kind of project.


I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?



I'm open to suggestions on whether this EBCDIC example is worth being
in the
doc in some form or putting this in the wiki would be good enough.


I like the idea of having a real example. Ideally, we could add some
explanation along the way about how the rule is constructed to match
EBCDIC, which would reduce the shock of a long rule like that.

I wonder why the rule syntax is such that it cannot be broken up? Would
it be incorrect for us to allow some whitespace in there?


I'll give the unhelpful comment of "yes, I agree we should have a real 
world example", especially one that seems relevant to helping more 
people adopt PostgreSQL.




OpenPGP_signature
Description: OpenPGP digital signature

Re: pg_collation.collversion for C.UTF-8

2023-06-21 Thread Daniel Verite

Thomas Munro wrote:

> What could we do that would be helpful here, without affecting users
> of the "true" C.UTF-8 for the rest of time?  This is a Debian (+
> downstream distro) only problem as far as we know so far, and only
> for Debian 11 and older.

It seems to include RedHat-based distros as well.

According to https://bugzilla.redhat.com/show_bug.cgi?id=902094
C.utf8 was added in 2015 and backported down to Fedora 22.
RHEL8 / CentOS 8 / Rocky8 provide glibc 2.28 with a C.utf8
locale. We can reasonably suspect that they've been using the same
kind of patches as Debian before version 12, with not all codepoints
being sorted bytewise.

RHEL9 comes with glibc 2.34 according to distrowatch [1] and the
announcement [2], so presumably it also lacks the "real" C.utf8
with bytewise sorting that glibc 2.35 upstream added.

[1] https://distrowatch.com/table.php?distribution=redhat
[2]
https://developers.redhat.com/articles/2022/05/18/whats-new-red-hat-enterprise-linux-9

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~

2023-06-21 Thread Andres Freund

Hi,

On 2023-06-21 10:11:33 +0200, Peter Eisentraut wrote:
> On 21.06.23 09:43, Michael Paquier wrote:
> > On Wed, Jun 21, 2023 at 09:16:38AM +0200, Daniel Gustafsson wrote:
> > > Agreed, I'd be more inclined to go with OPENSSL_API_COMPAT.  If we still 
> > > get
> > > warnings with that set then I feel those warrant special consideration 
> > > rather
> > > than a blanket suppression.
> > 
> > 4d3db136 seems to be OK on REL_13_STABLE with a direct cherry-pick.
> > REL_11_STABLE and REL_12_STABLE require two different changes:
> > - pg_config.h.win32 needs to list OPENSSL_API_COMPAT.
> > - Solution.pm needs an extra #define OPENSSL_API_COMPAT in
> > GenerateFiles() whose value can be retrieved from configure.in like in
> > 13~.
> > 
> > Anything I am missing perhaps?
> 
> Backpatching the OPENSSL_API_COMPAT change would set the minimum OpenSSL
> version to 1.0.1, which is newer than what was so far required in those
> branches.  That is the reason we didn't do this.

What's the problem with just setting a different version in those branches?

Greetings,

Andres Freund

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-06-21 Thread Jeff Davis

On Tue, 2023-06-20 at 15:52 -0700, Nathan Bossart wrote:
> At the moment, I think I'm inclined to call this "existing behavior"
> since
> we didn't check privileges for each partition in this case even
> before
> MAINTAIN was introduced.  IIUC we still process the individual
> partitions
> in v15 regardless of whether the calling user owns the partition.

That's fine with me. I just wanted to bring it up in case someone else
thought it was a problem.

> However, I do agree that it feels inconsistent.  Besides the options
> you
> proposed, we might also consider making REINDEX work a bit more like
> VACUUM
> and ANALYZE and emit a WARNING for any relations that the user is not
> permitted to process.  But this probably deserves its own thread, and
> it
> might even need to wait until v17.

Yes, we can revisit for 17.

Regards,
Jeff Davis

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-06-21 Thread Jeff Davis

On Wed, 2023-06-21 at 07:53 +0900, Michael Paquier wrote:
> I am not sure to understand this last sentence.  REINDEX on a
> partitioned table builds a list of the indexes to work on in the
> first
> transaction processing the command in ReindexPartitions(), and there
> is no need to process partitioned indexes as these have no storage,
> so
> your suggestion is a no-op?

What I meant is that if you do:

  CREATE TABLE p(i INT, j INT) PARTITION BY RANGE (i);
  CREATE TABLE p0 PARTITION OF p FOR VALUES FROM (00) TO (10);
  CREATE TABLE p1 PARTITION OF p FOR VALUES FROM (10) TO (20);
  CREATE INDEX p_idx ON p (i);
  CREATE INDEX special_idx ON p0 (j);
  GRANT MAINTAIN ON p TO foo;
  \c - foo
  REINDEX TABLE p;

That would reindex p0_i_idx and p1_i_idx, but skip special_idx. That
might be too confusing, but feels a bit more consistent permissions-
wise.

Regards,
Jeff Davis

Re: Use of additional index columns in rows filtering

2023-06-21 Thread James Coleman

On Wed, Jun 21, 2023 at 11:28 AM Tomas Vondra
 wrote:
>
>
>
> On 6/21/23 14:45, James Coleman wrote:
> > Hello,
> >
> > I've cc'd Jeff Davis on this due to a conversation we had at PGCon
> > about applying filters on index tuples during index scans.
> >
> > I've also cc'd Andres Freund because I think this relates to his
> > musing in [1] that:
> >> One thing I have been wondering around this is whether we should not have
> >> split the code for IOS and plain indexscans...
> >
> > I think I remember Peter Geoghegan also wondering (I can't remember if
> > this was in conversation at PGCon about index skip scans or in a
> > hackers thread) about how we compose these various index scan
> > optimizations.
> >
> > To be certain this is probably a thing to tackle as a follow-on to
> > this patch, but it does seem to me that what we are implicitly
> > realizing is that (unlike with bitmap scans, I think) it doesn't
> > really make a lot of conceptual sense to have index only scans be a
> > separate node from index scans. Instead it's likely better to consider
> > it an optimization to index scans that can dynamically kick in when
> > it's able to be of use. That would allow it to compose with e.g.
> > prefetching in the aforelinked thread. At the very least we would need
> > pragmatic (e.g., cost of dynamically applying optimizations) rather
> > than conceptual reasons to argue they should continue to be separate.
> >
>
> I agree it seems a bit weird to have IOS as a separate node. In a way, I
> think there are two dimensions for "index-only" scans - which pages can
> be scanned like that, and which clauses can be evaluated with only the
> index tuple. The current approach focuses on page visibility, but
> ignores the other aspect entirely. Or more precisely, it disables IOS
> entirely as soon as there's a single condition requiring heap tuple.
>
> I agree it's probably better to see this as a single node with various
> optimizations that can be applied when possible / efficient (based on
> planner and/or dynamically).
>
> I'm not sure I see a direct link to the prefetching patch, but it's true
> that needs to deal with tids (instead of slots), just like IOS. So if
> the node worked with tids, maybe the prefetching could be done at that
> level (which I now realize may be what Andres meant by doing prefetching
> in the executor).

The link to prefetching is that IOS (as a separate node) won't benefit
from prefetching (I think) with your current prefetching patch (in the
case where the VM doesn't allow us to just use the index tuple),
whereas if the nodes were combined that would more naturally be
composable.

> > Apologies for that lengthy preamble; on to the patch under discussion:
> >
> > On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
> >  wrote:
> >>
> >> Hi,
> >>
> >> I took a stab at this and implemented the trick with the VM - during
> >> index scan, we also extract the filters that only need the indexed
> >> attributes (just like in IOS). And then, during the execution we:
> >>
> >>   1) scan the index using the scan keys (as before)
> >>
> >>   2) if the heap page is all-visible, we check the new filters that can
> >>  be evaluated on the index tuple
> >>
> >>   3) fetch the heap tuple and evaluate the filters
> >
> > Thanks for working on this; I'm excited about this class of work
> > (along with index prefetching and other ideas I think there's a lot of
> > potential for improving index scans).
> >
> >> This is pretty much exactly the same thing we do for IOS, so I don't see
> >> why this would be incorrect while IOS is correct.
> >>
> >> This also adds "Index Filter" to explain output, to show which filters
> >> are executed on the index tuple (at the moment the filters are a subset
> >> of "Filter"), so if the index tuple matches we'll execute them again on
> >> the heap tuple. I guess that could be fixed by having two "filter"
> >> lists, depending on whether we were able to evaluate the index filters.
> >
> > Given that we show index filters and heap filters separately it seems
> > like we might want to maintain separate instrumentation counts of how
> > many tuple were filtered by each set of filters.
> >
>
> Yeah, separate instrumentation counters would be useful. What I was
> talking about was more about the conditions itself, because right now we
> re-evaluate the index-only clauses on the heap tuple.
>
> Imagine an index on t(a) and a query that has WHERE (a = 1) AND (b = 2).
> the patch splits this into two lists:
>
> index-only clauses: (a=1)
> clauses: (a=1) AND (b=1)
>
> So we evaluate (a=1) first, and then we fetch the heap tuple and check
> "clauses" again, which however includes the (a=1) again. For cheap
> clauses (or when (a=1) eliminated a lot of tuples using just the index),
> but for expensive clauses it might hurt.
>
> It's fixable, we'd just need to keep two versions of the "clauses" list,
> one for IOS mode (when index-only clauses were checked) and a complete
> one when we

Re: EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Jeff Davis

On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:
> At a conference this week I was asked if ICU could be able to
> sort like EBCDIC [2]. It turns out it has been already  asked on
> -general a few years ago [3] with no satisfactory answer at the time
> ,
> and that it can be implemented with rules in v16.

Interesting, thank you!

> This can be useful for people who migrate from mainframes to Postgres
> and need their migration tests to produce the same sorted results as
> the
> original system.
> Since rules can be defined at the database level with the icu_rules
> option,
> they don't even need to tweak their queries to add COLLATE clauses,
> which surely is appreciable in that kind of project.

I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?

> I'm open to suggestions on whether this EBCDIC example is worth being
> in the
> doc in some form or putting this in the wiki would be good enough.

I like the idea of having a real example. Ideally, we could add some
explanation along the way about how the rule is constructed to match
EBCDIC, which would reduce the shock of a long rule like that.

I wonder why the rule syntax is such that it cannot be broken up? Would
it be incorrect for us to allow some whitespace in there?

Regards,
Jeff Davis

Re: bgwriter doesn't flush WAL stats

2023-06-21 Thread Nazir Bilal Yavuz

Hi,

Thanks for the explanation.

On Wed, 21 Jun 2023 at 18:03, Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:
>
> On Wed, 21 Jun 2023 at 13:04, Nazir Bilal Yavuz 
wrote:
> > I was trying to add WAL stats to pg_stat_io. While doing that I was
comparing pg_stat_wal and pg_stat_io's WAL stats and there was some
inequality between the total number of WALs. I found that the difference
comes from bgwriter's WALs. bgwriter generates WAL but it doesn't flush
them because the pgstat_report_wal() function isn't called in bgwriter. I
attached a small patch for calling the pgstat_report_wal() function in
bgwriter.
> >
> > bgwriter generates WAL by calling functions in this order:
> > bgwriter.c -> BackgroundWriterMain() -> BgBufferSync() ->
SyncOneBuffer() -> FlushBuffer() -> XLogFlush() -> XLogWrite()
>
> I was quite confused here, as XLogWrite() does not generate any WAL;
> it only writes existing WAL from buffers to disk.
> In a running PostgreSQL instance, WAL is only generated through
> XLogInsert(xloginsert.c) and serialized / written to buffers in its
> call to XLogInsertRecord(xlog.c); XLogFlush and XLogWrite are only
> responsible for writing those buffers to disk.

Yes, you are right. Correct explanation should be "bgwriter writes existing
WAL from buffers to disk but pg_stat_wal doesn't count them because
bgwriter doesn't call pgstat_report_wal() to update WAL statistics".

> I also got confused with your included views; they're not included in
> the patch and the current master branch doesn't emit object=wal, so I
> can't really check that the patch works as intended.

I attached a WIP patch for showing WAL stats in pg_stat_io.

After applying patch, I used these queries for the getting views I shared
in the first mail;

Query for the first view:
SELECT
  'pg_stat_wal' AS view_name,
  SUM(wal_write) AS total_wal_write
FROM
  pg_stat_wal
UNION ALL
SELECT
  'pg_stat_io' AS view_name,
  SUM(writes) AS total_wal_write
FROM
  pg_stat_io
WHERE
  object = 'wal';

Query for the second view:
SELECT backend_type, object, writes FROM pg_stat_io where object = 'wal';

I also changed the description on the patch file and attached it.

Regards,
Nazir Bilal Yavuz
Microsoft
From 674b54adc3a92093f0cc25cd9117bf32956c6f2c Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 21 Jun 2023 13:39:13 +0300
Subject: [PATCH v2 2/2] Flush WAL in bgwriter

bgwriter writes existing WAL from buffers to disk but pg_stat_wal
doesn't count them because bgwriter doesn't call pgstat_report_wal()
to update WAL statistics. Call pgstat_report_wal() to flush WAL in
bgwriter.
---
 src/backend/postmaster/bgwriter.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..f2e4f23d9f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -241,6 +241,7 @@ BackgroundWriterMain(void)
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
+		pgstat_report_wal(true);
 
 		if (FirstCallSinceLastCheckpoint())
 		{
-- 
2.40.1

From 45f0dc4ee660baceb91a96f1e5aafab3adc9bf13 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 21 Jun 2023 18:36:25 +0300
Subject: [PATCH v2 1/2] [WIP] Show WAL stats on pg_stat_io

---
 src/backend/access/transam/xlog.c  | 11 +--
 src/backend/utils/activity/pgstat_io.c |  9 -
 src/include/pgstat.h   |  3 ++-
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8b0710abe6..37028429c7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2177,7 +2177,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	start, io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2185,6 +2185,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			nleft = nbytes;
 			do
 			{
+io_start = pgstat_prepare_io_time();
 errno = 0;
 
 /* Measure I/O timing to write WAL data */
@@ -2209,6 +2210,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 	INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration, start);
 }
 
+pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_WRITE, io_start, 1);
+
 PendingWalStats.wal_write++;
 
 if (written <= 0)
@@ -8163,7 +8166,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	start, io_start;
 
 	Assert(tli != 0);
 
@@ -8176,6 +8179,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		sync_method == SYNC_METHOD_OPEN_DSYNC)
 		return;
 
+	io_start = pgstat_prepare_io_time();
+
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)

Re: EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Joe Conway


On 6/21/23 09:28, Daniel Verite wrote:

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1] was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.


Oh, very cool! I have seen the requirement for EBCDIC come up multiple 
times over the years.





Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.


I would definitely favor adding to the docs, but no idea how to deal 
with the length issue.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Partial aggregates pushdown

2023-06-21 Thread Bruce Momjian

On Tue, Jun 20, 2023 at 09:59:11AM +0300, Alexander Pyhalov wrote:
> > Therefore, it seems like it would be near-zero cost to just call conn =
> > GetConnection() and then PQserverVersion(conn), and ReleaseConnection().
> > You can then use the return value of PQserverVersion() to determine if
> > you can push down partial aggregates.
> 
> Hi.
> Currently we don't get remote connection while planning if
> use_remote_estimate is not set.
> Such change would require to get remote connection in planner, not in
> executor.
> This can lead to change of behavior (like errors in explain when user
> mapping is wrong - e.g. bad password is specified).
> Also this potentially can lead to establishing connections even when plan
> node is not actually used
> (like extreme example - select sum(score) from t limit 0).
> I'm not saying we shouldn't do it - just hint at possible consequences.

Agreed.  I noticed it was doing FDW connections during optimization, but
didn't see the postgres_fdw option that would turn it off. 
Interestingly, it is disabled by default.

After considering the options, I think we should have a postgres_fdw
option called "planner_version_check", and default that false.  When
false, a remote server version check will not be performed during
planning and partial aggregates will be always be considered.  When
true, a version check will be performed during planning and partial
aggregate pushdown disabled for pre-PG 17 foreign servers during the
query.

If we want to be more specific, we can call it
"check_partial_aggregate_support".

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Use of additional index columns in rows filtering

2023-06-21 Thread Tomas Vondra

On 6/21/23 14:45, James Coleman wrote:
> Hello,
> 
> I've cc'd Jeff Davis on this due to a conversation we had at PGCon
> about applying filters on index tuples during index scans.
> 
> I've also cc'd Andres Freund because I think this relates to his
> musing in [1] that:
>> One thing I have been wondering around this is whether we should not have
>> split the code for IOS and plain indexscans...
> 
> I think I remember Peter Geoghegan also wondering (I can't remember if
> this was in conversation at PGCon about index skip scans or in a
> hackers thread) about how we compose these various index scan
> optimizations.
> 
> To be certain this is probably a thing to tackle as a follow-on to
> this patch, but it does seem to me that what we are implicitly
> realizing is that (unlike with bitmap scans, I think) it doesn't
> really make a lot of conceptual sense to have index only scans be a
> separate node from index scans. Instead it's likely better to consider
> it an optimization to index scans that can dynamically kick in when
> it's able to be of use. That would allow it to compose with e.g.
> prefetching in the aforelinked thread. At the very least we would need
> pragmatic (e.g., cost of dynamically applying optimizations) rather
> than conceptual reasons to argue they should continue to be separate.
> 

I agree it seems a bit weird to have IOS as a separate node. In a way, I
think there are two dimensions for "index-only" scans - which pages can
be scanned like that, and which clauses can be evaluated with only the
index tuple. The current approach focuses on page visibility, but
ignores the other aspect entirely. Or more precisely, it disables IOS
entirely as soon as there's a single condition requiring heap tuple.

I agree it's probably better to see this as a single node with various
optimizations that can be applied when possible / efficient (based on
planner and/or dynamically).

I'm not sure I see a direct link to the prefetching patch, but it's true
that needs to deal with tids (instead of slots), just like IOS. So if
the node worked with tids, maybe the prefetching could be done at that
level (which I now realize may be what Andres meant by doing prefetching
in the executor).

> Apologies for that lengthy preamble; on to the patch under discussion:
> 
> On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
>  wrote:
>>
>> Hi,
>>
>> I took a stab at this and implemented the trick with the VM - during
>> index scan, we also extract the filters that only need the indexed
>> attributes (just like in IOS). And then, during the execution we:
>>
>>   1) scan the index using the scan keys (as before)
>>
>>   2) if the heap page is all-visible, we check the new filters that can
>>  be evaluated on the index tuple
>>
>>   3) fetch the heap tuple and evaluate the filters
> 
> Thanks for working on this; I'm excited about this class of work
> (along with index prefetching and other ideas I think there's a lot of
> potential for improving index scans).
> 
>> This is pretty much exactly the same thing we do for IOS, so I don't see
>> why this would be incorrect while IOS is correct.
>>
>> This also adds "Index Filter" to explain output, to show which filters
>> are executed on the index tuple (at the moment the filters are a subset
>> of "Filter"), so if the index tuple matches we'll execute them again on
>> the heap tuple. I guess that could be fixed by having two "filter"
>> lists, depending on whether we were able to evaluate the index filters.
> 
> Given that we show index filters and heap filters separately it seems
> like we might want to maintain separate instrumentation counts of how
> many tuple were filtered by each set of filters.
> 

Yeah, separate instrumentation counters would be useful. What I was
talking about was more about the conditions itself, because right now we
re-evaluate the index-only clauses on the heap tuple.

Imagine an index on t(a) and a query that has WHERE (a = 1) AND (b = 2).
the patch splits this into two lists:

index-only clauses: (a=1)
clauses: (a=1) AND (b=1)

So we evaluate (a=1) first, and then we fetch the heap tuple and check
"clauses" again, which however includes the (a=1) again. For cheap
clauses (or when (a=1) eliminated a lot of tuples using just the index),
but for expensive clauses it might hurt.

It's fixable, we'd just need to keep two versions of the "clauses" list,
one for IOS mode (when index-only clauses were checked) and a complete
one when we need to check all clauses.

>> Most of the patch is pretty mechanical - particularly the planning part
>> is about identifying filters that can be evaluated on the index tuple,
>> and that code was mostly shamelessly copied from index-only scan.
>>
>> The matching of filters to index is done in check_index_filter(), and
>> it's simpler than match_clause_to_indexcol() as it does not need to
>> consider operators etc. (I think). But maybe it should be careful about
>> other things, not sure.
> 
> This would end up

Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Nathan Bossart

On Wed, Jun 21, 2023 at 09:43:38AM -0400, Tom Lane wrote:
> Kyotaro Horiguchi  writes:
>> IMHO, I'm not sure we should allow connections without the exact name
>> being provided. In that sense, I think we might want to consider
>> outright rejecting the estblishment of a connection when the given
>> database name doesn't fit the startup packet, since the database with
>> the exact given name cannot be found.
> 
> I think I agree.  I don't like the proposed patch at all, because it's
> making completely unsupportable assumptions about what encoding the
> names are given in.  Simply failing to match when a name is overlength
> sounds safer.

+1.  Even if these assumptions were supportable, IMHO it's probably not
worth the added complexity to keep the truncation consistent with CREATE
ROLE/DATABASE.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: bgwriter doesn't flush WAL stats

2023-06-21 Thread Matthias van de Meent

On Wed, 21 Jun 2023 at 13:04, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> I was trying to add WAL stats to pg_stat_io. While doing that I was comparing 
> pg_stat_wal and pg_stat_io's WAL stats and there was some inequality between 
> the total number of WALs. I found that the difference comes from bgwriter's 
> WALs. bgwriter generates WAL but it doesn't flush them because the 
> pgstat_report_wal() function isn't called in bgwriter. I attached a small 
> patch for calling the pgstat_report_wal() function in bgwriter.
>
> bgwriter generates WAL by calling functions in this order:
> bgwriter.c -> BackgroundWriterMain() -> BgBufferSync() -> SyncOneBuffer() -> 
> FlushBuffer() -> XLogFlush() -> XLogWrite()

I was quite confused here, as XLogWrite() does not generate any WAL;
it only writes existing WAL from buffers to disk.
In a running PostgreSQL instance, WAL is only generated through
XLogInsert(xloginsert.c) and serialized / written to buffers in its
call to XLogInsertRecord(xlog.c); XLogFlush and XLogWrite are only
responsible for writing those buffers to disk.

The only path that I see in XLogWrite() that could potentially put
anything into WAL is through RequestCheckpoint(), but that only writes
out a checkpoint when it is not in a postmaster environment - in all
other cases it will wake up the checkpointer and wait for that
checkpoint to finish.

I also got confused with your included views; they're not included in
the patch and the current master branch doesn't emit object=wal, so I
can't really check that the patch works as intended.

But on the topic of reporting the WAL stats in bgwriter; that seems
like a good idea to fix, yes.

+1

Kind regards,

Matthias van de Meent
Neon, Inc.

Re: Allow pg_archivecleanup to remove backup history files

2023-06-21 Thread torikoshia


On 2023-06-21 11:59, Kyotaro Horiguchi wrote:

At Tue, 20 Jun 2023 22:27:36 +0900, torikoshia
 wrote in

On 2023-06-19 14:37, Michael Paquier wrote:
> On Mon, Jun 19, 2023 at 11:24:29AM +0900, torikoshia wrote:
>> Thanks, now I understand what you meant.
> If I may ask, why is the refactoring of 0003 done after the feature in
> 0002?  Shouldn't the order be reversed?  That would make for a cleaner
> git history.
> --
> Michael

Agreed.
Reversed the order of patches 0002 and 0003.


Yeah, that is a possible division. However, I meant that we have room
to refactor and decrease the nesting level even further, considering
that 0003 already does this to some extent, when I suggested it. In
that sense, moving the nest-reduction part of 0003 into 0002 makes us
possible to focus on the point of this patch.


Thanks for the comment, it seems better than v9 patch.


What do you think about the attached version?


--v10-0002-Preliminary-refactoring-for-a-subsequent-patch.patch
+* Also we skip backup history files when --clean-backup-history
+* is not specified.
+*/
+   if (!IsXLogFileName(walfile) && !IsPartialXLogFileName(walfile))
+   continue;

I think this comment should be located in 0003.

Attached updated patches.


--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom eb99f581b2990c80329f2baf96ed4d5e9c00dda5 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Wed, 21 Jun 2023 21:42:08 +0900
Subject: [PATCH v11 1/3] Introduce pg_archivecleanup into getopt_long

This patch is a preliminary step to add an easy-to-understand option
to delete backup history files, but it also adds long options to
the existing options.
---
 doc/src/sgml/ref/pgarchivecleanup.sgml|  5 -
 src/bin/pg_archivecleanup/pg_archivecleanup.c | 22 +--
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/ref/pgarchivecleanup.sgml b/doc/src/sgml/ref/pgarchivecleanup.sgml
index 635e7c7685..09991c2fcd 100644
--- a/doc/src/sgml/ref/pgarchivecleanup.sgml
+++ b/doc/src/sgml/ref/pgarchivecleanup.sgml
@@ -95,6 +95,7 @@ pg_archivecleanup:  removing file "archive/00010037000E"
 
  
   -d
+  --debug
   

 Print lots of debug logging output on stderr.
@@ -104,6 +105,7 @@ pg_archivecleanup:  removing file "archive/00010037000E"
 
  
   -n
+  --dry-run
   

 Print the names of the files that would have been removed on stdout (performs a dry run).
@@ -122,7 +124,8 @@ pg_archivecleanup:  removing file "archive/00010037000E"
  
 
  
-  -x extension
+  -x extension
+  --strip-extension=extension
   

 Provide an extension
diff --git a/src/bin/pg_archivecleanup/pg_archivecleanup.c b/src/bin/pg_archivecleanup/pg_archivecleanup.c
index 7726d05149..fc0dca9856 100644
--- a/src/bin/pg_archivecleanup/pg_archivecleanup.c
+++ b/src/bin/pg_archivecleanup/pg_archivecleanup.c
@@ -17,7 +17,7 @@
 
 #include "access/xlog_internal.h"
 #include "common/logging.h"
-#include "pg_getopt.h"
+#include "getopt_long.h"
 
 const char *progname;
 
@@ -252,11 +252,13 @@ usage(void)
 	printf(_("Usage:\n"));
 	printf(_("  %s [OPTION]... ARCHIVELOCATION OLDESTKEPTWALFILE\n"), progname);
 	printf(_("\nOptions:\n"));
-	printf(_("  -d generate debug output (verbose mode)\n"));
-	printf(_("  -n dry run, show the names of the files that would be removed\n"));
-	printf(_("  -V, --version  output version information, then exit\n"));
-	printf(_("  -x EXT clean up files if they have this extension\n"));
-	printf(_("  -?, --help show this help, then exit\n"));
+	printf(_("  -d, --debug generate debug output (verbose mode)\n"));
+	printf(_("  -n, --dry-run   dry run, show the names of the files that would be\n"
+			 "  removed\n"));
+	printf(_("  -V, --version   output version information, then exit\n"));
+	printf(_("  -x, --strip-extension=EXT   strip this extention before identifying files for\n"
+			 "  clean up\n"));
+	printf(_("  -?, --help  show this help, then exit\n"));
 	printf(_("\n"
 			 "For use as archive_cleanup_command in postgresql.conf:\n"
 			 "  archive_cleanup_command = 'pg_archivecleanup [OPTION]... ARCHIVELOCATION %%r'\n"
@@ -274,6 +276,12 @@ usage(void)
 int
 main(int argc, char **argv)
 {
+	static struct option long_options[] = {
+		{"debug", no_argument, NULL, 'd'},
+		{"dry-run", no_argument, NULL, 'n'},
+		{"strip-extension", required_argument, NULL, 'x'},
+		{NULL, 0, NULL, 0}
+	};
 	int			c;
 
 	pg_logging_init(argv[0]);
@@ -294,7 +302,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt(argc, argv, "dnx:")) != -1)
+	while ((c = getopt_long(argc, argv, "dnx:", long_options, NULL)) != -1)
 	{
 		switch (c)
 		{
-- 
2.39.2

From d760567e6974a5b3618aba228f92c188529f6a75 Mon Sep 17

Re: pgindent vs. pgperltidy command-line arguments

2023-06-21 Thread Peter Eisentraut


On 21.06.23 13:35, Andrew Dunstan wrote:
If not, part of my patch would still be useful.  Maybe I should commit 
my posted patch for PG16, to keep consistency with pgindent, and then 
your work would presumably be considered for PG17.


That sounds like a good plan.


done

Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Drouvot, Bertrand


Hi,

On 6/21/23 3:43 PM, Tom Lane wrote:

Kyotaro Horiguchi  writes:

At Wed, 21 Jun 2023 09:43:50 +0200, "Drouvot, Bertrand" 
 wrote in

Trying to connect with the 64 bytes name:
$ psql -d 
psql: error: connection to server on socket "/tmp/.s.PGSQL.55448"
failed: FATAL: database "äää" does not
exist



IMHO, I'm not sure we should allow connections without the exact name
being provided. In that sense, I think we might want to consider
outright rejecting the estblishment of a connection when the given
database name doesn't fit the startup packet, since the database with
the exact given name cannot be found.


I think I agree.  I don't like the proposed patch at all, because it's
making completely unsupportable assumptions about what encoding the
names are given in.  Simply failing to match when a name is overlength
sounds safer.



Yeah, that's another and "cleaner" option.

I'll propose a patch to make it failing even for the non multibyte case then (
so that multibyte and non multibyte behaves the same aka failing in case of 
overlength
name is detected).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Stack overflow issue

2023-06-21 Thread Egor Chindyaskin

Hello! In continuation of the topic I would like to suggest solution. 
This patch adds several checks to the vulnerable functions above.diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d85e313908..102d0e1574 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -3043,6 +3043,9 @@ CommitTransactionCommand(void)
 	TransactionState s = CurrentTransactionState;
 	SavedTransactionCharacteristics savetc;
 
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	SaveTransactionCharacteristics();
 
 	switch (s->blockState)
@@ -5500,6 +5503,9 @@ ShowTransactionStateRec(const char *str, TransactionState s)
 {
 	StringInfoData buf;
 
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	initStringInfo();
 
 	if (s->nChildXids > 0)
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7acf654bf8..58a1d70b8a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -76,6 +76,7 @@
 #include "commands/trigger.h"
 #include "commands/typecmds.h"
 #include "funcapi.h"
+#include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteRemove.h"
@@ -524,6 +525,11 @@ findDependentObjects(const ObjectAddress *object,
 	if (stack_address_present_add_flags(object, objflags, stack))
 		return;
 
+	/* since this function recurses, it could be driven to stack overflow,
+	 * because of the deep dependency tree, not only due to dependency loops.
+	 */
+	check_stack_depth();
+
 	/*
 	 * It's also possible that the target object has already been completely
 	 * processed and put into targetObjects.  If so, again we just add the
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4f006820b8..a88550913d 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -552,6 +552,9 @@ CheckAttributeType(const char *attname,
 	char		att_typtype = get_typtype(atttypid);
 	Oid			att_typelem;
 
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	if (att_typtype == TYPTYPE_PSEUDO)
 	{
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7c697a285b..f22da79c65 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6684,6 +6684,9 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	TupleDesc	tupdesc;
 	FormData_pg_attribute *aattr[] = {};
 
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	/* At top level, permission check was done in ATPrepCmd, else do it */
 	if (recursing)
 		ATSimplePermissions((*cmd)->subtype, rel, ATT_TABLE | ATT_FOREIGN_TABLE);
@@ -8383,6 +8386,10 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 
 	/* Initialize addrs on the first invocation */
 	Assert(!recursing || addrs != NULL);
+
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	if (!recursing)
 		addrs = new_object_addresses();
 
@@ -10839,6 +10846,9 @@ ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel,
 	Oid			refrelid;
 	bool		changed = false;
 
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	currcon = (Form_pg_constraint) GETSTRUCT(contuple);
 	conoid = currcon->oid;
 	refrelid = currcon->confrelid;
@@ -11839,6 +11849,9 @@ ATExecDropConstraint(Relation rel, const char *constrName,
 	bool		is_no_inherit_constraint = false;
 	char		contype;
 
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	/* At top level, permission check was done in ATPrepCmd, else do it */
 	if (recursing)
 		ATSimplePermissions(AT_DropConstraint, rel, ATT_TABLE | ATT_FOREIGN_TABLE);
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index aa584848cf..77a5eb526c 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -2318,6 +2318,10 @@ static Node *
 eval_const_expressions_mutator(Node *node,
 			   eval_const_expressions_context *context)
 {
+
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	if (node == NULL)
 		return NULL;
 	switch (nodeTag(node))
diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index b561f0e7e8..dc7ab387ea 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -1232,6 +1232,9 @@ executeBoolItem(JsonPathExecContext *cxt, JsonPathItem *jsp,
 	JsonPathBool res;
 	JsonPathBool res2;
 
+	/* since this function recurses, it could be driven to stack overflow */
+	check_stack_depth();
+
 	if (!canHaveNext && jspHasNext(jsp))
 		elog(ERROR, "boolean jsonpath item

Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Tom Lane

Kyotaro Horiguchi  writes:
> At Wed, 21 Jun 2023 09:43:50 +0200, "Drouvot, Bertrand" 
>  wrote in 
>> Trying to connect with the 64 bytes name:
>> $ psql -d 
>> psql: error: connection to server on socket "/tmp/.s.PGSQL.55448"
>> failed: FATAL: database "äää" does not
>> exist

> IMHO, I'm not sure we should allow connections without the exact name
> being provided. In that sense, I think we might want to consider
> outright rejecting the estblishment of a connection when the given
> database name doesn't fit the startup packet, since the database with
> the exact given name cannot be found.

I think I agree.  I don't like the proposed patch at all, because it's
making completely unsupportable assumptions about what encoding the
names are given in.  Simply failing to match when a name is overlength
sounds safer.

(Our whole story about what is the encoding of names in shared catalogs
is a mess.  But this particular point doesn't seem like the place to
start if you want to clean that up.)

regards, tom lane

EBCDIC sorting as a use case for ICU rules

2023-06-21 Thread Daniel Verite

 Hi,

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1] was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.

A collation like the following this seems to work (the rule simply enumerates
US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$&'
'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
UVWXYZ0123456789

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.



[1]
https://www.postgresql.org/message-id/flat/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel%40j-davis.com

[2] https://en.wikipedia.org/wiki/EBCDIC

[3]
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F84A7AD%40G01JPEXMBYT05


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Can JoinFilter condition be pushed down into IndexScan?

2023-06-21 Thread Tomas Vondra



On 6/21/23 05:37, Bəxtiyar Neyman wrote:
> I define a table user_ranks as such:
> 
> CREATE TABLE user_ranks (
>   id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
>   rank INTEGER NOT NULL,
>   CONSTRAINT "by (rank, id)" UNIQUE (rank, id)
> );
> 
> INSERT INTO user_ranks (user_id, rank) SELECT generate_series(1, 1),
> generate_series(1, 1);
> 

This doesn't work, the INSERT needs to only insert into (rank).

> Here's a query I'd like to optimize:
> 
> explain (analyze,verbose)
> SELECT
>   t3_0."id" AS "id",
>   t3_0."rank" AS "rank"
> FROM
>   LATERAL (
>     SELECT
>       t4_0."rank" AS "rank"
>     FROM
>       user_ranks AS t4_0
>     WHERE
>       (t4_0."id" = 4732455)
>   ) AS t3_1
>   INNER JOIN user_ranks AS t3_0 ON true
> WHERE
>   (
>     ((t3_0."rank", t3_0."id") <= (t3_1."rank", 4732455))
>     AND true
>   )
> ORDER BY
>   t3_0."rank" DESC,
>   t3_0."id" DESC
> LIMIT
>   10
> 

Not sure why you make the query unnecessarily complicated - the LATERAL
is pointless I believe, the "AND true" just make it harder to read.
Let's rewrite it it like this to make discussion easier:

explain (analyze,verbose)
SELECT
  t3_0."id" AS "id",
  t3_0."rank" AS "rank"
FROM
  user_ranks AS t3_1
  INNER JOIN user_ranks AS t3_0
  ON ((t3_0."rank", t3_0."id") <= (t3_1."rank", t3_1."id"))
WHERE
  t3_1."id" = 4732455
ORDER BY
  t3_0."rank" DESC,
  t3_0."id" DESC
LIMIT
  10

Same query, but perhaps easier to read.

> It compiles to the following plan:
> 
>  Limit  (cost=0.56..250.94 rows=10 width=12) (actual time=8.078..8.078
> rows=1 loops=1)
>    Output: t3_0.id , t3_0.rank
>    ->  Nested Loop  (cost=0.56..41763.27 rows=1668 width=12) (actual
> time=8.075..8.076 rows=1 loops=1)
>          Output: t3_0.id , t3_0.rank
>          Inner Unique: true
>          Join Filter: (ROW(t3_0.rank, t3_0.id ) <=
> ROW(t4_0.rank, 4732455))
>          Rows Removed by Join Filter: 5002
>          ->  Index Only Scan Backward using "by (rank,id)" on
> public.user_ranks t3_0  (cost=0.28..163.33 rows=5003 width=12) (actual
> time=0.023..0.638 rows=5003 loops=1)
>                Output: t3_0.rank, t3_0.id 
>                Heap Fetches: 0
>          ->  Index Scan using "by id" on public.user_ranks t4_0
>  (cost=0.28..8.30 rows=1 width=8) (actual time=0.001..0.001 rows=1
> loops=5003)
>                Output: t4_0.id , t4_0.rating, t4_0.rank
>                Index Cond: (t4_0.id  = 4732455)
> 
> As you can see, there are a lot of rows returned by t3_0, which are then
> filtered by Join Filter. But it would have been better if instead of the
> filter, the  t3_0 table would have an Index Cond. Similar to how it
> happens when a correlated subquery is used (or a CTE)
> 
> explain (analyze,verbose)
> SELECT
>   t3_0."id" AS "id",
>   t3_0."rank" AS "rank"
> FROM
>   user_ranks AS t3_0
> WHERE
>   (
>     ((t3_0."rank", t3_0."id") <= (
>     SELECT
>       t4_0."rank" AS "rank",
>       t4_0."id" AS "id"
>     FROM
>       user_ranks AS t4_0
>     WHERE
>       (t4_0."id" = 4732455)
>     ))
>     AND true
>   )
> ORDER BY
>   t3_0."rank" DESC,
>   t3_0."id" DESC
> LIMIT
>   10
> 
>  Limit  (cost=8.58..8.95 rows=10 width=12) (actual time=0.062..0.064
> rows=1 loops=1)
>    Output: t3_0.id , t3_0.rank
>    InitPlan 1 (returns $0,$1)
>      ->  Index Scan using "by id" on public.user_ranks t4_0
>  (cost=0.28..8.30 rows=1 width=12) (actual time=0.024..0.025 rows=1 loops=1)
>            Output: t4_0.rank, t4_0.id 
>            Index Cond: (t4_0.id  = 4732455)
>    ->  Index Only Scan Backward using "by (rank,id)" on
> public.user_ranks t3_0  (cost=0.28..61.47 rows=1668 width=12) (actual
> time=0.061..0.062 rows=1 loops=1)
>          Output: t3_0.id , t3_0.rank
>          Index Cond: (ROW(t3_0.rank, t3_0.id ) <=
> ROW($0, $1))
>          Heap Fetches: 0
> 
> 
> I'm an opposite of a PostgreSQL expert, but it was surprising to me to
> see that a correlated subquery behaves better than a join. Is this
> normal? Is it something worth fixing/easy to fix?
> 

Because those queries are not doing the same thing. In the first query
you sort by t3_0 columns, while the "id = 4732455" condition is on the
other table. And so it can't use the index scan for sorting.

While in the second query it can do that, and it doesn't need to do the
explicit sort (which needs to fetch all the rows etc.). If you alter the
first query to do

  ORDER BY
t3_1."rank" DESC,
t3_1."id" DESC

it'll use the same plan as the second query. Well, not exactly the same,
but much closer to it.


Nevertheless, these example queries have other estimation issues, which
might result in poor plan choices. There's no row for (id = 4732455),
and the cross-table inequality estimate is just some default estimate
(33%). In reality, this produces no rows.

Secondly, for LIMIT, the cost is

Re: Support logical replication of DDLs

2023-06-21 Thread Jelte Fennema

(to be clear I only skimmed the end of this thread and did not look at
all the previous messages)

I took a quick look at the first patch (about deparsing table ddl) and
it seems like this would also be very useful for a SHOW CREATE TABLE,
like command. Which was suggested in this thread:
https://www.postgresql.org/message-id/flat/CAFEN2wxsDSSuOvrU03CE33ZphVLqtyh9viPp6huODCDx2UQkYA%40mail.gmail.com

On that thread I sent a patch with Citus its CREATE TABLE deparsing as
starting point. It looks like this thread went quite a different route
with some JSON intermediary representation. Still it might be useful
to look at the patch with Citus its logic for some inspiration/copying
things. I re-attached that patch here for ease of finding it.
From ddb375afc74339bd0eaf0c272d06805637fd85cc Mon Sep 17 00:00:00 2001
From: Jelte Fennema 
Date: Mon, 5 Jun 2023 12:32:12 +0200
Subject: [PATCH v1] Add initial code to generate table definition SQL

This patch adds various functions which generate table DDL statements
based on a table oid. These functions originated in Citus source code,
but are now contributed to upstream Postgres by means of this patch.
It currently contains no wrappers around these C functions that can be
called from SQL.
---
 src/backend/utils/adt/acl.c   |2 +-
 src/backend/utils/adt/ruleutils.c | 1187 -
 src/include/utils/acl.h   |1 +
 src/include/utils/ruleutils.h |   51 ++
 src/tools/pgindent/typedefs.list  |2 +
 5 files changed, 1240 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index c660fd3e701..abe329d4c83 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -1702,7 +1702,7 @@ convert_any_priv_string(text *priv_type_text,
 }
 
 
-static const char *
+const char *
 convert_aclright_to_string(int aclright)
 {
 	switch (aclright)
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index d3a973d86b7..30bb4e7ffd9 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -24,12 +24,16 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/toast_compression.h"
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_attrdef.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_depend.h"
+#include "catalog/pg_extension.h"
+#include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_language.h"
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_operator.h"
@@ -39,10 +43,13 @@
 #include "catalog/pg_trigger.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "commands/extension.h"
+#include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "common/keywords.h"
 #include "executor/spi.h"
 #include "funcapi.h"
+#include "foreign/foreign.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -59,6 +66,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "rewrite/rewriteManip.h"
 #include "rewrite/rewriteSupport.h"
+#include "utils/acl.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -507,7 +515,6 @@ static Node *processIndirection(Node *node, deparse_context *context);
 static void printSubscripts(SubscriptingRef *sbsref, deparse_context *context);
 static char *get_relation_name(Oid relid);
 static char *generate_relation_name(Oid relid, List *namespaces);
-static char *generate_qualified_relation_name(Oid relid);
 static char *generate_function_name(Oid funcid, int nargs,
 	List *argnames, Oid *argtypes,
 	bool has_variadic, bool *use_variadic_p,
@@ -521,6 +528,10 @@ static void get_reloptions(StringInfo buf, Datum reloptions);
 
 #define only_marker(rte)  ((rte)->inh ? "" : "ONLY ")
 
+#define CREATE_SEQUENCE_COMMAND \
+	"CREATE %sSEQUENCE IF NOT EXISTS %s AS %s INCREMENT BY " INT64_FORMAT \
+	" MINVALUE " INT64_FORMAT " MAXVALUE " INT64_FORMAT \
+	" START WITH " INT64_FORMAT " CACHE " INT64_FORMAT " %sCYCLE"
 
 /* --
  * pg_get_ruledef		- Do it all and return a text
@@ -12110,7 +12121,7 @@ generate_relation_name(Oid relid, List *namespaces)
  *
  * As above, but unconditionally schema-qualify the name.
  */
-static char *
+char *
 generate_qualified_relation_name(Oid relid)
 {
 	HeapTuple	tp;
@@ -12606,3 +12617,1175 @@ get_range_partbound_string(List *bound_datums)
 
 	return buf->data;
 }
+
+/*
+ * pg_get_extensiondef_string finds the foreign data wrapper that corresponds to
+ * the given foreign tableId, and checks if an extension owns this foreign data
+ * wrapper. If it does, the function returns the extension's definition. If not,
+ * the function returns null.
+ */
+char *
+pg_get_extensiondef_string(Oid tableRelationId)
+{
+	ForeignTable *foreignTable = GetForeignTable(tableRelationId);
+	ForeignServer *server

Opportunistically pruning page before update

2023-06-21 Thread James Coleman

Hello,

While at PGCon I was chatting with Andres (and I think Peter G. and a
few others who I can't remember at the moment, apologies) and Andres
noted that while we opportunistically prune a page when inserting a
tuple (before deciding we need a new page) we don't do the same for
updates.

Attached is a patch series to do the following:

0001: Make it possible to call heap_page_prune_opt already holding an
exclusive lock on the buffer.
0002: Opportunistically prune pages on update when the current tuple's
page has no free space. If this frees up enough space, then we
continue to put the new tuple on that page; if not, then we take the
existing code path and get a new page.

One would plausibly expect the following improvements:
- Reduced table bloat
- Increased HOT update rate
- Improved performance on updates

I started to work on benchmarking this, but haven't had time to devote
properly to that, so I'm wondering if there's anyone who might be
interested in collaborating on that part.

Other TODOs:
- Audit other callers of RelationSetTargetBlock() to ensure they don't
hold pointers into the page.

Regards,
James Coleman


v1-0001-Allow-getting-lock-before-calling-heap_page_prune.patch
Description: Binary data


v1-0002-Opportunistically-prune-to-avoid-building-a-new-p.patch
Description: Binary data

Re: Use of additional index columns in rows filtering

2023-06-21 Thread James Coleman

Hello,

I've cc'd Jeff Davis on this due to a conversation we had at PGCon
about applying filters on index tuples during index scans.

I've also cc'd Andres Freund because I think this relates to his
musing in [1] that:
> One thing I have been wondering around this is whether we should not have
> split the code for IOS and plain indexscans...

I think I remember Peter Geoghegan also wondering (I can't remember if
this was in conversation at PGCon about index skip scans or in a
hackers thread) about how we compose these various index scan
optimizations.

To be certain this is probably a thing to tackle as a follow-on to
this patch, but it does seem to me that what we are implicitly
realizing is that (unlike with bitmap scans, I think) it doesn't
really make a lot of conceptual sense to have index only scans be a
separate node from index scans. Instead it's likely better to consider
it an optimization to index scans that can dynamically kick in when
it's able to be of use. That would allow it to compose with e.g.
prefetching in the aforelinked thread. At the very least we would need
pragmatic (e.g., cost of dynamically applying optimizations) rather
than conceptual reasons to argue they should continue to be separate.

Apologies for that lengthy preamble; on to the patch under discussion:

On Thu, Jun 8, 2023 at 1:34 PM Tomas Vondra
 wrote:
>
> Hi,
>
> I took a stab at this and implemented the trick with the VM - during
> index scan, we also extract the filters that only need the indexed
> attributes (just like in IOS). And then, during the execution we:
>
>   1) scan the index using the scan keys (as before)
>
>   2) if the heap page is all-visible, we check the new filters that can
>  be evaluated on the index tuple
>
>   3) fetch the heap tuple and evaluate the filters

Thanks for working on this; I'm excited about this class of work
(along with index prefetching and other ideas I think there's a lot of
potential for improving index scans).

> This is pretty much exactly the same thing we do for IOS, so I don't see
> why this would be incorrect while IOS is correct.
>
> This also adds "Index Filter" to explain output, to show which filters
> are executed on the index tuple (at the moment the filters are a subset
> of "Filter"), so if the index tuple matches we'll execute them again on
> the heap tuple. I guess that could be fixed by having two "filter"
> lists, depending on whether we were able to evaluate the index filters.

Given that we show index filters and heap filters separately it seems
like we might want to maintain separate instrumentation counts of how
many tuple were filtered by each set of filters.

> Most of the patch is pretty mechanical - particularly the planning part
> is about identifying filters that can be evaluated on the index tuple,
> and that code was mostly shamelessly copied from index-only scan.
>
> The matching of filters to index is done in check_index_filter(), and
> it's simpler than match_clause_to_indexcol() as it does not need to
> consider operators etc. (I think). But maybe it should be careful about
> other things, not sure.

This would end up requiring some refactoring of the existing index
matching code (or alternative caching on IndexOptInfo), but
match_filter_to_index() calling check_index_filter() results in
constructs a bitmapset of index columns for every possible filter
which seems wasteful (I recognize this is a bit of a proof-of-concept
level v1).

> The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
> earlier, the idea is to check VM and evaluate the filters on the index
> tuple if possible, similar to index-only scans. Except that we then have
> to fetch the heap tuple. Unfortunately, this means the code can't use
> index_getnext_slot() anymore. Perhaps we should invent a new variant
> that'd allow evaluating the index filters in between.

It does seem there are some refactoring opportunities there.

> With the patch applied, the query plan changes from:
>
> ...
>
> to
>
>  QUERY PLAN
> ---
>  Limit  (cost=0.42..3662.15 rows=1 width=12)
> (actual time=13.663..13.667 rows=0 loops=1)
>Buffers: shared hit=544
>->  Index Scan using t_a_include_b on t
>(cost=0.42..3662.15 rows=1 width=12)
>(actual time=13.659..13.660 rows=0 loops=1)
>  Index Cond: (a > 100)
>  Index Filter: (b = 4)
>  Rows Removed by Index Recheck: 197780
>  Filter: (b = 4)
>  Buffers: shared hit=544
>  Planning Time: 0.105 ms
>  Execution Time: 13.690 ms
> (10 rows)
>
> ...

I did also confirm that this properly identifies cases Jeff had
mentioned to me like "Index Filter: (((a * 2) > 50) AND ((b % 10)
= 4))".

I noticed also you still had questions/TODOs about handling index
scans for join clauses.

Regards,
James Coleman

1:

Re: Making empty Bitmapsets always be NULL

2023-06-21 Thread Ranier Vilela

Hi,

David Rowley wrote:
>I've adjusted the attached patch to do that.

I think that was room for more improvements.

1. bms_member_index Bitmapset can be const.
2. Only compute BITNUM when necessary.
3. Avoid enlargement when nwords is equal wordnum.
Can save cycles when in corner cases?

Just for convenience I made a new version of the patch,
If want to use it.

regards,
Ranier Vilela


remove_zero_trailing_words_from_bitmapsets_v5.patch
Description: Binary data

Can JoinFilter condition be pushed down into IndexScan?

2023-06-21 Thread Bəxtiyar Neyman

I define a table user_ranks as such:

CREATE TABLE user_ranks (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  rank INTEGER NOT NULL,
  CONSTRAINT "by (rank, id)" UNIQUE (rank, id)
);

INSERT INTO user_ranks (user_id, rank) SELECT generate_series(1, 1),
generate_series(1, 1);

Here's a query I'd like to optimize:

explain (analyze,verbose)
SELECT
  t3_0."id" AS "id",
  t3_0."rank" AS "rank"
FROM
  LATERAL (
SELECT
  t4_0."rank" AS "rank"
FROM
  user_ranks AS t4_0
WHERE
  (t4_0."id" = 4732455)
  ) AS t3_1
  INNER JOIN user_ranks AS t3_0 ON true
WHERE
  (
((t3_0."rank", t3_0."id") <= (t3_1."rank", 4732455))
AND true
  )
ORDER BY
  t3_0."rank" DESC,
  t3_0."id" DESC
LIMIT
  10

It compiles to the following plan:

 Limit  (cost=0.56..250.94 rows=10 width=12) (actual time=8.078..8.078
rows=1 loops=1)
   Output: t3_0.id, t3_0.rank
   ->  Nested Loop  (cost=0.56..41763.27 rows=1668 width=12) (actual
time=8.075..8.076 rows=1 loops=1)
 Output: t3_0.id, t3_0.rank
 Inner Unique: true
 Join Filter: (ROW(t3_0.rank, t3_0.id) <= ROW(t4_0.rank, 4732455))
 Rows Removed by Join Filter: 5002
 ->  Index Only Scan Backward using "by (rank,id)" on
public.user_ranks t3_0  (cost=0.28..163.33 rows=5003 width=12) (actual
time=0.023..0.638 rows=5003 loops=1)
   Output: t3_0.rank, t3_0.id
   Heap Fetches: 0
 ->  Index Scan using "by id" on public.user_ranks t4_0
 (cost=0.28..8.30 rows=1 width=8) (actual time=0.001..0.001 rows=1
loops=5003)
   Output: t4_0.id, t4_0.rating, t4_0.rank
   Index Cond: (t4_0.id = 4732455)

As you can see, there are a lot of rows returned by t3_0, which are then
filtered by Join Filter. But it would have been better if instead of the
filter, the  t3_0 table would have an Index Cond. Similar to how it happens
when a correlated subquery is used (or a CTE)

explain (analyze,verbose)
SELECT
  t3_0."id" AS "id",
  t3_0."rank" AS "rank"
FROM
  user_ranks AS t3_0
WHERE
  (
((t3_0."rank", t3_0."id") <= (
SELECT
  t4_0."rank" AS "rank",
  t4_0."id" AS "id"
FROM
  user_ranks AS t4_0
WHERE
  (t4_0."id" = 4732455)
))
AND true
  )
ORDER BY
  t3_0."rank" DESC,
  t3_0."id" DESC
LIMIT
  10

 Limit  (cost=8.58..8.95 rows=10 width=12) (actual time=0.062..0.064 rows=1
loops=1)
   Output: t3_0.id, t3_0.rank
   InitPlan 1 (returns $0,$1)
 ->  Index Scan using "by id" on public.user_ranks t4_0
 (cost=0.28..8.30 rows=1 width=12) (actual time=0.024..0.025 rows=1 loops=1)
   Output: t4_0.rank, t4_0.id
   Index Cond: (t4_0.id = 4732455)
   ->  Index Only Scan Backward using "by (rank,id)" on public.user_ranks
t3_0  (cost=0.28..61.47 rows=1668 width=12) (actual time=0.061..0.062
rows=1 loops=1)
 Output: t3_0.id, t3_0.rank
 Index Cond: (ROW(t3_0.rank, t3_0.id) <= ROW($0, $1))
 Heap Fetches: 0


I'm an opposite of a PostgreSQL expert, but it was surprising to me to see
that a correlated subquery behaves better than a join. Is this normal? Is
it something worth fixing/easy to fix?

Sincerely,
Bakhtiyar

Re: pgindent vs. pgperltidy command-line arguments

2023-06-21 Thread Andrew Dunstan



On 2023-06-21 We 05:09, Peter Eisentraut wrote:

On 20.06.23 17:38, Andrew Dunstan wrote:

+1, although I wonder if we shouldn't follow pgindent's new lead
and require some argument(s).


That makes sense to me.  Here is a small update with this behavior 
change and associated documentation update.


I'm intending to add some of the new pgindent features to pgperltidy. 
Preparatory to that here's a rewrite of pgperltidy in perl - no new 
features yet but it does remove the hardcoded path, and requires you 
to pass in one or more files / directories as arguments.


Are you planning to touch pgperlcritic and pgperlsyncheck as well? 



Yeah, it would make sense to.


If not, part of my patch would still be useful.  Maybe I should commit 
my posted patch for PG16, to keep consistency with pgindent, and then 
your work would presumably be considered for PG17.



That sounds like a good plan.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

bgwriter doesn't flush WAL stats

2023-06-21 Thread Nazir Bilal Yavuz

Hi,

I was trying to add WAL stats to pg_stat_io. While doing that I was
comparing pg_stat_wal and pg_stat_io's WAL stats and there was some
inequality between the total number of WALs. I found that the difference
comes from bgwriter's WALs. bgwriter generates WAL but it doesn't flush
them because the pgstat_report_wal() function isn't called in bgwriter. I
attached a small patch for calling the pgstat_report_wal() function in
bgwriter.

bgwriter generates WAL by calling functions in this order:
bgwriter.c -> BackgroundWriterMain() -> BgBufferSync() -> SyncOneBuffer()
-> FlushBuffer() -> XLogFlush() -> XLogWrite()

I used a query like BEGIN; followed by lots of(3000 in my case) INSERT,
DELETE, or UPDATE, followed by a COMMIT while testing.

Example output before patch applied:

┌─┬─┐
│  view_name  │ total_wal_write │
├─┼─┤
│ pg_stat_wal │   10318 │
│ pg_stat_io  │   10321 │
└─┴─┘

┌─┬┬┐
│backend_type │ object │ writes │
├─┼┼┤
│ autovacuum launcher │ wal│  0 │
│ autovacuum worker   │ wal│691 │
│ client backend  │ wal│   8170 │
│ background worker   │ wal│  0 │
│ background writer   │ wal│  3 │
│ checkpointer│ wal│  1 │
│ standalone backend  │ wal│737 │
│ startup │ wal│  0 │
│ walsender   │ wal│  0 │
│ walwriter   │ wal│719 │
└─┴┴┘

After the patch has been applied, there are no differences between
pg_stat_wal and pg_stat_io.

I appreciate any comment/feedback on this patch.

Regards,
Nazir Bilal Yavuz
Microsoft
From 9317e86e4b2b3ba202ce090f383ac48526b97955 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 21 Jun 2023 13:39:13 +0300
Subject: [PATCH v1] Flush WAL in bgwriter

bgwriter generates WAL but it doesn't flush them, so they can't be seen
in pg_stat_wal. Call pgstat_report_wal() to flush WAL in bgwriter.
---
 src/backend/postmaster/bgwriter.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..f2e4f23d9f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -241,6 +241,7 @@ BackgroundWriterMain(void)
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
+		pgstat_report_wal(true);
 
 		if (FirstCallSinceLastCheckpoint())
 		{
-- 
2.40.1

Re: RFC: logical publication via inheritance root?

2023-06-21 Thread Amit Kapila

On Tue, Jun 20, 2023 at 10:39 PM Jacob Champion  wrote:
>
> On Fri, Jun 16, 2023 at 9:24 PM Amit Kapila  wrote:
>
> > The other idea that came across my mind was to provide some schema
> > mapping kind of feature on subscribers where we could route the tuples
> > from table X to table Y provided they have the same or compatible
> > schema. I don't know if this is feasible or how generally it will be
> > useful and whether any other DB (replication solution) provides such a
> > feature but I guess something like that would have helped your use
> > case.
>
> Yes, that may have also worked. Making it a subscriber-side feature
> requires tight coupling between the two peers, though. (For the
> timescaledb case, how does the subscriber know which new partitions
> belong to which root?
>

Yeah, the subscriber can't figure that out automatically. Users need
to provide the mapping manually.

-- 
With Regards,
Amit Kapila.

Re: [BUG] recovery of prepared transactions during promotion can fail

2023-06-21 Thread Michael Paquier

On Wed, Jun 21, 2023 at 11:11:55AM +0200, Julian Markwort wrote:
> I see you've already undone it.
> Attached is a patch for 009_twophase.pl to just try this corner case at the 
> very end, so as not to influence other
> existing tests in suite.
> 
> When I run this on REL_14_8 I get the error again, sort of as a sanity 
> check...

+$cur_primary->enable_archiving;

enable_archiving is a routine aimed at being used internally by
Cluster.pm, so this does not sound like a good idea to me.

Relying on a single node to avoid the previous switchover problem is a
much better idea than what I have tried, but wouldn't it be better to
just move the new test to a separate script and parallelize more?  The
runtime of 009_twophase.pl is already quite long.

It is worth noting that I am not going to take any bets with the
buildfarm before 16beta2.  Doing that after REL_16_STABLE is created
will limit the risks.
--
Michael

signature.asc
Description: PGP signature

Re: postgres_fdw: wrong results with self join + enable_nestloop off

2023-06-21 Thread Nishant Sharma

Looks good to me. Tested on master and it works.
New patch used a bool flag to avoid calls for both FDW and custom hook's
call. And a slight change in comment of "has_pseudoconstant_clauses"
function.

Regards,
Nishant.

On Wed, Jun 14, 2023 at 12:19 PM Etsuro Fujita 
wrote:

> On Mon, Jun 5, 2023 at 10:19 PM Etsuro Fujita 
> wrote:
> > To avoid this issue, I am wondering if we should modify
> > add_paths_to_joinrel() in back branches so that it just disallows the
> > FDW to consider pushing down joins when the restrictlist has
> > pseudoconstant clauses.  Attached is a patch for that.
>
> I think that custom scans have the same issue, so I modified the patch
> further so that it also disallows custom-scan providers to consider
> join pushdown in add_paths_to_joinrel() if necessary.  Attached is a
> new version of the patch.
>
> Best regards,
> Etsuro Fujita
>

Re: [BUG] recovery of prepared transactions during promotion can fail

2023-06-21 Thread Julian Markwort

First off, thanks for the quick reaction and reviews, I appreciate it.

On Wed, 2023-06-21 at 14:14 +0900, Michael Paquier wrote:
> But that won't connect work as the segment requested is now a partial
> one in the primary's pg_wal, still the standby wants it.

I think since 009_twophase.pl doesn't use archiving so far, it's not a good 
idea to enable it generally, for all those
tests. It changes too much of the behaviour.

> I think that it is better
> for now to just undo has_archiving in has_archiving, and tackle the
> coverage with a separate test, perhaps only for HEAD.

I see you've already undone it.
Attached is a patch for 009_twophase.pl to just try this corner case at the 
very end, so as not to influence other
existing tests in suite.

When I run this on REL_14_8 I get the error again, sort of as a sanity check...

Kind regards
Julian
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
index 900d181788..706fc1bc10 100644
--- a/src/test/recovery/t/009_twophase.pl
+++ b/src/test/recovery/t/009_twophase.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 26;
 
 my $psql_out = '';
 my $psql_rc  = '';
@@ -480,3 +480,34 @@ $cur_standby->psql(
 is( $psql_out,
 	qq{27|issued to paris},
 	"Check expected t_009_tbl2 data on standby");
+
+###
+# Check recovery of prepared transaction with DDL inside after a hard restart
+# of the primary, now with archiving enabled, to test handling of .partial files.
+###
+
+$cur_primary->enable_archiving;
+$cur_primary->restart;
+
+$cur_primary->psql(
+	'postgres', "
+	BEGIN;
+	CREATE TABLE t_009_tbl7 (id int, msg text);
+	SAVEPOINT s1;
+	INSERT INTO t_009_tbl7 VALUES (31, 'issued to ${cur_primary_name}');
+	PREPARE TRANSACTION 'xact_009_18';");
+
+$cur_primary->stop('immediate');
+$cur_primary->set_standby_mode;
+$cur_primary->start;
+
+$cur_primary->promote;
+
+my $logfile = slurp_file($cur_primary->logfile());
+ok( $logfile =~
+	  qr/recovering prepared transaction .* from shared memory/,
+	'want to see that a prepared transaction was recovered');
+
+# # verify that recovery and promotion finished and that the prepared transaction still exists.
+$psql_rc = $cur_primary->psql('postgres', "COMMIT PREPARED 'xact_009_18'");
+is($psql_rc, '0', 'Commit prepared transaction after crash recovery');

Re: pgindent vs. pgperltidy command-line arguments

2023-06-21 Thread Peter Eisentraut


On 20.06.23 17:38, Andrew Dunstan wrote:

+1, although I wonder if we shouldn't follow pgindent's new lead
and require some argument(s).


That makes sense to me.  Here is a small update with this behavior 
change and associated documentation update.


I'm intending to add some of the new pgindent features to pgperltidy. 
Preparatory to that here's a rewrite of pgperltidy in perl - no new 
features yet but it does remove the hardcoded path, and requires you to 
pass in one or more files / directories as arguments.


Are you planning to touch pgperlcritic and pgperlsyncheck as well?  If 
not, part of my patch would still be useful.  Maybe I should commit my 
posted patch for PG16, to keep consistency with pgindent, and then your 
work would presumably be considered for PG17.

Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Kyotaro Horiguchi

At Wed, 21 Jun 2023 09:43:50 +0200, "Drouvot, Bertrand" 
 wrote in 
> Trying to connect with the 64 bytes name:
> 
> $ psql -d 
> psql: error: connection to server on socket "/tmp/.s.PGSQL.55448"
> failed: FATAL: database "äää" does not
> exist

IMHO, I'm not sure we should allow connections without the exact name
being provided. In that sense, I think we might want to consider
outright rejecting the estblishment of a connection when the given
database name doesn't fit the startup packet, since the database with
the exact given name cannot be found.

While it is somewhat off-topic, I cannot establish a connection if the
console encoding differs from the template database even if I provide
the identical database name. (I don't mean I want that behavior to be
"fix"ed.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~

2023-06-21 Thread Peter Eisentraut


On 21.06.23 09:43, Michael Paquier wrote:

On Wed, Jun 21, 2023 at 09:16:38AM +0200, Daniel Gustafsson wrote:

Agreed, I'd be more inclined to go with OPENSSL_API_COMPAT.  If we still get
warnings with that set then I feel those warrant special consideration rather
than a blanket suppression.


4d3db136 seems to be OK on REL_13_STABLE with a direct cherry-pick.
REL_11_STABLE and REL_12_STABLE require two different changes:
- pg_config.h.win32 needs to list OPENSSL_API_COMPAT.
- Solution.pm needs an extra #define OPENSSL_API_COMPAT in
GenerateFiles() whose value can be retrieved from configure.in like in
13~.

Anything I am missing perhaps?


Backpatching the OPENSSL_API_COMPAT change would set the minimum OpenSSL 
version to 1.0.1, which is newer than what was so far required in those 
branches.  That is the reason we didn't do this.

Re: remap the .text segment into huge pages at run time

2023-06-21 Thread John Naylor

On Wed, Jun 21, 2023 at 10:42 AM Andres Freund  wrote:

> So I am wondering if you're encountering a different kind of problem. As I
> mentioned, I have observed that the pages need to be clean for this to
> work. For me adding a "sync path/to/postgres" makes it work on 6.3.8.
Without
> the sync it starts to work a while later (presumably when the kernel got
> around to writing the data back).

Hmm, then after rebooting today, it shouldn't have that problem until a
build links again, but I'll make sure to do that when building. Still same
failure, though. Looking more closely at the manpage for madvise, it has
this under MADV_HUGEPAGE:

"The  MADV_HUGEPAGE,  MADV_NOHUGEPAGE,  and  MADV_COLLAPSE  operations  are
available only if the kernel was configured with
CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported if the
kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS."

Earlier, I only checked the first config option but didn't know about the
second...

$ grep CONFIG_READ_ONLY_THP_FOR_FS /boot/config-$(uname -r)
# CONFIG_READ_ONLY_THP_FOR_FS is not set

Apparently, it's experimental. That could be the explanation, but now I'm
wondering why the fallback

madvise(addr, advlen, MADV_HUGEPAGE);

didn't also give an error. I wonder if we could mremap to some anonymous
region and call madvise on that. That would be more similar to the hack I
shared last year, which may be more fragile, but now it wouldn't
need explicit huge pages.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~

2023-06-21 Thread Michael Paquier

On Wed, Jun 21, 2023 at 09:16:38AM +0200, Daniel Gustafsson wrote:
> Agreed, I'd be more inclined to go with OPENSSL_API_COMPAT.  If we still get
> warnings with that set then I feel those warrant special consideration rather
> than a blanket suppression.

4d3db136 seems to be OK on REL_13_STABLE with a direct cherry-pick.
REL_11_STABLE and REL_12_STABLE require two different changes:
- pg_config.h.win32 needs to list OPENSSL_API_COMPAT.
- Solution.pm needs an extra #define OPENSSL_API_COMPAT in
GenerateFiles() whose value can be retrieved from configure.in like in
13~.

Anything I am missing perhaps?
--
Michael


signature.asc
Description: PGP signature

ProcessStartupPacket(): database_name and user_name truncation

2023-06-21 Thread Drouvot, Bertrand


Hi hackers,

Please find attached a patch to truncate (in ProcessStartupPacket())
the port->database_name and port->user_name in such a way to not break
multibyte character boundary.

Indeed, currently, one could create a database that way:

postgres=# create database ;
NOTICE:  identifier "" will be truncated to 
"äää"
CREATE DATABASE

The database name has been truncated from 64 bytes to 62 bytes thanks to 
pg_mbcliplen()
which ensures to not break multibyte character boundary.

postgres=# select datname, OCTET_LENGTH(datname),encoding from pg_database;
 datname | octet_length | encoding
-+--+--
 äää |   62 |6

Trying to connect with the 64 bytes name:

$ psql -d 
psql: error: connection to server on socket "/tmp/.s.PGSQL.55448" failed: FATAL:  
database "äää" does not exist


It fails because the truncation done in ProcessStartupPacket():

"
if (strlen(port→database_name) >= NAMEDATALEN)
port→database_name[NAMEDATALEN - 1] = '\0';
"

does not take care about multibyte character boundary.

On the other hand it works with non multibyte character involved:

postgres=# create database 
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijke;
NOTICE:  identifier "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijke" 
will be truncated to "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk"
CREATE DATABASE

postgres=# select datname, OCTET_LENGTH(datname),encoding from pg_database;
 datname | octet_length 
| encoding
-+--+--
 abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk |   63 
|6

The database name is truncated to 63 bytes and then using the 64 bytes name 
would work:

$ psql -d abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijke
psql (16beta1)
Type "help" for help.

The comment in ProcessStartupPacket() states:

"
/*
 * Truncate given database and user names to length of a Postgres name.
 * This avoids lookup failures when overlength names are given.
 */
"

The last sentence is not right in case of mutlibyte character (as seen
in the first example).

About the patch:

As the database encoding is not known yet in ProcessStartupPacket() (
and we are even not sure the database provided does exist), the proposed
patch does not rely on pg_mbcliplen() but on pg_encoding_mbcliplen().

The proposed patch does use the client encoding that it retrieves that way:

- use the one requested in the startup packet (if we come across it)
- use the one from the locale (if we did not find a client encoding request
in the startup packet)
- use PG_SQL_ASCII (if none of the above have been satisfied)

Happy to discuss any other thoughts or suggestions if any.

With the proposed patch in place, using the first example above (and the
64 bytes name) we would get:

$ PGCLIENTENCODING=LATIN1 psql -d 
psql: error: connection to server on socket "/tmp/.s.PGSQL.55448" failed: FATAL:  
database "äää" does not exist

but this one would allow us to connect:

$ PGCLIENTENCODING=UTF8 psql -d 
psql (16beta1)
Type "help" for help.

The patch does not provide documentation update or related TAP test (but could 
be added
if we feel the need).

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 6ccb4f03f56dda9e9a2616c6e0875a97a8711d72 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 20 Jun 2023 10:26:16 +
Subject: [PATCH v1] multibyte truncation for database and user name

database_name and user_name truncation done in ProcessStartupPacket() do not
take care of multibyte character boundary. In case of multibyte, this somehow
breaks the initial goal to avoid lookup failures when overlength names are 
given.
---
 src/backend/postmaster/postmaster.c | 48 ++---
 1 file changed, 44 insertions(+), 4 deletions(-)
 100.0% src/backend/postmaster/

diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index 4c49393fc5..72991c4eab 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1951,6 +1951,7 @@ ProcessStartupPacket(Port *port, bool ssl_done, bool 
gss_done)
char   *buf;
ProtocolVersion proto;
MemoryContext oldcontext;
+   int client_encoding = -1;
 
pq_startmsgread();
 
@@ -2238,6 +2239,18 @@ retry1:
{

Re: Consistent coding for the naming of LR workers

2023-06-21 Thread Alvaro Herrera

On 2023-Jun-21, Peter Smith wrote:

> Except, please note that there are already multiple message format
> strings in the HEAD code that look like "%s for subscription ...",
> that are using the get_worker_name() function for the name
> substitution.
> 
> e.g.
> - "%s for subscription \"%s\" will stop because the subscription was removed"
> - "%s for subscription \"%s\" will stop because the subscription was disabled"
> - "%s for subscription \"%s\" will restart because of a parameter change"
> - "%s for subscription %u will not start because the subscription was
> removed during startup"
> - "%s for subscription \"%s\" will not start because the subscription
> was disabled during startup"
> - "%s for subscription \"%s\" has started"

That is a terrible pattern in relatively new code.  Let's get rid of it
entirely rather than continue to propagate it.

> So, I don't think it is fair to say that these format strings are OK
> for the existing HEAD code, but not OK for the patch code, when they
> are both the same.

Agreed.  Let's remove them all.

BTW this is documented:
https://www.postgresql.org/docs/15/nls-programmer.html#NLS-GUIDELINES

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I suspect most samba developers are already technically insane...
Of course, since many of them are Australians, you can't tell." (L. Torvalds)

Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~

2023-06-21 Thread Daniel Gustafsson

> On 21 Jun 2023, at 07:44, Andres Freund  wrote:
> On 2023-06-21 11:53:44 +0900, Michael Paquier wrote:

>> I have been annoyed by these in the past when doing backpatches, as
>> this creates some noise, and the only place where this counts is
>> sha2_openssl.c.  Thoughts about doing something like the attached for
>> ~13?
> 
> Wouldn't the proper fix be to backpatch 4d3db13621b?

Agreed, I'd be more inclined to go with OPENSSL_API_COMPAT.  If we still get
warnings with that set then I feel those warrant special consideration rather
than a blanket suppression.

--
Daniel Gustafsson

Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

2023-06-21 Thread Michael Paquier

On Tue, Jun 20, 2023 at 11:39:31PM -0400, Tom Lane wrote:
> I'd be okay with adding \v to the set of whitespace characters in
> scan.l and scanner_isspace (and other affected places) for v17.
> Don't want to back-patch it though.

Okay.  No idea where this will lead, but for now I have sent a patch
that adds \v to the parser paths where it would be needed, as far as I
checked:
https://www.postgresql.org/message-id/zjkcjnwwhhvw9...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Consider \v to the list of whitespace characters in the parser

2023-06-21 Thread Michael Paquier

Hi all,
(Adding Evan in CC as he has reported the original issue with hstore.)

$subject has showed up as a subject for discussion when looking at the
set of whitespace characters that we use in the parsers:
https://www.postgresql.org/message-id/ca+hwa9btrdf52dhyu+jooqealgrgro5uhuytfuduoj3cbfe...@mail.gmail.com

On HEAD, these are \t, \n, \r and \f which is consistent with the list
that we use in scanner_isspace().  

This has quite some history, first in 9ae2661 that dealt with an old
issue with BSD's isspace where whitespaces may not be detected
correctly.  hstore has been recently changed to fix the same problem
with d522b05, still depending on scanner_isspace() for the job makes
the handling of \v kind of strange.

That's not the end of the story.  There is an inconsistency with the
way array values are handled for the same problem, where 95cacd1 added
handling for \v in the list of what's considered a whitespace.

Attached is a patch to bring a bit more consistency across the board,
by adding \v to the set of characters that are considered as
whitespace by the parser.  Here are a few things that I have noticed
in passing:
- JSON should not escape \v, as defined in RFC 7159.
- syncrep_scanner.l already considered \v as a whitespace.  Its
neighbor repl_scanner.l did not do that.
- There are a few more copies that would need a refresh of what is
considered as a whitespace in their respective lex scanners:
psqlscan.l, psqlscanslash.l, cubescan.l, segscan.l, ECPG's pgc.l.

One thing I was wondering: has the SQL specification anything specific
about the way vertical tabs should be parsed?

Thoughts and comments are welcome.
Thanks,
--
Michael
diff --git a/src/backend/parser/parse_type.c b/src/backend/parser/parse_type.c
index be75dc6ab0..63b4e96962 100644
--- a/src/backend/parser/parse_type.c
+++ b/src/backend/parser/parse_type.c
@@ -742,7 +742,7 @@ typeStringToTypeName(const char *str, Node *escontext)
 	ErrorContextCallback ptserrcontext;
 
 	/* make sure we give useful error for empty input */
-	if (strspn(str, " \t\n\r\f") == strlen(str))
+	if (strspn(str, " \t\n\r\f\v") == strlen(str))
 		goto fail;
 
 	/*
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index b2216a9eac..a6c67c7151 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -219,7 +219,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  * to agree.
  */
 
-space			[ \t\n\r\f]
+space			[ \t\n\r\f\v]
 horiz_space		[ \t\f]
 newline			[\n\r]
 non_newline		[^\n\r]
@@ -1414,6 +1414,8 @@ unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 			return '\r';
 		case 't':
 			return '\t';
+		case 'v':
+			return '\v';
 		default:
 			/* check for backslash followed by non-7-bit-ASCII */
 			if (c == '\0' || IS_HIGHBIT_SET(c))
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index ed67f5f5fe..4f0005a114 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -121,6 +121,7 @@ scanner_isspace(char ch)
 		ch == '\t' ||
 		ch == '\n' ||
 		ch == '\r' ||
+		ch == '\v' ||
 		ch == '\f')
 		return true;
 	return false;
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index cb467ca46f..1cc7fb858c 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -73,7 +73,7 @@ static void addlitchar(unsigned char ychar);
 %x xd
 %x xq
 
-space			[ \t\n\r\f]
+space			[ \t\n\r\f\v]
 
 quote			'
 quotestop		{quote}
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 9000f83a83..4359dbd83d 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -24,6 +24,7 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
+#include "parser/scansup.h"
 #include "port/pg_bitutils.h"
 #include "utils/array.h"
 #include "utils/arrayaccess.h"
@@ -89,7 +90,6 @@ typedef struct ArrayIteratorData
 	int			current_item;	/* the item # we're at in the array */
 }			ArrayIteratorData;
 
-static bool array_isspace(char ch);
 static int	ArrayCount(const char *str, int *dim, char typdelim,
 	   Node *escontext);
 static bool ReadArrayStr(char *arrayStr, const char *origStr,
@@ -254,7 +254,7 @@ array_in(PG_FUNCTION_ARGS)
 		 * Note: we currently allow whitespace between, but not within,
 		 * dimension items.
 		 */
-		while (array_isspace(*p))
+		while (scanner_isspace(*p))
 			p++;
 		if (*p != '[')
 			break;/* no more dimension items */
@@ -338,7 +338,7 @@ array_in(PG_FUNCTION_ARGS)
 	 errdetail("Missing \"%s\" after array dimensions.",
 			   ASSGN)));
 		p += strlen(ASSGN);
-		while (array_isspace(*p))
+		while (scanner_isspace(*p))
 			p++;
 
 		/*
@@ -434,27 +434,6 @@ array_in(PG_FUNCTION_ARGS)
 	PG_RETURN_ARRAYTYPE_P(retval);
 }
 
-/*
- * array_isspace() --- a non-locale-dependent isspace()
- *
- * We used to use isspace() for parsing array values, but that

Re: Assert while autovacuum was executing

2023-06-21 Thread Peter Geoghegan

On Tue, Jun 20, 2023 at 10:27 PM Andres Freund  wrote:
> As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> also means 3ba59ccc89 is not right.

Quite possibly. But I maintain that ginInsertCleanup() is probably
also bogus in a way that's directly relevant.

Did you know that ginInsertCleanup() is the only code that uses
heavyweight page locks these days? Though only on the index metapage!

Isn't this the kind of thing that VACUUM's relation level lock is
supposed to take care of?

--
Peter Geoghegan

Re: Adding further hardening to nbtree page deletion

2023-06-21 Thread Peter Geoghegan

On Tue, Jun 20, 2023 at 10:39 PM Andres Freund  wrote:
> But the further we go down this path, the more important it is that we provide
> some way to monitor stuff like this. IME it's not particularly practical to
> rely on scanning logs to find such issues at scale. I suspect we ought to add
> at least something that makes such "ignored errors" visible from stats.

I'm in favor of that, of course. We do at least use
ERRCODE_INDEX_CORRUPTED for all of the ERRORs that became LOG messages
in the past several years. That's a start.

FWIW, I'm almost certain that I'll completely run out of ERRORs to
demote to LOGs before too long. In fact, this might very well be the
last ERROR that I ever have to demote to a LOG to harden nbtree
VACUUM. There just aren't that many ERRORs that would benefit from
similar treatment. And most of the individual cases that I've
addressed come up very infrequently in practice.

-- 
Peter Geoghegan

83 matches

Mail list logo